Psychology is in the middle of a sea change in its attitudes towards direct replication. Despite their value in providing evidence for the reliability of a particular experimental finding, incentives for direct replications have typically been limited. Increasingly, however, journals and funding agencies now increasingly value these sorts of efforts. One major challenge, however, has been evaluating the success of direct replications studies. In short, how do we know if the finding is the same?
There has been limited consensus on this issue, so many projects have used a diversity of methods. The RP:P 100-study replication project, reports several indicators of replication success, including 1) the statistical significance of the replication, 2) whether the original effect size lies within the confidence interval of the replication, 3) the relationship between the original and replication effect size, 4) the meta-analytic estimate of effect size combining both, and 5) a subjective assessment of replication by the team. Mostly these indicators hung together, though there were numerical differences.
Several of these criteria are flawed from a technical perspective. As Uri Simonsohn points out in his "Small Telescopes" paper, as the power of the replication study goes to infinity, the replication will always be statistically significant, even if it's finding a very small effect that's quite different from the original. And similarly, as N in the original study goes to zero (if it's very underpowered), it gets harder and harder to differentiate its effect size from any other, because of its wide confidence interval. So both statistical significance of the replication and comparison of effect sizes have notable flaws.*
In addition, all this trouble is just for a single effect. In fact, one weakness of RP:P was that researchers were forced to choose just a single effect size as the key analysis in the original study. If you start looking at an experiment that has multiple important analyses, the situation gets way worse. Consider a simple 2x2 factorial design: Even if the key test identified by a replicator is the interaction, if the replication study fails to see a main effect or sees a new, un-predicted main effect, those findings might lead someone to say that the replication result was different than the original. And in practice it's even more complicated than that because sometimes it's not straightforward to figure out whether it was the main effect or the interaction the authors cared about (or maybe it was both). Students in my class routinely struggle to find the key effect that they should focus on in their replication projects.
Recently we had a case that was yet more confounding than this. We did a direct replication of an influential paper and found that we were able to reproduce every single one of the statistical tests. The only issue was that we also found another significant result where the authors' theory would predict a null effect or even an effect in the opposite direction. (We were studying theory of mind reasoning and we found that participants' responses were slower not only when the state of the world was incongruent with their and others' beliefs, but also when it was congruent with belief information). In this case, it was only the theoretical interpretation that allowed us to argue that our "successful replication" was in fact inconsistent with the authors' theory.
I think this case illustrates a broader generalization, namely that statistical methods for assessing replication success need to be considered as secondary to the theoretical interpretation of the result. Instead, I propose that:
The primary measure of whether a replication result is congruent with the original finding is whether it provides support for the theoretical interpretation given to the original.And as in my discussion of publication bias, I think the key test is adversarial. In other words, a replication is unsuccessful if a knowledgeable and adversarial reviewer could reasonably argue that the new data fail to support the interpretation.
This argument apparently puts me in an odd position, because it seems like I'm advocating for giving up on an important family of quantitative approaches to reproducibility. In particular, the effect-size estimation approach to reproducibility emerges from the tradition of statistical meta-analysis. And meta-analysis is just about as good as it gets in terms of aggregating data across multiple studies right now. So is this an argument for vagueness?
No. The key point that emerges from this set of ideas is instead that the precision of the original theoretical specification is what governs whether a replication is successful or not. If the original theory is vague, it's simply hard to tell whether what you saw gives support to it, and all the statistics in the world won't really help. (This is of course the problem in all the discussion of context-sensitivity in replication). In contrast, if the original theory is precisely specified, it's very easy to assess support.
In other words, rather than arguing for a vaguer definition of replication success, what I'm instead arguing for is more precise theories. Replication is only well-defined if we know what we're looking for. The tools of meta-analysis provide a theory-neutral fix for class of single effect statistics (think of the effect size for the RCT). But once we get beyond the light shed by that small lamp post, theory is going to be the only way we find our keys.
---
* Simonsohn proposes another – perhaps more promising – criterion for distinguishing effect sizes that I won't go into here because it's limited to the single-effect domain.
Related to your point, I think that the biggest problem is that before we run a study we don't have any quantitative predictions about the effect of interest, not even ballpark guesses. We only have statements about the sign of the effect. So if my computational model of process Y predicts an 8 ms effect, with 95%credible intervals 0-16, I have some chance of quantitatively assessing that through replication. I have never been in that fortunate situation. This leads to the assessment approach you suggest using expert judgement. One problem I see with that approach is that experts at least in my field have no sense of what a plausible vs implausible effect is because they have encoded their knowledge in terms of significant or not significant. So we are back to the problem of not thinking about magnitudes of effects.
ReplyDeleteThanks, Shravan. I agree with you completely about the necessity for quantitative predictions - and I know both of us have worked to try and make those sorts of predictions (with only some sort of success, usually).
DeleteI also agree that the expert judgment criterion is not ideal. But I don't think you have to worry about study *plausibility* per se in this case, because the challenge is just judging whether two sets of results support the same theory. I think it's actually much harder for most people to judge the evidential value of the original study (and you'll get much less consensus). Most replication cases are easier, because you have a clear comparison you can make.
"Most replication cases are easier, because you have a clear comparison you can make."
DeleteI guess what I'm trying to say is that this comparison is not easy to make. If theory A predicts that theta>0 and theory B predicts that theta=0, and we find hat-theta to be positive in both cases but the credible interval crosses zero in one but not in another, you could find theory A proponents claiming a replication success and theory B proponents saying that the result is equivocal across the two studies. I suppose the problem is that the way people are taught to talk about a result is in binary terms: there is an effect or there isn't one. This, coupled with a lack of quantitative claims (surely the theory B claiming theta=0 don't think that theta is exactly 0 but rather that it is 0 +/- x) leads to ambiguity about whether a replication was a success or not.
I am mostly concerned with reading or EEG studies, where a continuous dependent measure is at issue; maybe the situation is different in your field.