Tuesday, June 21, 2016

Reproducibility and experimental methods posts

In celebration of the third anniversary of this blog, I'm collecting some of my posts on reproducibility. I didn't initially anticipate that methods and the "reproducibility crisis" in psychology would be my primary blogging topic, but it's become a huge part of what I write about on a day-to-day basis.

Here are my top four posts in this sequence:

Then I've also written substantially about a number of other topics, including publication incentives and the file-drawer problem:

The blog has been very helpful for me in organizing and communicating my thoughts, as well as for collecting materials for teaching reproducible research. Hoping to continue thinking about these topics in the future, even as I move back to discussing more developmental and cognitive science topics. 

Sunday, June 5, 2016

An adversarial test for replication success

(tl;dr: I argue that the only way to tell if a replication study was successful is by considering the theory that motivated the original.)

Psychology is in the middle of a sea change in its attitudes towards direct replication. Despite their value in providing evidence for the reliability of a particular experimental finding, incentives for direct replications have typically been limited. Increasingly, however, journals and funding agencies now increasingly value these sorts of efforts. One major challenge, however, has been evaluating the success of direct replications studies. In short, how do we know if the finding is the same?

There has been limited consensus on this issue, so many projects have used a diversity of methods. The RP:P 100-study replication project, reports several indicators of replication success, including 1) the statistical significance of the replication, 2) whether the original effect size lies within the confidence interval of the replication, 3) the relationship between the original and replication effect size, 4) the meta-analytic estimate of effect size combining both, and 5) a subjective assessment of replication by the team. Mostly these indicators hung together, though there were numerical differences.

Several of these criteria are flawed from a technical perspective. As Uri Simonsohn points out in his "Small Telescopes" paper, as the power of the replication study goes to infinity, the replication will always be statistically significant, even if it's finding a very small effect that's quite different from the original. And similarly, as N in the original study goes to zero (if it's very underpowered), it gets harder and harder to differentiate its effect size from any other, because of its wide confidence interval. So both statistical significance of the replication and comparison of effect sizes have notable flaws.*