Monday, June 2, 2014

Shifting our cultural understanding of replication

tl;dr - I agree that replication is important – very important! But the way to encourage it as a practice is to change values, not to shame or bully.

Introduction. 

The publication of the new special issue on replications in Social Psychology has prompted a huge volume of internet discussion. For one example, see the blogpost by Simone Schnall, one of the original authors of a paper in the replication issue – much of the drama has played out in the comment thread (and now she has posted a second response). My favorite recent additions to this conversation have focused on how to move the field forward. For example, Betsy Levy Paluck has written a very nice piece on the issue that also happens to recapitulate several points about data sharing that I strongly believe in. Jim Coan also has a good post on the dangers of negativity.

All of this discussion has made me consider two questions: First, what is the appropriate attitude towards replication? And second, can we make systematic cultural changes in psychology to encourage replication without the kind of negative feelings that have accompanied the recent discussion? Here are my thoughts:
  1. The methodological points made by proponents of replication are correct. Something is broken.
  2. Replication is a major part of the answer; calls for direct replication may even understate its importance if we focus on cumulative, model-driven science.
  3. Replication must not depend on personal communications with authors. 
  4. Scolding, shaming, and "bullying" will not create the cultural shifts we want. Instead, I favor technical and social solutions. 
I'll expand on each of these below.

1. Something is broken in psychology right now.

The Social Psychology special issue and the Reproducibility Project (which I'm somewhat involved in) both suggest that there may be systematic issues in our methodological, statistical, and reporting standards. My own personal experiences confirm this generalization. I teach a course based on students conducting replications. A paper describing this approach is here, and my syllabus – along with lots of other replication education materials – is here.

In class last year, we conducted a series of replications of an ad-hoc set of findings that the students and I were interested in. Our reproducibility rate was shockingly low. I coded our findings on a scale from 0 - 1, with 1 denoting full replication (a reliable significance test on the main hypothesis of interest) and .5 denoting partial replication (a trend towards significance, or a replication of the general pattern but without a predicted interaction or with an unexpected moderator). We reproduced 8.5 / 19 results (50%), with a somewhat higher probability of replication for more "cognitive" effects (~75%, N=6) and a somewhat lower probability for more "social" effects (~30%, N=11). Alongside the obvious possibility for our failures – that some of the findings we tried to reproduce were spurious to begin with – there are many other very plausible explanations. Among others: We conducted our replications on the web, there may have been unknown moderators, we may have made methodological errors, and we could have been underpowered (though we tried for 80% power relative to the reported effects).*

To state the obvious: Our numbers don't provide an estimate of the probability that these findings were incorrect. But they do estimate the probability that a tenacious graduate student could reproduce the finding effectively for a project, given the published record and an email – sometimes unanswered – to the original authors. Although our use of Mechanical Turk as a platform was for convenience, preliminary results from the RP suggest that my course's estimate of reproducibility isn't that far off base.

When "prep" – the probability of replication – is so low (the true prep, not the other one), we need to fix something. If we don't, we run a large risk that students who want to build on previous work will end up wasting tremendous amounts of time and resources trying to reproduce findings that – even if they are real – are nevertheless so heavily moderated or so finicky that they will not form a solid basis for new work.

2. Replication is more important than even the replicators emphasize.  

Much of the so-called "replication debate" has been about the whether, how, who, and when of doing direct replications of binary hypothesis tests. These hypothesis tests are used in paradigms from papers that support particular claims (e.g. cleanliness is related to moral judgment, or flags prime conservative attitudes). This NHST approach – even combined with a meta-analytic effect-size estimation approach, as in the Many Labs project – understates the importance of replication. That's because these effects typically aren't used as measurements supporting a quantitative theory.

Our goal as scientists (psychological, cognitive, or otherwise) should be to construct theories that make concrete, quantitative predictions. While verbal theories are useful up to a point, formal theories are a more reliable method for creating clear predictions; these formal theories are often – but don't have to be – instantiated in computational models. Some more discussion of this viewpoint, which I call "models as formal theories," here and here. If our theories are going to make quantitative predictions about the relationship between measurements, we need to be able to validate and calibrate our measurements. This validation and calibration is where replication is critical.

Validation. In the discussion to date (largely surrounding controversial findings in social psychology), it has been assumed that we should replicate simply to test the reliability of previous findings. But that's not why every student methods class performs the Stroop task. They are not checking to see that it still works. They are testing their own reliability – validating their measurements.

Similarly, when I first set up my eye-tracker, I set out to replicate the developmental speedup in word processing shown by Anne Fernald and her colleagues (reviewed here). I didn't register this replication, and I didn't run it by her in advance. I wasn't trying to prove her wrong; as with students doing Stroop as a class exercise, I was trying to validate my equipment and methods. I believed so strongly in Fernald's finding that I figured that if I failed to replicate it, then I was doing something wrong in my own methods. Replication isn't always adversarial. This kind of bread and butter replication is – or should be – much more common.

Calibration. If we want to make quantitative predictions about the performance of new group of participants in tasks derived from previous work, we need to calibrate our measurements to those of other scientists. Consistent and reliable effects may nevertheless be scaled differently due to differences in participant populations. For example, studies of reading time among college undergraduates at selective institutions may end up finding overall faster reading than studies conducted among a sample with a broader educational background.

As one line of my work, I've studied artificial language learning in adults as a case study of language learning mechanisms that could have implications for the study of language learning in children. I've tried to provide computational models of these sorts of learning phenomena (e.g. here, here, and here). Fitting these models to data has been a big challenge because typical experiments only have a few datapoints - and given the overall scaling differences in learning described above, a model needs to have 1 - 2 extra parameters (minimally an intercept but possibly also a slope) to integrate across experiment sets from different labs and populations.

As a result, I ended up doing a lot of replication studies of artificial language experiments so that I could vary parameters of interest and get quantitatively-comparable measures. I believed all of the original findings would replicate – and indeed they did, often precisely as specified. If you are at all curious about this literature, I replicated (all with adults): Saffran et al. (1996a; 1996b); Aslin et al. (1998); Marcus et al. (1999); Gomez (2002); Endress, Scholl, & Mehler (2005); and Yu & Smith (2007). All of these were highly robust. In addition, in all cases where there were previous adult data, I found differences in the absolute level of learning from the prior report (as you might expect, considering I was comparing participants on Mechanical Turk or at MIT with whatever population the original researchers had used). I wasn't surprised or worried by these differences. Instead, I just wanted to get calibrated – find out what the baseline measurements were for my particular participant population.

In other words, even – or maybe even especially – when you assume that the binary claims of a paper are correct, replication plays a role by helping to validate empirical measurements and calibrate those measurements against prior data.

3. Replications can't depend on contacting the original authors for details. 

As Andrew Wilson argues in his nice post on the topic, we need to have the kind of standards that allow reproducibility – as best as we can – without direct contact of the initial authors. Of course, no researcher will always know perfectly what factors matter to their finding, especially in complex social scenarios. But how are we ever supposed to get anything done if we can't just read the scientific literature and come up with new hypotheses and test them? Should we have to contact the authors for every finding we're interested in, to find out whether the authors knew about important environmental moderators that they didn't report? In a world where replication is commonplace and unexceptional – where it is the typical starting point for new work rather than an act of unprovoked aggression – the extra load caused by these constant requests would be overwhelming, especially for authors with popular paradigms.

There's a different solution. Authors could make all of their materials (at least the digital ones) directly and publicly accessible as part of publication. Psycholinguists have been attaching their experimental stimulus items as an appendix to their papers for years – no reason not to do this more ubiquitously. For most studies, posting code and materials will be enough. In fact, for most of my studies – which are now all run online – we can just link to the actual HTML/javascript paradigm so that interested parties can try it out. If researchers believe that their effects are due to very specific environmental factors, then they can go the extra mile to take photos or videos of the experimental circumstances. The sharing of materials and data (whether using the Open Science Framework, github, or other tools) is free and costs almost nothing in terms of time. Used properly, these tools can even improve the reliability of own work along with its reproducibility by others.

I don't mean to suggest that people shouldn't contact original authors, merely that they shouldn't be obliged to. Original authors are – by definition – experts in a phenomenon, and can be very helpful in revealing the importance of particular factors, providing links to followup literature both published and unpublished, and generally putting work in context. But a requirement to contact authors prior to performing a replication emphasizes two negative values: the possibility for perceived aggressiveness in the act of replication, and the incompleteness of methods reporting. I strongly advocate for the reverse default. We should be humbled and flattered when others build on our work by assuming that it is a strong foundation, and they should assume our report is complete and correct. Neither of these assumptions will always be true, but good defaults breed good cultures.

4. The way to shift to a culture of replication is not by shaming the authors of papers that don't replicate. 

No one likes it when people are mean to one another. There's been some considerable discussion of tone on the SPSP blog and on twitter, and I think this is largely to the good. It's important to be professional in our discussion or else we alienate many within the field and hurt our reputation outside it. But there's a larger reason why shaming and bullying shouldn't be our MO: they won't bring about the cultural changes we need. For that we need two ingredients. First, technical tools that decrease the barriers to replication; and second, role models who do cool research that moves the field forward by focusing on solid measurement and quantitative detail, not flashy demonstrations. 

Technical tools. One of the things I have liked about the – otherwise somewhat acrimonious – discussion of Schnall et al.'s work is the use of the web to post data, discuss alternative theories, and iterate in real time on an important issue (three links herehere, and here, with meta-analysis here). If nothing else comes of this debate, I hope it convinces its participants that posting data for reanalysis is a good thing. 

More generally, my feeling is that there is a technical (and occasionally generational) gap at work in some of this discussion. On the data side, there is a sense that if all we do are t-tests on two sets of measurements from 12 people, then no big deal, no one needs to see your analysis. But datasets and analyses are getting larger and more sophisticated. People who code a lot accept that everyone makes errors. In order to fight error, we need to have open review of code and analysis. We also need to have reuse of code across projects. If we publish models or novel analyses, we need to give people the tools to reproduce them. We need sharing and collaborating, open-source style – enabled by tools like github and OSF. Accepting these ideas about data and analyses means that replication on the data side should be trivial: a matter of downloading and rerunning a script. 

On the experimental side, reproducibility should be facilitated by a combination of web-based experiments and code-sharing. There will always be costly and slow methods – think fMRI or habituating infants – but standard issue social and cognitive psychology is relatively cheap and fast. With the advent of Mechanical Turk and other online study venues, often an experiment is just a web page, perhaps served by a tool like PsiTurk. And I think it should go without saying: if your experiment is a webpage, then I would like to see the webpage if I am reading your paper. That way if I want to reproduce your findings I should be able to make a good start by simply directing people – online or in the lab – to look at your webpage, measuring your responses, and rerunning your analysis code. Under this model, if I have $50 and a bit of curiosity about your findings, I can run a replication. No big deal.** 

We aren't there yet. And perhaps we will never be for more involved social psychological interventions (though see PERTS, the Project for Education Research that Scales, for a great example of such interventions in a web context). But we are getting closer and closer. The more open we are with experimental materials, code, and data, the easier replication and reanalysis will be and the less we will have to imagine replication as a last-resort, adversarial move, and the more collecting new data will be part of a general ecosystem of scientific sharing and reuse.

Role models. These tools will only catch on if people think they are cool. For example, Betsy Levy Paluck's research on high-schoolers suggests something that we probably all know intuitively. We all want to be like the cool people, so the best way to change a culture is by having the cool kids endorse your value of choice. In other words, students and early-career psychologists will flock to new approaches if they see awesome science that's enabled by these methods. I think of this as a new kind of bling: Instead of being wowed by the counterintuitiveness or unlikeliness of a study's conclusions, can we instead praise how thoroughly it nailed the question? Or the breadth and scope of its approach

Conclusions. 

For what it's worth, some of the rush to publish high-profile tests of surprising hypotheses has to be due to pressures related to hiring and promotion in psychology. Here I'll again follow Betsy Levy Paluck and Brian Nosek in reiterating that, in the search committees I've sat on, the discussion over and over turns to how thorough, careful, and deep a candidate's work is – not how many publications they have. Our students have occasionally been shocked to see that a candidate with a huge, stellar CV doesn't get a job offer, and have asked "what more does someone need to do in order to get hired." My answer (and of course this is only my personal opinion, not the opinion of anyone else in the department): Engage deeply with an interesting question and do work on that question that furthers the field by being precise, thorough, and clearly-thought out. People who do this may pay a penalty in terms of CV length - but they are often the ones who get the job in the end

I've argued here that something really is broken in psychology. It's not just that some of our findings don't (easily) replicate, it's also that we don't think of replication as core to the enterprise of making reliable and valid measurements to support quantitative theorizing. In order to move away from this problematic situation, we are going to need technical tools to support easier replication, reproduction of analyses, and sharing more generally. We will also need the role models to make it cool to follow these new scientific standards.


---
Thanks very much to Noah Goodman and Lera Boroditsky for quick and helpful feedback on a previous draft. (Minor typos fixed 6/2 afternoon).

* I recently posted about one of the projects from that course, an unsuccessful attempt to replicate Schnall, Benton, & Harvey (2008)'s cleanliness priming effect. As I wrote in that posting, there are many reasons why we might not have reproduced the original finding – including differences between online and lab administration. Simone Schnall wrote in her response that "If somebody had asked me whether it makes sense to induce cleanliness in an online study, I would have said 'no,' and they could have saved themselves some time and money." It's entirely possible that cleanliness priming specifically is hard to achieve online. That would surprise me given the large number of successes in online experimentation more broadly (including the IAT and many subtle cognitive phenomena, among other things – I also just posted data from a different social priming finding that does replicate nicely online). In addition, while the effectiveness of a prime should depend on the baseline availability of the concept being primed, I don't see why this extra noise would completely eliminate Schnall et al.'s effect in the two large online replications that have been conducted so far.

** There's been a lot of talk about why web-based experiments are "just as good" as in-lab experiments. In this sense, they're actually quite a bit better! The fact that a webpage is so easily shown to others around the world as an exemplar of your paradigm means that you can communicate and replicate much more easily.

5 comments:

  1. Really nice post. The emphasis on making replication an every day practice as opposed to an antagonistic response and the way modern web-based experiment tools can make this easier really resonated.

    I did have one minor statistical quibble about the many attempted replications and the comment that in class you attempt to reach 80% power based on the published effect sizes. Powering your replication this way doesn't take into account the fact that successfully published studies with small sample sizes are likely to have obtained effect sizes that are greater than the real size of the effect.

    That is, given that most effects are small (which I find preferable to "most effects are null", at least in cognitive and social psych), and I ran a study with low power to detect a small effect, but obtained a significant result, it's likely that I got a bit lucky and the real effect is smaller than the one I observed.

    If others power their replications at 80% based on my effect size, we'll see failures to replicate much more than 20% of the time. We'll also eventually get a decent estimate of the real effect size based on the effect sizes obtained across the many replications, but it may take a lot of them to get there.

    It seems like it would be interesting and helpful to begin seeing more publication of not only effect sizes, but confidence or credible intervals around those effect sizes, *even for null results*. For results that *just* reached significance, we'll see a very wide range, with 0 just barely outside of the interval. In a lot of those cases, for the reasons described above, you can bet that the real effect size is in the lower part of that range.

    One thing I've found since moving into doing data analysis in industry, where others are likely to make decisions about what to do next based on the analysis, is that interval information is much more helpful and informative to people than p-values. It seems like that might be the case here as well.

    ReplyDelete
    Replies
    1. Very good point, Daniel! I last ran the class (winter 2013) before I had learned as much about this issue. Since then, I have come to agree with you, in part because of this post:

      http://datacolada.org/2013/10/14/powering_replications/

      And yes, effect size CIs are a good approach that we should move towards.

      Delete
  2. One other thing that I think would be very useful and I'd like to see much more often is a culture that expects multiple replications of the same (or very similar) experiments within the same paper. Doing this is fairly easy with Mechanical Turk; and indeed, as I develop an experimental method I often find that I get several replications of only minimally-different paradigms without even trying hard. Yet when it comes to writing it up, I usually feel constrained to only write one of them up -- there is a huge cultural pressure against making a paper too long or complicated. Yet I think this is a mistake. Even if there was the expectation that such replication studies would be attached as supplemental materials, but not part of the paper, that would make a big difference. If it was necessary to actually have achieved two or three replications of an effect before you could even publish it, I think there might be fewer spurious effects in the literature in the first place. As it is, it almost feels like you're punished if you try to include several self-replications in a single paper.

    It wouldn't solve all problems, of course, but it would be one additional thing that would help.

    ReplyDelete
    Replies
    1. What are your thoughts on in-text reporting of replications? E.g. "we ran a minimally different version of this with smaller images prior to the current experiment and saw a qualitatively similar correlation (r = .63, p = .02)." These sorts of footnotes add to my confidence in a paper without forcing an APA format writeup of an extra experiment that has essentially no salient differences or clear narrative justification. I've put them in a couple of papers without reviewers commenting or maybe even noticing, e.g. footnote 2 in

      http://langcog.stanford.edu/papers/FFLSG-cogpsych-inpress.pdf

      and footnote 5 in

      http://langcog.stanford.edu/papers/FG-underreview.pdf

      Delete
  3. This comment has been removed by a blog administrator.

    ReplyDelete