Tuesday, March 24, 2015

Estimating p(replication) in a practical setting

tl;dr - an estimate of the proportion of recent psychology findings that can be reproduced by an early-stage graduate student and some thoughts about the consequences of that estimate

I just finished reading the final projects for Psych 254, my graduate lab course. The centerpiece of the course is that each student chooses a published experiment and conducts a replication online. My view is replication projects are a great way to learn the nuts and bolts of doing research. Rebecca Saxe (who developed this model) and I wrote a paper about this idea a couple of years ago.

The goal of the course is to teach skills related to experimental data collection and analysis. Nevertheless, as a result of this exercise, we get real live data on (mostly) well-conducted projects by smart, hard-working, and motivated students. Some of these projects have been contributed to the Open Science Framework's Reproducibility Project. One has even been published, albeit with lots of additional work, as a stand-alone contribution (Philips, Ong, et al., in press). An additional benefit of this framework – and perhaps a point of contrast with other replication efforts – is that students choose projects that they want to build on in their own work.

I've been keeping a running tally of replications and non-replications from the course (see e.g., this previous blog post). I revisited these results and simplified my coding scheme a bit in order to consolidate them with this year's data. This post is my report of those findings. I'll first describe the methods for this informal study, then the results and some broader musings.

Before I get into the details, let me state up front that I do not think that these findings consist of an estimate of the reproducibility of psychological science (that's the goal of the Reproducibility Project, which has a random sample and perhaps a bit more more external scrutiny of replication attempts). But I do think they provide an estimate of how likely it is that a sophisticated and motivated graduate student can reproduce a project within the scope of a course (and using online techniques). And that estimate is a very useful number as well – it tells us how much of our literature is possible for trainees to reproduce and to build on.


The initial sample was 40 total studies (N=19 and N=21, for 2013 and 2015, respectively). All target articles  were published in the last 15 years, typically in major journals, e.g., Cognition, JPSP, Psych Sci, PNAS, Science. Pretty much all the replicators were psychology graduate students at Stanford, though there were two master's students and two undergrads; this variable was not a moderator of success. All of the studies were run on Amazon Mechanical Turk using either Qualtrics or custom JavaScript. All confirmatory analyses were pre-reviewed by me and the TAs via the template used by the Reproducibility Project.

Studies varied in their power, but all were powered to at least the sample of the original paper, and most were powered either to around 80% power according to post-hoc power analysis, or to 2.5x the original sample. See below for more discussion on this. The TAs and I pilot tested people's paradigms, and we would have excluded any study whose failure we knew to be due to experimenter error (e.g. bad programming), but we didn't catch any mistakes after piloting even though likely some exist. We excluded two studies for obvious failures to achieve the planned power (e.g., because one study that attempted to compare Asian Americans and Caucasian Americans was only able to recruit 6 Asian Americans); the final sample after these exclusions was 38 studies.

In consultation with the TAs, I coded replication success on a 0 – 1 scale. Zero was interpreted as no evidence of the key effect from the original paper; one was a solid, significant finding in the same direction and of approximately the same magnitude (see below). In between were many variously interpretable patterns of data, including .75 (roughly, reproducing most of the key statistical tests) and .25 (some hints of the same pattern, albeit much attenuated or substantially different). 

Because of numbers, I ended up splitting the studies into Cognitive (N=20) and Social (N=18) subfields. I initially tried to use a finer categorization but the numbers for most sub-fields were simply too small. So in practice, the Social subfield included some cross-cultural work, some affective work, lots of priming (more on this later), and a grab-bag of other effects in fields like social perception. The Cognitive work included perception, cognition, language, and some decision-making. When papers were about social cognition (as several were), I judged on sociological factors. If a target article was in JPSP it went in the Social pile; if it was by people who studied cognitive development, it went in the Cognition pile. 


The mean replication code was .57 across all studies. Here is the general histogram of replication success codes across all data:

Overall, the modal outcome was a full replication, but the next most common was a complete failure. For more information, you can see the breakdown across subfields and years:

There was a slight increase in replications from 2013 to 2015 that was not significant in an unpaired t-test (t(36) = -1.08, p = .29). In contrast, the contrast between Social and Cognitive findings was significant (t(36) = -2.32, p = .03). 

What happened with the social psych findings? The sample is too sparse to tell, really, but I can speculate. One major trend is that we tried 6 findings that I would classify as "social priming" – many involving a word-unscrambling task. The mean reproducibility rating for these studies was .21, with none receiving a code higher than .5. I don't know what to make of that, but minimally I wouldn't recommend doing social priming on mechanical turk. In addition, given my general belief in the relative similarity of turk and other populations (motivation 1, motivation 2), I personally would be hesitant to take on one of these paradigms.

One other trend stood out to me. As someone trained in psycholinguistics, I am used to creating a set of experimental "items" – e.g., different sentences that all contain the phenomenon you're interested in. Clark (1973, the "Language-As-Fixed-Effect Fallacy") makes a strong argument that this kind of design – along with the appropriate statistical testing – is critical for ensuring the validity of our experiments. The same issue has been noted for social psychology (Wells & Windschitl, 1999).

Despite these arguments, seven of our studies had exactly one stimulus item per condition. Usually, a study of this type tests the idea that being exposed to some sort of idea or evidence leads to a difference in attitude, and the manipulation is that participants read a passage (different in each condition) that invokes this idea. Needless to say, there is a problem here with validity; but in a post-hoc analysis, we also found that such studies were less likely to be reproducible. Only one in this "single stimulus" group was coded above .25 for replication (t(9.74) = -2.60, p = .03, not assuming equal variances). This result is speculative due to small numbers and the fact that it's post-hoc; but it's still striking. Maybe we're seeing the lack of reliability coming as a result of different populations responding differently to those individual stimulus items.

The other findings that we failed to reproduce were a real grab-bag. They include both complicated reaction-time paradigms and very simple surveys. Sometimes we had ideas for issues with the paradigms (perhaps, things that the original authors had solved by clear instructions or un-reported aspects of the paradigm). Sometimes we were completely mystified. It would be very interesting to find out which of these we would be eventually able to understand; but that's a tremendous amount of work – we did it in exactly one case and it took almost a dozen studies.

Broader Musings

This course makes all of us think hard about some knotty issues in replication. First, how do you decide how many participants to include in a replication in order to ensure sufficient statistical power. One solution is to assume that the target paper has a good estimate of the target effect size, then use that effect size to do power analysis. The problem (as many folks have pointed out) is that post-hoc power is problematic. In addition, with complex ANOVAs or linear models, we almost never have the data to perform power analyses correctly.

Nevertheless, we don't have much else in the way of tools, with one exception. Based on some clever statistical analysis, Uri Simonsohn's "small telescopes" piece advocates simply running 2.5x the sample. This is a nice idea and generally seems conservative. But when the initial study already appears overpowered, this approach is almost certainly overkill – and it's very rough on the limited course budget. In practice, this year we did a mix of post-hoc power, 2.5x, and budget-limited samples. There were few cases, however, where we worried about the power of the replications: The paradigms that didn't replicate that tended to be the short paradigms that we were able to power most effectively given our budget. Even so, deciding on sample sizes is typically one of the trickiest parts of the project planning process.

A second tricky issue is deciding when a particular effect replicated. Following the Reproducibility Project format, students in the course replicate key statistical tests, and so if all of these are reliable and of roughly the same magnitude, it's easy to say that the replication was positive. But there are nevertheless many edge cases where the pattern of magnitudes is theoretically different, or similar in size but nevertheless nonsignificant. The logic of "small telescopes" is very compelling for simple t-tests, but it's often quite hard to extend to the complex, theoretically-motivated interactions that we sometimes see in sophisticated studies – for example, we sometimes don't even know the effect size! As a result, I can't guarantee that the replication codes we used above are what the original author would assign – perhaps they'd say "oh yes, that pattern of findings is consistent with the spirit of the original piece" even if the results were quite different. But this kind of uncertainty is always going to be an issue – there's really no way to judge a replication except with respect to the theory the original finding was meant to support.


I love teaching this course. It's wonderful to see students put so much energy into so many exciting, new projects. Students don't choose replications to do "take downs" or to "bully or intimidate."* They choose projects they want to learn about and build on in their own research. As a consequence, it's very sad when they fail to reproduce these findings! It's sad because it creates major uncertainty around work that they like and admire. And it's also sad because this is essentially lost effort for them; they can't build on the work they did in creating a framework for running their study and analyzing the data. In contrast, their classmates – whose paradigms "worked" – are able to profit directly from the experience by using their new data and experimental tools to create cool new extensions.

When I read debates about whether replications are valuable, whether they should be published, whether they are fair or unfair, ad nauseaum, I'm frustrated by the lack of consideration for this most fundamental use-case for replication. Replication is most typically about the adoption of a paradigm for future research. If our scientific work can't be built on by the people doing the science – trainees – then we've really lost track of our basic purpose.  

Major thanks to Long Ouyang and Desmond Ong, the TAs for the course!

* Added scare quotes here 3/25 to indicate that I don't think anyone really does experimental psychology to bully or intimidate, even if it sometimes feels that way! 

No comments:

Post a Comment