Tuesday, March 24, 2015

Estimating p(replication) in a practical setting

tl;dr - an estimate of the proportion of recent psychology findings that can be reproduced by an early-stage graduate student and some thoughts about the consequences of that estimate

I just finished reading the final projects for Psych 254, my graduate lab course. The centerpiece of the course is that each student chooses a published experiment and conducts a replication online. My view is replication projects are a great way to learn the nuts and bolts of doing research. Rebecca Saxe (who developed this model) and I wrote a paper about this idea a couple of years ago.

The goal of the course is to teach skills related to experimental data collection and analysis. Nevertheless, as a result of this exercise, we get real live data on (mostly) well-conducted projects by smart, hard-working, and motivated students. Some of these projects have been contributed to the Open Science Framework's Reproducibility Project. One has even been published, albeit with lots of additional work, as a stand-alone contribution (Philips, Ong, et al., in press). An additional benefit of this framework – and perhaps a point of contrast with other replication efforts – is that students choose projects that they want to build on in their own work.

I've been keeping a running tally of replications and non-replications from the course (see e.g., this previous blog post). I revisited these results and simplified my coding scheme a bit in order to consolidate them with this year's data. This post is my report of those findings. I'll first describe the methods for this informal study, then the results and some broader musings.

Before I get into the details, let me state up front that I do not think that these findings consist of an estimate of the reproducibility of psychological science (that's the goal of the Reproducibility Project, which has a random sample and perhaps a bit more more external scrutiny of replication attempts). But I do think they provide an estimate of how likely it is that a sophisticated and motivated graduate student can reproduce a project within the scope of a course (and using online techniques). And that estimate is a very useful number as well – it tells us how much of our literature is possible for trainees to reproduce and to build on.


The initial sample was 40 total studies (N=19 and N=21, for 2013 and 2015, respectively). All target articles  were published in the last 15 years, typically in major journals, e.g., Cognition, JPSP, Psych Sci, PNAS, Science. Pretty much all the replicators were psychology graduate students at Stanford, though there were two master's students and two undergrads; this variable was not a moderator of success. All of the studies were run on Amazon Mechanical Turk using either Qualtrics or custom JavaScript. All confirmatory analyses were pre-reviewed by me and the TAs via the template used by the Reproducibility Project.

Studies varied in their power, but all were powered to at least the sample of the original paper, and most were powered either to around 80% power according to post-hoc power analysis, or to 2.5x the original sample. See below for more discussion on this. The TAs and I pilot tested people's paradigms, and we would have excluded any study whose failure we knew to be due to experimenter error (e.g. bad programming), but we didn't catch any mistakes after piloting even though likely some exist. We excluded two studies for obvious failures to achieve the planned power (e.g., because one study that attempted to compare Asian Americans and Caucasian Americans was only able to recruit 6 Asian Americans); the final sample after these exclusions was 38 studies.

In consultation with the TAs, I coded replication success on a 0 – 1 scale. Zero was interpreted as no evidence of the key effect from the original paper; one was a solid, significant finding in the same direction and of approximately the same magnitude (see below). In between were many variously interpretable patterns of data, including .75 (roughly, reproducing most of the key statistical tests) and .25 (some hints of the same pattern, albeit much attenuated or substantially different). 

Because of numbers, I ended up splitting the studies into Cognitive (N=20) and Social (N=18) subfields. I initially tried to use a finer categorization but the numbers for most sub-fields were simply too small. So in practice, the Social subfield included some cross-cultural work, some affective work, lots of priming (more on this later), and a grab-bag of other effects in fields like social perception. The Cognitive work included perception, cognition, language, and some decision-making. When papers were about social cognition (as several were), I judged on sociological factors. If a target article was in JPSP it went in the Social pile; if it was by people who studied cognitive development, it went in the Cognition pile. 


The mean replication code was .57 across all studies. Here is the general histogram of replication success codes across all data:

Overall, the modal outcome was a full replication, but the next most common was a complete failure. For more information, you can see the breakdown across subfields and years:

There was a slight increase in replications from 2013 to 2015 that was not significant in an unpaired t-test (t(36) = -1.08, p = .29). In contrast, the contrast between Social and Cognitive findings was significant (t(36) = -2.32, p = .03). 

What happened with the social psych findings? The sample is too sparse to tell, really, but I can speculate. One major trend is that we tried 6 findings that I would classify as "social priming" – many involving a word-unscrambling task. The mean reproducibility rating for these studies was .21, with none receiving a code higher than .5. I don't know what to make of that, but minimally I wouldn't recommend doing social priming on mechanical turk. In addition, given my general belief in the relative similarity of turk and other populations (motivation 1, motivation 2), I personally would be hesitant to take on one of these paradigms.

One other trend stood out to me. As someone trained in psycholinguistics, I am used to creating a set of experimental "items" – e.g., different sentences that all contain the phenomenon you're interested in. Clark (1973, the "Language-As-Fixed-Effect Fallacy") makes a strong argument that this kind of design – along with the appropriate statistical testing – is critical for ensuring the validity of our experiments. The same issue has been noted for social psychology (Wells & Windschitl, 1999).

Despite these arguments, seven of our studies had exactly one stimulus item per condition. Usually, a study of this type tests the idea that being exposed to some sort of idea or evidence leads to a difference in attitude, and the manipulation is that participants read a passage (different in each condition) that invokes this idea. Needless to say, there is a problem here with validity; but in a post-hoc analysis, we also found that such studies were less likely to be reproducible. Only one in this "single stimulus" group was coded above .25 for replication (t(9.74) = -2.60, p = .03, not assuming equal variances). This result is speculative due to small numbers and the fact that it's post-hoc; but it's still striking. Maybe we're seeing the lack of reliability coming as a result of different populations responding differently to those individual stimulus items.

The other findings that we failed to reproduce were a real grab-bag. They include both complicated reaction-time paradigms and very simple surveys. Sometimes we had ideas for issues with the paradigms (perhaps, things that the original authors had solved by clear instructions or un-reported aspects of the paradigm). Sometimes we were completely mystified. It would be very interesting to find out which of these we would be eventually able to understand; but that's a tremendous amount of work – we did it in exactly one case and it took almost a dozen studies.

Broader Musings

This course makes all of us think hard about some knotty issues in replication. First, how do you decide how many participants to include in a replication in order to ensure sufficient statistical power. One solution is to assume that the target paper has a good estimate of the target effect size, then use that effect size to do power analysis. The problem (as many folks have pointed out) is that post-hoc power is problematic. In addition, with complex ANOVAs or linear models, we almost never have the data to perform power analyses correctly.

Nevertheless, we don't have much else in the way of tools, with one exception. Based on some clever statistical analysis, Uri Simonsohn's "small telescopes" piece advocates simply running 2.5x the sample. This is a nice idea and generally seems conservative. But when the initial study already appears overpowered, this approach is almost certainly overkill – and it's very rough on the limited course budget. In practice, this year we did a mix of post-hoc power, 2.5x, and budget-limited samples. There were few cases, however, where we worried about the power of the replications: The paradigms that didn't replicate that tended to be the short paradigms that we were able to power most effectively given our budget. Even so, deciding on sample sizes is typically one of the trickiest parts of the project planning process.

A second tricky issue is deciding when a particular effect replicated. Following the Reproducibility Project format, students in the course replicate key statistical tests, and so if all of these are reliable and of roughly the same magnitude, it's easy to say that the replication was positive. But there are nevertheless many edge cases where the pattern of magnitudes is theoretically different, or similar in size but nevertheless nonsignificant. The logic of "small telescopes" is very compelling for simple t-tests, but it's often quite hard to extend to the complex, theoretically-motivated interactions that we sometimes see in sophisticated studies – for example, we sometimes don't even know the effect size! As a result, I can't guarantee that the replication codes we used above are what the original author would assign – perhaps they'd say "oh yes, that pattern of findings is consistent with the spirit of the original piece" even if the results were quite different. But this kind of uncertainty is always going to be an issue – there's really no way to judge a replication except with respect to the theory the original finding was meant to support.


I love teaching this course. It's wonderful to see students put so much energy into so many exciting, new projects. Students don't choose replications to do "take downs" or to "bully or intimidate."* They choose projects they want to learn about and build on in their own research. As a consequence, it's very sad when they fail to reproduce these findings! It's sad because it creates major uncertainty around work that they like and admire. And it's also sad because this is essentially lost effort for them; they can't build on the work they did in creating a framework for running their study and analyzing the data. In contrast, their classmates – whose paradigms "worked" – are able to profit directly from the experience by using their new data and experimental tools to create cool new extensions.

When I read debates about whether replications are valuable, whether they should be published, whether they are fair or unfair, ad nauseaum, I'm frustrated by the lack of consideration for this most fundamental use-case for replication. Replication is most typically about the adoption of a paradigm for future research. If our scientific work can't be built on by the people doing the science – trainees – then we've really lost track of our basic purpose.  

Major thanks to Long Ouyang and Desmond Ong, the TAs for the course!

* Added scare quotes here 3/25 to indicate that I don't think anyone really does experimental psychology to bully or intimidate, even if it sometimes feels that way! 

Monday, March 23, 2015

Team up or slow down!

(This post is a draft of a talk I gave at SRCD last week, in a round-table discussion organized by Melanie Soderstrom on the topic of standardizing infancy methods.)

We were asked to consider what the big issues are in standardizing infancy methods. I think there's one big issue: statistical power. Infancy experiments are dramatically underpowered. So any given experiment we run tells us relatively little unless it has a lot of babies in it. This means that if we want good data, we need either to team up or else to slow down.

1. Statistical power.

The power on a test is the probability that the test will reject the null (at p < .05), given the effect size. A general standard is that you want 80% power. So the smaller the effect, the larger the sample you need to have to detect that effect. We're talking about standard effect sizes here (Cohen's d), so if d = 1, then the two groups are a standard deviation apart. That's a really big effect.

A couple of facts to get you calibrated. The traditional sample size in infancy research is 16. Let's assume a within-subjects t-test. Then 16 infants gets you 80% power to detect an effect of around d = .75. That's a big effect, by most standards. But the important question is, how big is your average effect in infancy research?

2. Facts on power in infancy research.

Luckily, Sho Tsuji (another presenter in the roundtable), Christina Bergmann, and Alex Cristia have been getting together these lovely databases of infant speech perception work. And my student Molly Lewis is presenting a meta-analysis of the "mutual exclusivity phenomenon" in the posters on Saturday morning (edit: link). There's some good news and some bad news. 
  • Mutual exclusivity (ME) is really robust, at least with older kids. D is around 1, meaning you really can see this effect reliably with 16 kids (as you might have suspected). But if you go younger, you need substantially higher power and groups of 36 or more are warranted.
  • Phoneme recognition is also pretty good. Traditional methods like high-amplitude sucking and conditioned head turn yield effect sizes over 1. On the other hand, head-turn preference only is around d = .5. So again, you need more like 36 infants per group to have decent power. 
  • Word segmentation, on the other hand, not so good. The median effect size is just above .25. So your power in those studies is actually pretty close to zero. 
Again thanks to Sho, Christina, and Alex for putting this stuff on the internet so I could play with it. I can't stress enough important that is.

3. What does this mean?

First, if you do underpowered experiments, you get junk. The best case is that you recognize that, but more often than not, what we do is over-interpret away some kind of spurious finding that we got with one age group or stimulus but not another, or with boys but not girls. Theory can provide some guide – so if you didn't expect a result, be very skeptical unless you have high power!

Second, all of this is about power to reject the null. So that means the situation is much worse when you want to see an age by stimulus interaction, or a difference between groups in the magnitude of an effect (say, whether bilinguals show less mutual exclusivity than monolinguals). The power on these interaction tests will be very low, and you are likely to need many dozens or hundreds of children to be able to test this kind of hypothesis accurately. Let's say for the sake of argument that we're looking at a difference of d = .5 – that is, ME is .5 SDs stronger for monolinguals than bilinguals. That's a whopping difference. We need 200 kids to have 80% power on that interaction. Effects smaller than d = .4, don't even bother testing interactions because you won't have the power to detect them. That's just the harsh statistical calculus.

4. So where do we go from here?

There are a bunch of options, and no one option is the only way forward – in fact, these are very complementary options:

You can slow down and test more kids. This is a great option - we've started working with children's museums to try and collect much larger samples, and we've found these samples give us much more interesting and clear patterns of data. We can really see developmental quantitatively with 200+ kids.

You can team up. Multiple labs working together can test the same number of kids but put the results together and get the power they need - with the added bonus of greater generalizability. You'll get papers with more authors, but that's the norm in many other sciences at this point. And if you team up, you will need to standardize – that means planning the stimuli out and sharing them across labs.

You can meta-analyze. That means people need to share their data (or at least publish enough about it so that we can do all the analyses we need). I don’t know about Sho et al., but our experience has been that reporting standards are still very bad for developmental papers. We need information about effect size and variability.

5. Conclusions

It isn't just infancy research that is having trouble with this issue. Social psychologists are struggling with the lack of reproducibility in their area; neuroimaging is in a similar place, trying to figure out what to do because more than 12 fMRI scans is too expensive but 12 subjects doesn't give you any power. We are together figuring out that this is a tough position to be in. But pursuing open methods, collaboration, and high-powered studies will certainly help.