Babies Learning Language: The ManyBabies Project

tl;dr: Introducing and organizing a ManyLabs study for infancy research. Please comment or email me (mcfrank (at) stanford.edu) if you would like to join the discussion list or contribute to the project.

Introduction

The last few years have seen increasing acknowledgement that there are flaws in the published scientific literature – in psychology and elsewhere (e.g., Ioannidis, 2005). Even more worrisome is that self-corrective processes are not as fast or as reliable as we might hope. For example, in the reproducibility project, which was published this summer (RPP, project page here), 100 papers were sampled from top journals, and one replication of each was conducted. This project revealed a disturbingly low rate of success for seemingly well-powered replications. And even more disturbing, although many of the target papers had a large impact, most still had not been replicated independently seven years later (outside of RPP).

I am worried that the same problems affect developmental research. The average infancy study – including many I've worked on myself – has the issues we've identified in the rest of the psychology literature: low power, small samples, and undisclosed analytic flexibility. Add to this the fact that many infancy findings are never replicated, and even those that are replicated may show variable results across labs. All of these factors lead to a situation where many of our empirical findings are too weak to build theories on.

In addition, there is a second, more infancy-specific problem that I am also worried about. Small decisions in infancy research – anything from the lighting in the lab to whether the research assistant has a beard – may potentially affect data quality, because of the sensitivity of infants to minor variations in the environment. In fact, many researchers believe that there is huge intrinsic variability between developmental labs, because of unavoidable differences in methods and populations (hidden moderators). These beliefs lead to the conclusion that replication research is more difficult and less reliable with infants, but we don't have data that bear one way or the other on this question.

The Depressing Math of Power Analysis

Even aside from issues of analytic flexibility, questionable research practices, or hidden moderators, the basic math of effect sizes and power analysis suggest that many of our findings in infancy research are likely not to be reproducible. To figure out how likely our findings are to be false positives under the classic statistical paradigm, we need to know our average sample size (N), and our average effect size (d). Then we can figure out how much power we have, and so whether our studies are dramatically under-powered.

Here are some numbers. A systematic review of infancy research from 2014 that I've been doing shows a median cell-size for looking-time research that's around N=20 babies/cell. And folks in my lab (Molly Lewis, Mika Braginsky) and Alex Cristia's in Paris (Sho Tsuji, Christina Bergmann, Page Piccinini) have been working on aggregation of meta-analyses of early language development. Eyeballing the "meta-meta-analysis" figure from our project, MetaLab, suggests that d=.5, the average effect size for adult research, is not too far off for much infant language research as well.

These numbers, d=.5 and N=20, mean we are in deep trouble. Here's a power-analysis app I created that you can play with, and here's the relevant power plot.

As you can see, these settings give you 60% power to detect a difference between conditions (red curve, with red dot indicating N=20). But once you add in a negative control group (which many people do) and look for an interaction (which fewer do but all should), power goes down to about 35% (green curve). And if you do a mediation analysis or a gender split, or look at two different analyses, power is negligible (< 20%, not shown). So on this – admittedly simplistic analysis – a huge number of findings in the infancy literature may be unreliable, even if they report a statistically significant finding; they may be false positives or false negatives (for more on this statistical intuition, see an important paper by Button et al., 2013).

ManyBabies

How can we assess the reliability of the developmental literature, given these issues? Redoing the 100-study RPP with developmental samples will not be possible – powering even a small number of baby studies to the level necessary for definitive evidence will be very costly, and likely impossible in the context of a single lab. Instead, the idea for the ManyBabies project is that we can try to understand some of the issues of power and reliability by following a different approach. For any individual effect, a single failed replication doesn't constitute definitive evidence, because of all the different factors (variance in method, population, etc.) that could have led to failure. To address this issue, the ManyLabs projects were designed to quantify these moderators systematically (ManyLabs 1, 2, and 3). In these projects, many labs carried out the same standardized procedure, and a set of planned analyses were conducted to quantify variability due to sample, setting, and other factors. For example, ManyLabs 3 looked at variability across the academic semester.

I propose that we should take this approach for developmental research. We should select a small set of findings in infancy research and conduct preregistered replications, with many labs contributing data to each replication. The results will help us understand the methodological bases of infancy research. In addition, they will provide an unparalleled database for examining infant behavior, potentially leading to interesting insights about novelty/familiarity preferences, eye-movements, and other topics. We should select studies for replication based on methodological and topical diversity, as well as theoretical importance and evidence of prior replication. These studies should be divided across labs so that the resulting samples have sufficient power A) to test for heterogeneity of variance across labs (indicating the presence of some moderator), and B) to conduct a set of planned analyses of specific cross-lab moderators.

Objections and Replies

Standardizing methods across labs has been tried before, and it's very difficult. We may never be able to standardize infancy methods perfectly. But following the ManyLabs approach, variability is not a problem: it is the phenomenon of interest. Perfect standardization is not desirable, even if it were achievable, especially since it's non-standard aspects of procedure that are most interesting from the perspective of understanding variability across labs. For example, perhaps we would like to have as a planned experimental goal the comparison of manual coding of looking times vs. eye-tracking. This analysis would be very valuable for new labs assessing the financial and technical costs of eye-tracking.

Focusing on replication slows down the pace of discovery. Collecting replication data may have a short-term cost, but it will also likely have very powerful medium- and long-term benefits for the field as a whole. First, what we learn about moderators and methodological best practices may increase labs' abilities to collect good data. Second, if we show that we are taking replicability seriously as a field, then we are much more likely to attract funding and support. Conversely, if we fail to take these issues seriously, it's likely that funders – who are increasingly aware of these concerns – will decrease support. Finally, if our eventual goal is to create quantitative theories that allow us to make predictions for a wide range of phenomena, then it is almost never wasted effort to make more precise measurements. And if we design our study properly, we might also create a rich dataset that can be mined for insights about individual variation or used for longitudinal followups.

Time spent on this project may hurt trainees in participating labs. The eventual aim of the study is to provide a set of data collection and analytic practices that make research more efficient by decreasing time wasted chasing false positives, and these improvements will hopefully do lots of good for trainees in the long term. But there is still a short-term cost in terms of lost participants. In exchange for this, there are several ways to reap other benefits. First, and most importantly, the project will provide training on important statistical and methodological concepts. Second, the project itself may lead to publication opportunities. For example, students could get involved in the project and earn authorship, or plan secondary analyses of their own that address questions of interest. Third, labs could double down and run extra babies in theoretically-interesting (and publishable) comparison conditions, for example with different populations or different manipulations.* I hope the project creates many scientific opportunities of this sort.

Many successful replications of a classic finding won't tell us anything interesting. Actually, as in the earlier ManyLabs studies, I think at least one – and perhaps all – of the studies we include in ManyBabies should definitely be a "positive control." A positive control in this context would be a study for which we have very strong expectations that it will replicate clearly. If we fail to reproduce this control study overall, we will believe that our protocol is likely flawed, rather than that the original results are incorrect. And even if this effect reaches significance in every lab we look at it in (which is statistically very unlikely), we will still have a chance to test whether the effect size is moderated by differences in procedure and population across labs. Even a finding of no variability would be immensely interesting, since it would provide evidence against the presence of big hidden moderators that control outcomes in infancy studies.

Conclusion

As I've discussed the ManyBabies project with others, we've cohered around a general vision of the project that I've described above. But there is much left to do in terms of planning the exact approach we take, as well as selecting the studies for inclusion and planning the analyses. If you are interested in infancy research and have opinions on any of these issues, please join the project by emailing me (mcfrank (at) stanford.edu). We would love to have you.

---

* Thanks to Janet Werker for this suggestion.

Many thanks to Alex Cristia, Emmanuel Dupoux, Melanie Soderstrom, Sho Tsuji, and everyone on the ManyBabies listserv for very valuable discussion.

Babies Learning Language

Monday, December 14, 2015

The ManyBabies Project

Introduction

The Depressing Math of Power Analysis

ManyBabies

Objections and Replies

Conclusion

1 comment: