Monday, November 10, 2014

Comments on "reproducibility in developmental science"

A new article by Duncan et al. in the journal Developmental Psychology highlights best practices for reproducibility in developmental research. From the abstract:
Replications and robustness checks are key elements of the scientific method and a staple in many disciplines. However, leading journals in developmental psychology rarely include explicit replications of prior research conducted by different investigators, and few require authors to establish in their articles or online appendices that their key results are robust across estimation methods, data sets, and demographic subgroups. This article makes the case for prioritizing both explicit replications and, especially, within-study robustness checks in developmental psychology. 
I'm very interested in this topic in general and think that the broader message is on target. Nevertheless, I was surprised by the specific emphasis in this article on what they call "robustness checking" practices. In particular, all three of the robustness practices they describe – multiple estimation techniques, multiple datasets, and subgroup analyses – seem to be most useful for non-experimental studies that involve large correlational datasets (e.g. from nationally representative studies).

Multiple estimation techniques refers to the use of several different statistical models (e.g. standard regression, propensity matching, instrumental variable regression) to estimate the same effect. This is not a bad practice, but it is much more important when there are many different ways of controlling for confounders (e.g. in a large observational dataset). In a two-condition experiment, the menu of options is more limited. Similarly, subgroup estimation – estimating models on smaller populations within the main sample – is typically only possible with a large, multi-site dataset. And the use of multiple datasets presupposes that there are many datasets that bear on the question of interest, something that is not usually true when you are making experimental tests of a new theoretical question.

So all this means that the primary empirical claim of the article – that developmental psych is behind other disciplines (like applied economics) in these practices – is a bit unfair. Here's the key table from the article:

The main point we're supposed to take away from this table is that the econ articles are doing many more robustness checks than the developmental psych articles. But I'd bet that most of the developmental psych journals are filled with novel empirical studies that don't afford comparison with large, pre-existing datasets; subgroup analyses; or use of multiple estimation techniques. And I'm not sure that's a bad thing – at very least, causal inference is far more straightforward in randomized experiments than large-scale observational studies.

I think I have the same goals as the authors: making developmental (and other) research more reproducible. But I would start with a different set of recommendations to the developmental psych community. Here are three simple ones:
  • Larger samples. It is still common in the literature on infancy and early childhood to have extremely small sample sizes. N=16 is still the accepted standard in infancy research, believe it or not. Given the evidence that looking time is a quantitative variable (e.g. here and here), we need to start measuring it with precision. Infants are expensive, but not as expensive as false positives. And preschoolers are cheap, so there's really no excuse for tiny cell sizes.
  • Internal replication. There are many papers – again especially in infant research but also in work with older children – where the primary effect is demonstrated in Study 1 and then the rest of the reported findings are negative controls. A good practice for these studies is to pair each control with a de novo replication. This facilitates statistical comparison (e.g., equating for small aspects of population or testing setup that may change between studies) and also ensures robustness of the effect. 
  • Developmental comparison. This recommendation probably should go without saying. For developmental research – that is, work that tries to understand mechanisms of growth and change – it's critical to provide developmental comparisons and not just sample a single convenient age group. Developmental comparison groups also provide an important opportunity for internal replication. If 3-year-olds are above chance on your task and 4- and 5-year-olds aren't, then perhaps you've discovered an amazing phenomenon; but it's also possible you have a false positive. Our baseline hypotheses about development provide useful constraints on the pattern of results we expect, meaning that developmental comparison groups can provide both new data and a useful sanity check.
Perhaps this all just reflects my different orientation towards the field than Duncan et al.; but a quick flip through a recent issue of Child Development suggests that the modal article is not a large observational study but a much smaller-scale set of experiments. The recommendations Duncan et al. make are certainly reasonable, but we need to supplement them with guidelines for experimental research as well. Duncan GJ, Engel M, Claessens A, & Dowsett CJ (2014). Replication and robustness in developmental research. Developmental psychology, 50 (11), 2417-25 PMID: 25243330

(HT: Dan Yurovsky)

No comments:

Post a Comment