Thursday, August 27, 2015

A moderate's view of the reproducibility crisis

(Part 1 of a series of two blogposts on this topic. The second part is here.)

Reproducibility is a major problem in psychology and elsewhere. Much of the published literature is not solid enough to build on: experiences from my class suggest that students can get interesting stuff to work about half the time, at best. The recent findings of the reproducibility project only add to this impression.* And awareness has been growing about all kinds of potential problems for reproducibility, including p-hacking, file-drawer effects, and deeper issues in the frequentist data analysis tools many of us were originally trained on. What we should do about this problem?

Many people advocate dramatic changes to our day-to-day scientific practices. While I believe deeply in some of these changes – open practices being one example – I also worry that some recommendations will hinder the process of normal science. I'm what you might call a "reproducibility moderate." A moderate acknowledges the problem, but believes that the solutions should not be too radical. Instead, solutions should be chosen to conserve the best parts of our current practice.

Here are my thoughts on three popular proposed solutions to the reproducibility crisis: preregistration, publication of null results, and Bayesian statistics. In each case, I believe these techniques should be part of our scientific arsenal – but adopting them wholesale would cause more problems than it would fix.

Pre-registration. Pre-registering a study is an important technique for removing analytic degrees of freedom. But it also ties the analysts's hands in ways that can be cumbersome and unnecessary early in a research program, where analytic freedom is critical for making sense of the data (the trick is just not to publish those exploratory analyses as though they are confirmatory). As I've argued, preregistration is a great tool to have in your arsenal for large-scale or one-off studies. In cases where subsequent replications are difficult or overly costly, prereg allows you to have confidence in your analyses. But in cases where you can run a sequence of studies that build on one another, each replicating the key finding and using the same analysis strategy, you don't need to pre-register because your previous work naturally constrains your analysis. So: rather than running more one-off studies but preregistering them, we should be doing more cumulative, sequential work where – for the most part – preregistration isn't needed.

Publication of null findings. File drawer biases – where negative results are not published and so effect sizes are inflated across a literature – are a real problem, especially in controversial areas. But the solution is not to publish everything, willy-nilly! Publishing a paper, even a short one or a preprint, is a lot of work. The time you spend writing up null results is time you are not doing new studies. What we need is thoughtful consideration of when it is ethical to suppress a result, and when there is a clear need to publish.

Bayesian statistics. Frequentist statistical methods have deep conceptual flaws and are broken in any number of ways. But they can still be a useful tool for quantifying our uncertainty about data, and a wholesale abandonment of them in favor of Bayesian stats (or even worse, nothing!) risks several negative consequences. First, having a uniform statistical analysis paradigm facilitates evaluation of results. You don't have to be an expert to understand someone's ANOVA analysis. But if everyone uses one-off graphical models (as great as they are), then there are many mistakes we will never catch due to the complexity of the models. Second, the tools for Bayesian data analysis are getting better quickly, but they are nowhere near as easy to use as the frequentist ones. To pick on one system, as an experienced modeler, I love working with Stan. But until it stops crashing my R session, I will not recommend it as a tool for first-year graduate stats. In the mean time, I favor the Cumming solution: A more gentle move towards confidence intervals, judicious use of effect size, and a decrease in reliance on inferences from individual instances of p < .05.

Sometimes it looks like we've polarized into two groups: replicators and everyone else. This is crazy! Who wants to spend an entire career replicating other people's work, or even your own? Instead, replication needs to be part of our scientific process more generally. It needs to be a first step, where we build on pre-existing work, and a last step, where we confirm our findings prior to publication. But the steps in the middle – where you do the real discovery – are important as well. If we focus only on those first and last steps and make our recommendations in light of them alone, we forget the basic practice of science.

* I'm one of many, many authors of that project, having helped to contribute four replication projects from my graduate class.


  1. Some quick notes:

    1a. I don't think the advocates of preregistration want to disband exploratory research altogether.
    1b. > you don't need to pre-register because your previous work naturally constrains your analysis
    only the previous *published* work constrains the analysis etc. A researcher may decide to leave 2 out of 6 studies out and publish the remaining 4 as a single paper. There is no way to know he left out 2 papers based on the remaining work. That's why pre-registration is necessary.
    2. > Publishing a paper, even a short one or a preprint, is a lot of work.
    Agreed. But I see this more as a problem of the current publishing system and the current publishing format.
    3a.> You don't have to be an expert to understand someone's ANOVA analysis. But if everyone uses one-off graphical models, then there are many mistakes we will never catch due to the complexity of the models.

    Is this not a consequence of the asymetric focus of the current education on freq stats? If people were taught bayes stats maybe they would find bayes easy and they would be confused by freq methods, no?
    (btw. bayesian anova does exist and some researchers prefer to estimate graphical models with freq methods, the examples are not well chosen)
    3b. Stan is very recent - 3 years old, it's not surprising that it has bugs. More mature software like bugs, jags or pymc exists and should be considered Though, from my experience, virtualy, all problems I encountered in Stan were promptly solved by upgrading to the most recent Stan version and I can recommend the software with good conscience.

  2. Thanks for the comments! A few followups:

    1b. Exclusion of studies from publication is a different (though related problem). My point was that if you publish those four studies together, then their analytic decisions should constrain one another. If you swap e.g. exclusion criteria or dependent variables from study to study in a single paper, then it's pretty obvious what you're doing.

    2. Maybe I should have said *writing* a paper is a lot of work! I don't think it's just the publication process. We can lower the bars to publication, but it's still not trivial to craft a decent manuscript.

    3a/b. I'm arguing these points as someone whose career has been about using Bayesian methods! It's not that I don't like Bayesian statistics altogether. But the fact is that the frequentist linear model is easy to use and easy to reason about and it will take a lot of education as well as theoretical development to change that. I've used Bugs and Jags (though not PyMC) and can vouch for the fact that even though these tools generally work well, they are still *way* harder to use and to understand than lm(y ~ x), and the benefits are not always obvious.