Wednesday, November 19, 2014

Musings on the "file drawer" effect

tl;dr: Even if you love science, you don't have to publish every experiment you conduct.

I was talking with a collaborator a few days ago and discussing which of a series of experiments we should include in our writeup. In the course of this conversation, he expressed uncertainty about whether we were courting ethical violation by choosing to exclude from a potential publication a set of generally consistent but somewhat more poorly executed studies. Publication bias is a major problem in the social sciences (and elsewhere).* Could we be contributing to the so-called "file drawer problem," in which meta-analytic estimates of effects are inflated by the failure to publish negative findings?

I'm pretty sure the answer is "no."

Some time during my first year of graduate school, I had run a few studies that produced positive findings (e.g., statistically significant differences between groups).  I went to my advisor and started saying all kinds of big things about how I would publish them and they'd be in this paper and that paper and the other; probably it came off as quite grandiose. After listening for a while, he said, "we don't publish every study we run."

His point was that a publishable study – or set of studies – is not one that produces a "significant" result. A publishable study is one that advances our knowledge, whether the result is statistically significant or not. If a study is uninteresting, it may not be worth publishing. Of course, the devil is in the details of what "worth publishing" means, so I've been thinking about how you might assess this. Here's my proposal:
It is unethical to avoid publishing a result if a knowledgeable and adversarial reviewer could make a reasonable case that your publication decision was due to a theoretical commitment to one outcome over another. 
I'll walk through both sides of this proposal below. If you have feedback, or counterexamples, I'd be eager to hear them. 

When it's fine not to publish. First, everyone doesn't have an obligation to publish scientific research. For example, I've supervised some undergraduate honors theses that were quite good, but the students weren't interested in a career in science. I regret that they didn't do the work to write up their data for publication, but I don't think they were being unethical, at least from the perspective of publication bias (if they had discovered a lifesaving drug, the analysis might be different).

Second, publication has a cost. The cost is mostly in terms of time, but time is translatable directly into money (whether from salary or from research opportunity cost). Under the current publication system, publishing a peer-reviewed paper is extremely slow. In addition to the authors' writing time, a paper takes hours of time from editors and reviewers, and much thought and effort in responding to reviews. A discussion of the merits of peer review is a topic for another post (spoiler: I'm in favor of it).** But even the most radical alternatives – think generalized arXiv – do not eliminate the cost of writing a clear, useful manuscript. 

So on a cost-benefit analysis, there is a lot of work that shouldn't be written up. For example, cases of experimenter error are pretty clear cut. If I screw up my stimuli and Group A's treatment was contaminated with items that Group B should have seen, then what do we learn? The generalizable knowledge from that kind of experiment is pretty thin. It seems uncontroversial that this sort of results aren't worth publishing.

What about correct but boring experiments? What if I show that the Stroop effect is unaffected by font choice – or perhaps I show a tiny, statistically significant but not meaningful, effect of serif fonts on Stroop effect.*** For either of these experiments, I imagine I could find someone to publish them. In principle, if they were well-executed, PLoS ONE would be a viable venue, since they do not referee for impact. But I am not sure why anyone would be particularly interested, and I don't think it'd be unethical not to publish them.

When it's NOT fine not to publish. First, when a finding is "null" – meaning, not statistically significant despite your expectation that it would be. Someone who held an alternative position (e.g. that the finding would not be predicted to yield a significant result) could say that you were biasing the literature due to your theoretical commitment. This is probably the most common case of publication bias.

Second, if your finding is inconsistent with a particular theory, this fact also should not be used in the decision about publication. Obviously, an adversarial critic could argue – rightly – that you suppressed the finding, which in turn leads to an exaggeration in the degree of published evidence for your preferred theory.

Third, when a finding (finding #1) is contradictory to another finding (finding #2) that you do intend to publish. Here, just think about if your reviewer knew about #1 as well. Could you justify on independent, a priori grounds that you should not publish #1, independent of the theory? In my experience, the only time that is possible is if #1 is clearly a flawed experiment and does not have any evidential value for the question you're interested in.****

Conclusions. Publication bias is a significant issue, and we need use a variety of tools to combat it. Funnel plots are a useful tool, and some new work by Simonsohn et al. uses p-curve analysis. But the solution is certainly not to assume that researchers should publish all their experiments – that solution might be as bad as the problem, in terms of the cost for scientific productivity. Instead, to determine if they are suppressing evidence due to their own biases, researchers should consider applying an ethical test like the one I proposed above.

(The footnotes here got a little out of control).

* A recent, high impact study used TESS (Time-Sharing Experiments in the Social Sciences, a resource for doing pre-peer reviewed experiments with large, representative samples) to estimate publication bias in the social sciences. I like this study a lot, but I am not sure how general the bias estimates are, because TESS is a special case. TESS is a limited resource, and experiments submitted to TESS undergo substantial additional scrutiny due to TESS's pre-data collection review. They are relatively more well-vetted for potential theoretical impact, and substantially less likely to have basic errors, compared with a one-off study using a convenience sample. I suspect – based on no data except my own experience – that relatively more data is left unpublished than the TESS study's estimate, but also that relatively less of it should be published.

** You could always say, hey, we should just put all our data online. We actually do something sort of like that. But you can't just go to and easily find out whether we conducted an experiment on your theoretical topic of choice. Reporting experiments is not just about putting the data out there – you need description, links to relevant literature, etc.

*** Actually, someone has done Stroop for fonts, though that's a different and slightly more interesting experiment.

**** Here's a trickier one. If a finding is consistent with a theory, could this consistency be grounds to avoid publishing it? A Popperian falsificationist scientist should never publish data that are simply consistent with a particular theory, because those data have no value. But basically no one operates in this way – we all routinely make predictions from theory and are excited when they are satisfied.  For a Bayesian scientist of this type, data consistent with a theory are important. But some data may be consistent with many theories and hence provide little evidential value. Other data may be consistent with a theory, but that theory is already so well-supported, so the experiments make little change in our overall degree of belief – consider the case of experiments supportive of Newton's laws, or of further Stroop replications. These cases also potentially work under the adversarial reviewer test, but only if we include the cost-benefit analysis above, and the logic is dicier. A reviewer could accuse you of bias against the Stroop effect, but you might respond that you just didn't think the incremental evidence was worth the effort. Nevertheless, this balance seems less straightforward. Reflecting this complexity, perhaps the failure to publish confirmatory evidence actually does matter. In a talk I heard last spring, John Ioannidis made the point that there are basically no medical interventions out there with d (standardized effect size) > 3 or so (I forget the exact number). I think this is actually a case of publication bias against confirmation of obvious effects. For example, I can't find a clinical trial of the rabies vaccine anywhere after Pasteur – because the mortality rate without the vaccine is apparently around 99%, and with the vaccine most people survive. The effect size there is just enormous – so big that you should just treat people! So actually the literature does have systematic bias against really big effects.


  1. Nice post. I have raised similar points in my discussions of publication bias:

    Francis, G. (2013). Replication, statistical consistency, and publication bias. Journal of Mathematical Psychology, 57, 153-169.

    Francis, G. (2013). We should focus on the biases that matter: A reply to commentaries. Journal of Mathematical Psychology, 57, 190-195.

    Francis, G. (2014). The frequency of excess success for articles in Psychological Science. Psychonomic Bulletin & Review, 21, 1180-1187.

    (The second article is the shortest and most directly related to these issues).

    A side effect of these observations is that a file drawer bias is always relative to the theoretical interpretation of the experimental findings. A set of findings under one interpretation may appear biased but the same set of findings under another interpretation may appear unbiased. It's the relationship between data and theory that matters.

    A second side effect is that a large scale study demonstrating bias across the field is typically not as interesting as an investigation of bias relative to a theoretical interpretation. The latter might happen for perfectly good reasons, but the former indicates a problem with the support for the theoretical claims (relative to the data). This is why my investigations of bias mostly focus on individual articles rather than field-wide bias.

    I have a feeling many people have misunderstood this aspect of my investigations, so I really appreciate having a well-written post like yours that I can direct them to.

  2. Thanks, Greg. That actually puts your work in context in a way I hadn't thought about before. Very interesting!

  3. I just came across this post. It has always been my personal policy to publish all properly conducted experiments. But obviously we cannot publish every little datum we find. I have countless 1-2 subject pilot experiments (often on myself) often with fundamental methodological flaws etc that shouldn't be of interest to anyone. Apart from the problems with the work/benefit ratio you discuss, it would just contaminate the scientific record with very noisy data.

    I only have one proper study in my file drawer which is from a master thesis of a student I had a few years ago. I really want to publish it but I also don't have the time to write it up myself. I keep thinking I will do it eventually... :P

    Recently another student finished their thesis project and we had wondered whether it is worth trying to publish it. In my mind we should but again there is the question of the work/benefit ratio. It was largely a low-powered exploratory experiment to see what happens when you do a certain manipulation. The way I see it would be to collect substantially more data - obviously being honest about it by either using (1) an evidence-based cut-off or (2) repeating the experiment in a new larger sample. But that would mean I need to give this project to someone who could do something more interesting instead. Not an appealing prospect.

    Maybe I will blog about this issue once I am fully rested and the whole replication debate has blown over... ;)