Monday, March 28, 2016

Should we always bring out our nulls?

tl;dr: Thinking about projects that aren't (and may never be) finished. Should they necessarily be published?

So, the other day there was a very nice conversation on twitter, started by Micah Allen and focusing on people clearing out their file-drawers and describing null findings. The original inspiration was a very interesting paper about one lab's file drawer, in which we got insight into the messy state of the evidence the lab had collected prior to its being packaged into conventional publications.

The broader idea, of course, is that – since they don't fit as easily into conventional narratives of discovery – null findings are much less often published than positive findings. This publication bias then leads to an inflation of effect sizes, with many negative consequences downstream. And the response to problem of publication bias then appears to be simple: publish findings regardless of statistical significance, removing the bias in the literature. Hence, #bringoutyernulls.

This narrative is a good one and an important one. But whenever the publication bias discussion come up, I have a contrarian instinct that I have a hard time suppressing. I've written about this issue before, and in that previous piece I tried to articulate the cost-benefit calculation: while suppressing publication has a cost in terms of bias, publication itself also has a very significant cost to both authors (in writing, revising, and even funding publication) and readers (in sorting through and interpreting the literature). There really is junk, the publication of which would be a net negative –whether because of errors or irrelevance. But today I want to talk about something else that bothers me about the analysis of publication bias I described above.

This account of publication bias assumes a very particular story about experimental research: namely that the focus of our empirical investigation is the estimation of a single effect size associated with a phenomenon of interest. Now, there are certainly times when that model appears to be correct, especially in translational intervention studies where the question of interest is typically the difference in effect between treatment and control conditions.* And this is the paradigm that the Registered Replication Reports have followed, finding a theoretically-central finding and then measuring its size across labs.

But the majority of my projects simply do not follow this schema. Much of my work has focused on the development of quantitative theories of language use and language learning. Projects I work on typically do not look anything like running a two-condition experiment to show directional effects between treatment and control. They don't look like a classic 2x2 design with a predicted interaction that "shows mechanism." Instead, most of my work has focused on measuring inference or learning across a wide variety of conditions and then trying to understand sources of variability in the measurements, often using quantitative models – or failing that, competing verbal theories – to make predictions. In practice, this means papers are often either about gathering "descriptive" measurements, fitting models to these measurements, or – in the best case – both.

From that perspective it often makes very little sense to think about these projects as yielding simple "nulls." Instead, they yield complex patterns of data that are more or less well-understood – they are empirical explorations instead of point estimates of a particular effect. And when my collaborators and I discuss whether or not to publish them, we often do so on grounds of whether they provide – as a holistic package – compelling evidence for or against a particular theoretical position. That's the same generally problematic logic that can lead to publication bias in simple effect size estimate cases.  But the key intuition here feels different to me.

Here are two examples of pieces of work that I have not published, despite very large investments of time and effort. The first is a followup of some work I did where we examined the numerical cognition of hunter gatherers whose language had no words for numbers. The followup (with Tom Honeyman, at the time at ANU) looked at a group in Papua New Guinea who had been reported to have an interesting number system that was bounded at six. We did a bunch of experiments that hinted at the idea that different individuals understood the number system differently, and perhaps used it differently in a set of counting tasks. But in the end, we got data from only around 12 participants, and many were different degrees of bilingual, further confounding the findings. So we weren't able to make that particular claim, even though there was certainly a range of variation in numerical ability across the group.

We could still publish the data from this investigation. I think they are moderately interesting. But as far as I know they don't speak for or against a particular theoretical construct of interest. So a paper reporting them wouldn't have much of a theoretical upshot, other than providing an interesting and complicated – but inconclusive – story about the variability of people's number concepts.

The second piece was a followup of a model of early word learning that I worked on with Noah Goodman and Josh Tenenbaum. That model was successful in capturing a number of phenomena, but since it came out a number of people have tried to work on it, mostly with limited success. The inference algorithm was complex and unwieldy, and the simulations required many hundreds of hours of computing time. So not a lot of direct followup work has used it, to my general chagrin. A couple of years ago, I spent a lot of time on a reimplementation of the model that had a cleaner mathematical description and a better inference scheme. (Code and partial writeup here, if you're interested).  But the results with this model weren't quite as good, though largely comparable, and even though the inference was somewhat faster, it wasn't really that fast. In addition, we couldn't make a new inference scheme – a particle filter – work as well as we thought it would. So I don't think this particular model advances our conversation about early word learning sufficiently to warrant publication. It's a theoretical artifact of moderate interest, but it fails to capture new phenomena or outperform previous models.

A theory is a way of creating expectations about observations under a particular set of conditions. In the "point estimate" regime, the theory leads to strong expectations about the experiment's result, regardless of outcome. In contrast, in the "empirical exploration" paradigm, it is very easy for the exploration to move outside of the conditions for which the theory makes strong predictions. The theories I was considering about numerical cognition had nothing much to say about bilingualism, and the sample size I could collect wouldn't allow the strong observations necessary for constructing a new theory. Similarly, nothing about the word learning theory I was interested in implementing in my model controlled whether particle filter inference schemes should work right. In both cases, the theory simply didn't have anything to say about the failure mode for the study.

So perhaps the fundamental generalization from these projects is this:

An experiment that has as its goal estimating a single effect will nearly always provide information relevant to that effect; in contrast, a study whose goal is to provide information relevant to deciding between multi-dimensional theories can fail uninformatively. 

I am not advocating for one or the other of these ways of working. They are synergistic with one another. Effect size estimation is important for a variety of cases, both where we are beginning to build theory and for when theories are well-established. But as I've written before, concerns about replicability and estimation of individual effects are most important in the absence of strong quantitative theory.

* These are also precisely the cases where the classic NHST logic holds and the single p-value for the a priori  between groups t-test is a pretty good measure of what we care about for inference.  In the context of critiques of NHST, we acknowledge that this match between the actual statistical paradigm and our research practice is a rare event. Why do we not acknowledge the same weaknesses of the estimation paradigm as well? 

No comments:

Post a Comment