Tuesday, December 3, 2024

Four papers I'm sad never to have published

One of the saddest things in academic research is an abandoned project. You pour time, effort, and sometimes money into a piece of research, only to feel that it has not been released into the world to make an impact. Sometimes you don't finish an analysis or write a paper. But I would argue that the saddest situations are the projects that came closest to being published – these are "near misses."*

This sadness can also have practical consequences. If we abandon projects differentially because of their results – failing to report negative findings because of a belief that they would be uninteresting or hard to publish – then we get a bias in the published literature. We know this is true – but in this post I'm not going to focus on that. I'm thinking more about inadvertent near misses. The open science movement – and in particular the rise of preprints – has changed the field a lot in that these near misses are now visible again. So I'm writing this post in part to promote and discuss four projects that never saw journal publication but that I still love...

I'm a researcher but I'm also (maybe primarily) an advisor and mentor, and so this kind of thing happens all the time: a trainee comes into my lab, does a great project, writes a paper about it, and then moves on to a new position. Sometimes they stay in academia, sometimes they don't. Even if we submit the manuscript before they leave, however, it frequently happens that reviews come back after they are distracted by the next stage of their life. Unless I take over the writing process, things typically remain unpublished. 

But the worst thing is when I abandon my own work because I'm too busy doing all that advising and teaching (and also getting grants to do the next shiny thing). Sadly this has happened many times over the past 15 years or so that I've been a faculty member. I simply didn't have the fortitude to get the paper through peer review and so it lingers as something interesting but unrevised – and perhaps fatally flawed (depending on whether you trust the reviewers). Here are my four biggest regrets. 

1. A literature review on computational models of early language learning. This was the first chapter of my dissertation initially, and I revised it for a review journal, hoping to do something like Pinker's famous early review paper. It was reviewed by two people, one nativist and one empiricist. Both hated it, and I abandoned it in despair. I still like what I wrote, but it's very out of date now. 

2. A huge dataset on children's free-viewing of naturalistic third-person dialogue and how it relates to their word learning. I loved this one. These experiments were my very first projects when I got to Stanford – we collected hundreds of kids worth of eye-tracking data (with an eye-tracker bought with my very first grant) and we were able to show correlational relationships between free-viewing and word learning. We even saw a similar relationship in kids on the autism spectrum. This paper was rejected several times from good journals for reasonable reasons (too correlational, kids with ASD were not well characterized). But I think it has a lot of value. (The data are now in Peekbank, at least).

(Graph showing big developmental differences in free viewing, specifically for a moment at which you had to follow an actor's gaze to see what they were talking about in the video).

3. A large set of experiments on reference games. Noah Goodman and I created the Rational Speech Act  (RSA) model of pragmatic processing and this was a big part of my early research at Stanford. I spent a ton of time and money doing mechanical turk experiments to try to learn more about the nature of the model. This manuscript includes a lot of methodological work on paradigms for studying pragmatic inference online as well as some clever scenarios to probe the limits (there were 10 experiments overall!). Sadly I think I tried to make the manuscript more definitive than it should have been – by the time I finally submitted it, RSA already had many variants, and some of the formal work was not as strong as the empirical side. So reviewers who disliked RSA disliked it, and reviewers who liked RSA still thought it needed work. 

4. A simplified formal model of teaching and learning. This one was an extension of the RSA model for teaching and learning scenarios, trying to get a handle on how teachers might change their messages based on the prior beliefs and/or knowledge of the learners. I was really proud of it, and it shapes my thinking about the dynamics of teaching to this day. Lawrence Liu started the project, but I did a ton more analysis several years later in hopes of making a full paper. Sadly, it was rejected once – reviewers thought, perhaps reasonably, that the policy implications were too big a stretch. By the time I submitted it to another journal, a bunch of other related formal work had appeared in the computer science literature. Reviewers the second time asked for more simulations, but I was out of time and the code had gotten quite stale because it depended on a very specific tech stack. 

I hope someone gets a little pleasure or knowledge from these pieces. I loved working on all four of them!

---- 

* I just learned that there is a whole literature on the psychology of near misses, for example in gambling or with respect to emotions like relief and regret.

Some thoughts on ManyBabies 4

 [repost from Bluesky]

Three ManyBabies projects - big collaborative replications of infancy phenomena - wrapped up this year. The first paper came out this fall. I thought I'd take this chance to comment on what I make of the non-replication result.

https://onlinelibrary.wiley.com/doi/full/10.1111/desc.13581

First, off - this study was a SUCCESS! We got the community together to plan a replication study and then we got 37 labs and 1000 babies to do a complicated study and we pulled it off. That's a huge win for team science! Major kudos to Kelsey, Francis, and Kiley.

In case you're wondering about the status of the other projects, here's a summary slide (already shared right after ICIS). MB3-4 yielded null effects; MB2 is complicated... preliminary analysis shows the predicted effect but an even bigger, unpredicted effect in the control condition.


Turning back to MB4, we were interested in the classic "helper hinderer" phenomenon. In these studies, babies have been shown to choose an object that "helps" over one that "hinders" a third one. A nice meta-analysis by Margoni & Surian (2018) confirms that this effect is variable across labs but has been found quite a lot of times.  Data from this MA and an update by Alvin Tan are on metalab: langcog.github.io/metalab/. MB4 ran a "straightforward" best practices replication, but with standardized video displays and both a social and non-social condition. Overall, there were no preferences for helpers or hinderers at any age and for either condition.

So what's going on? Well, the initial success (and various replications in the meta-analysis) could have been false positives or contained some confound leading to success. Or there might be some key difference in the replication leading babies to fail in this particular version. There are other possibilities (bad implementation or bad measurement, for example) but I think these are less likely, given the general care that was taken in the project and the large sample size, which allows detection of effects much smaller than the original effect.

Some people will jump to the interpretation that this study shows that the original finding was incorrect (and hence that the other replications were incorrect as well, and the earlier non-replications were right). This one possibility - but we shouldn't be so quick to jump to conclusions. Another possibility is that the *particular* instantiation of helper-hinderer in MB4 is just not a good one. Maybe the stimuli are too fast, for example (some people have suggested this explanation). For all the size of the sample of participants in MB4, it's just a *single* stimulus sample.

In collaborative replication projects, I have an increasing appreciation of Tal Yarkoni's point about the critical need for sampling stimuli (and paradigms) from the broader space in order to achieve generalizability. One stimulus or paradigm can always be idiosyncratic. In a recent paper, Holzmeister et al. break down heterogeneity into population, procedural, and analytic heterogeneity. They find that population is low, but procedural and (likely) analytic heterogeneity is very high across various multi-lab studies.  That conclusion fits with what we saw in ManyBabies 1 where procedure did really matter - different methods yielded quite different effect sizes - but population didn't seem to matter as much, modulo known moderators like age and native language.

A very reasonable alternative interpretation of MB4 - instead of the false positive interpretation - is that we simply do not know *how* to elicit the helper-hinderer effect reliably, even if it is true. This "stimulus variability" explanation is not a very positive conclusion - lots of experts in the field sat around and tried to create a paradigm to elicit this finding and failed. The best case is that we don't as a field have processes for finding stimuli that elicit particular effects. The stimulus variability explanation is really different than saying that the original phenomenon is a false positive. But I think we really need to keep both explanations on the table at the moment, as uncomfortable as that may be.

In sum, I'm really enthusiastic about MB4. It's a key success for team science in infancy research, and it's also a valuable datapoint for understanding the helper-hinderer phenomenon. It's just not the end of the story...

PS: I think everyone should give HUGE props to Kiley Hamlin for pursuing this project to the end with massive dedication and openness to the result, even though it calls into question some of her previous work. That is what I call true scientific bravery.