Tuesday, December 3, 2024

Some thoughts on ManyBabies 4

 [repost from Bluesky]

Three ManyBabies projects - big collaborative replications of infancy phenomena - wrapped up this year. The first paper came out this fall. I thought I'd take this chance to comment on what I make of the non-replication result.

https://onlinelibrary.wiley.com/doi/full/10.1111/desc.13581

First, off - this study was a SUCCESS! We got the community together to plan a replication study and then we got 37 labs and 1000 babies to do a complicated study and we pulled it off. That's a huge win for team science! Major kudos to Kelsey, Francis, and Kiley.

In case you're wondering about the status of the other projects, here's a summary slide (already shared right after ICIS). MB3-4 yielded null effects; MB2 is complicated... preliminary analysis shows the predicted effect but an even bigger, unpredicted effect in the control condition.


Turning back to MB4, we were interested in the classic "helper hinderer" phenomenon. In these studies, babies have been shown to choose an object that "helps" over one that "hinders" a third one. A nice meta-analysis by Margoni & Surian (2018) confirms that this effect is variable across labs but has been found quite a lot of times.  Data from this MA and an update by Alvin Tan are on metalab: langcog.github.io/metalab/. MB4 ran a "straightforward" best practices replication, but with standardized video displays and both a social and non-social condition. Overall, there were no preferences for helpers or hinderers at any age and for either condition.

So what's going on? Well, the initial success (and various replications in the meta-analysis) could have been false positives or contained some confound leading to success. Or there might be some key difference in the replication leading babies to fail in this particular version. There are other possibilities (bad implementation or bad measurement, for example) but I think these are less likely, given the general care that was taken in the project and the large sample size, which allows detection of effects much smaller than the original effect.

Some people will jump to the interpretation that this study shows that the original finding was incorrect (and hence that the other replications were incorrect as well, and the earlier non-replications were right). This one possibility - but we shouldn't be so quick to jump to conclusions. Another possibility is that the *particular* instantiation of helper-hinderer in MB4 is just not a good one. Maybe the stimuli are too fast, for example (some people have suggested this explanation). For all the size of the sample of participants in MB4, it's just a *single* stimulus sample.

In collaborative replication projects, I have an increasing appreciation of Tal Yarkoni's point about the critical need for sampling stimuli (and paradigms) from the broader space in order to achieve generalizability. One stimulus or paradigm can always be idiosyncratic. In a recent paper, Holzmeister et al. break down heterogeneity into population, procedural, and analytic heterogeneity. They find that population is low, but procedural and (likely) analytic heterogeneity is very high across various multi-lab studies.  That conclusion fits with what we saw in ManyBabies 1 where procedure did really matter - different methods yielded quite different effect sizes - but population didn't seem to matter as much, modulo known moderators like age and native language.

A very reasonable alternative interpretation of MB4 - instead of the false positive interpretation - is that we simply do not know *how* to elicit the helper-hinderer effect reliably, even if it is true. This "stimulus variability" explanation is not a very positive conclusion - lots of experts in the field sat around and tried to create a paradigm to elicit this finding and failed. The best case is that we don't as a field have processes for finding stimuli that elicit particular effects. The stimulus variability explanation is really different than saying that the original phenomenon is a false positive. But I think we really need to keep both explanations on the table at the moment, as uncomfortable as that may be.

In sum, I'm really enthusiastic about MB4. It's a key success for team science in infancy research, and it's also a valuable datapoint for understanding the helper-hinderer phenomenon. It's just not the end of the story...

PS: I think everyone should give HUGE props to Kiley Hamlin for pursuing this project to the end with massive dedication and openness to the result, even though it calls into question some of her previous work. That is what I call true scientific bravery.

No comments:

Post a Comment