Tuesday, December 3, 2024

Four papers I'm sad never to have published

One of the saddest things in academic research is an abandoned project. You pour time, effort, and sometimes money into a piece of research, only to feel that it has not been released into the world to make an impact. Sometimes you don't finish an analysis or write a paper. But I would argue that the saddest situations are the projects that came closest to being published – these are "near misses."*

This sadness can also have practical consequences. If we abandon projects differentially because of their results – failing to report negative findings because of a belief that they would be uninteresting or hard to publish – then we get a bias in the published literature. We know this is true – but in this post I'm not going to focus on that. I'm thinking more about inadvertent near misses. The open science movement – and in particular the rise of preprints – has changed the field a lot in that these near misses are now visible again. So I'm writing this post in part to promote and discuss four projects that never saw journal publication but that I still love...

I'm a researcher but I'm also (maybe primarily) an advisor and mentor, and so this kind of thing happens all the time: a trainee comes into my lab, does a great project, writes a paper about it, and then moves on to a new position. Sometimes they stay in academia, sometimes they don't. Even if we submit the manuscript before they leave, however, it frequently happens that reviews come back after they are distracted by the next stage of their life. Unless I take over the writing process, things typically remain unpublished. 

But the worst thing is when I abandon my own work because I'm too busy doing all that advising and teaching (and also getting grants to do the next shiny thing). Sadly this has happened many times over the past 15 years or so that I've been a faculty member. I simply didn't have the fortitude to get the paper through peer review and so it lingers as something interesting but unrevised – and perhaps fatally flawed (depending on whether you trust the reviewers). Here are my four biggest regrets. 

1. A literature review on computational models of early language learning. This was the first chapter of my dissertation initially, and I revised it for a review journal, hoping to do something like Pinker's famous early review paper. It was reviewed by two people, one nativist and one empiricist. Both hated it, and I abandoned it in despair. I still like what I wrote, but it's very out of date now. 

2. A huge dataset on children's free-viewing of naturalistic third-person dialogue and how it relates to their word learning. I loved this one. These experiments were my very first projects when I got to Stanford – we collected hundreds of kids worth of eye-tracking data (with an eye-tracker bought with my very first grant) and we were able to show correlational relationships between free-viewing and word learning. We even saw a similar relationship in kids on the autism spectrum. This paper was rejected several times from good journals for reasonable reasons (too correlational, kids with ASD were not well characterized). But I think it has a lot of value. (The data are now in Peekbank, at least).

(Graph showing big developmental differences in free viewing, specifically for a moment at which you had to follow an actor's gaze to see what they were talking about in the video).

3. A large set of experiments on reference games. Noah Goodman and I created the Rational Speech Act  (RSA) model of pragmatic processing and this was a big part of my early research at Stanford. I spent a ton of time and money doing mechanical turk experiments to try to learn more about the nature of the model. This manuscript includes a lot of methodological work on paradigms for studying pragmatic inference online as well as some clever scenarios to probe the limits (there were 10 experiments overall!). Sadly I think I tried to make the manuscript more definitive than it should have been – by the time I finally submitted it, RSA already had many variants, and some of the formal work was not as strong as the empirical side. So reviewers who disliked RSA disliked it, and reviewers who liked RSA still thought it needed work. 

4. A simplified formal model of teaching and learning. This one was an extension of the RSA model for teaching and learning scenarios, trying to get a handle on how teachers might change their messages based on the prior beliefs and/or knowledge of the learners. I was really proud of it, and it shapes my thinking about the dynamics of teaching to this day. Lawrence Liu started the project, but I did a ton more analysis several years later in hopes of making a full paper. Sadly, it was rejected once – reviewers thought, perhaps reasonably, that the policy implications were too big a stretch. By the time I submitted it to another journal, a bunch of other related formal work had appeared in the computer science literature. Reviewers the second time asked for more simulations, but I was out of time and the code had gotten quite stale because it depended on a very specific tech stack. 

I hope someone gets a little pleasure or knowledge from these pieces. I loved working on all four of them!

---- 

* I just learned that there is a whole literature on the psychology of near misses, for example in gambling or with respect to emotions like relief and regret.

Some thoughts on ManyBabies 4

 [repost from Bluesky]

Three ManyBabies projects - big collaborative replications of infancy phenomena - wrapped up this year. The first paper came out this fall. I thought I'd take this chance to comment on what I make of the non-replication result.

https://onlinelibrary.wiley.com/doi/full/10.1111/desc.13581

First, off - this study was a SUCCESS! We got the community together to plan a replication study and then we got 37 labs and 1000 babies to do a complicated study and we pulled it off. That's a huge win for team science! Major kudos to Kelsey, Francis, and Kiley.

In case you're wondering about the status of the other projects, here's a summary slide (already shared right after ICIS). MB3-4 yielded null effects; MB2 is complicated... preliminary analysis shows the predicted effect but an even bigger, unpredicted effect in the control condition.


Turning back to MB4, we were interested in the classic "helper hinderer" phenomenon. In these studies, babies have been shown to choose an object that "helps" over one that "hinders" a third one. A nice meta-analysis by Margoni & Surian (2018) confirms that this effect is variable across labs but has been found quite a lot of times.  Data from this MA and an update by Alvin Tan are on metalab: langcog.github.io/metalab/. MB4 ran a "straightforward" best practices replication, but with standardized video displays and both a social and non-social condition. Overall, there were no preferences for helpers or hinderers at any age and for either condition.

So what's going on? Well, the initial success (and various replications in the meta-analysis) could have been false positives or contained some confound leading to success. Or there might be some key difference in the replication leading babies to fail in this particular version. There are other possibilities (bad implementation or bad measurement, for example) but I think these are less likely, given the general care that was taken in the project and the large sample size, which allows detection of effects much smaller than the original effect.

Some people will jump to the interpretation that this study shows that the original finding was incorrect (and hence that the other replications were incorrect as well, and the earlier non-replications were right). This one possibility - but we shouldn't be so quick to jump to conclusions. Another possibility is that the *particular* instantiation of helper-hinderer in MB4 is just not a good one. Maybe the stimuli are too fast, for example (some people have suggested this explanation). For all the size of the sample of participants in MB4, it's just a *single* stimulus sample.

In collaborative replication projects, I have an increasing appreciation of Tal Yarkoni's point about the critical need for sampling stimuli (and paradigms) from the broader space in order to achieve generalizability. One stimulus or paradigm can always be idiosyncratic. In a recent paper, Holzmeister et al. break down heterogeneity into population, procedural, and analytic heterogeneity. They find that population is low, but procedural and (likely) analytic heterogeneity is very high across various multi-lab studies.  That conclusion fits with what we saw in ManyBabies 1 where procedure did really matter - different methods yielded quite different effect sizes - but population didn't seem to matter as much, modulo known moderators like age and native language.

A very reasonable alternative interpretation of MB4 - instead of the false positive interpretation - is that we simply do not know *how* to elicit the helper-hinderer effect reliably, even if it is true. This "stimulus variability" explanation is not a very positive conclusion - lots of experts in the field sat around and tried to create a paradigm to elicit this finding and failed. The best case is that we don't as a field have processes for finding stimuli that elicit particular effects. The stimulus variability explanation is really different than saying that the original phenomenon is a false positive. But I think we really need to keep both explanations on the table at the moment, as uncomfortable as that may be.

In sum, I'm really enthusiastic about MB4. It's a key success for team science in infancy research, and it's also a valuable datapoint for understanding the helper-hinderer phenomenon. It's just not the end of the story...

PS: I think everyone should give HUGE props to Kiley Hamlin for pursuing this project to the end with massive dedication and openness to the result, even though it calls into question some of her previous work. That is what I call true scientific bravery.

Monday, March 27, 2023

Domain-specific data repositories for better data sharing in psychology!

Data sharing is a critical part of ensuring a reproducible and robust research literature. It's also increasingly the law of the land, with new federal mandates taking effect in the US this year. How should psychologists and other behavioral scientists share their data? 

Repositories should clearly be FAIR - findable, accessible, interoperable, and reusable. But here's the thing - most data on a FAIR repository like the Open Science Framework (which is great, btw), will never be reused. It's findable and accessible, but it's not really interoperable or reusable. The problem is that most psychological data are measurements of some stuff in some experimental context. The measures we use are all over the place. We do not standardize our measures, let alone our manipulations. The metadata are comprehensible but not machine readable. And there is no universal ontology that lets someone say "I want all the measurements of self-regulation on children that are posted on OSF." 

What makes a dataset reusable really depends on the particular constructs that it measures, which in turn depends on the subfield and community those data are being collected for. When I want to reuse data, I don't want data in general. I want data about a specific construct, from a specific instrument, with metadata particular to my use field. Such should be stored in repositories specific to that measure, construct, or instrument. Let's call these Domain Specific Data Repositories (DSDRs). DSDRs are a way to make sure data actually are interoperable and actually do get reused by the target community.

Why do LLMs learn so much slower than humans?


How do we compare the scale of language learning input for large language models vs. humans? I've been trying to come to grips with recent progress in AI. Let me explain two illustrations I made to help.

Recent progress in AI is truly astonishing, though somewhat hard to interpret. I don't want to reiterate recent discussion, but @spiantado has a good take in the first part of lingbuzz.net/lingbuzz/007180; l like this thoughtful piece by @MelMitchell1 as well: https://www.pnas.org/doi/10.1073/pnas.2300963120.

Many caveats still apply. LLMs are far from perfect, and I am still struggling with their immediate and eventual impacts on science (see prior thread). My goal in the current thread is to think about them as cognitive artifacts instead.

For cognitive scientists interested in the emergence of intelligent behavior, LLMs suggest that some wide range of interesting adaptive behaviors can emerge given enough scale. Obviously, there's huge debate over what counts as intelligent, and I'm not going to solve that here. 

But: for my money, we start seeing *really* interesting behaviors at the scale of GPT3. Prompting for few shot tasks felt radically unexpected and new, and suggested task abstractions underlying conditional language generation. At what scale do you see this? 

GPT-3 was trained on 500 billion tokens (= .75 words). So that gives us ~4e11 words. PaLM and Chinchilla are both trained on around 1e12 words. We don't know the corpus size for GP4-4 (!?!). How do these numbers compare with humans? 

Let’s start with an upper bound. A convenient approximation is 1e6 words per month for an upper bound on spoken language to a kid (arxiv.org/pdf/1607.08723…, appendix A or pnas.org/doi/abs/10.107…). That's 2e8 words for a 20 year old. How much could they read?

Assume they start reading when they’re 10, and read a 1e5-word book/week. That’s an extra 5e6 million words per year. Double that to be safe and it still only gets us to 3e8 words over 10 years.
Now let's do a rough lower bound. Maybe 1e5 words per month for kids growing up in a low-SES environment with limited speech to children (onlinelibrary.wiley.com/doi/epdf/10.11…). We don't get much of a literacy boost. So that gives us 5e6 by age 5 and 2e7 by age 20. 

That "lower bound" five year old can still reason about novel tasks based on verbal instructions - especially once they start kindergarten! 

The take-home here is that we are off by 4-5 orders of input magnitude in the emergence of adaptive behaviors.


The big cognitive science question is - which factors account for that gap? I'll think about four broad ones. 

Factor 1: innate knowledge. Humans have SOME innate perceptual and/or conceptual foundation. The strongest version posits "core knowledge" of objects, agents, events, sets, etc. which serve to bootstrap further learning. People disagree about whether this is true.

Factor 2: multi-modal grounding. Human language input is (often) grounded in one or more perceptual modalities, especially for young children. This grounding connects language to rich information for world models that can be used for broader reasoning.

Factor 3: active, social learning. Humans learn language in interactive social situations, typically curricularized to some degree by the adults around them. After a few years, they use conversation to elicit information relevant to them.

Factor 4: evaluation differences. We're expecting chatGPT to reason about/with all the internet's knowledge, and a five year old just understand a single novel theory of mind or causal reasoning task. Is comparison even possible?

So of course I don't know the answer! But here are a few scenarios for thinking this through. Scenario 1 is classic nativist dev psych: innate endowment plus input make the difference. You use core knowledge to bootstrap concepts from your experience. 


Scenario 2 is more like modern rational constructivism. Grounded experience plus a bunch of active and social learning allow kids to learn about the structure of the world even with limited innate knowledge.

I hear more about Scenario 3 in the AI community - once we ground these models in perceptual input, it's going to be easier for them to do common-sense reasoning with less data. And finally, of course, we could just be all wrong about the evaluation (Scenario 4).

As I said, I don't know the answer. But this set of questions is precisely why challenges like BabyLM are so important (babylm.github.io).

AI for psychology workflows hackathon - a report

[reposted from twitter]

My lab held a hackathon yesterday to play with places where large language models could help us with our research in cognitive science. The mandate was, "how can these models help us do what we do, but better and faster."

Some impressions:🧵

Whatever their flaws, chat-based LLMs are astonishing. My kids and I used ChatGPT to write birthday poems for their grandma. I would have bet money against this being possible even ten years ago.

But can they be used to improve research in cognitive science and psychology?

1. Using chat-based agents to retrieve factual knowledge is not effective. They are not trained for this and they do it poorly (the "hallucination problem"). Ask ChatGPT for a scientist bio, and the result will be similar but with random swaps of institutions, dates, facts, etc.

2. A new generation of retrieval-based agents are on their way but not here yet. These will have a true memory where they can look up individual articles, events, or entities rather than predicting general gestalts. Bing and Bard might be like this some day, but they aren't now.

3. Chat-based agents can accomplish pretty remarkable text formatting and analysis, which has applications in literature reading and data munging. E.g., they can pull out design characteristics from scientific papers, reformat numbers from tables, etc. Cool opportunities. These functions are critically dependent on long prompt windows. Despite GPT-4's notionally long prompt length, in practice we couldn't get more than 1.5k tokens consistently. That meant that pre-parsing inputs was critical, and this took too much manual work to be very useful. 

4. A massive weakness for scientific use is that cutting-edge agents cannot easily be placed in a reproducible scientific pipeline. Pasting pasting text into a window is not a viable route for science. You can get API access but without random seeds, this is not enough. (We got a huge object lesson in this reproducibility issue yesterday when OpenAI declared that they are retiring Codex, a model that is the foundation of a large number of pieces of work on code generation in the past year. This shouldn't happen to our scientific workflows.) Of course we could download Alpaca or some other open model, set it up, and run it as part of a pipeline. But we are cognitive scientists, not LLM engineers. We don't want to do that just to make our data munging slightly easier!

5. Chat agents are not that helpful in breaking new ground. The problem is that, if you don't know the solution for a problem, then you can't tell whether the AI did it right, or even is going in the right direction!  Instead, the primary use case seems to be helping people accomplish tasks they *already know how to do*, but to do them more effectively and faster. If you can check the answer, then the AI can produce a candidate answer to check.

6. It was very easy for us to come up with one-off use-cases that could be very helpful (e.g., help me debug this function, help me write this report or letter), and surprisingly hard to come up with cases that could benefit with creating automated workflows. At small scale, using chat AI to automate research tasks is trading one task (e.g., annotating data) for more menial and annoying ones (prompt engineering and data reformatting so that the AI can process it). This is ok for large problems, but not small and medium ones.

7. Confidence rating is a critical functionality that we couldn't automate reliably. We need AI to tell us when a particular output is low confidence so that it can be rechecked.

In sum: Chat AI is going to help us be faster at many tasks we already know how to do, and there are a few interesting scientific automation applications that we found. But for LLMs to change our research, we need better engineering around reliability and reproducibility.

Thursday, February 16, 2023

Why do hybrid meetings suck?

I tried rendering this post in Quarto, which is not blogger-compatible, but I'm including the link here: rpubs.com/mcfrank/hybrid.

Sunday, February 21, 2021

Methodological reforms, or, If we all want the same things, why can't we be friends?

 (tl;dr: "Ugh, can't we just get along?!" OR "aspirational reform meet actual policy?" OR "whither metascience?")


This post started out as a thread about the tribes of methodological reform in psychology, all of whom I respect and admire. Then it got too long, so it became a blogpost. 

As folks might know, I think methodological reform in psychology is critical (some of my views have been formed by my work with the ManyBabies consortium). For the last ~2 years, I've been watching two loose groups of methodological reformers get mad at each other. It has made me very sad to see these conflicts because I like all of the folks involved. I've actually felt like I've had to take a twitter holiday several times because I can't stand to see some of my favorite folks on the platform yelling at each other. 

This post is my - perhaps misguided - attempt to express appreciation for everyone involved and try to spell out some common ground.