Monday, April 20, 2026

Using AI to improve (not automate away) academic research

Everyone seems to be consumed with AI anxiety. Graduate students are wondering if they will be replaced by assistants, or if they themselves are using AI enough or using it "right". Researchers are wondering what it means to produce research if agents can write whole papers. Everyone is wondering how we will keep up with a literature that is moving ever faster. 

Everyone is feeling the pressure to do *more*: do more projects, produce more papers, review more papers. This has already resulted in negative impacts on the research space, for example the problems that conferences have in getting quality, non-automated reviewing for the huge volume of submissions they receive. 

We should think about what we can do that is *different.* We should try to use automation to be more efficient at the annoying parts of our jobs while leaving more time for discovering new knowledge. The key (fast-evolving, unresolved) issue is how AI models will change the frontier of what is scientifically possible. This varies from field to field and changes day by day, but my sense is that the rise of semi-autonomous agents will be very interesting for scaling up social and behavioral science.

Monday, February 16, 2026

An LLM-backed "socratic tutor" to replace reading responses

My hot take on college-level teaching is that reading responses are mostly a terrible assignment, and they're even worse in the age of AI. I'm piloting something a bit different with my co-instructor right now: a "socratic tutor" bot that asks students to answer open-ended socratic questions about a specific text and "passes" them when they show sufficient comprehension. Initial feedback from students in a first trial has been extremely positive, so I am thinking more about how this could be useful in the future, as well as some of the potential problems. LLMs are far from a panacea for education – they cause way more problems than they solve, at the moment! – but this might be an interesting use case. 

As an instructor, one major challenge is that you want people to read the assigned reading and engage with it so that what you do in class can build on this content in a meaningful way; some students would prefer not to (or just don't have time, or whatever). How do you solve this problem? Weekly quizzes are possible but they're time-consuming to make and give and annoying to grade; plus they reinforce a memorization mindset, rather than inviting students to engage. 

The humble reading response is a frequent alternative: you ask students to respond to, critique, or build on their readings, usually in a short response ranging from a paragraph to a page. At their best in a well-prepared seminar, the instructor reads these beforehand, synthesizes them, and calls on individual students to share their reactions. But in a larger course, often this synthesis is impossible – and so the reading response becomes an assignment that no one wants to write and that can be tedious to read at the level they deserve. Even worse, if you're not getting called out on your reaction, it's possible to "respond" to a reading without having read it. And that's even before you can ask an AI to write a response to a text that it has ingested at some point (or that you've pasted into its chat window). What do we do? 

Wednesday, July 23, 2025

Book review: Elusive Cures

 I'm normally an avid fiction reader, but this summer I've been on a non-fiction kick. I just finished listening to Nicole Rust's new book, Elusive Cures. The premise of the book is the simple, important question: why haven't we made more progress on understanding brain disorders using basic neuroscience? Rust's argument is that the kind of "domino chain" causal model that we use to understand many neural systems is simply mismatched to the nature of how complex systems work. Rust is a cognitive neuroscientist who is known for her work on vision and memory, but she does not lean on these areas in the book, instead broadly surveying the neuroscience of disorders including Alzheimer's, Parkinson's, and depression. 

Although I'm mostly a cognitive scientist these days, Rust's description of the forward causal model for neuroscience immediately felt familiar from from my grad school neuroscience training. These kinds of causal systems are the ones we've made the most progress on in cognition as well: we have pretty strong models of how visual object recognition, reading, and language processing unfold in time. In contrast, processes that unfold interactively over time, such as mood, are much harder to understand this way. 

I have often been skeptical of the application of complex dynamical systems theory to cognition, though Rick Dale's nice intro for the Open Encyclopedia of Cognitive Science did win me over somewhat. I agree that cognition is a complex dynamical system, but in practice such formalisms can often feel unconstrained. Many researchers using dynamical systems theories don't – for whatever reason – engage in the kind of systematic model comparison and evaluation that I believe is critical for cognitive modeling

I heard that same kind of skepticism in Rust's own writing, which made it even more compelling when she made the case for the critical importance of understanding the brain as a complex dynamical system. Her discussion of the role of homeostasis in brain systems in particular was inspiring. It made me wonder why we don't apply the concept of homeostasis more to reason about social systems as well – for example, how communities maintain their educational standards in the face of policy changes or interventions. It's always a pleasure when a book sparks this kind of reflection. 

In sum, I strongly recommend Elusive Cures. I found it thought-provoking and broad, with good descriptions of both individual research findings and sweeping trends. The book feels like the rare "popular" book that also effectively makes a forceful scientific argument. 

Wednesday, June 11, 2025

Two summer book recommendations

After a long stint primarily reading fiction, I've been on a non-fiction kick recently and just read two books that I would definitely recommend!

Persuasion in Parallel (2022) by Alexander Coppock, a political scientist at Yale, is a scholarly monograph on how political persuasion works. It's a delightful combination of large-scale replications, strong emphasis on effect estimation and causal inference, and really thoughtful discussion of mechanisms. It starts from a replication and re-analysis of Lord, Ross, and Lepper (1979), the seminal work on political persuasion, and goes on to replicate a whole host of more recent studies. Across all of them, the key take-home is that arguments about controversial topics (e.g., gun control, abortion, etc.) operate very similarly across people with very different views: they cause small changes in attitude in the direction of the arguments, regardless of the recipient's initial views. 

In statistical terms, there's very limited heterogeneity across groups in the effect of persuasive arguments. I really appreciated evidence on this heterogeneity question because often students' intuition in psychology is that everything differs based on sociodemographic characteristics; yet this intuition is rarely quantified or challenged. Coppock's analyses take a really important step in this direction. 

The book is short and quite readable (especially given how data-rich it is). It's also very up front about the limitations of the work. There's also a thought-provoking final chapter on Bayesian inference and rational models of belief change that makes a number of connections to computational cognitive science that I enjoyed. Despite being an academic, I am not the sort of person who will sit down on a weekend with a monograph from another discipline for fun; this book was an exception for me because of how interesting, important, and thorough the work is.

On a heavier note, Doctored (2025), by Charles Piller (an investigative reporter with Science), is a screed about scientific misconduct in Alzheimer's research. I'm intimately familiar with replication issues in psychology, but I was still totally horrified to read about the impacts of scientific fraud in the Alzheimer's field. Piller makes a very well-researched and thorough case, working with experts on fraud and scientific reviewers. While a critique of the book by an Alzheimer's authority questions how central the fraudulent work was to the field (Lancet review), I was convinced by the later chapters of Doctored that show how pervasive image falsification has been within the Alzheimer's research enterprise. It's just awful to think that many people have been in dangerous clinical trials due to research misconduct. 

The book was clearly written very fast as there is some redundancy between chapters and a bit of unnecessary stage-setting around various researchers' grandparents (perhaps reflecting a pivot from an earlier vision of the book where certain people were more important to the narrative). But the substance of the scientific critique is so compelling – and honestly terrifying – that I was more than happy to overlook a few minor weaknesses in the prose. Definitely recommend.

Tuesday, December 3, 2024

Four papers I'm sad never to have published

One of the saddest things in academic research is an abandoned project. You pour time, effort, and sometimes money into a piece of research, only to feel that it has not been released into the world to make an impact. Sometimes you don't finish an analysis or write a paper. But I would argue that the saddest situations are the projects that came closest to being published – these are "near misses."*

This sadness can also have practical consequences. If we abandon projects differentially because of their results – failing to report negative findings because of a belief that they would be uninteresting or hard to publish – then we get a bias in the published literature. We know this is true – but in this post I'm not going to focus on that. I'm thinking more about inadvertent near misses. The open science movement – and in particular the rise of preprints – has changed the field a lot in that these near misses are now visible again. So I'm writing this post in part to promote and discuss four projects that never saw journal publication but that I still love...

I'm a researcher but I'm also (maybe primarily) an advisor and mentor, and so this kind of thing happens all the time: a trainee comes into my lab, does a great project, writes a paper about it, and then moves on to a new position. Sometimes they stay in academia, sometimes they don't. Even if we submit the manuscript before they leave, however, it frequently happens that reviews come back after they are distracted by the next stage of their life. Unless I take over the writing process, things typically remain unpublished. 

But the worst thing is when I abandon my own work because I'm too busy doing all that advising and teaching (and also getting grants to do the next shiny thing). Sadly this has happened many times over the past 15 years or so that I've been a faculty member. I simply didn't have the fortitude to get the paper through peer review and so it lingers as something interesting but unrevised – and perhaps fatally flawed (depending on whether you trust the reviewers). Here are my four biggest regrets. 

1. A literature review on computational models of early language learning. This was the first chapter of my dissertation initially, and I revised it for a review journal, hoping to do something like Pinker's famous early review paper. It was reviewed by two people, one nativist and one empiricist. Both hated it, and I abandoned it in despair. I still like what I wrote, but it's very out of date now. 

2. A huge dataset on children's free-viewing of naturalistic third-person dialogue and how it relates to their word learning. I loved this one. These experiments were my very first projects when I got to Stanford – we collected hundreds of kids worth of eye-tracking data (with an eye-tracker bought with my very first grant) and we were able to show correlational relationships between free-viewing and word learning. We even saw a similar relationship in kids on the autism spectrum. This paper was rejected several times from good journals for reasonable reasons (too correlational, kids with ASD were not well characterized). But I think it has a lot of value. (The data are now in Peekbank, at least).

(Graph showing big developmental differences in free viewing, specifically for a moment at which you had to follow an actor's gaze to see what they were talking about in the video).

3. A large set of experiments on reference games. Noah Goodman and I created the Rational Speech Act  (RSA) model of pragmatic processing and this was a big part of my early research at Stanford. I spent a ton of time and money doing mechanical turk experiments to try to learn more about the nature of the model. This manuscript includes a lot of methodological work on paradigms for studying pragmatic inference online as well as some clever scenarios to probe the limits (there were 10 experiments overall!). Sadly I think I tried to make the manuscript more definitive than it should have been – by the time I finally submitted it, RSA already had many variants, and some of the formal work was not as strong as the empirical side. So reviewers who disliked RSA disliked it, and reviewers who liked RSA still thought it needed work. 

4. A simplified formal model of teaching and learning. This one was an extension of the RSA model for teaching and learning scenarios, trying to get a handle on how teachers might change their messages based on the prior beliefs and/or knowledge of the learners. I was really proud of it, and it shapes my thinking about the dynamics of teaching to this day. Lawrence Liu started the project, but I did a ton more analysis several years later in hopes of making a full paper. Sadly, it was rejected once – reviewers thought, perhaps reasonably, that the policy implications were too big a stretch. By the time I submitted it to another journal, a bunch of other related formal work had appeared in the computer science literature. Reviewers the second time asked for more simulations, but I was out of time and the code had gotten quite stale because it depended on a very specific tech stack. 

I hope someone gets a little pleasure or knowledge from these pieces. I loved working on all four of them!

---- 

* I just learned that there is a whole literature on the psychology of near misses, for example in gambling or with respect to emotions like relief and regret.

Some thoughts on ManyBabies 4

 [repost from Bluesky]

Three ManyBabies projects - big collaborative replications of infancy phenomena - wrapped up this year. The first paper came out this fall. I thought I'd take this chance to comment on what I make of the non-replication result.

https://onlinelibrary.wiley.com/doi/full/10.1111/desc.13581

First, off - this study was a SUCCESS! We got the community together to plan a replication study and then we got 37 labs and 1000 babies to do a complicated study and we pulled it off. That's a huge win for team science! Major kudos to Kelsey, Francis, and Kiley.

In case you're wondering about the status of the other projects, here's a summary slide (already shared right after ICIS). MB3-4 yielded null effects; MB2 is complicated... preliminary analysis shows the predicted effect but an even bigger, unpredicted effect in the control condition.


Turning back to MB4, we were interested in the classic "helper hinderer" phenomenon. In these studies, babies have been shown to choose an object that "helps" over one that "hinders" a third one. A nice meta-analysis by Margoni & Surian (2018) confirms that this effect is variable across labs but has been found quite a lot of times.  Data from this MA and an update by Alvin Tan are on metalab: langcog.github.io/metalab/. MB4 ran a "straightforward" best practices replication, but with standardized video displays and both a social and non-social condition. Overall, there were no preferences for helpers or hinderers at any age and for either condition.

So what's going on? Well, the initial success (and various replications in the meta-analysis) could have been false positives or contained some confound leading to success. Or there might be some key difference in the replication leading babies to fail in this particular version. There are other possibilities (bad implementation or bad measurement, for example) but I think these are less likely, given the general care that was taken in the project and the large sample size, which allows detection of effects much smaller than the original effect.

Some people will jump to the interpretation that this study shows that the original finding was incorrect (and hence that the other replications were incorrect as well, and the earlier non-replications were right). This one possibility - but we shouldn't be so quick to jump to conclusions. Another possibility is that the *particular* instantiation of helper-hinderer in MB4 is just not a good one. Maybe the stimuli are too fast, for example (some people have suggested this explanation). For all the size of the sample of participants in MB4, it's just a *single* stimulus sample.

In collaborative replication projects, I have an increasing appreciation of Tal Yarkoni's point about the critical need for sampling stimuli (and paradigms) from the broader space in order to achieve generalizability. One stimulus or paradigm can always be idiosyncratic. In a recent paper, Holzmeister et al. break down heterogeneity into population, procedural, and analytic heterogeneity. They find that population is low, but procedural and (likely) analytic heterogeneity is very high across various multi-lab studies.  That conclusion fits with what we saw in ManyBabies 1 where procedure did really matter - different methods yielded quite different effect sizes - but population didn't seem to matter as much, modulo known moderators like age and native language.

A very reasonable alternative interpretation of MB4 - instead of the false positive interpretation - is that we simply do not know *how* to elicit the helper-hinderer effect reliably, even if it is true. This "stimulus variability" explanation is not a very positive conclusion - lots of experts in the field sat around and tried to create a paradigm to elicit this finding and failed. The best case is that we don't as a field have processes for finding stimuli that elicit particular effects. The stimulus variability explanation is really different than saying that the original phenomenon is a false positive. But I think we really need to keep both explanations on the table at the moment, as uncomfortable as that may be.

In sum, I'm really enthusiastic about MB4. It's a key success for team science in infancy research, and it's also a valuable datapoint for understanding the helper-hinderer phenomenon. It's just not the end of the story...

PS: I think everyone should give HUGE props to Kiley Hamlin for pursuing this project to the end with massive dedication and openness to the result, even though it calls into question some of her previous work. That is what I call true scientific bravery.

Monday, March 27, 2023

Domain-specific data repositories for better data sharing in psychology!

Data sharing is a critical part of ensuring a reproducible and robust research literature. It's also increasingly the law of the land, with new federal mandates taking effect in the US this year. How should psychologists and other behavioral scientists share their data? 

Repositories should clearly be FAIR - findable, accessible, interoperable, and reusable. But here's the thing - most data on a FAIR repository like the Open Science Framework (which is great, btw), will never be reused. It's findable and accessible, but it's not really interoperable or reusable. The problem is that most psychological data are measurements of some stuff in some experimental context. The measures we use are all over the place. We do not standardize our measures, let alone our manipulations. The metadata are comprehensible but not machine readable. And there is no universal ontology that lets someone say "I want all the measurements of self-regulation on children that are posted on OSF." 

What makes a dataset reusable really depends on the particular constructs that it measures, which in turn depends on the subfield and community those data are being collected for. When I want to reuse data, I don't want data in general. I want data about a specific construct, from a specific instrument, with metadata particular to my use field. Such should be stored in repositories specific to that measure, construct, or instrument. Let's call these Domain Specific Data Repositories (DSDRs). DSDRs are a way to make sure data actually are interoperable and actually do get reused by the target community.