Babies Learning Language

Book review: Elusive Cures

2025-07-23T11:25:00.000-07:00

I'm normally an avid fiction reader, but this summer I've been on a non-fiction kick. I just finished listening to Nicole Rust's new book, Elusive Cures. The premise of the book is the simple, important question: why haven't we made more progress on understanding brain disorders using basic neuroscience? Rust's argument is that the kind of "domino chain" causal model that we use to understand many neural systems is simply mismatched to the nature of how complex systems work. Rust is a cognitive neuroscientist who is known for her work on vision and memory, but she does not lean on these areas in the book, instead broadly surveying the neuroscience of disorders including Alzheimer's, Parkinson's, and depression.

Although I'm mostly a cognitive scientist these days, Rust's description of the forward causal model for neuroscience immediately felt familiar from from my grad school neuroscience training. These kinds of causal systems are the ones we've made the most progress on in cognition as well: we have pretty strong models of how visual object recognition, reading, and language processing unfold in time. In contrast, processes that unfold interactively over time, such as mood, are much harder to understand this way.

I have often been skeptical of the application of complex dynamical systems theory to cognition, though Rick Dale's nice intro for the Open Encyclopedia of Cognitive Science did win me over somewhat. I agree that cognition is a complex dynamical system, but in practice such formalisms can often feel unconstrained. Many researchers using dynamical systems theories don't – for whatever reason – engage in the kind of systematic model comparison and evaluation that I believe is critical for cognitive modeling.

I heard that same kind of skepticism in Rust's own writing, which made it even more compelling when she made the case for the critical importance of understanding the brain as a complex dynamical system. Her discussion of the role of homeostasis in brain systems in particular was inspiring. It made me wonder why we don't apply the concept of homeostasis more to reason about social systems as well – for example, how communities maintain their educational standards in the face of policy changes or interventions. It's always a pleasure when a book sparks this kind of reflection.

In sum, I strongly recommend Elusive Cures. I found it thought-provoking and broad, with good descriptions of both individual research findings and sweeping trends. The book feels like the rare "popular" book that also effectively makes a forceful scientific argument.

Two summer book recommendations

2025-06-11T13:15:00.000-07:00

After a long stint primarily reading fiction, I've been on a non-fiction kick recently and just read two books that I would definitely recommend!

Persuasion in Parallel (2022) by Alexander Coppock, a political scientist at Yale, is a scholarly monograph on how political persuasion works. It's a delightful combination of large-scale replications, strong emphasis on effect estimation and causal inference, and really thoughtful discussion of mechanisms. It starts from a replication and re-analysis of Lord, Ross, and Lepper (1979), the seminal work on political persuasion, and goes on to replicate a whole host of more recent studies. Across all of them, the key take-home is that arguments about controversial topics (e.g., gun control, abortion, etc.) operate very similarly across people with very different views: they cause small changes in attitude in the direction of the arguments, regardless of the recipient's initial views.

In statistical terms, there's very limited heterogeneity across groups in the effect of persuasive arguments. I really appreciated evidence on this heterogeneity question because often students' intuition in psychology is that everything differs based on sociodemographic characteristics; yet this intuition is rarely quantified or challenged. Coppock's analyses take a really important step in this direction.

The book is short and quite readable (especially given how data-rich it is). It's also very up front about the limitations of the work. There's also a thought-provoking final chapter on Bayesian inference and rational models of belief change that makes a number of connections to computational cognitive science that I enjoyed. Despite being an academic, I am not the sort of person who will sit down on a weekend with a monograph from another discipline for fun; this book was an exception for me because of how interesting, important, and thorough the work is.

On a heavier note, Doctored (2025), by Charles Piller (an investigative reporter with Science), is a screed about scientific misconduct in Alzheimer's research. I'm intimately familiar with replication issues in psychology, but I was still totally horrified to read about the impacts of scientific fraud in the Alzheimer's field. Piller makes a very well-researched and thorough case, working with experts on fraud and scientific reviewers. While a critique of the book by an Alzheimer's authority questions how central the fraudulent work was to the field (Lancet review), I was convinced by the later chapters of Doctored that show how pervasive image falsification has been within the Alzheimer's research enterprise. It's just awful to think that many people have been in dangerous clinical trials due to research misconduct.

The book was clearly written very fast as there is some redundancy between chapters and a bit of unnecessary stage-setting around various researchers' grandparents (perhaps reflecting a pivot from an earlier vision of the book where certain people were more important to the narrative). But the substance of the scientific critique is so compelling – and honestly terrifying – that I was more than happy to overlook a few minor weaknesses in the prose. Definitely recommend.

Four papers I'm sad never to have published

2024-12-03T16:35:00.000-08:00

One of the saddest things in academic research is an abandoned project. You pour time, effort, and sometimes money into a piece of research, only to feel that it has not been released into the world to make an impact. Sometimes you don't finish an analysis or write a paper. But I would argue that the saddest situations are the projects that came closest to being published – these are "near misses."*

This sadness can also have practical consequences. If we abandon projects differentially because of their results – failing to report negative findings because of a belief that they would be uninteresting or hard to publish – then we get a bias in the published literature. We know this is true – but in this post I'm not going to focus on that. I'm thinking more about inadvertent near misses. The open science movement – and in particular the rise of preprints – has changed the field a lot in that these near misses are now visible again. So I'm writing this post in part to promote and discuss four projects that never saw journal publication but that I still love...

I'm a researcher but I'm also (maybe primarily) an advisor and mentor, and so this kind of thing happens all the time: a trainee comes into my lab, does a great project, writes a paper about it, and then moves on to a new position. Sometimes they stay in academia, sometimes they don't. Even if we submit the manuscript before they leave, however, it frequently happens that reviews come back after they are distracted by the next stage of their life. Unless I take over the writing process, things typically remain unpublished.

But the worst thing is when I abandon my own work because I'm too busy doing all that advising and teaching (and also getting grants to do the next shiny thing). Sadly this has happened many times over the past 15 years or so that I've been a faculty member. I simply didn't have the fortitude to get the paper through peer review and so it lingers as something interesting but unrevised – and perhaps fatally flawed (depending on whether you trust the reviewers). Here are my four biggest regrets.

1. A literature review on computational models of early language learning. This was the first chapter of my dissertation initially, and I revised it for a review journal, hoping to do something like Pinker's famous early review paper. It was reviewed by two people, one nativist and one empiricist. Both hated it, and I abandoned it in despair. I still like what I wrote, but it's very out of date now.

2. A huge dataset on children's free-viewing of naturalistic third-person dialogue and how it relates to their word learning. I loved this one. These experiments were my very first projects when I got to Stanford – we collected hundreds of kids worth of eye-tracking data (with an eye-tracker bought with my very first grant) and we were able to show correlational relationships between free-viewing and word learning. We even saw a similar relationship in kids on the autism spectrum. This paper was rejected several times from good journals for reasonable reasons (too correlational, kids with ASD were not well characterized). But I think it has a lot of value. (The data are now in Peekbank, at least).

(Graph showing big developmental differences in free viewing, specifically for a moment at which you had to follow an actor's gaze to see what they were talking about in the video).

3. A large set of experiments on reference games. Noah Goodman and I created the Rational Speech Act (RSA) model of pragmatic processing and this was a big part of my early research at Stanford. I spent a ton of time and money doing mechanical turk experiments to try to learn more about the nature of the model. This manuscript includes a lot of methodological work on paradigms for studying pragmatic inference online as well as some clever scenarios to probe the limits (there were 10 experiments overall!). Sadly I think I tried to make the manuscript more definitive than it should have been – by the time I finally submitted it, RSA already had many variants, and some of the formal work was not as strong as the empirical side. So reviewers who disliked RSA disliked it, and reviewers who liked RSA still thought it needed work.

4. A simplified formal model of teaching and learning. This one was an extension of the RSA model for teaching and learning scenarios, trying to get a handle on how teachers might change their messages based on the prior beliefs and/or knowledge of the learners. I was really proud of it, and it shapes my thinking about the dynamics of teaching to this day. Lawrence Liu started the project, but I did a ton more analysis several years later in hopes of making a full paper. Sadly, it was rejected once – reviewers thought, perhaps reasonably, that the policy implications were too big a stretch. By the time I submitted it to another journal, a bunch of other related formal work had appeared in the computer science literature. Reviewers the second time asked for more simulations, but I was out of time and the code had gotten quite stale because it depended on a very specific tech stack.

I hope someone gets a little pleasure or knowledge from these pieces. I loved working on all four of them!

----

* I just learned that there is a whole literature on the psychology of near misses, for example in gambling or with respect to emotions like relief and regret.

Some thoughts on ManyBabies 4

2024-12-03T16:01:00.004-08:00

[repost from Bluesky]

Three ManyBabies projects - big collaborative replications of infancy phenomena - wrapped up this year. The first paper came out this fall. I thought I'd take this chance to comment on what I make of the non-replication result.

https://onlinelibrary.wiley.com/doi/full/10.1111/desc.13581

First, off - this study was a SUCCESS! We got the community together to plan a replication study and then we got 37 labs and 1000 babies to do a complicated study and we pulled it off. That's a huge win for team science! Major kudos to Kelsey, Francis, and Kiley.

In case you're wondering about the status of the other projects, here's a summary slide (already shared right after ICIS). MB3-4 yielded null effects; MB2 is complicated... preliminary analysis shows the predicted effect but an even bigger, unpredicted effect in the control condition.

Turning back to MB4, we were interested in the classic "helper hinderer" phenomenon. In these studies, babies have been shown to choose an object that "helps" over one that "hinders" a third one. A nice meta-analysis by Margoni & Surian (2018) confirms that this effect is variable across labs but has been found quite a lot of times. Data from this MA and an update by Alvin Tan are on metalab: langcog.github.io/metalab/. MB4 ran a "straightforward" best practices replication, but with standardized video displays and both a social and non-social condition. Overall, there were no preferences for helpers or hinderers at any age and for either condition.

So what's going on? Well, the initial success (and various replications in the meta-analysis) could have been false positives or contained some confound leading to success. Or there might be some key difference in the replication leading babies to fail in this particular version. There are other possibilities (bad implementation or bad measurement, for example) but I think these are less likely, given the general care that was taken in the project and the large sample size, which allows detection of effects much smaller than the original effect.

Some people will jump to the interpretation that this study shows that the original finding was incorrect (and hence that the other replications were incorrect as well, and the earlier non-replications were right). This one possibility - but we shouldn't be so quick to jump to conclusions. Another possibility is that the *particular* instantiation of helper-hinderer in MB4 is just not a good one. Maybe the stimuli are too fast, for example (some people have suggested this explanation). For all the size of the sample of participants in MB4, it's just a *single* stimulus sample.

In collaborative replication projects, I have an increasing appreciation of Tal Yarkoni's point about the critical need for sampling stimuli (and paradigms) from the broader space in order to achieve generalizability. One stimulus or paradigm can always be idiosyncratic. In a recent paper, Holzmeister et al. break down heterogeneity into population, procedural, and analytic heterogeneity. They find that population is low, but procedural and (likely) analytic heterogeneity is very high across various multi-lab studies. That conclusion fits with what we saw in ManyBabies 1 where procedure did really matter - different methods yielded quite different effect sizes - but population didn't seem to matter as much, modulo known moderators like age and native language.

A very reasonable alternative interpretation of MB4 - instead of the false positive interpretation - is that we simply do not know *how* to elicit the helper-hinderer effect reliably, even if it is true. This "stimulus variability" explanation is not a very positive conclusion - lots of experts in the field sat around and tried to create a paradigm to elicit this finding and failed. The best case is that we don't as a field have processes for finding stimuli that elicit particular effects. The stimulus variability explanation is really different than saying that the original phenomenon is a false positive. But I think we really need to keep both explanations on the table at the moment, as uncomfortable as that may be.

In sum, I'm really enthusiastic about MB4. It's a key success for team science in infancy research, and it's also a valuable datapoint for understanding the helper-hinderer phenomenon. It's just not the end of the story...

PS: I think everyone should give HUGE props to Kiley Hamlin for pursuing this project to the end with massive dedication and openness to the result, even though it calls into question some of her previous work. That is what I call true scientific bravery.

Domain-specific data repositories for better data sharing in psychology!

2023-03-27T13:14:00.013-07:00

Data sharing is a critical part of ensuring a reproducible and robust research literature. It's also increasingly the law of the land, with new federal mandates taking effect in the US this year. How should psychologists and other behavioral scientists share their data?

Repositories should clearly be FAIR - findable, accessible, interoperable, and reusable. But here's the thing - most data on a FAIR repository like the Open Science Framework (which is great, btw), will never be reused. It's findable and accessible, but it's not really interoperable or reusable. The problem is that most psychological data are measurements of some stuff in some experimental context. The measures we use are all over the place. We do not standardize our measures, let alone our manipulations. The metadata are comprehensible but not machine readable. And there is no universal ontology that lets someone say "I want all the measurements of self-regulation on children that are posted on OSF."

What makes a dataset reusable really depends on the particular constructs that it measures, which in turn depends on the subfield and community those data are being collected for. When I want to reuse data, I don't want data in general. I want data about a specific construct, from a specific instrument, with metadata particular to my use field. Such should be stored in repositories specific to that measure, construct, or instrument. Let's call these Domain Specific Data Repositories (DSDRs). DSDRs are a way to make sure data actually are interoperable and actually do get reused by the target community.

Put data in DSDRs

Suppose I'm doing a project on executive function in early childhood. Wouldn't it be nice if I could download raw or aggregated data from the various tasks that people had used to measure executive function? Or suppose I'm now interested in complex sentence structure and psycholinguistics. Wouldn't it be nice to be able to download data from the hundreds of experiments on word-by-word reading time for sentences of different types? Data on both these questions exist, but they are spread out across repositories for individual papers, formatted differently in every case. Putting together more than one dataset is typically a nightmare of data harmonization and meta-data guesswork.

Neuroimaging folks get this. You don't post fMRI images to Zenodo or OSF or another repository of this type. You post them to OpenNeuro - a domain-specific repository for neuroimaging. fMRI data have specific standards for metadata and particular affordances in terms of preprocessing, aggregation, and analysis. OpenNeuro is designed around these ideas.

Similarly, the Child Language Data Exchange System (CHILDES) has known this basic fact for years. They established a common schema for transcripts of parent-child conversations (the CHAT standard). Now everyone in the field of child language posts their data to CHILDES in this format, and so when you want to learn about kids' use of the word "and", you can search every major transcribed corpus of child language in a single archive. My group has done the same kind of thing with data around children's vocabulary, with Wordbank archiving parent reports about child language from dozens of languages and tens of thousands of kids.

To make high-value, reusable datasets, it is critical to aggregate the data around a common data standard that is specific to a particular instrument or construct, and that connects with the agenda of a particular research community. These tools can even help catalyze research communities to work together around a shared agenda. They can also increase data quality by putting into place domain-specific quality controls.

We need more DSDRs

The trouble is, making these domain-specific repositories is expensive and complicated. We've now made four: Wordbank, childes-db, Peekbank, and Metalab. Each of them has their own web hosting framework (similar but different) as well as their own underlying database schema, visualization apps, and application programming interface (API) for downloading the data. Even though they are structurally similar, they are not the same, and each was made as a one-off.

As a result, we now struggle under the burden of maintaining and updating these repositories, and it's not likely we can do too many more without abandoning some of them. Every time one breaks, I get lots of email. Every year I have to beg RStudio (now Posit) for free licenses to keep our visualization interface going. And it goes without saying that there is no funding for long-term maintenance of such repositories.

But maybe we could automate and centralize the construction of such repositories and host them jointly in the cloud, rather than creating wholly separate resources each time.

At the core of each repository is a database schema, like the schema for Peekbank:

Designing this kind of schema requires a clear understanding of the ontology for the kind of data you want to archive – it's surprisingly tricky (the Peekbank one took us many meetings over several years!). But once you have such a schema, it is straightforward to create an API to get data out of a database with this schema. And with a good API, it is surprisingly easy to define visualizations of data in the schema. People are often surprised that the interactive visualizations in something like Wordbank are the easy part!

The only pain point is importing new datasets into the schema – typically this work requires writing custom data-munging code for each dataset to define the relation between the incoming data format and the specific tables required in the schema. For Wordbank we even defined an intermediate abstract layer for defining the mapping between incoming data and our schema.

In principle, all of this work could be wrapped in a sufficiently general framework to make it unnecessary to create a custom hosting solution. Each database could be an instance of a broader database type, or even inside a giant wrapper database. And each API could be generated automatically from the database schema. You could even imagine a world where these DSDRs were created automatically out of an app like AirTable. There's some serious design work to do to describe the scope of such a system, but it is certainly not out of the realm of possibility.

Challenges

We have some work to do to make DSDRs like Wordbank the norm. At a minimum, we need:

Credit assignment: robust norms for giving contributors credit when their data are used. At the moment, Wordbank and CHILDES simply ask folks to cite the contributors' paper (e.g., http://wordbank.stanford.edu/contributors) but in the long term, datasets should have DOIs that are downloaded with the data and associated to the paper DOI automagically.
Dataset use tracking: repositories also need DOIs and methods for tracking their use and impact beyond citations of papers about the repository, which are often out of date and which split impact across multiple products.
Effective data versioning solutions: we need easy tools for using historical snapshots of repositories so that analyses of DSDR data are reproducible. We have hand engineered this for some of our DSDRs, but we need to be able to roll out this functionality with limited extra effort. Right now some key repositories like CHILDES have no accessible version control, meaning analyses can break down the line and users will not know why.
Mechanisms for ensuring the longevity of DSDRs: we need to ensure that DSDRs don't just rely on single investigators for maintenance and updates, perhaps through partnerships with libraries and cloud providers.

There's a lot to do.

Conclusion

Lost in many discussions of data sharing is that data shared in individual packages fosters reproducibility but often not interoperability and reuse. Reuse comes when data are organized around specific disciplinary constructs, frameworks, and measurements. And reuse value grows further as the size and diversity of the datasets in a domain-specific repository increase. We need more of domain-specific data repositories to catalyze research communities, especially in smaller fields where no such data resource exists. To create these, we will need new technical tools for rapidly and sustainably spinning up new repositories. These tools should be a development priority.

Why do LLMs learn so much slower than humans?

2023-03-27T09:37:00.004-07:00

[repost from twitter]

How do we compare the scale of language learning input for large language models vs. humans? I've been trying to come to grips with recent progress in AI. Let me explain two illustrations I made to help.

Recent progress in AI is truly astonishing, though somewhat hard to interpret. I don't want to reiterate recent discussion, but @spiantado has a good take in the first part of lingbuzz.net/lingbuzz/007180; l like this thoughtful piece by @MelMitchell1 as well: https://www.pnas.org/doi/10.1073/pnas.2300963120.

Many caveats still apply. LLMs are far from perfect, and I am still struggling with their immediate and eventual impacts on science (see prior thread). My goal in the current thread is to think about them as cognitive artifacts instead.

For cognitive scientists interested in the emergence of intelligent behavior, LLMs suggest that some wide range of interesting adaptive behaviors can emerge given enough scale. Obviously, there's huge debate over what counts as intelligent, and I'm not going to solve that here.

But: for my money, we start seeing *really* interesting behaviors at the scale of GPT3. Prompting for few shot tasks felt radically unexpected and new, and suggested task abstractions underlying conditional language generation. At what scale do you see this?

GPT-3 was trained on 500 billion tokens (= .75 words). So that gives us ~4e11 words. PaLM and Chinchilla are both trained on around 1e12 words. We don't know the corpus size for GP4-4 (!?!). How do these numbers compare with humans?

Let’s start with an upper bound. A convenient approximation is 1e6 words per month for an upper bound on spoken language to a kid (arxiv.org/pdf/1607.08723…, appendix A or pnas.org/doi/abs/10.107…). That's 2e8 words for a 20 year old. How much could they read?

Assume they start reading when they’re 10, and read a 1e5-word book/week. That’s an extra 5e6 million words per year. Double that to be safe and it still only gets us to 3e8 words over 10 years.
Now let's do a rough lower bound. Maybe 1e5 words per month for kids growing up in a low-SES environment with limited speech to children (onlinelibrary.wiley.com/doi/epdf/10.11…). We don't get much of a literacy boost. So that gives us 5e6 by age 5 and 2e7 by age 20.

That "lower bound" five year old can still reason about novel tasks based on verbal instructions - especially once they start kindergarten!

The take-home here is that we are off by 4-5 orders of input magnitude in the emergence of adaptive behaviors.

The big cognitive science question is - which factors account for that gap? I'll think about four broad ones.

Factor 1: innate knowledge. Humans have SOME innate perceptual and/or conceptual foundation. The strongest version posits "core knowledge" of objects, agents, events, sets, etc. which serve to bootstrap further learning. People disagree about whether this is true.

Factor 2: multi-modal grounding. Human language input is (often) grounded in one or more perceptual modalities, especially for young children. This grounding connects language to rich information for world models that can be used for broader reasoning.

Factor 3: active, social learning. Humans learn language in interactive social situations, typically curricularized to some degree by the adults around them. After a few years, they use conversation to elicit information relevant to them.

Factor 4: evaluation differences. We're expecting chatGPT to reason about/with all the internet's knowledge, and a five year old just understand a single novel theory of mind or causal reasoning task. Is comparison even possible?

So of course I don't know the answer! But here are a few scenarios for thinking this through. Scenario 1 is classic nativist dev psych: innate endowment plus input make the difference. You use core knowledge to bootstrap concepts from your experience.

Scenario 2 is more like modern rational constructivism. Grounded experience plus a bunch of active and social learning allow kids to learn about the structure of the world even with limited innate knowledge.

I hear more about Scenario 3 in the AI community - once we ground these models in perceptual input, it's going to be easier for them to do common-sense reasoning with less data. And finally, of course, we could just be all wrong about the evaluation (Scenario 4).

As I said, I don't know the answer. But this set of questions is precisely why challenges like BabyLM are so important (babylm.github.io).

AI for psychology workflows hackathon - a report

2023-03-27T09:32:00.002-07:00

[reposted from twitter]

My lab held a hackathon yesterday to play with places where large language models could help us with our research in cognitive science. The mandate was, "how can these models help us do what we do, but better and faster."

Some impressions:🧵

Whatever their flaws, chat-based LLMs are astonishing. My kids and I used ChatGPT to write birthday poems for their grandma. I would have bet money against this being possible even ten years ago.

But can they be used to improve research in cognitive science and psychology?

1. Using chat-based agents to retrieve factual knowledge is not effective. They are not trained for this and they do it poorly (the "hallucination problem"). Ask ChatGPT for a scientist bio, and the result will be similar but with random swaps of institutions, dates, facts, etc.

2. A new generation of retrieval-based agents are on their way but not here yet. These will have a true memory where they can look up individual articles, events, or entities rather than predicting general gestalts. Bing and Bard might be like this some day, but they aren't now.

3. Chat-based agents can accomplish pretty remarkable text formatting and analysis, which has applications in literature reading and data munging. E.g., they can pull out design characteristics from scientific papers, reformat numbers from tables, etc. Cool opportunities. These functions are critically dependent on long prompt windows. Despite GPT-4's notionally long prompt length, in practice we couldn't get more than 1.5k tokens consistently. That meant that pre-parsing inputs was critical, and this took too much manual work to be very useful.

4. A massive weakness for scientific use is that cutting-edge agents cannot easily be placed in a reproducible scientific pipeline. Pasting pasting text into a window is not a viable route for science. You can get API access but without random seeds, this is not enough. (We got a huge object lesson in this reproducibility issue yesterday when OpenAI declared that they are retiring Codex, a model that is the foundation of a large number of pieces of work on code generation in the past year. This shouldn't happen to our scientific workflows.) Of course we could download Alpaca or some other open model, set it up, and run it as part of a pipeline. But we are cognitive scientists, not LLM engineers. We don't want to do that just to make our data munging slightly easier!

5. Chat agents are not that helpful in breaking new ground. The problem is that, if you don't know the solution for a problem, then you can't tell whether the AI did it right, or even is going in the right direction! Instead, the primary use case seems to be helping people accomplish tasks they *already know how to do*, but to do them more effectively and faster. If you can check the answer, then the AI can produce a candidate answer to check.

6. It was very easy for us to come up with one-off use-cases that could be very helpful (e.g., help me debug this function, help me write this report or letter), and surprisingly hard to come up with cases that could benefit with creating automated workflows. At small scale, using chat AI to automate research tasks is trading one task (e.g., annotating data) for more menial and annoying ones (prompt engineering and data reformatting so that the AI can process it). This is ok for large problems, but not small and medium ones.

7. Confidence rating is a critical functionality that we couldn't automate reliably. We need AI to tell us when a particular output is low confidence so that it can be rechecked.

In sum: Chat AI is going to help us be faster at many tasks we already know how to do, and there are a few interesting scientific automation applications that we found. But for LLMs to change our research, we need better engineering around reliability and reproducibility.

Why do hybrid meetings suck?

2023-02-16T16:37:00.007-08:00

I tried rendering this post in Quarto, which is not blogger-compatible, but I'm including the link here: rpubs.com/mcfrank/hybrid.

Methodological reforms, or, If we all want the same things, why can't we be friends?

2021-02-21T20:31:00.007-08:00

(tl;dr: "Ugh, can't we just get along?!" OR "aspirational reform meet actual policy?" OR "whither metascience?")

This post started out as a thread about the tribes of methodological reform in psychology, all of whom I respect and admire. Then it got too long, so it became a blogpost.

As folks might know, I think methodological reform in psychology is critical (some of my views have been formed by my work with the ManyBabies consortium). For the last ~2 years, I've been watching two loose groups of methodological reformers get mad at each other. It has made me very sad to see these conflicts because I like all of the folks involved. I've actually felt like I've had to take a twitter holiday several times because I can't stand to see some of my favorite folks on the platform yelling at each other.

This post is my - perhaps misguided - attempt to express appreciation for everyone involved and try to spell out some common ground.

What do the centrists and the radicals think?

One thread that catalyzed my thinking about this discussion was the "far left" and "center left" comparison that Charlie Ebersole proposed. Following that thread, I'll call these groups the centrists and the radicals.

I'm definitely not the first to notice this, but it bears repeating: The gender imbalance between prominent "mainstream" open science folks and those critiquing it from the methodological "left" is striking and concerning. 1/3
— Charlie Ebersole (@CharlieEbersole) January 29, 2021

Centrist reforms are things like preregistration, transparency guidelines, and tweaks to hypothesis testing (e.g., p-value thresholds, equivalence testing, or Bayesian hypothesis testing). There's no consensus "platform" for reforms, but a recent review summarizes the state of things quite well. Just to be clear, a number of authors of this article are collaborators and friends, and I think it's on the whole a really good article.

10 years of replication and reform in psychology. What has been done and learned?

Our latest paper prepared for the Annual Review summarizes the advances in conducting and understanding replication and the reform movement that has spawned around it.https://t.co/i5GQRPGzIa

1/ pic.twitter.com/yIYzUCaGE0
— Brian Nosek (@BrianNosek) February 9, 2021

In contrast to the centrists, radicals start with the critical importance of theory building, often via computational models. On this view, no matter how well planned a test is, if it's not posed as part of a comparison of theories, you are playing 20 questions with nature (as Newell said), and you probably won't win. Here's a nice guide to some of the work in this tradition:

I want to highlight some non-mainstream work on reproducibility, open science, replication crisis, meta-science by women. Reading and drawing from a diverse set of authors and ideas will help push this stream of work forward and help make science more open and inclusive.
— Berna Devezer (@zerdeve) March 3, 2020

In this debate, the rubber really hits the road in the discussion around preregistration. Preregistration is a critical part of centrist reforms (e.g., through registered reports) but is "redundant at best" in much of the more radical views (e.g., this really nice post by Danielle Navarro).

I'm a centrist and a radical

Here's the thing. These views are not inconsistent! It's just that the implicit contexts of application are different. Centrists are trying to make broad policy recommendations for funders/journals/training programs; radicals are thinking about ideal scientific structures. Both viewpoints resonate with my personal experience.

In my lab, I try to do science that conforms to the radical vision of ideal scientific structures! In much of my work, we do the kind of computational theory building that lets us make quantitative predictions in advance and test them using precise measurements. This kind paradigm obviates simple NHST p-values, though sometimes we include them anyway because reviewers. We do typically preregister this work though, to keep from fooling ourselves about our predictions. Here's an example:

Preregistration and iterative statistical modeling go hand in hand. [THREAD]

I'll illustrate via a new preprint from my lab that I'm very excited about, "Polite speech emerges from competing social goals" (w/ @EricaYoon4, @mhtessler, and Noah Goodman): https://t.co/LvUf3Pecns /1
— Michael C. Frank (@mcxfrank) November 19, 2018

On the other hand, I also teach experimental methods to psychology graduate students. In my teaching I'm much more of a centrist. In this context, I see lots of "garden variety" psych research on the topics that students are interested in. Much of it is not easily amenable to computational theory. (Here's a sample of the perspective I've developed in that course).

From the radicals, there's lots of interest in computational theory building and some very nice guides/explainers (e.g. this one by Olivia Guest and Andrea Martin, EDIT: these authors are just trying to help people understand modeling and want to be clear that they feel there is a place for qualitative theory and don't subscribe to a "radical" position). The radical tradition is what I was trained in and what I do. I love this kind of work. But psych is a VERY big place (TM). It feels to me like hubris to say to a student who does educational mindsets work, or emotion regulation, or longitudinal development of racial identity – "don't even bother unless you have my kind of computational theory." Maybe that's not what they want as an outcome from their research, and maybe they are right and I am wrong!

(As an aside: models and data go hand and hand, and it's not actually that clear to me that moving to computational theory is right in areas where there are no precise empirical measurements to explain. In 2013 I taught a fun class trying to make models of social behavior with Jamil Zaki and Noah Goodman. We made lots of models but had no reliable quantitative measurements to use to fit the models. So we had some pretty great computational theory – in my humble opinion – but we were still nowhere.)

So based on these musings, in my experimental methods class, I make more minimal recommendations to the students. To evaluate the effect of an intervention, plan your sample size and preregister the statistical test. Don't p-hack. Go ahead and explore your data but don't pretend p-values from that exploration are a sound basis for strong conclusions. Try to make good plots of your raw data. Again, these sound pretty centrist, even though like I said, in my own lab I'm much more of a radical!

The methodological practices that I recommend in class don't necessarily result in a robust body of theory. But at the same time, I have a strong conviction that they are a first step towards keeping people from tricking themselves while they stare at noise. Random promotion of noise to signal is rampant in the literature - we see it all the time when we try to replicate findings in class that are clearly the basis of post-hoc selection of significant p-values. So simply blocking this kind of noise promotion is an important first step.

Contexts for everything

I'm arguing that one difference between centrists and radicals is what the context of the claim is. The centrist in me says: "it's really easy to tell NSF/NIH to add preregistration, sample size planning, and data sharing, to the merit review criteria (think clinicaltrials.gov)." In contrast, I don't think anyone would even know what you meant if you said: "all grants need to have sound computational theory."

Danielle Navarro make the general case wonderfully in the piece I linked above: "advocating preregistration as a solution to p-hacking (or its Bayesian equivalent) is deeply misguided because we should never have been relying on these tools as a proxy for scientific inference." I basically agree with this point completely. For my own research.

But I'm also worried that applying this standard as a blanket policy intervention across all of psychology (plus the other behavioral sciences, to say nothing of the clinical sciences) would be a disaster for everyone involved. What would people do when they didn't have computational theory or adequate statistical models but got asked by funders and journals to provide such theory? My guess is that they'd make it up in a way that satisfied the policy hoop they'd been asked to jump through and then would continue p-hacking.

Here are a few ideas about consensus metascience directions for both groups. Centrists should consider how they want to tweak policies to encourage cumulative science in the form of quantitative theory. How could we study the effects of quantitative theory on the robustness of empirical findings? I've got one idea: seems like literatures that test quantitative theories presuppose precise and replicable measurements; this is a testable correlational claim at least. I've also wondered about encouraging dose-response designs as a potential intervention on the standard 2x2 design that gets (over-)used in much of the psychology literature.

On the other side, though, methodological radicals should take a look at the metascience policy intervention literature - where something actually gets changed in an official policy and then you measure the outcome. Through my collaborations with Tom Hardwicke, I've become convinced that this kind of work can make us clearer about our desired endpoints as science policy-makers – what counts as success when we propose methodological reforms?

One final comment. Another dynamic in this whole conversation is the failure – perceived and actual – of some centrist voices to engage constructively with the more radical critiques. As has been pointed out several times (as in the Ebersole tweet above), this lack of engagement may have to do with the gender distribution - more male voices in the center, more women on the radical side. These dynamics aren't good and this behavior is not OK. Leaders in the centrist parts of the field need to address the more radical critiques, especially those that come from folks who are deeply knowledgeable about the philosophical and statistical issues. The radical critiques of preregistration sometimes may get mistakenly written off as being part of a different genre of knee-jerk response to methodological reforms from less thoughtful corners of the field. This is sloppy. The radical work needs to be cited and discussed – and if, as I've suggested here, there's a response to the critiques based on pragmatics and policy issues, then that response needs to be articulated.

Conclusions

OK, in sum: Maybe this is part of being an official old person (TM) but, why can't we all just get along? Let's have radical ambitions for the future while taking well-scoped, pragmatic policy positions in the short term.

Transparency and openness is an ethical duty, for individuals and institutions

2021-02-08T17:38:00.005-08:00

(tl;dr: I wrote an opinion piece a couple of years ago - now rejected - on the connection between ethics and open science. Rather than letting it just get even staler than it was, here it is as a blog post.)

In the past few years, journals, societies, and funders have increasingly oriented themselves towards open science reforms, which are intended to improve reproducibility and replicability. Typically, transparency policies focus on open access to publications and the sharing of data, analytic code, and other research products.

Many working scientists have a general sense that transparency is a positive value, but also have concerns about specific initiatives. For example, sharing data often carries confidentiality risks that can only be mitigated via substantial additional effort. Further, many scientists worry about personal or career consequences from being “scooped” or having errors discovered. And transparency policies sometimes require resources that are not be available to researchers outside of rich institutions.

I argue below that despite these worries, scientists have an ethical duty to be open. Further, where this duty is in conflict with scientists' other responsibilities, we need to lobby our institutions – universities, journals, and funders – to mitigate the costs and risks of openness.

Scientists have an ethical duty to be open

Openness is definitional to the scientific enterprise. The sociologist Robert Merton (1942) described a set of norms that science is assumed to follow: communism – that scientific knowledge belongs to the community; universalism – that the validity of scientific results is independent of the identity of the scientists; disinterestedness – that scientists and scientific institutions act for the benefit of the overall enterprise; and organized skepticism – that scientific findings must be critically evaluated prior to acceptance. The choice to be a scientist constitutes acceptance of these norms.

For individual scientists to adhere to these norms, the products of research must be open. To contribute to the communal good, papers must be available so they can be read, evaluated, and extended. And to be subject to skeptical inquiry, experimental materials, research data, analytic code, and software must be all available so that analytic calculations can be verified and experiments can be reproduced. Otherwise, evaluators must accept arguments on the authority of the reporter rather than by virtue of the materials and data, an alternative that is inimical to the norm of universalism. For many scientists, the situation is neatly summarized by the motto of the Royal Society: “Nullius in verba,” often loosely translated as “on no one’s word”.

Beyond its centrality to science, openness also carries benefits, both to science and to scientists. Open access to the scientific literature increases the impact of publications, which in turn increases the pace of discovery. Openly accessible data increases the potential for citation and reuse, and maximizes the chances that errors are found and corrected. These benefits accrue not just to the scientific ecosystem at large but also to individual scientists, who gain via citations, media impact, collaborations, and funding opportunities.

Some responsibilities follow from these benefits. Because openness maximizes the impact of research and its products, researchers have a responsibility to their funders to pursue open practices so as to seek the maximal return on funders’ investments. And by the same logic, if research participants contribute their time to scientific projects, the researchers also owe it to these participants to maximize the impact of their contributions, as my colleague Russ Poldrack has argued.

For all of these reasons, individual scientists have a duty to be open – scientific institutions have a duty to promote transparency in the science they support and publish.

The negatives of openness

Scientists have many other ethical duties beyond openness, however. They have obligations to their collaborators and trainees. They have committed to funders to complete specific studies. And in biomedical and social science fields, they have duties to preserve the welfare of their research participants as well. Conflicts with these duties are often the source of researchers’ hesitance to embrace openness.

Transparency policies also carry costs in terms of time and effort. For example, some routes to open access publication require authors to pay substantial publication costs (i.e., author processing charges). Organizing materials and data for sharing as well as providing support to dataset users can also be time-consuming, especially for larger datasets.

Maintaining participant confidentiality is a major source of both cost and risk for biomedical and other human subjects research. Loss of confidentiality by research participants can have big negative consequences for health, employment, and well-being. While ensuring that tabular data does not contain identifying information is often relatively straightforward, other types of data can be tricky and expensive to anonymize. For example, removing identifying information from video data requires considerable time and expertise. And certain types of dense or narrative data simply may not be de-identifiable due to aspects of the data or the participants’ identities.

Transparency can even be a source of risk – actual or perceived – to researchers themselves. Effort spent pursuing open practices may not be seen as compatible with other career incentives. For example, learning technical tools to facilitate code and data sharing could take away from time to pursue new research. Disclosure of high value datasets prior to publication could in principle lead to opportunities for “scooping” – though it turns out that there are very few documented cases of pre-emption as a result of data sharing. Finally, open sharing of research products prior to and during peer review might carry greater risk for junior researchers and for researchers from disadvantaged groups, because of their greater vulnerability to critiques or negative attention.

Individuals should consider openness as a default

In the face of competing duties as well as potential negatives to openness, what should individual researchers do? First, because of the ethical duty to openness for every scientist, open practices should be a default in cases where risks and costs are limited. For example, the vast majority of journals allow authors to post accepted manuscripts in their untypset form to an open repository. This route to “green” open access is easy, cost free, and – because it comes only after articles are accepted for publication – confers essentially no risks of scooping. As a second example, the vast majority of analytic code can be posted as an explicit record of exactly how analyses were conducted, even if posting data is sometimes more fraught. These kinds of “incentive compatible” actions towards openness can bring researchers much of the way to a fully transparent workflow, and there is no excuse not to take them.

For some researchers, however, there will be real negatives associated with one or more open practices. If they are not aware of the positive benefits of transparency and sharing for their work and the work of their trainees, they may consider open practices only as a necessary evil, rather than as opportunities to increase citations or build a reputation. But if they recognize the potential benefits of openness, researchers can ask whether there are steps that can be taken to realize some of those benefits while mitigating risks – for example, releasing only summary, tabular data rather than raw media data, or making use of a data sharing repository with robust access control.

In some cases, researchers might decide not to share. One example of this kind of situation came up in my own work, when I was studying dense audio-video recordings of the private life of a single identified family; these data are both sensitive and impossible to de-identify. The family decided not to share these data, and I support this decision, having seen how much the data would have compromised their family's privacy – though we did make tabular data available so that statistical results could be reproduced. A second more general case is archival data without consent for sharing where recontacting participants may be impossible or impractical. These cases are relatively rare, however; it is more common that sharing simply presents some potentially mitigable costs. It is precisely in these cases that institutions should step in.

Institutions can mitigate the risks and costs of openness

Given the ethical imperative towards openness, institutions like funders, journals, and societies need to use their role to promote open practices and to mitigate potential negatives. Scholarly societies have an important role to play in educating scientists about the benefits of openness and providing resources to steer their members towards best practices for sharing their publication and other research products. Similarly, journals can set good defaults, for example by requiring data and code sharing except in cases where a strong justification is given (equivalent to adopting the second highest level in the Transparency and Openness Promotion guidelines). I don't think the TOP guidelines are perfect, but I'm not sure why in this case we'd let the perfect be the enemy of the good.

Departments and research institutes can also signal their interest in open practices in job advertisements and tenure/promotion guidelines. We did this the last time we had a search at Stanford Psych and it signaled our department's general interest in these practices, leading to some good conversations with candidates (and letting us notice explicitly if candidates weren't as interested as we were). In addition, by structuring graduate programs to provide training in tools and methods for data and code sharing, departments can educate grad students about producing reproducible and replicable research – this has been my hobby horse for quite a while (see here and here).

Institutional funders of research play the most important role, however. Most funders already signal an interest in openness through a required data management plan or similar document, and some (like the US NIH) mandate data sharing to the extent permissible given other regulatory constraints (e.g., institutional review, health or data privacy laws). These requirements, though laudable, don't really change the scientific incentives at play. Data sharing should not just be required: It should also be treated as part of the scientific merit of an application. Creating a sufficiently high value dataset should be itself meritorious enough to warrant funding. And on the opposite side of the calculus, funders should signal their willingness to support the effort required to mitigate data sharing costs. For example, this could take the form of extra budget supplements explicitly tied to sharing activities.

More generally, funders and other institutional stakeholders need to act to change the incentive structure for individuals. For example, funding agencies could make it a priority to invest in creating technical tools and practice guidelines for human subject data anonymization. A small RFP for these could create huge value, making it much more straightforward to participate in data sharing.

Conclusion

Both advocates and critics of open practices often appear to be arguing about the merits of radical transparency, but this goal is often not achievable. Instead, individual researchers and institutions should proceed from both an understanding of the benefits of openness and an appreciation of the ethical duty to be open. These starting points lead naturally to a set of practices that are open by default, with exceptions in case of specific risks.

When individual researchers can't mitigate the costs associated with openness, responsibility falls to institutional actors in the scientific ecosystem to help. We can all do our part in this by lobbying our journals scientific societies, institutions, and funders to support researchers in making the right decisions around transparency.

Against reference limits

2020-10-23T16:49:00.007-07:00

Many academic conferences and journals have limits on the number of references you can cite. I want to argue here that these limits make no sense and should be universally abolished.

To be honest, I kind of feel like I should be able to end this post here, since the idea seems so eminently sensible to me. But here's the positive case: If you are doing academic research of any type, you are not starting from scratch. It's critical to acknowledge antecedents and background so that readers can check assumptions. Some research has less antecedent work in its area, other research has more, and so a single limit for all articles doesn't make sense. More references allow readers to understand better where an article falls in the broader literature.

Some objections and responses.

Aren't there space limitations? No, there aren't. Some journals still operate based on a set "page budget" that the publisher puts in place. This is silly as absolutely no one reads paper journals any more. If this weren't already clear before the pandemic, it's clear now. No one has sent an issue of Cognition or Psych Science to my house but life goes on.

In my high profile, glossy journal, you should only cite important references and not try to be complete. When you remove references from an article, typically you cut the three papers you might have cited to just one. That one is probably the original positive claim; it's more likely to be from a famous old guy and it's less likely to be a newer finding, a meta-analysis, or a reference that provides additional context. This lack of context feeds the "rich get richer" cycle of citations and it hurts readers who should see multiple sources of evidence on an issue.

My review journal is aimed at students and we don't want to overwhelm them with references. I guess the argument is this: If you have 70 references and you cut them to 30 as a function of the journal limits, then students know what citation to look at. To me this seems crazy. First of all, no student is going to track down all 30 references; they are inevitably looking at a subset, probably the references for one claim. And for that one claim, they deserve the same context as a researcher does – don't just send them to the original paper without also giving them the critique, the meta-analysis, or the newer non-replication. If you want to curate, then have the bibliography be annotated (as, for example, Nature Reviews Neuroscience does). Let the author call out the important references, rather than removing dissent and diversity from the bibliography.

It's only a conference paper/abstract, you don't need references and they count against the space limitations anyway. Most computer science conferences now do not count references against page limits, and increasingly abstracts for developmental psych conferences do not either. Fundamentally, you are probably looking at conference papers or abstracts on the web – so the documents you look at should be able to have hyperlinks in them (and that's all references are, anyway, is hyperlinks to other papers). Let authors add a reference section! And while we're at it, we should have a technical solution (e.g., a regular expression) to count words outside of citations. Why dock people words for appropriate scholarly procedure?

Unlimited references encourage (self-)citation packing. It's true that if citations were unlimited, in principle you could pack the reference section with tons of irrelevant citations or, maybe more realistically, with self-citations. But first of all, most journals already have unlimited citations and no one does this (well, almost no one). Second, citation packing is something that can be dealt with by editors and reviewers. Finally, if someone is hell-bent on self-citation and you have a reference limit, they will use all of their references to cite themselves anyway. But if you give them unlimited references they might actually cite the relevant work in addition to their own. Self-citation is a real issue, but limiting references is the wrong policy tool to deal with this problem.

OK, I hope I've convinced you. Let everyone cite to their heart's content. Don't limit references, and don't count them towards page and word limits in submissions.

Advice on reviewing

2020-03-02T16:12:00.003-08:00

(Several people I work with have recently asked me for reviewing advice, so I thought I'd share my thoughts more broadly.)

Peer review – organized scrutiny of scientific work prior to publication – is a critical part of our current scientific ecosystem. But I have heard many of the peer review horror studies out there and experienced some myself. The peer review ecosystem could be improved – better tracking and sharing of peer review, better credit assignment, more fair allocations of review requests, better online systems for editors and reviewers, to name a few.*

Should we have peer review at all? In my view, peer review is primarily a filter that limits the amount of truly terrible work that appears in reputable journals (e.g., society publications, high-ranked international outlets). Don't get me wrong: plenty of incorrect, irreproducible, and un-replicable science still appears in print! But there are certain minimal standards that peer review enforces – published work typically ends up conforming to the standards of its field, even if those standards themselves could be improved. Without peer review, more of this terrible work would appear and there would be even more limited cues for distinguishing the good from the bad.** To paraphrase, it's the worst solution to the problem of quality control in science – except for all the others!

So all in all I'm an advocate for peer review.

But for an early career researcher (say a grad student or postdoc especially), getting involved poses some tradeoffs. On the one hand, there are several positives. Being a reviewer helps you:

learn about other new work in the field by engaging with it deeply,
calibrate your judgment to that of the editor and other reviewers, and
get credit from editors (and occasionally authors and other readers, in the case of open review) for contributing.

But it also can be time-consuming, especially at first. How do you decide when to review and when not to review? Here's my advice.

On average, try to review about 2.5x as many papers as you submit as first author. Try to do those reviews at the places you publish and want to publish. Be efficient with your reviewing.

I'll explain each part here in a bit more depth.

1. On average, not right now. As my wife is fond of saying, we have seasons of giving. You don't have to do everything at once! This means, first, you should try not to have more than a few reviews out at a time. Otherwise it gets very overwhelming. So try to space things out: don't feel like you have to review continuously. Take a break from time to time, especially if family or career circumstances mean you have a lot on your plate. I did a ton of one-off reviewing for several years, then did a bunch of editorial service, got burned out – related confessional blogpost here – then took a breather, and now am back doing a mix of editing and reviewing.

2. Review at the population replacement rate. Most papers have 2–3 reviewers. So if everyone reviews 2–3 papers for each first authored paper they submit, then we should have as many reviews coming into the system as going out. But again, this doesn't have to be all at once! If you haven't submitted anything yet from your PhD, doing a lot of reviewing is not usually a great idea. I tend to suggest focusing on your own work until then. This is also not a hard and fast rule and it's great to be generous with reviewing if you have the curiosity and capacity. If you're submitting one paper this year, I think it's fine – maybe even good – to review more than two or three papers. But I wouldn't necessarily review ten unless you really want to.

3. Review at places you (want to) publish. Peer review is an important part of socialization into a scientific community. It's one way our communities develop norms as to statistical or methodological standards. A lot has been said about the ways these norms are occasionally negative (e.g., requiring HARKing – "hypothesizing around known results"). Plenty of this socialization is good, though. For example, my recent reviewers have required more breadth in the cited literature, required more reproducible code, asked for additional studies, and many other steps that have made my and my collaborators' work better.*** By participating in specific communities' review, you learn what they want from their contributors. You also have a chance to show editors your thoughtfulness and judgment. (This isn't a big motivator but it's not nothing.) So choosing outlets carefully helps you give back to the scholarly community you want to be part of and it also helps you learn about how that community works.

4. Spend time on reviews, but not too much time. My first review ever (as a grad student) was eight pages long. I included information on every typo in the paper. I'm sure there was useful feedback in there, but as an editor, these kinds of over-the-top reviews don't actually help that much. And as an author, they are a pain – they are either "writing for the author," or nitpicking specific wording decisions. Authors should get some autonomy in what they write, provided the underlying research is sound. The advice I received from my advisor (after he had a nice chuckle about the length of my review was): summarize the paper in no more than a paragraph, provide a small handful of major points that are critical to your evaluation of the paper, and if you feel it's appropriate, make a recommendation.**** Then you can list a few minor points that are helpful to the authors but don't themselves make or break the paper.

Writing a review like this takes time, but not too much time. I recommend reading the paper through soon after you get it, making a few notes, thinking it over, and then coming back and writing the review as you reread. That way you can form an opinion and then check it. It's hard to say how long this process should take – everyone is different, and the process gets way faster with experience. But if a normal length paper is taking more than 3-5 hours to review, I think that's probably too much, unless you are really taking time to check a specific calculation or analysis.

Finally, what do you do if a particular reviewing opportunity just isn't right? Don't be afraid to say no. Editors are people too, and they will totally understand if you tell them how many reviews you already have outstanding or share that you are on leave or otherwise occupied (finishing your thesis, for example). Editors generally are totally fine with a quick and helpful decline response, especially when you name other people who you think are qualified.***** You can always say "happy to help next time!"

----
* I won't talk about blinding vs. not blinding here, though I did share some thoughts elsewhere.
** In some fields, there aren't huge incentives for publishing random nonsense. Theoretical physics comes to mind – you can upload random junk to arXiv but it's not a huge deal, in the sense that it's just more spam that needs to be filtered out. In contrast, in biomedicine or even in psychology, publication can in a strong journal can lead to positive commercial consequences. So we need significant filtering to prevent unscrupulous researchers from taking advantage of this route.
*** They also of course misunderstood simple points; got the stats wrong; asked me to cite their own work; and said trenchant stuff about my writing that made me feel bad for days. Criticism is always a mixed bag.
**** Some people say that reviewers should assess but not recommend. But most journals make you choose your recommendation from a dropdown menu so I don't know what that really means. I think that if you have a clear recommendation, you should state it in the review and argue for it. E.g., "for this paper to be acceptable, the authors would need to do X, Y, and Z."
***** Especially helpful to decline by giving names of early career experts as most editors think of the same prominent researchers for reviews in a particular domain and then have trouble generating a broader review pool for areas they don't know as well.

Letter of recommendation: Attack of the Psychometricians

2019-11-05T21:51:00.002-08:00

(tl;dr: It's letter of recommendation season, and so I decided to write one to a paper that's really been influential in my recent thinking. Psychometrics, y'all.)

To whom it may concern:

I am writing to provide my strongest recommendation for the paper, "Attack of the Psychometricians" by Denny Borsboom (2006). Reading this paper oriented me to a rich tradition of psychometric modeling – but more than that, it changed my perspective on the relationship between psychological measurement and theory. (It also taught me to use the term "sumscore"* as an insult). I urge you to consider it for a position in your reading list, syllabus, or lab meeting.

I first met AotP (or Attack!, as I like to call it) via a link on twitter. Not the most auspicious beginning, but from a quick skim on my phone, I could tell that this was a paper that needed further study.

The paper presents and discusses what it calls the central insight of psychometrics: that "measurement does not consist of finding the right observed score to substitute for a theoretical attribute, but of devising a model structure to relate an observable to a theoretical attribute." In other words, the goal is to make models that link data to theoretical quantities of interest. What this means is that measurement is essentially continuous with theory construction. By creating and testing a good measurement model, you're creating and testing a key component of a good theory.

Attack! has made me think about the origins of this situation. Here's my attempt at an origin story. In the olden times, all the psychologists went to the same conferences and worried about the same things. But then a split formed between different groups. Educational psychologists and psychometricians knew that different problems on tests had different measurement properties, and began exploring how to select good and bad items, and how to figure out people's ability abstracted away from specific items. Cognitive psychologists, on the other hand, spurned this item-level variation and embraced the dogma of exchangeable experimental items. People did Lots Of Trials, all generated from the same basic template. The sumscore reigned supreme, and yielded important insight into Memory, Attention, and Reasoning (irrespective of what was being remembered, attended to, or reasoned about).

Psychophysicists diverged from the cognitivist hierarchy. They always knew that they needed to infer a latent relationship (the psychometric curve). As they got better at doing this, they fit models that included parameters of the decision process – for example, a "lapse" parameter to capture inattention) – as well as the quantities of interest. And because they typically fit these curves within individual subjects, these parameters were participant-level estimates. But the models that fit these curves were often specific to particular metric relationships and not appropriate for increasingly complicated domains.

Now in modern cognitive science, we get work on sophisticated constructs – for example, in moral psychology or psycholinguistics – where experimenters break with the cognitivist dogma and use non-exchangeable items. Sometimes items are sentences or even whole vignettes. Yet for the most part these researchers have forgotten to model item variation (except occasionally using a random intercept for items in their linear mixed effects models). Clark (1972) scolded them about the problematic statistical inferences that could result from forgetting to model items and this guidance has reappeared in recent exhortations to Keep It Maximal! But as far as I can tell, no one really talks about modeling items in more detail *in order to learn more about what is in people's heads*.

Attack! has infested my brain. Now when I see someone use differentiated items in their task yet use the sumscore as their measure of the latent trait of interest, I think, "you're just leaving information on the table." I suddenly want to fit psychometric models to everything. Because, in the end, what do you want as a psychologist? A better understanding of the latent space that we're trying to theorize about. I used to think that this was called Theory and it was distinct from Data Analysis. Thanks to Attack! I now know that measurement and theory are (or at least should be) contiguous with one another.**

On a personal note, Attack! is a great read and will play well with your interest in sociological biases that shape the structure of scientific inquiry. You shouldn't pass this paper up. Do not hesitate to contact me with questions or concerns.

Sincerely,

Michael C. Frank
Internet Commentator

---
* For those of you not in the know, the sumscore is just what we normal psychologists call "percent correct" – treating the sum of your correct answers on the test as your score, as opposed to inferring the latent trait (ability) from the performance on the observed variables.
** This contiguity idea is interestingly related to the Bayesian Data Analysis turn in the Bayesian cognitive modeling world, where we now think about linking functions that relate models to data directly. In fact, I think these are really the same idea when you get down to it. Here's a great paper that describes this viewpoint: Tauber, Navarro, Perfors, & Steyvers (2017).

Confounds and covariates

2019-10-08T21:34:00.001-07:00

(tl;dr: explanation of confounding and covariate adjustment)

Every year, one of the trickiest concepts for me to teach in my experimental methods course is the difference between experimental confounds and covariates. Although this distinction seems simple, it's pretty deeply related to the definition of what an experiment is and why experiments lead to good causal inferences. It's also caught up in a number of methodological problems that come up again and again in my class. This post is my attempt to explain the distinction and how it relates to different problems and cultural practices in psychology.

Throughout this post, I'll use a silly example. My first year of graduate school, I got distracted from my actual research by the hypothesis that listening to music with lyrics decreased my ability to write papers for my classes. I'll call this the "Bob Dylan" hypothesis, since I was listening to a lot of Dylan at the time. Let's represent this by the following causal diagram.

Our outcome is writing skill (Y) and our predictor is Dylan listening (X). The edge between them represents a hypothesized causal relationship. Dylan is hypothesized to affect writing skill, and not vice versa. (These kinds of diagrams are called causal graphical models*).

Observational Studies and Experiments

Suppose we did an observational study where we measured each of these variables in a large population. Assume we came up with some way to sample people's writing, get a measure of whether they either were or weren't listening to lyric-heavy music at the time, and assess the writing sample's quality. We might find that Y was correlated with X, but in a surprising direction: listening to Dylan would be related to better writing.

Can we make a causal inference in this case? If so, we could get rich promoting a Dylan-based writing intervention. Unfortunately, we can't – correlation doesn't equal causation here, because there is (at least one) confounding third variable: age (Z). Age is positively related to both Dylan listening and writing skill in our population of interest. Older people tend to be good writers and also tend to be more into folk rockers; I'm not even going to put a question mark on this edge because I'm pretty sure this is true.

But: the causal relationship of age to our other two variables means that variation in Z can induce a correlation in X and Y, even in the absence of a true causal link. We can say that age is a confound in estimating the Dylan-writing skill relationship: it's a variable that is correlated with both our predictor and our outcome variables.

To get gold-standard evidence about causality, we need to do an experiment. (We won't discuss statistical techniques for inferring causality, which can be useful but don't give you gold standard evidence anyway; review here).

Experiments are when we intervene on the world and measure the consequences. Here, this means forcing some people to listen to Dylan. In the language of graphical models, if we control the Dylan listening, that means that variable X is causally exogenous. (Exogenous means that it's not caused by anything else in the system). We "snipped" the causal link between age and Dylan listening.

So now we can "wiggle" the Dylan listening variable – change it experimentally – and see if we detect any changes in writing skill. We do this by randomly assigning individuals to listen to Dylan or not and then measuring writing during the assigned listening (or non-listening) period. This is a "between-subjects" design. We can use our randomized experiment to get a measure of the average treatment effect of Dylan, the size of the causal effect of the intervention on the outcome. In this simple experiment, the ATE is estimated by the regression Y ~ X (for ease of exposition, I'm not going to discuss so-called mixed models, which model variation across subjects and/or experimental items). That's the elegant logic of randomized experiments: the difference between condition gives you the average effect.

Confounds

Let's consider an alternate experiment now. Suppose we did the same basic procedure, but now with a "within-subjects" design where participants do both the Dylan treatment and the control, in that order. This experiment is flawed, of course. If you observe a Dylan effect, you can't rule out the idea that participants got tired and wrote worse in the control condition because it always came second.

Order (Dylan first vs. control first; notated X') is an experimental confound: a variable that is created in the course of the experiment that is both causally related to the predictor and potentially also related to the outcome. Here's how the causal model now looks:

We've reconstructed the same kind of confounding relationship we had with age, where we had a variable (X') that was correlated both with our predictor (X) and our outcome (Y)! So...

What should we do with our experimental confounds?

Option 1. Randomize. Increasingly, this is my go-to method for dealing with any confound. Is the correct answer on my survey confounded with response side? Randomize what side the response shows up on! Is order confounded with condition? Randomize the order you present in! Randomization is much easier now that we program many of our experiments using software like Qualtrics or code them from scratch in JavaScript.

The only time you really get in trouble with randomization is when you have a large number of options, a small number of participants, or some combination of the two. In this case, you can end up with unbalanced levels of the randomized factors (for example, ten answers on the right side and two on the left). Averaging across many experiments, this lack of balance will come out in the wash. But in a single experiment, it can really mess up your data – especially if your participants notice and start choosing one side more than the other because it's right more often. For that reason, when balance is critical, you want option 2.

Option 2. Counterbalance. If you think a particular confound might have a significant effect on your measure, balancing it across participants and across trials is a very safe choice. That way, you are guaranteed to have no effect of the confound on your average effect. In a simple counterbalance of order for our Dylan experiment, we manipulate condition order between subjects. Some participants hear Dylan first and others hear Dylan second. Although technically we might call order a second "factor" in the experiment, in practice it's really just a nuisance variable, so we don't talk about it as a factor and we often don't analyze it (but see Option 3 below).

In the causal language we have been using, counterbalancing allows us to snip out the causal dependency between order and Dylan. Now they are unconfounded (uncorrelated) with one another. We've "solved" a confound in our experimental design. Here's the picture:

Counterbalancing doesn't always work, though. It gets trickier when you have too many levels on a variable (too many Dylan songs!) or multiple confounding variables. For example, if you have lots of different nuisance variables – say, condition order, what writing prompt you use for each order, which Dylan song you play – it may not be possible to do a fully-crossed counterbalance so that all combinations of these factors are seen by equal numbers of participants. In these kinds of cases, you may have to rely on partial counterbalancing schemes or latin squares designs, or you may have to fall back on randomization.

Option 3. Do Options 1 and 2 and then model the variation. This option was never part of my training, but it's an interesting third option that I'm increasingly considering.** That is, we are often faced with the choice between A) a noisy between-participants design and B) a lower-noise within-participants design that nevertheless adds noise back in via some obvious order effect that you have to randomize or counterbalance. In a recent talk by Andrew Gelman, he suggested that we try to model these as covariates, to reduce noise. This seems like a pretty interesting suggestion, especially if the correlation between them and the outcome is substantial.***

Covariates

Going back to our example, now we have two variables – age and order – that are no longer confounded with our primary relationship of interest (i.e., Dylan and writing). But they may still be related to our outcome measure. Here's what the picture looks like, repeated from above.

Even if they are not confounding our experimental manipulation, age and experimental condition order may still be correlated with our outcome measure, writing skill. How does this work? Well, the average treatment effect of Dylan on writing is still given by the regression Y ~ X. But we also know that there is some variance in Y that is due to X' and Z.

That's because age and order are covariates: they may – by virtue of their potential causal links with the outcome variable – have some correlation with outcomes, even in a case where the predictor is experimentally manipulated. This should be intuitive for the external (age) covariate, but it's true for both: they may account for variance in Y over and above that controlled by the experimental manipulation of X.

What should we do about our covariates?

Option 1. Nothing! We are totally safe in ignoring all of our covariates, regressing Y on X and treating the estimate as an unbiased estimate of the the effect (the ATE). This is why randomization is awesome. We are guaranteed that, in the limit of many different experiments, even though people with different ages will be in the different Dylan conditions, this source of variation will be averaged out.

The first fallacy of covariates is that, because you have a known covariate, you have to adjust for it. Not true. You can just ignore it and your estimate of the ATE is unbiased. This is the norm in cognitive psychology, for example: variation between individuals is treated as noise and averaged out. Of course, there are weaknesses in this strategy – you will not learn about the relationship of your treatment to those covariates! – but it is sound.

Option 2. If you have a small handful of covariates that you believe are meaningfully related to the outcome, you can plan in advance to adjust for them in your regression. In our Dylan example, this would be a pre-registered plan to add Z as a predictor: Y ~ X + Z. If age (Z) is highly correlated with writing ability (Y), then this will give us a more precise estimate of the ATE, while remaining unbiased.

When should we do this? Well, it turns out that you need a pretty strong correlation to make a big difference. There's some nice code to simulate the effects of covariate adjustment on precision in this useful blogpost on covariate adjustment; I lightly adapted it. Here's the result:

Root mean squared error (RMSE; lower RMSE means greater precision, in other words) is plotted as a function of the sample size (N). Different colors show the increase in precision when you control for covariates with different levels of correlation with the outcome variable. For low levels of correlation with the covariate, you don't get much increase in precision (pink and red lines). Only as the correlation is .6 or above do we see noticeable increases in precision; and it only really makes a big difference with correlations in the range of .8.

Considering these numbers in light of our Dylan study, I would bet that age and writing skill are not correlated with writing skill > .8 (unless we're looking at ages from kindergarten to college!). I would guess that in an adult population this correlation would be much, much lower. So maybe it's not worth controlling for age in our analyses.

And the same is probably true for order, our other covariate. Although perhaps we do think that our order has a strong correlation with our skill measure. For example, maybe our experiment is long and there are big fatigue effects. In that case, we would want to condition.

So these are are options: if the covariate is known to be very strong, we can condition. Otherwise we should probably not worry about it.

What shouldn't we do with our covariates?

Don't condition on lots and lots of covariates because you think they are theoretically important. There are lots of things that people do with covariates that they shouldn't be doing. My personal hunch is that this is because a lot of researchers think that covariates (especially demographic ones like age, gender, socioeconomic status, race, ethnicity, etc.) are important. That's true: these are important variables. But that doesn't mean you need to control for them in every regression. This leads us to the second fallacy.

The second fallacy of covariates is that, because you think covariates are in general meaningful, it is not harmful to control for them in your regression model. In fact, if you control for meaningless covariates in a standard regression model, you will on average reduce your ability to see differences in your treatment effect. Just by chance your noise covariates will "soak up" variation in the response, leaving less to be accounted for by the true treatment effect! Even if you strongly suspect something is a covariate, you should be careful before throwing it into your regression model.

Don't condition on covariates because your groups are unbalanced. People often talk about "unhappy randomization": you randomize adults to the different Dylan groups, for example, but then it turns out the mean age is a bit different between groups. Then you do a t-test or some other statistical test and find out that you actually have a significant age difference. But this makes no sense: because you randomized, you know that the difference in ages occurred by chance, so why are you using a t-test to test if the variation is due to chance? In addition, if your covariate isn't highly correlated with the outcome, this difference won't matter (see above). Finally, if you adjust for this covariate because of such a statistical test, you can actually end up biasing estimates of the ATE across the literature. Here's a really useful blogpost from the Worldbank that has more details on why you shouldn't follow this practice.

Don't condition on covariates post-hoc. The previous example is a special case of a general practice that you shouldn't follow. Don't look at your data and then decide to control for covariates! Conditioning on covariates based on your data is an extremely common route for p-hacking; in fact, it's so common that it shows up in Simmons, Nelson, & Simonsohn's (2011) instant classic False Positive Psychology paper as one of the key ingredients of analytic flexibility. Data-dependent selection of covariates is a quick route to false positive findings that will be less likely to be replicable in independent samples.

Don't condition on a post-treatment variable. As we discussed above, there are some reasons to condition on highly-correlated covariates in general. But there's an exception to this rule. There are some variables that are never OK to condition on – in particular, any variable that is collected after treatment. For example, we might think that another good covariate would be someone's enjoyment of Bob Dylan. So, after the writing measurements are done, we do a Dylan Appreciation Questionnaire (DAQ). The problem is, imagine that having a bad experience writing while listening to Dylan might actually change your DAQ score. So then people in the Dylan condition would have lower DAQ on average. If we control for DAQ in our regression (Y ~ X + DAQ), we then distort our estimate of the effects of Dylan. Because DAQ and X (Dylan condition) are correlated, DAQ will end up soaking up some variance that is actually due to condition. This is bad news. Here's a nice paper that explains this issue in more detail.

Don't condition on a collider. This issue is a little bit off-topic for the current post, since it's primarily an issue in observational designs, but here's a really good blogpost about it.

Conclusions

Covariates and confounds are some of the most basic concepts underlying experimental design and analysis in psychology, yet they are surprisingly complicated to explain. Often the issues seem clear until it comes time to do the data analysis, at which point different assumptions lead to different default analytic strategies. I'm especially concerned that these strategies vary by culture, for example with some psychologists always conditioning on confounders, and others never doing so. (We haven't even talked about mediation and moderation!). Hopefully this post has been useful in using the vocabulary of causal models to explain some of these issues.

---
* The definitive resource on causal graphical models is Pearl (2009). It's not easy going, but it's very important stuff. Even just starting to read it will strengthen your methods/stats muscles.
** Importantly, it's a lot like adding random effects to your model – you model sources of structure in your data so that you can better estimate the particular effects of interest.
*** The advice not to model covariates that aren't very correlated with your outcome is very frequentist, with the idea being that you lose power when you condition on too many things. In contrast, Gelman & Hill (2006) give more Bayesian advice: if you think a variable matters to your outcome, keep it in the model. This advice is consistent with the idea of modeling experimental covariates, even if they don't have a big correlation with the outcome. In the Bayesian framework, including this extra information should (maybe only marginally) improve your precision but you aren't "spending degrees of freedom" in the same way.

An ethical duty for open science?

2019-07-23T20:58:00.001-07:00

Let's do a thought experiment. Imagine that you are the editor of a top-flight scientific journal. You are approached by a famous researcher who has developed a novel molecule that is a cure for a common disease, at least in a particular model organism. She would like to publish in your journal. Here's the catch: her proposed paper describes the molecule and asserts its curative properties. You are a specialist in this field, and she will personally show you any evidence that you need to convince you that she is correct – including allowing you to administer this molecule to an animal under your control and allowing you to verify that the molecule is indeed the one that she claims it is. But she will not put any of these details in the paper, which will contain only the factual assertion.

Here's the question: should you publish the paper?

If you publish it quickly, you will ensure that the molecule is known quickly and hence that translational research to humans will commence as soon as possible. This step will likely save many lives. In addition, the article is likely to be well-cited (since, as we have stipulated, it is correct). So publication should be assured, right?

On the other hand, perhaps you share some reservations about publication. This paper doesn't look like a traditional scientific paper: it provides no methods or data, it only asserts a conclusion. There is no way for a reader to reproduce the experiments that led to the assertion, since no experiments are even mentioned. That doesn't feel like science. Maybe it’s worthy of being published in the newspaper. But not in a scientific journal.

Further, you might be worried about the precedent set by this individual decision. Isn’t this person saying that the editor is the sole arbiter of the work they’ve done? How will this work out in the hands of other editors? Also, who gets to write such an article? You paid attention to this person because she was already famous – but you probably wouldn’t have taken time to verify the work of an unknown scientist so it could be published in this way.

I posted a version of this thought experiment as a twitter poll, and – with 1,025 respondents – saw only 6% recommending publication:

Thought experiment. You're editor at a journal. A famous scientist submits an important medical discovery. She proves to you that she's right (stipulation: she's right), but her paper will contain *only* the assertion of the discovery, no methods/data. Should you publish?
— Michael C. Frank (@mcxfrank) July 22, 2019

Even though this was a self-selected group of respondents, that’s as strong an intuition as you’d pretty much ever find.

Remember, if we were purely utilitarian in our treatment of scientific reporting standards, this would be an obvious case: we should publish the paper. Perhaps we could make an argument about the long term utility of the precedent, but that’s an analysis of future rule-making, not the logic of this particular case. Thus, the intuition that the paper shouldn’t be published stems from something other than the immediate utility of the situation.

Perhaps, like me, you think that the essence of science is verifiability – so if others can’t check your work, you are not contributing to science.

This thought experiment demonstrates that we feel that scientists have a duty to report our methods and data to the community as part of reporting our findings. What is the nature of this duty? It is not based on the utility of any individual instance of publication. Is it a conventional norm that we can violate? In other words, is it like wearing pajamas to work – we don't happen to do that around here, but if we did it would be OK?

Let's try a further thought experiment (based on Nucci & Turiel, 1978). Imagine now that there were a journal where people did just publish assertions, and didn't have to report methods or results. (In this further thought experiment, we don't stipulate that the assertions are correct). Would this still be a real scientific journal? I think the judgment is pretty clear that it wouldn't be. It would be an opinion magazine on the topic of science but it wouldn't be science. So the intuition that we we shouldn't publish the assertion paper is not an intuition about social conventions or norms.

Instead, the duty has the force of a moral or ethical norm, something that is in force regardless of what some particular community's norms are. In other words, more like stealing and hurting people – wrong pretty mostly always – than like wearing pajamas to work. Consider the idea of "pseudoscience": this is a word that refers precisely to communities or people who say they are doing science but are actually violating the principles of science!

This ethical norm emerges from concerns about benefits to a broader community (the scientific enterprise as a whole) rather than from concerns for the individual researcher. And it feels tied up in concerns about fairness or universality as well. A scientist getting to publish because she's famous and maybe more likely to be believed (or perhaps even more likely to be right) doesn't feel like a fair way that science to work. You might even say that the norm of reporting information for verification and assessment of scientific findings is a deontological norm: one that is designed so that it can appropriately or fairly be held by the entire community.

No one has to write scientific papers, of course, but if they choose to, they have to report sufficient information for another researcher to verify their conclusions. An assertion just won’t do.

I’ll just end with a question. Why does the ethical duty to provide verification information stop at the conventional reporting standards of a scientific paper, which – as many people have observed – are insufficient for fully reproducing the data analysis or independently replicating the data collection?

(Thanks to Tom Hardwicke for discussion.)

It's the random effects, stupid!

2019-05-06T08:53:00.000-07:00

(tl;dr: wonky post on statistical modeling)

I fit linear mixed effects models (LMMs) for most of the experimental data I collect. My data are typically repeated observations nested within subjects, and often have crossed effects of items as well; this means I need to account for this nesting and crossing structure when estimating the effects of various experimental manipulations. For the last ten years or so, I've been fitting these models in lme4 in R, a popular package that allows quick specification of complex models.

One question that comes up frequently regarding these models is what random effect structure to include? I typically follow the advice of Barr et al. (2013), who recommend "maximal" models – models that nest all the fixed effects within a random factor that have repeated observations for that random grouping factor. So for example, if you have observations for both conditions for each subject, fit random condition effects by subject. This approach contrasts, however, with the "parsimonious" approach of Bates et al.,* who argue that such models can be over-parameterized relative to variability in the data. The issue of choosing an approach is further complicated by the fact that, in practice, lme4 can almost never fit a completely maximal model and instead returns convergence warnings. So then you have to make a bunch of (perhaps ad-hoc) decisions about what to prune or how to tweak the optimizer.

Last year, responding to this discussion, I posted a blogpost that became surprisingly popular, arguing for the adoption of Bayesian mixed effects models. My rationale was not mainly that Bayesian models are interpretively superior – which they are, IMO – but just that they allow us to fit the random effect structure that we want without doing all that pruning business. Since then, we've published a few papers (e.g. this one) using Bayesian LMMs (mostly without anyone even noticing or commenting).**

In the mean time, I was working on the ManyBabies project. We finally completed data collection on our first study, a 60+ lab consortium study of babies' preference for infant-directed speech! This is exciting and big news, and I will post more about it shortly. But in the course of data analysis, we had to grapple with this same set of LMM issues. In our pre-registration (which, for what it's worth, was written before I really had tried the Bayesian methods), we said we would try to fit a maximal LMM with the following structure. It doesn't really matter what all the predictors are, but trial_type is the key experimental manipulation:

M1) log_lt ~ trial_type * method +
trial_type * trial_num +
age_mo * trial_num +
trial_type * age_mo * nae +
(trial_type * trial_num | subid) +
(trial_type * age_mo | lab) +
(method * age_mo * nae | item)

Of course, we knew this model would probably not converge. So we preregistered a pruning procedure, which we followed during data analysis, leaving us with:

M2) log_lt ~ trial_type * method +
trial_type * trial_num +
age_mo * trial_num +
trial_type * age_mo * nae +
(trial_type | subid) +
(trial_type | lab) +
(1 | item)

We fit that model and report it in the (under review) paper, and we interpret the p-values as real p-values (well, as real as p-values can be anyway), because we are doing exactly the confirmatory thing we said we'd do. But in the back of my mind, I was wondering if we shouldn't have fit the whole thing with Bayesian inference and gotten the random effect structure that we hoped for.***

So I did that. Using the amazing brms package, all you need to do is replace "lmer" with "brm" (to get a default prior model with default inference).**** Fitting the full LMM on my MacBook Pro takes about 4hrs/chain with completely default parameters, so 16 hrs total – though if you do it in parallel you can fit all four at once. I fit M1 (the maximal model, called "bayes"), M2 (the pruned model, "bayes_pruned"), and for comparison the frequentist (also pruned, called "freq") model. Then I plotted coefficients and CIs against one another for comparison. There are three plots, corresponding to the three pairwise comparisons (brms M1 vs. lme4 M2, brms M1 vs. brms M2, and brms M2 vs. lme4 M2). (So as not to muddy the interpretive waters for ManyBabies, I'm just showing the coefficients without labels here). Here are the results.

As you can see, to a first approximation, there are not huge differences in coefficient magnitudes, which is good. But, inspecting the top row of plots, you can see that the full Bayesian M1 does have two coefficients that are different from both the Bayesian M2 and the frequentist M2. In other words, the fitting method didn't matter with this big dataset – but the random effects structure did! Further, if you dig into the confidence intervals, they are again similar between fitting methods but different between random effects structures. Here's a pairs plot of the correlation between upper CI limits (note that .00 here means a correlation of 1.00!):

Not huge differences, but they track with random effect structure again, not with the fitting method.

In sum, in one important practical case, we see that fitting the maximal model structure (rather than the maximal convergent model structure) seems to make a difference to model fit and interpretation. This evidence to me supports the Bayesian approach that I recommended in my prior post. I don't know that M1 is the best model – I'm trusting the "keep it maximal" recommendation on that point. But to the extent that I should be able to fit all the models I want to try, then using brms (even if it's slower) seems important. So I'm going to keep using this fitting procedure in the immediate future.

----
* This approach seems very promising, but also a bit tricky to implement. I have to admit, I am a bit lazy and it is really helpful when software provides a solution for fitting that I can share with people in my lab as standard practice. A collaborator and I tried someone else's implementation of parsimonious models and it completely failed, and then we gave up. If someone wants to try it on this dataset I'd be happy to share!

* An aside: after I posted, Doug Bates kindly engaged and encouraged me to adopt Julia, rather than R, for model fitting, if it was fitting that I wanted and not Bayesian inference. We did experiment a bit with this, and Mika Braginsky wrote the jglmm package to use Julia for fitting. This experiment resulted in her in-press paper using Julia for model fits, but also with us recognizing that 1) Julia is TONS faster than R for big mixed models, which is a win, but 2) Julia can't fit some of the baroque random effects structures that we occasionally use, and 3) installing Julia and getting everything working is very non-trivial, meaning that it's hard to recommend for folks just getting started.

** Jake Westfall, back in 2016 when we were planning the study, said we should do this, and I basically told him that I thought that developmental psychologists wouldn't agree to it. But I think he was probably right.

*** Code for this post is on github.

A (mostly) positive framing of open science reforms

2019-04-08T13:30:00.001-07:00

I don't often get the chance to talk directly and openly to people who are skeptical of the methodological reforms that are being suggested in psychology. But recently I've been trying to persuade someone I really respect that these reforms are warranted. It's a challenge, but one of the things I've been trying to do is give a positive, personal framing to the issues. Here's a stab at that.

My hope is that a new graduate student in the fields I work on – language learning, social development, psycholinguistics, cognitive science more broadly – can pick up a journal and choose a seemingly strong study, implement it in my lab, and move forward with it as the basis for a new study. But unfortunately my experience is that this has not been the case much of the time, even in cases where it should be. I would like to change that, starting with my own work.

Here's one example of this kind of failure: As a first-year assistant professor, a grad student and I tried to replicate one of my grad school advisors' well-known studies. We failed repeatedly – despite the fact that we ended up thinking the finding was real (eventually published as Lewis & Frank, 2016, JEP:G). The issue was likely that the original finding was an overestimate of the effect, because the original sample was very small. But converging on the truth was very difficult and required multiple iterations.

This kind of thing happens to me quite a lot. I run a class in which first year PhDs in my department try to replicate the published literature, often articles from Psych Science and other top journals. I've blogged about this course (e.g., here) and published on outcomes from it as well (Hawkins, Smith et al., 2018, AMPPS). More than half of the time, these replication studies fail, roughly consistent with estimates from larger meta-science projects like RPP and the more recent (and higher-quality) ManyLabs 2 and Social Science Replication projects.

The reasons for this failure are not always clear, and we don't always do the extensive followup work necessary to "debug" the experiment. But over time I have tried to identify a number of reasons for failures and use them as guides to the way I run my lab and provide methodological training for students. I also have advocated for journals and funders to adopt these reforms. Most are about transparency, and some are about good design practices. These reforms have been a win-win for my lab. They improve the clarity, impact, and validity of our work – mostly while speeding things up! Here they are.

Share code and data. Several studies, including ours (Hardwicke et al., 2018, Royal Soc Open Science) show that MOST published journal articles contain some statistical errors, ranging from the trivial to the extreme. In reproducing the analytic calculations from a number of prominent papers (which would only be possible through data sharing), we have found major errors requiring correction in quite a few. Creating clear sharing pipelines leads to cleaner, easier-to-check papers.

Use a reproducible workflow. Technical tools like git, RMarkdown, Jupyter, etc. facilitate students and other researchers reporting results whose provenance and relationship to the underlying data are known. These tools also speed up research dramatically, letting you share and reuse code effectively much more often and auto-generate tables, graphs, and other elements of reports. They also decrease copy/paste errors in reporting! And for me as a PI, I love being able to "audit" the work that folks in my lab do, and to quickly and easily pull in figures, data, or other excerpts from github when I need to add them to a talk.

Preregister. Everything in my lab is preregistered. All this means is that people in my lab need to write down what they are going to do (sample size, main analysis) before they do it. Here's a sample. If we have talked things through enough, writing the registration often takes 30 minutes; of course for more complex projects, more thought is needed (and it's a good thing to do that thinking ahead of time!). This process is not binding – we routinely violate our registration, and report our violation – and takes very little time. It just makes us transparently report what we knew before doing the study. As an added bonus, if you care about p < .05 results (I mostly don't), these are really only valid in the case of a preregistered hypothesis. There's what I think is a pretty good explanation of this perspective in our transparency guide from last year (Klein et al., 2018, Collabra).

Follow best practices in experimental design. That means thinking about reliability and validity, and using a psychometric perspective (e.g., including sampling multiple experimental items). It also means planning a sample size that is sufficient to get precise enough measures to make quantitative predictions. There is a huge body of knowledge about how to do good experiments from Rosenthal and Rosnow onward – but often we rely on lab lore and implicit learning.

In sum, my worries about the literature have led me to a set of practices that – I think – have enhanced the research we do and made it more reproducible and replicable, while not slowing us down or making our workflow more onerous.

Nothing in childhood makes sense except in the light of continuous developmental change

2019-02-21T16:31:00.001-08:00

I'm awestruck by the processes of development that operate over children's first five years. My daughter M is five and my newborn son J is just a bit more than a month old. J can't yet consistently hold his head up, and he makes mistakes even in bottle feeding – sometimes he continues to suck but forgets to swallow so that milk pours out of his mouth until his clothes are soaked. I remember this kind of thing happening with M as a baby ... and yet voila, five years later, you have someone who is writing text messages to grandma and illustrating new stories about Spiderman. How you could possibly get from A to B (or in my case, from J to M)? The immensity of this transition is perhaps the single most important challenge for theories of child development.

As a field, we have bounced back and forth between continuity and discontinuity theories to explain these changes. Continuity theories posit that infants' starting state is related to our end state, and that changes are gradual, not saltatory; discontinuity theories posit stage-like transitions. Behaviorist learning theory was fundamentally a continuity hypothesis – the same learning mechanisms (plus experience) underly all of behavior, and change is gradual. In contrast, Piagetian stage theory was fundamentally about explaining behavioral discontinuities. As the pendulum swung, we get core knowledge theory, a continuity theory: innate foundations are "revised but not overthrown" (paraphrasing Spelke et al. 1992). Gopnik and Wellman's "Theory theory" is a discontinuity theory: intuitive theories of domains like biology or causality are discovered like scientific theories. And so on.

For what it's worth, my take on the "modern synthesis" in developmental psychology is that development is domain-specific. Domain of development – perception, language, social cognition, etc. – progress on their own timelines determined by experience, maturation, and other constraining factors. And my best guess is that some domains develop continuously (especially motor and perceptual domains) while others, typically more "conceptual" ones, show more saltatory progress associated with stage changes. But – even though it would be really cool to be able to show this – I don't think we have the data to do so.

The problem is that we are not thinking about – or measuring – development appropriately. As a result, what we end up with is a theoretical mush. We talk as though everything is discrete, but that's mostly a function of our measurement methods. Instead, everything is at rock bottom continuous, and the question is how steep the changes are.

We talk as though everything is discontinuous all the time. The way we know how to describe development verbally is through what I call "milestone language." We discuss developmental transitions by (often helpful) age anchors, like "children say their first word around their first birthday," or "preschoolers pass the Sally-Ann task at around 3.5 years." When summarizing a study, we* assert that "by 7 months, babies can segment words from fluent speech," even if we know that this statement describes the fact that the mean performance of a group is significantly different than zero in a particular paradigm instantiating this ability, and even if we know that babies might show this behavior a month earlier if you tested enough of them! But it's a lot harder to say "early word production emerges gradually from 10 - 14 months (in most children)."

Beyond practicalities, one reason we use milestone language is because our measurement methods are only set up to measure discontinuities. First, our methods have poor reliability: we typically don't learn very much about any one child, so we can't say conclusively whether they truly show some behavior or not. In addition, most developmental studies are severely underpowered, just like most studies in neuroscience and psychology in general. So the precision of our estimates of a behavior for groups of children are noisy. To get around this problem, we use null hypothesis significance tests – and when the result is p < .05, we declare that development has happened. But of course we will see discrete changes in development if we use a discrete statistical cutoff!

And finally, we tend to stratify our samples into discrete age bins (which is a good way to get coverage), e.g. recruiting 3-month-olds, 5-month-olds, and 7-month-olds for a study. But then, we use these discrete samples as three separate analytic groups, ignoring the continuous developmental variation between them! This practice reduces statistical power substantially, much like taking median splits on continuous variables (taking a median split on average is like throwing away a third of your sample!). In sum, even in domains where development is continuous, our methods guarantee that we get binary outcomes. We don't try to estimate continuous functions, even when our data afford them.

The truth is, when you scratch the surface in development, everything changes continuously. Even the stuff that's not supposed to change still changes. I saw this in one of my very first studies, when I was a lab manager for Scott Johnson and we accidentally found ourselves measuring 3-9 month-olds' face preferences. Though I had learned from the literature that infants had an innate face bias, I was surprised to find that magnitude of face looking was changing dramatically across the range I was measuring. (Later we found that this change was related to the development of other visual orientating skills). Of course "it's not surprising" that some complex behavior goes up with development, says reviewer 3. But it is important, and the ways we talk about and analyze our data don't reflect the importance of quantifying continuous developmental change.

One reason that it's not surprising to see developmental change is that everything that children do is at its heart a skill. Sucking and swallowing is a skill. Walking is a skill. Recognizing objects is a skill. Recognizing words is a skill too - so too is the rest of language, at least according to some folks. Thinking about other people's thoughts is a skill. So that means that everything gets better with practice. It will – to a first approximation – follow a classic logistic curve like this:

Most skills get better with practice, and the ones described above are no exception. But developmental progress also happens in the absence of practice of specific skills due to physiological maturation – older children's brains are faster and more accurate at processing information, even for skills that haven't been practiced. So samples from this behavior should look like these red lines:

But here's the problem. If you have a complex behavior, it's built of simple behaviors, which are themselves skills. To get the probability of success on one of those complex skills, you can – as a first approximation – multiply the independent probabilities of success in each of the components. That process yields logistic curves that look like these (color indicating the number of components):

And samples from a process with many components look even more discrete, because the logistic is steeper!

Given this kind of perspective, we should expect complex behaviors to emerge relatively suddenly, even if they are simply the product of a handful of continuously changing processes.

This means, from a theoretical standpoint, we need stronger baselines. Our typical baseline at the moment is the null hypothesis of no difference; but that's a terrible baseline! Instead, we need to be comparing to a null hypothesis of "developmental business as usual." To show discontinuity, we need to take into account the continuous changes that a particular behavior will inevitably be undergoing. And then, we need to argue that the rate of developmental change that a particular process is undergoing is faster than we should expect based on simple learning of that skill. Of course to make these kinds of inferences requires far more data about individuals than we usually gather.

In a conference paper that I'm still quite proud of, we tried to create this sort of baseline for early word learning. Arguably, early word learning is a domain where there likely aren't huge, discontinuous changes – instead kids gradually get faster and more accurate in learning new words until they are learning several new words per day. We used meta-analysis to estimate developmental increases in two component processes of novel word mapping: auditory word recognition and social cue following. Both of these got faster and more accurate over the first couple of years. When we put these increases together, we found they together created really substantial changes in how much input would be needed for a new word mapping. (Of course what we haven't done in the three years since we wrote that paper is actually measure the parameters on the process of word mapping developmentally – maybe that's for a subsequent ManyBabies study...). Overall, this baseline suggests that even in the absence of discontinuity, continuous changes in many small processes can produce dramatic developmental differences.

In sum: sometimes developmental psychologists don't take the process of developmental change seriously enough. To do better, we need to start analyzing change continuously; measuring with sufficient precision to estimate rates of change; and creating better continuous baselines before we make claims about discrete change or emergence.

---
* I definitely do this too!

How to run a study that doesn't replicate, experimental design edition

2018-12-09T22:38:00.001-08:00

(tl;dr: Design features of psychology studies to avoid if you want to run a good study!)

Imagine reading about a psychology experiment in which participants are randomly assigned to one of two different short state inductions (say by writing a passage or unscrambling sentences), and then outcomes are measured via a question about an experimental vignette. The whole thing takes place in about 10 minutes and is administered through a survey, perhaps via Qualtrics.

The argument of this post is that this experiment has a low probability of replicating, and we can make that judgment purely from the experimental methods – regardless of the construct being measured, the content of the state induction, or the judgment that is elicited. Here's why I think so.

Friday was the last day of my graduate class in experimental methods. The centerpiece of the course is a replication project in which each student collects data on a new instantiation of a published experiment. I love teaching this course and have blogged before about outcomes from it. I've also written several journal articles about student replication in this model (Frank & Saxe, 2012; Hawkins*, Smith*, et al., 2018). In brief, I think this is a really fun way for student to learn about experimental design and data analysis, open science methods, and the importance of replication in psychology. Further, the projects in my course are generally pretty high quality: they are pre-registered confirmatory tests with decent statistical power, and both the paradigm and the data analysis go through multiple rounds of review by the TAs and me (and sometimes also get feedback from the original authors).

Every year I rate each student project on its replication outcomes. The scale is from 0 to 1, with intermediate values indicating unclear results or partial patterns of replication (e.g., significant key test but different qualitative interpretation). The outcomes from the student projects this year were very disappointing. With 16/19 student projects finished, we have an average replication rate of .31. There were only 4 clear successes, 2 intermediate results, and 10 failure. Samples are small every year, but this rate was even lower than we saw in previous samples (2014-15: .57, N=38) and another one-year sample (2016: .55, N=11).

What happened? Many of the original experiments followed part or all of the schema described above, with a state induction followed by a question about a vignette. In other words, they were poorly designed.

There's now a strong meta-scientific literature suggesting that prediction markets can accurately guess which studies will not replicate. Some of this effect is likely due to general plausibility of study results – the general correlation of prior and posterior probabilities of effects. There are also general statistical predictors of failures to replicate – small samples, small effect sizes, and p-values relatively close to the .05 boundary. Over the past 5-6 years, the community has received a real education about these issues. In my class, we try to spot effects with these sorts of issues and sometimes now ask students not to select projects with statistical red flags. Further, within the constraints of our class budget (which is limited), we try to recruit decent sample sizes.*

This year in my class, I think experimental design was the culprit for many of our failed replications, however. Further, I suspect that many of the prediction markets are picking up on problematic design features as well as the statistical issues mentioned above. Here are the experimental design features that appear – both in my experience and, in some cases, in the broader literature – related to replication success. These "negative features" shape my defaults about how to design a study.

Single-question DVs. Psychological measurements are noisy. If you have high noise, you will have low signal to detect the effect of even a strong manipulation. One way to reduce noise is to measure many times and combine those measurements. Papers that fail to take advantage of this strategy dramatically reduce their ability to find effects of their manipulation. Yet it is striking how many of the findings we look at nevertheless have a single "key question" that is supposed to detect their manipulation. From an item response theory perspective, even if you found the perfect item (optimal discrimination) for a particular population, that item is still likely to be suboptimal and yield under-informative estimates about other populations. This means that your design is unlikely to be replicable in a different context, just because your item isn't designed to measure people in that context.

Single-item manipulations. The counterpart to single question DVs is single-item manipulations, e.g. instantiations of a particular theoretical contrast in a particular experimental vignette or stimulus. Even if an effect induced via a particular item is replicable, it is likely not easily generalizable to a larger population of experimental items (as has been noted since Clark, 1972). But in addition, if you have only a single stimulus of interest, the chance of variation in response to this stimulus – due to sample differences including demographic variation or overall cohort change – is very high; this is exactly the same point as is made above about the DV, now made about the IV. Further, there is a substantial threat to internal validity if this stimulus is used by any other psychologists (as frequently happens with popular tasks - e.g., the prisoner's dilemma).

Between-subjects designs. Variation between people is a huge source of the total variation in psychological measurements. By subtracting out this variance, within-subjects designs dramatically decrease the variance in the measurement of some manipulation. As a result, between-subjects effects tend to replicate less (unless their original samples were really huge). This effect shows up in the original OSC 2015 replication sample, and it also shows up in our previous class sample. In our 16 project sample so far this year, the replication rate for the between subjects experiments was .21 (2.5 successes out of 12) vs. .625 (2.5 successes out of 4).

Short state induction manipulations. It's hard to change people's state in a significant way during a very short experiment, at least, given the tools available to ethical psychologists. If you want to make some one feel powerful, or greedy, or afraid, or anxious, there's only so much you can do by showing them images on a computer screen, making them read words, or making them reflect on their experience by writing a short paragraph. And if you make even a moderate change to someone's state, they are extremely likely to reflect on this experience in the context of the experiment in some very substantial ways (see Task demands, below). It's hard, but probably not impossible to do these kinds of manipulations right; likely manipulations of this type that can and do work.** But think about the counterfactual world where experimenters really could push people's feelings around quite flexibly and easily – we'd be constantly bent to our environment, pushed one way or the other by the precise stimuli we came into contact with, with the incumbent policy implications (Hal Pashler and Andrew Gelman have both made this point previously in several different ways).

Task demands. When I was an undergraduate, my girlfriend – now my wife – and I used to walk over to the business school and do experimental studies for fun (they paid better than psychology). After we were done, we'd walk out and compare notes on what the point of each study was, as well as what condition we thought we were in. MTurk workers are just the same – probably better because many of them have done more studies. Participants will be thinking about what your study is about, and reacting based on some complex combination of that guess (correct or not) and their desired self-presentation and feelings about that goal. It is remarkable how many studies do not consider this issue. Two-stage studies like the one I described at the beginning are extremely vulnerable to this kind of reasoning: if your survey consists only of a state induction and a vignette, it is a guarantee that people will read the two together and then think about the connection. Hmm, I wonder what my feeling of powerlessness has to do with my reading about moral judgements? I wonder what reading a news article about the environment has to do with my judgements about future planning? This kind of design (especially without a good cover story) is a recipe to include participants' interpretive thinking in your pattern of results. Yet most of these paradigms do not even include strategies like a funnel debrief to detect such issues.***

No manipulation checks. Manipulation checks are tricky in state-induction experiments. Because they often directly refer to the construct of interest ("how powerful do you feel?") they can increase task demands and explicit reasoning. They also often are only single items themselves and aren't necessarily psychometrically valid measures of the precise construct of interest. That said: without a manipulation check, if your experiment fails in the type of design we're considering, there is typically no signal for understanding what went wrong. In classic perception, memory, and learning experiments there are usually correct answers, allowing the experimenter to think about whether participants understood the task and were at floor or at ceiling in their performance. In contrast, in judgement studies of the type I'm writing about here, there is not typically any calibration of the measurement. In many experiments without manipulation checks, there is no signal (beyond a difference on the key DV) that allows experimenters (or readers) to verify that the participants understood the materials and were affected by the manipulation.

A subtitle to this post could well be "revenge of the psychometricians." (They already attacked us once). Many of the problematic practices I see come down to poor measurement: single items for the DV measure, single items for the IV manipulation, lack of within-subjects design. All of these are places where experimenters can reduce measurement variation in easy ways. It is not that experiments like the one I've described here are impossible to do right, or that they never replicate. (ManyLabs 1 and ManyLabs 2 have both had replicable and non-replicable examples of such experiments in each). It's that there are so many lost opportunities to do better.

---

* We probably don't have the power to detect small effects in the cases where the authors initially reported large ones, however.

** Some good ones likely take advantage of apparent task demands to cause deeper reasoning about the state induction.

*** Surprisingly I couldn't find a good description of this strategy online. In brief, ask successively more specific questions to try to elicit an understanding of how much they knew about the manipulation, e.g. "what did you think this experiment was about? what did you think about the other person in the experiment? did you notice anything odd about him? did you know he was a confederate?"

[Correction: w/in subjects designs decrease variance, thanks Yoel Sanchez-Araujo]

Scale construction, continued

2018-09-07T14:12:00.004-07:00

For psychometrics fans: I helped out with a post by Brent Roberts, "Yes or No 2.0: Are Likert scales always preferable to dichotomous rating scales?" This post is a continuation of our earlier conversation on scale construction and continues to examine the question of if – and if so, when – it's appropriate to use a Likert scale vs. a dichotomous scale. Spoiler: in some circumstances it's totally safe, while in others it is a disaster!

Three (different) questions about development

2018-08-30T11:26:00.001-07:00

(tl;dr: Some questions I'm thinking about, inspired by the idea of studying the broad structure of child development through larger-scale datasets.)

My daughter, M, started kindergarten this month. I began this blog when I was on paternity leave after she was born; the past five years have been an adventure and revolution for my understanding of development to watch her grow.* Perhaps the most astonishing feature of the experience is how continuous, incremental changes lead to what seem like qualitative revolutions. There is of course no moment in which she became the sort of person she is now: the kind of person who can tell a story about an adventure in which two imaginary characters encounter one another for the first time,** but some set of processes led us to this point. How do you uncover the psychological factors that contribute to this kind of growth and change?

My lab does two kinds of research. In both my hope is to contribute to this kind of understanding by studying the development of cognition and language in early childhood. The first kind of work we do is to conduct series of experiments with adults and children, usually aimed at getting answers to questions about representation and mechanism in early language learning in social contexts. The second kind of work is a larger-scale type of resource-building, where we create datasets and accompanying tools like Wordbank, MetaLab, and childes-db. The goal of this work is to make larger datasets accessible for analysis – as testbeds for reproducibility and theory-building.

Each of these activities connects to the project of understanding development at the scale of an entire person's growth and change. In the case of small-scale language learning experiments, the inference strategy is pretty standard. We hypothesize the operation of some mechanism or the utility of some information source in a particular learning problem (say, the utility of pragmatic inference in word learning). Then we carry out a series of experiments that shows a proof of concept that children can use the hypothesized mechanism to learn something in a lab situation, along with control studies that rule out other possibilities. When done well, these studies can give you pretty good traction on individual learning mechanisms. But they can't tell you that these mechanisms are used by children consistently (or even at all) in their actual language learning.

In contrast, when we work with large-scale datasets, we get a whole-child picture that isn't available in the small studies. In our Wordbank work, for example, we get a global picture of the child's vocabulary and linguistic abilities, for many children across many languages. The trouble is, it's very hard or even impossible to find answers to smaller-scale questions (say, about information seeking from social partners) in datasets that represent global snapshots of children's experience or outcomes. Both methods – the large-scale and the small-scale – are great. The trouble is, the questions don't necessarily line up. Instead, larger datasets tend to direct you towards different questions. Here are three.

1. How do you connect small mechanisms to big changes?

An individual child's vocabulary is made up of hundreds or thousands of individual words, each of which has its own natural history – how and when it was learned, what information was used, what inferences were made. For example, M figured out that "parchment" is a kind paper because Harry Potter was always writing on it. But this is true for any other piece of knowledge (or for that matter, any other skill) as well – it has its own learning history that is contributed to in different ways and to different extents by particular processes and experiences. These individual contributions are typically the object of study for small-scale experimental studies, but in larger-scale observations we only see the result of these – the accreted strata of experience as fossilized by learning.

The problem is that paleontology in this situation isn't straightforward. We don't have a good sense what it would look like if words – or for that matter, any other kind of skill or knowledge – were learned exclusively via a particular route. The best work of this type that I know about is a slightly-esoteric but cool line of computational investigations of word learning (example 1, example 2) that ask about what vocabularies look like – in terms of their growth, composition, and learning times – under different assumptions about the mechanisms in operation.

Relatively little work has tried to connect this kind of theorizing to empirical datasets, however. In one very recent preprint we've tried to take a first step in this direction by asking about the effects of different predictors on the composition of children's early vocabulary (e.g., does word frequency in the input predict which words are learned earlier, or does the conceptual concreteness predict better?). But lots of work is still needed to connect actual mechanistic proposals about in-the-moment learning mechanisms to larger-scale datasets that characterize what children's knowledge looks like.

Even if you have proposals about learning mechanisms, how do you verify that they add up to the kind of child you see in the aggregate measures?

2. Does development mostly hang together or is it many different things?

Piaget's developmental theorizing offered at least two things. The first is an account of how knowledge grows and changes – the relationship between assimilation and accommodation. This account feels very modern to me, as I wrote about a while back ("Was Piaget a Bayesian?"). The other part of the story was an elaborate theory about global, stage-based transitions in children's development. This second part, the stage theory – while on the whole still taught and tested more in textbooks of developmental psychology – has fallen into disrepute in terms of its empirical validity. My favorite critique is Gelman & Baillargeon (2003). But the particular stages posited by Piaget don't need to be right for us to consider the factor structure of development more broadly.

Another way of looking at this. My grandmother (who worked as a research assistant at the Yale Child Studies Center in the 50s and 60s) apparently used to say that kids "either walk or talk," meaning that they would either achieve one milestone or the other first. This is a multi-factorial view of development, in which language vs. locomotor development are two different capacities that are in fact anti-correlated.*** Actually it seems like walking and vocabulary growth are positively correlated. This is a small case study, but it raises the question of how the different features of global developmental progress relate to one another.

Intelligence is defined psychometrically via little g, the first factor in a factor analysis of many tests of cognitive ability. The empirical regularity is that g usually accounts for a substantial amount of variance across cognitive tasks – though that doesn't necessarily mean it's a unitary construct. One analogous question you could ask is about development in early childhood. Early language hangs together astonishingly well, but does early language relate to motor development, for example? There are some reviews that argue that it does, but I'm not aware of a comprehensive analysis of children's trajectories through both that dissociates shared variation due to age.

More generally, is there a little d, that – beyond age – explains global developmental advancement or delay? Statistically, there must be, but how much of the variance does it explain, and what capacities are most tightly related to one another?

3. What's variable and what's consistent?

Finally, how universal are developmental trajectories, across children and across cultures? Imagine having some arbitrary estimate of locomotor development that assigned a number on some (hypothetically) reliable and valid scale. We could ask about the variance of this measure for a particular age group, but that would be largely meaningless without any units or comparison. But by comparing that variation to developmental variation, we can reason about how consistent individuals' development is. This variation is argued to be small for stereoacuity of depth perception, for example, while it is much larger for vocabulary.

Neither of these cases make apples-to-apples comparisons, however. To be precise, units of variation would have to be defined in terms of the ratio of individual variance to developmental variance (as a function of either absolute age or percentage age). Using this approach, you could begin to ask, is variation across individuals larger for particular aspects of development than others? Or is variability itself standard across developmental phenomena?

One further addition is the application of these ideas to cultural variance in skills. Once we have comparable units for a particular skill, we can ask about the relative variability across individuals vs. variability across cultures. What proportion of total variance is due to cultural variability vs. idiosyncrasies of individuals' development? This variance-partitioning approach is in some sense a statistical answer to old questions about universals and variation in language (and other domains) of development.

Conclusions

Bigger datasets shouldn't lead us to abandon our questions. Nor should they lead us to forget basic statistical facts – e.g., the problematic nature of correlational studies or inferences from convenience samples. But in pursuing the kinds of answers they can give, they sometimes lead back in interesting ways to prior theoretical developments; some of these feel almost forgotten in our current emphasis on small-scale, tightly controlled experiments.

---
* That's just on the professional side. Being a parent has changed me profoundly as a person – I hope for the better.
** Harry Potter, of course, and Hiccup from How To Train Your Dragon.
*** I don't know if she'd endorse this view more generally – she passed away before I was born, and this anecdote is related by my dad.

Where does logical language come from? The social bootstrapping hypothesis

2018-08-10T11:57:00.003-07:00

(Musings on the origins of logical language, inspired by work done in my lab by Ann Nordmeyer, Masoud Jasbi, and others).

For the last couple of years I've been part of a group of researchers who are interested in where logic comes from. While formal, boolean logic is a human discovery*, all human languages appear to have methods for making logical statements. We can negate a statement ("No, I didn't eat your dessert while you were away"), quantify ("I ate all of the cookies"), and express conditionals ("if you finish early, you can join me outside.").** While boolean logic doesn't offer a good description of these connectives, natural language still has some logical properties. How does this come about? Because I study word learning, I like to think about logic and logical language as a word learning problem. What is the initial meaning that "no" gets mapped to? What about "and", "or", or "if"?

Perhaps logical connectives are learned just like other words. When we're talking about object words like "ball" or "dog," a common hypothesis is that children have object categories as the possible meanings of nouns. These object categories are given to the child by perception*** in some form or other. Then, kids hear their parents refer to individual objects ("look! a dog! [POINTS TO DOG]"). The point allows the determination of reference; the referent is identified as an instance of a category, and – modulo some generalization and statistical inference – the word is learned, more or less.****

So how does this process work for logical language? There are plenty of linguistic complexities for the learner to deal with: Most logical words simply don't make sense on their own. You can't just turn to your friend and say "or" (at least not without a lot of extra context). So any inference that a child makes about the meaning of the word will have to involve disentangling that from the meaning of the sentence as a whole. But beyond that, what are the potential targets for the meaning of these words? There's nothing you can point to out in the world that is an "if," an "and," or even a "no."

For many folks this boils down to a classic argument from the poverty of the stimulus: there must be some innate logical concepts that underly the ability to acquire logical language. Let's call this idea "logical nativism." These innate logical concepts need not look like boolean primitives, but they should at least form some kind of basis for inducing a more complex semantics and making lexical mappings. To the extent that you can find evidence for logical reasoning in infants before they can talk*****, this would constitute evidence for the logical nativist perspective.

Others would deny this kind of innate structure. There are lots of reasons to be skeptical of strong nativist claims, whether because you think logic isn't the kind of thing that brains represent innately or because you believe such structures could be learned from input (relatedly, here's my take on "minimal nativism."). But if you make this sort of claim, then you are responsible for characterizing how children come to learn these words and use them correctly. Even if you skirt around Fodor's problem by assuming that children have access to a space of concepts expressive enough to discover these logical operators, you still might want to ask how they do so.

One possible learning theory is that children build the logical operators directly (perhaps through some kind of probabilistic induction). But I want to sketch the beginnings of a different acquisition theory here. On this theory – let's call it the social bootstrapping hypothesis – children begin by mapping logical words to speech acts with specifically social functions like rejection, offer, or threat. They then gradually generalize the broader logical functions of these words by noticing similarities between social uses of the words and other more abstract uses.

This post is a way of writing down my own speculations, and is not fully worked out. Probably someone has said something like this before – perhaps Liz Bates or Lois Bloom - I'm not sure, and that's why this is a blog post rather than a paper. That said, here are a couple of examples.

Negation

"No" is often one of children's very first words. (In some unpublished data, we even saw that this was especially true for second children – presumably they were saying to their sibling "don't DO that!") Consistent with this idea, early negation has been glossed as having the meaning "rejection" – something like "I don't want that" (lit review and up to date coding in this recent paper by Ann Nordmeyer and me). Some other early negations are used for nonexistence ("no cookies") which is a bit different, both syntactically – functioning as a determiner – and semantically. But it's been claimed that you see less early use of negation as what have been called "denials," where a proposition is being negated and the intended meaning is "it is not true that X."

Ann's study suggests that it's true you don't see these early propositional denials as often, but she did find more frequent denials for some – often during book reading, where parents would ask polar questions like "is that a dog [pointing to a bird]?" and children would say "no!" It seemed like while these utterances were technically logical denials, they were more straightforwardly denying a name rather than a proposition. Further, they seemed like they made sense in those contexts and were being uttered by pretty young children.

More broadly: I wonder if the relevant target for initial mapping of "no" is essentially the social act of rejection – the head shake when a new food is offered, meaning "don't put that in my mouth." Then once this initial mapping is made, from a very salient and present social impulse (parents rejecting kid's behavior and kid rejecting parents' behavior), this meaning can be generalized to other cases. In particular, the trajectory from "no! don't do that" to "no! don't (you) say that" to "no! don't (you) think that" to "no! not true!" doesn't feel too implausible to me. This would especially be an easy conflation to make under a pre-theory of mind, naïve-realist viewpoint in which what I think is what you think is what is true of the world. It would also explain why the early denials that Ann saw were possible – they're very transparently instances of "rejection of a name" even though they look like "denial of a proposition" on the earlier analysis.

One much-discussed example of early negation is the utterance "no mummy do it" (see Drozd, 1995), which means something like "I don't want mummy to do it." Drozd then presents the utterance "no Nathaniel a king," (Nathaniel is the kid here, who's speaking) which alternatively means something like "I don't want you to say that Nathaniel's [I am] a king" or "Nathaniel's not a king." You see how there is a pretty small step from rejecting an action to rejecting a proposition.

Related to this bootstrapping account is the persistent negativity of negation – in corpora, negative terms carry negative valence. To be fair, the account given in that paper notes that these effects may be pragmatic in nature. But the paper did lead me to a related hypothesis to my social bootstrapping idea, \namely that negation is “Learned early on with the association of ‘unpleasant feelings’” (from Bertrand Russell originally). I think that's probably right, although I'm arguing that the negativity of negation is not a direct affective mapping, it's instead a mapping to the social negativity of rejection.

Disjunction

In contrast to "no," "or" is a bit of a mess in acquisition. Children say "or" pretty early, but who knows what they mean? One big issue is that they hear disjunctions that seem to mean logical OR ("[waiter:] you can order dinner or drinks" - true if one is true, the other is true, or both), but they also hear some that appear to be XOR ("[waiter:] you can have dessert or the check" - true if one is true, or the other is true, but NOT both). What could be the target for mapping for this word?

Well, one part of the puzzle comes from Masoud Jasbi's paper, which is that these different uses have different prosody: the second one has a more distinctive rise/fall/rise pattern than the first. (Also, typically the disjuncts are logically inconsistent in XOR cases.) But there's a more general issue: how do you even think of OR and XOR as possible meanings?

Again, my suggestion is that the initial target is a social meaning: offer. Under this story, "X or Y" as a construction initially means "offer." Probably this comes up in the context of food offers, especially. The exclusivity of this offer (can you take both or only one) is then a secondary concern that can be worked out from context. But again, you can see the progression from "would you like carrots or string cheese" -> offer(X,Y) to "is john home or at school" -> offer(john at home, john at school). The key step is again, offering an action to offering a proposition.

Furthermore, as Masoud's dissertation uncovers, there are a host of other meanings for "or" that don't fit well at all with the basic boolean OR vs. XOR idea. For example, "I'm a wine-lover, or oenophile" (definitional disjunction) doesn't fit. And we constantly correct ourselves using disjunction, e.g. "I think it's in the closet. [observes that's not the case] Or under the piano." These broader meanings feel like they might be different classes of social meanings that map onto the lexical item in specific pragmatic and prosodic frames.

Implication (and Conclusion)

Before I wrap up I just want to mention "if," where I think there is possible story. Threat seems like a clear candidate as a target for mapping. "If you dump that out, you won't get any more" feels to me like a prototypical example of a child-directed utterance where the causal interpretation could eventually get generalized into whatever your semantics is for conditionals. Note here that in this case again there's a reversal. The Gricean pragmatics that is assumed on conventional accounts to be built out of a logical semantics actually becomes on this account the place where acquisition starts! So rather than causality being an implicature from the conditional, it's actually the starting point for mapping and generalization. I don't have data on this, but I'd be interested in investigating...

Hopefully, in this post, I've planted the idea that social meanings could be the roots of logical word learning. There are of course many obstacles to realizing this kind of account – first of all, specifying the relationship between the different semantic entities that can be acted on (from objects to actions to propositions). Further, it's not as clear how this would work for "and" or quantifiers like "some." But as I observe children's interactions and think about the way their pragmatic competence supports word learning, this is the sort of constructivist account that feels like the most plausible response to logical nativism.

---
* Or invention. I won't get all philosophy of math on this right now.
** Of course, the logic of natural language is contaminated constantly with pragmatic inference – that's what I spend most of my time studying.
*** We'll ignore here both reciprocal effects of language on category formation and the difficulty of object recognition.
**** By "more or less" here I mean this is actually a major topic of study for a whole subfield. So there is a lot to learn. But at a high level this kind of social learning view is not terrible.
***** I have some criticisms of the inferences from this paper, but the experimental designs are extremely clever.

(Thanks very much to Chris Potts for helpful comments).

What does it mean to get a degree in psychology these days?

2018-06-18T10:56:00.001-07:00

(I was asked to give a speech yesterday at Stanford's Psychology commencement ceremony. Here is the text).

1. Chair, Colleagues, graduates of the class of 2018 – undergraduates and graduate students – family members, and friends. It’s a pleasure to be here today with all of you. Along with honoring our graduates, we especially honor all the wonderful speakers today for their accomplishments – MH for his excellence in research and teaching, Angela for her deep engagement with the department community. You could be forgiven for thinking that there was some special achievement that brought me here as well. In fact, by tradition, faculty take turns addressing the graduating class and is my turn this year. It’s a real pleasure to have one last chance to address you.

Two weeks ago, my daughter Madeline graduated from preschool. There was cake; photos were taken. They broke a piñata. It was a big deal! Several of her friends will be going to different schools, some moving away to other states or even other countries. This is one of the biggest changes she’s ever experienced. I’m already worried about what happens next. Parents, I can only imagine what you are going through today – but at least you know that your kids made it through the first day of kindergarten.

Graduates - Your graduation from Stanford today is a really big deal. You also get to have cake and photos. If you’re very lucky, some special person has even bought you a piñata. But more importantly, just like for Madeline this is a time of transitions. You may be moving somewhere new. Even if you are staying here, friends will be further away than the next dorm or the next office. So do not hesitate to take a little extra time today to celebrate with the people you love and who love you.

Congratulations.

2. I want to take a little time now to think about what it means to get a degree in psychology from Stanford.

When you sit next to someone on an airplane and tell them you are studying psychology, perhaps they ask you if you are reading their mind. Perhaps they wonder if you are studying Freudian analysis and have thoughts about their unconscious, or their relationship with their mother. Or maybe they are more up to date and wonder if you study psychological disorders as they manifest themselves in the clinic. But the truth is, knowing what you’ve done in your degrees here at Stanford, you probably haven’t done too much Freud. Or too much mind-reading. And although you may be interested in clinical work (and this is laudable), that’s not the core of what we teach here.

Gaining a degree in psychology also means that you have gone to many classes in psychology and learned about many studies – from social influence to stereotype threat, from mental rotation to marshmallow tests. Although this body of knowledge is a lovely thing to have come into contact with (and I hope that you continue to deepen your knowledge), knowing this content is also not the core of what it means to receive your degree.

What you have learned instead are tools; a specific kind of tools, namely tools for thought. These tools can be used to approach problems and construct solutions. This is what it means for psychology to be an academic discipline: a discipline denotes a particular mental toolbox. The university is the intellectual equivalent of a construction firm – different departments have the tools to solve different sorts of problems.

3. Like nearly all ideas, “cognitive tools” seem obvious – after you are used to them. Let’s take one example, a foundational cognitive tool that we use every single day: numbers. Because we are so numerate, a lot of people have the idea that numbers are easy and straightforward. But they aren’t.

Take the preschoolers in Madeline’s old classroom. Nearly all of them can count, at least to ten and maybe higher. But if you probe a bit more deeply, it all falls apart. If at snack time, you ask someone to give you exactly four cheerios, she’s liable to hand you seven, or a whole handful. Even when a child knows that “one” means exactly 1, it takes quite a few months for them to figure out that “two” means exactly 2, and more months for 3. When they finally figure out how the whole system works it enables so many new things! Madeline owes all of her dessert-negotiation prowess to her abilities with numbers. Seven gummi bears? No. How about six? This idea of exact comparison is a skill – even though it makes for tiresome after-dinner conversation.

Numbers are an invented, culturally-transmitted tool. In graduate school I worked with an Amazonian indigenous group, the Pirahã, who have no words for numbers. They are bright, sophisticated people who love a good practical joke. Many Pirahã can shoot a fish with an arrow while standing in a canoe. Yet because their language does not have these particular words in it – words like “seven” - and because they do not go through that laborious period of practice that Madeline and other kids learning languages like English do – they can’t remember that it’s exactly seven gummi bears. To them, six or eight seems like the same amount. They simply don’t have the tool.

4. So what are the tools of the psychologist?

There’s one tool that qualifies as the hammer of psychology – the single tool you can use to frame an entire house. That’s the experiment. The fundamental insight of all of modern psychology is that the puzzles of the human mind can be understood as objects of scientific study if we can design appropriately controlled experiments. As complicated and unpredictable as people are (especially when they are integrated into complex cultural systems), we can still learn about their inner workings via experiments.

This insight has spread far outside of psychology and far outside of the academy. Nowadays, Facebook runs a hundred experiments a day on you. Governments and political campaigns, startups and not-for-profits are all constantly experimenting to try to understand how to achieve their goals. There is a good chance that in the next few years of your professional life you will face a complicated human problem with an unknown solution. The psychologist’s approach will serve you well: formulate a hypothesis about how you should manipulate the world; then assess whether the manipulation has changed your measurement of interest. This strategy is shockingly effective.

But the serious carpenter has other, more specialized tools in the toolkit – the plane, awl, rasp, drawknife, jigsaw, bevel. Let me mention two more.

The first is the idea that our knowledge is not just a set of facts, but is organized into theories that help us understand the world. We call these theories intuitive theories – they are the explanatory frameworks that people carry with them to understand why things happen. What follows from this idea is that when you want to change people’s behavior, you can’t just tell them to change or tell them different facts. You need to change their theory. When I want Madeline to eat her vegetables, it turns out just telling her to “eat broccoli” doesn’t work very well – even if she does eat the broccoli, she won’t know what else to eat or why to eat it. And of course the well-known idea about fostering a growth mindset is precisely this kind of implicit theory: it’s a theory of whether ability is fixed or whether it can be improved with hard work.

The second idea I want to share is that our judgment is systematically biased. It’s biased by our own beliefs. Our minds are wonderful, efficient systems that deal with uncertainty – we piece together a sentence even in a noisy restaurant using our expectations about what that person might be trying to say to us. In most cases, this is an amazing feature of our own cognition, letting us operate flexibly using limited data. But this reliance on our own beliefs also has negative consequences: it leads us to stereotype, and to engage in confirmation bias, looking for evidence that further supports our own beliefs. Understanding of these sources of bias can help us avoid falling into this trap. A good grounding in psychology, in other words, helps us be more aware of our own limitations.

I’d love to tell you about more ideas. Every woodworker loves to show off their workbench. And the wonderful thing about tools is that when you use them together you can create new tools, in the same way the carpenter can first make a jig to make it easier to make a difficult cut. I could go on, but hopefully I’ve piqued your curiosity – and you have lots more to do today.

5. So. Make sure that you celebrate! Eat some cake, smash a piñata, and most of all, say your "thank you"s to the people who have supported you during your time here at Stanford. I speak for all of them when I say that we are very proud of you and cannot wait to see what you accomplish.

As this weekend passes and you head off for other things, it is all but certain that you will find yourself in new situations facing challenges that you have not considered before. (Life would not be fun without them!). But I am confident that your tools will be sufficient to the job. Keep them sharp and they will serve you well.

nosub: a command line tool for pushing web experiments to Amazon Mechanical Turk

2018-05-05T15:52:00.000-07:00

(This post is co-written with Long Ouyang, a former graduate student in our department, who is the developer of nosub, and Manuel Bohn, a postdoc in my lab who has created a minimal working example).

Although my lab focuses primarily on child development, our typical workflow is to refine experimental paradigms via working with adults. Because we treat adults as a convenience population, Amazon Mechanical Turk (AMT) is a critical part of this workflow. AMT allows us to pay an hourly wage to participants all over the US who complete short experimental tasks. (Some background from an old post).

Our typical workflow for AMT tasks is to create custom websites that guide participants through a series of linguistic stimuli of one sort or another. For simple questionnaires we often use Qualtrics, a commercial survey product, but most tasks that require more customization are easy to set up as free-standing javascript/HTML sites. These sites then need to be pushed to AMT as "external HITs" (Human Intelligence Tasks) so that workers can find them, participate, and be compensated.

nosub is a simple tool for accomplishing this process, building on earlier tools used by my lab.* The idea is simple: you customize your HIT settings in a configuration file and type

nosub upload

to upload your experiment to AMT. Then you can type

nosub download

to fetch results. Two nice features of nosub from a psychologist's perspective are: 1. worker IDs are anonymized by default so you don't need to worry about privacy issues (but they are deterministically hashed so you can still flag repeat workers), and 2. nosub can post HITs in batches so that you don't get charged Amazon's surcharge for tasks with more than 9 hits.

All you need to get started is to install Node.js; installation instructions for nosub are available in the project repository.

Once you've run nosub, you can download your data in JSON format, which can easily be parsed into R. We've put together a minimal working example of an experiment that can be run using nosub and a data analysis script in R that reads in the data.

---

* psiTurk is another framework that provides a way of serving and tracking HITs. psiTurk is great and we have used it for heavier-weight applications where we need to track participants, but can be tricky to debug and is not always compatible with some of our light-weight web experiments.

Mixed effects models: Is it time to go Bayesian by default?

2018-02-26T23:18:00.002-08:00

(tl;dr: Bayesian mixed effects modeling using brms is really nifty.)

Introduction: Teaching Statistical Inference?

How do you reason about the relationship between your data and your hypotheses? Bayesian inference provides a way to make normative inferences under uncertainty. As scientists – or even as rational agents more generally – we are interested in knowing the probability of some hypothesis given the data we observe. As a cognitive scientist I've long been interested in using Bayesian models to describe cognition, and that's what I did much of my graduate training in. These are custom models, sometimes fairly difficult to write down, and they are an area of active research. That's not what I'm talking about in this blogpost. Instead, I want to write about the basic practice of statistics in experimental data analysis.

Mostly when psychologists do and teach "stats," they're talking about frequentist statistical tests. Frequentist statistics are the standard kind people in psych have been using for the last 50+ years: t-tests, ANOVAs, regression models, etc. Anything that produces a p-value. P-values represent the probability of the data (or any more extreme) under the null hypothesis (typically "no difference between groups" or something like that). The problem is that this is not what we really want to know as scientists. We want the opposite: the probability of the hypothesis given the data, which is what Bayesian statistics allow you to compute. You can also compute the relative evidence for one hypothesis over another (the Bayes Factor).

Now, the best way to set psychology twitter on fire is to start a holy war about who's actually right about statistical practice, Bayesians or frequentists. There are lots of arguments here, and I see some merit on both sides. That said, there is lots of evidence that much of our implicit statistical reasoning is Bayesian. So I tend towards the Bayesian side on the balance <ducks head>. But despite this bias, I've avoided teaching Bayesian stats in my classes. I've felt like, even with their philosophical attractiveness, actually computing Bayesian stats had too many very severe challenges for students. For example, in previous years you might run into major difficulties inferring the parameters of a model that would be trivial under a frequentist approach. I just couldn't bring myself to teach a student a philosophical perspective that – while coherent – wouldn't provide them with an easy toolkit to make sense of their data.

The situation has changed in recent years, however. In particular, the BayesFactor R package by Morey and colleagues makes it extremely simple to do basic inferential tasks using Bayesian statistics. This is a huge contribution! Together with JASP, these tools make the Bayes Factor approach to hypothesis testing much more widely accessible. I'm really impressed by how well these tools work.

All that said, my general approach to statistical inference tends to rely less on inference about a particular hypothesis and more on parameter estimation – following the spirit of folks like Gelman & Hill (2007) and Cumming (2014). The basic idea is to fit a model whose parameters describe substantive hypotheses about the generating sources of the dataset, and then to interpret these parameters based on their magnitude and the precision of the estimate. (If this sounds vague, don't worry – the last section of the post is an example). The key tool for this kind of estimation is not tests like the t-test or the chi-squared. Instead, it's typically some variant of regression, usually mixed effects models.

Mixed-Effects Models

Especially in psycholinguistics where our experiments typically show many people many different stimuli, mixed effects models have rapidly become the de facto standard for data analysis. These models (also known as hierarchical linear models) let you estimate sources of random variation ("random effects") in the data across various grouping factors. For example, in a reaction time experiment some participants will be faster or slower (and so all data from those particular individuals will tend to be faster or slower in a correlated way). Similarly, some stimulus items will be faster or slower and so all the data from these groupings will vary. The lme4 package in R was a game-changer for using these models (in a frequentist paradigm) in that it allowed researchers to estimate such models for a full dataset with just a single command. For the past 8-10 years, nearly every paper I've published has had a linear or generalized linear mixed effects model in it.

Despite their simplicity, the biggest problem with mixed effects models (from an educational point of view, especially) has been figuring out how to write consistent model specifications for random effects. Often there are many factors that vary randomly (subjects, items, etc.) and many other factors that are nested within those (e.g., each subject might respond differently to each condition). Thus, it is not trivial to figure out what model to fit, even if fitting the model is just a matter of writing a command. Even in a reaction-time experiment with just items and subjects as random variables, and one condition manipulation, you can write

(1) rt ~ condition + (1 | subject) + (1 | item)

for just random intercepts by subject and by item, or you can nest condition (fitting a random slope) for one or both:

(2) rt ~ condition + (condition | subject) + (condition | item)

and you can additionally fiddle with covariance between random effects for even more degrees of freedom!

Luckily, a number of years ago, a powerful and clear simulation paper by Barr et al. (2013) came out. They argued that there was a simple solution to the specification issue: use the "maximal" random effects structure supported by the design of the experiment. This meant adding any random slopes that were actually supported by your design (e.g., if condition was a within-subject variable, you could fit condition by subject slopes). While this suggestion was quite controversial,* Barr et al.'s simulations were persuasive evidence that this suggestion led to conservative inferences. In addition, having a simple guideline to follow eliminated a lot of the worry about analytic flexibility in random effects structure. If you were "keeping it maximal" that meant that you weren't intentionally – or even inadvertently – messing with your model specification to get a particular result.

Unfortunately, a new problem reared its head in lme4: convergence. With very high frequency, when you specify the maximal model, the approximate inference algorithms that search for the maximum likelihood solution for the model will simply not find a satisfactory solution. This outcome can happen even in cases where you have quite a lot of data – in part because the number of parameters being fit is extremely high. In the case above, not counting covariance parameters, we are fitting a slope and an intercept across participants, plus a slope and intercept for every participant and for every item.

To deal with this, people have developed various strategies. The first is to do some black magic to try and change the optimization parameters (e.g., following these helpful tips). Then you start to prune random effects away until your model is "less maximal" and you get convergence. But these practices mean you're back in flexible-model-adjustment land, and vulnerable to all kinds of charges of post-hoc model tinkering to get the result you want. We've had to specify lab best-practices about the order for pruning random effects – kind of a guide to "tinkering until it works," which seems suboptimal. In sum, the models are great, but the methods for fitting them don't seem to work that well.

Enter Bayesian methods. For several years, it's been possible to fit Bayesian regression models using Stan, a powerful probabilistic programming language that interfaces with R. Stan, building on BUGS before it, has put Bayesian regression within reach for someone who knows how to write these models (and interpret the outputs). But in practice, when you could fit an lmer in one line of code and five seconds, it seemed like a bit of a trial to hew the model by hand out of solid Stan code (which looks a little like C: you have to declare your variable types, etc.). We have done it sometimes, but typically only for models that you couldn't fit with lme4 (e.g., an ordered logit model). So I still don't teach this set of methods, or advise that students use them by default.

brms?!? A worked example

In the last couple of years, the package brms has been in development. brms is essentially a front-end to Stan, so that you can write R formulas just like with lme4 but fit them with Bayesian inference.* This is a game-changer: all of a sudden we can use the same syntax but fit the model we want to fit! Sure, it takes 2-3 minutes instead of 5 seconds, but the output is clear and interpretable, and we don't have all the specification issues described above. Let me demonstrate.

The dataset I'm working on is an unpublished set of data on kids' pragmatic inference abilities. It's similar to many that I work with. We show children of varying ages a set of images and ask them to choose the one that matches some description, then record if they do so correctly. Typically some trials are control trials where all the child has to do is recognize that the image matches the word, while others are inference trials where they have to reason a little bit about the speaker's intentions to get the right answer. Here are the data from this particular experiment:

I'm interested in quantifying the relationship between participant age and the probability of success in pragmatic inference trials (vs. control trials, for example). My model specification is:

(3) correct ~ condition * age + (condition | subject) + (condition | stimulus)

So I first fit this with lme4. Predictably, the full desired model doesn't converge, but here are the fixed effect coefficients:

beta stderr z p

intercept 0.50 0.19 2.65 0.01

condition 2.13 0.80 2.68 0.01

age 0.41 0.18 2.35 0.02

condition:age -0.22 0.36 -0.61 0.54

Now let's prune the random effects until the convergence warning goes away. In the simplified version of the dataset that I'm using here I can keep stimulus and subject intercepts and still get convergence when there are no random slopes. But in the larger dataset, the model won't converge unless i do just the random intercept by subject:

beta stderr z p

intercept 0.50 0.21 2.37 0.02

condition 1.76 0.33 5.35 0.00

age 0.41 0.18 2.34 0.02

condition:age -0.25 0.33 -0.77 0.44

Coefficient values are decently different (but the p-values are not changed dramatically in this example, to be fair). More importantly, a number of fairly trivial things matter to whether the model converges. For example, I can get one random slope in if I set the other level of the condition variable to be the intercept, but it doesn't converge with either in this parameterization. And in the full dataset, the model wouldn't converge at all if I didn't center age. And then of course I haven't tweaked the optimizer or messed with the convergence settings for any of these variants. All of this means that there are a lot of decisions about these models that I don't have a principled way to make – and critically, they need to be made conditioned on the data, because I won't be able to tell whether a model will converge a priori!

So now I switched to the Bayesian version using brms, just writing brm() with the model specification I wanted (3). I had to do a few tweaks: upping the number of iterations (suggested by the warning messages from the output, changing to a Bernoulli model rather than binomial (for efficiency, again suggested by the error message), but this was very straightforward otherwise. For simplicity I've adopted all the default prior choices, but I could have gone more informative.

Here's the summary output for the fixed effects:

                      estimate  error    l-95% CI u-95% CI
intercept             0.54      0.48    -0.50     1.69
condition             2.78      1.43     0.21     6.19
age                   0.45      0.20     0.08     0.85
condition:age        -0.14      0.45    -0.98     0.84

From this call, we get back coefficient estimates that are somewhat similar to the other models, along with 95% credible interval bounds. Notably, the condition effect is larger (probably corresponding to being able to estimate a more extremal value for the logit based on sparse data), and then the interaction term is smaller but has higher error. Overall, coefficients look more like the first non-convergent maximal model than the second converging one.

The big deal about this model is not that what comes out the other end of the procedure is radically different. It's that it's not different. I got to fit the model I wanted, with a maximal random effects structure, and the process was almost trivially easy. In addition, and as a bonus, the CIs that get spit out are actually credible intervals that we can reason about in a sensible way (as opposed to frequentist confidence intervals, which are quite confusing if you think about them deeply enough).

Conclusion

Bayesian inference is a powerful and natural way of fitting statistical models to data. The trouble is that, up until recently, you could easily find yourself in a situation where there was a dead-obvious frequentist solution but off-the-shelf Bayesian tools wouldn't work or would generate substantial complexity. That's no longer the case. The existence of tools like BayesFactor and brms means that I'm going to suggest that people in my lab go Bayesian by default in their data analytic practice.

----
Thanks to Roger Levy for pointing out that model (3) above could include an age | stimulus slope to be truly maximal. I will follow this advice in the paper.

* Who would have thought that a paper about statistical models would be called "the cave of shadows"?

** Rstanarm did this also, but it covered fewer model specifications and so wasn't as helpful.