Babies Learning Language: Cognitive Science

Showing posts with label Cognitive Science. Show all posts

Tuesday, December 3, 2024

Four papers I'm sad never to have published

One of the saddest things in academic research is an abandoned project. You pour time, effort, and sometimes money into a piece of research, only to feel that it has not been released into the world to make an impact. Sometimes you don't finish an analysis or write a paper. But I would argue that the saddest situations are the projects that came closest to being published – these are "near misses."*

This sadness can also have practical consequences. If we abandon projects differentially because of their results – failing to report negative findings because of a belief that they would be uninteresting or hard to publish – then we get a bias in the published literature. We know this is true – but in this post I'm not going to focus on that. I'm thinking more about inadvertent near misses. The open science movement – and in particular the rise of preprints – has changed the field a lot in that these near misses are now visible again. So I'm writing this post in part to promote and discuss four projects that never saw journal publication but that I still love...

I'm a researcher but I'm also (maybe primarily) an advisor and mentor, and so this kind of thing happens all the time: a trainee comes into my lab, does a great project, writes a paper about it, and then moves on to a new position. Sometimes they stay in academia, sometimes they don't. Even if we submit the manuscript before they leave, however, it frequently happens that reviews come back after they are distracted by the next stage of their life. Unless I take over the writing process, things typically remain unpublished.

But the worst thing is when I abandon my own work because I'm too busy doing all that advising and teaching (and also getting grants to do the next shiny thing). Sadly this has happened many times over the past 15 years or so that I've been a faculty member. I simply didn't have the fortitude to get the paper through peer review and so it lingers as something interesting but unrevised – and perhaps fatally flawed (depending on whether you trust the reviewers). Here are my four biggest regrets.

1. A literature review on computational models of early language learning. This was the first chapter of my dissertation initially, and I revised it for a review journal, hoping to do something like Pinker's famous early review paper. It was reviewed by two people, one nativist and one empiricist. Both hated it, and I abandoned it in despair. I still like what I wrote, but it's very out of date now.

2. A huge dataset on children's free-viewing of naturalistic third-person dialogue and how it relates to their word learning. I loved this one. These experiments were my very first projects when I got to Stanford – we collected hundreds of kids worth of eye-tracking data (with an eye-tracker bought with my very first grant) and we were able to show correlational relationships between free-viewing and word learning. We even saw a similar relationship in kids on the autism spectrum. This paper was rejected several times from good journals for reasonable reasons (too correlational, kids with ASD were not well characterized). But I think it has a lot of value. (The data are now in Peekbank, at least).

(Graph showing big developmental differences in free viewing, specifically for a moment at which you had to follow an actor's gaze to see what they were talking about in the video).

3. A large set of experiments on reference games. Noah Goodman and I created the Rational Speech Act (RSA) model of pragmatic processing and this was a big part of my early research at Stanford. I spent a ton of time and money doing mechanical turk experiments to try to learn more about the nature of the model. This manuscript includes a lot of methodological work on paradigms for studying pragmatic inference online as well as some clever scenarios to probe the limits (there were 10 experiments overall!). Sadly I think I tried to make the manuscript more definitive than it should have been – by the time I finally submitted it, RSA already had many variants, and some of the formal work was not as strong as the empirical side. So reviewers who disliked RSA disliked it, and reviewers who liked RSA still thought it needed work.

4. A simplified formal model of teaching and learning. This one was an extension of the RSA model for teaching and learning scenarios, trying to get a handle on how teachers might change their messages based on the prior beliefs and/or knowledge of the learners. I was really proud of it, and it shapes my thinking about the dynamics of teaching to this day. Lawrence Liu started the project, but I did a ton more analysis several years later in hopes of making a full paper. Sadly, it was rejected once – reviewers thought, perhaps reasonably, that the policy implications were too big a stretch. By the time I submitted it to another journal, a bunch of other related formal work had appeared in the computer science literature. Reviewers the second time asked for more simulations, but I was out of time and the code had gotten quite stale because it depended on a very specific tech stack.

I hope someone gets a little pleasure or knowledge from these pieces. I loved working on all four of them!

----

* I just learned that there is a whole literature on the psychology of near misses, for example in gambling or with respect to emotions like relief and regret.

Tuesday, November 5, 2019

Letter of recommendation: Attack of the Psychometricians

(tl;dr: It's letter of recommendation season, and so I decided to write one to a paper that's really been influential in my recent thinking. Psychometrics, y'all.)

To whom it may concern:

I am writing to provide my strongest recommendation for the paper, "Attack of the Psychometricians" by Denny Borsboom (2006). Reading this paper oriented me to a rich tradition of psychometric modeling – but more than that, it changed my perspective on the relationship between psychological measurement and theory. (It also taught me to use the term "sumscore"* as an insult). I urge you to consider it for a position in your reading list, syllabus, or lab meeting.

I first met AotP (or Attack!, as I like to call it) via a link on twitter. Not the most auspicious beginning, but from a quick skim on my phone, I could tell that this was a paper that needed further study.

The paper presents and discusses what it calls the central insight of psychometrics: that "measurement does not consist of finding the right observed score to substitute for a theoretical attribute, but of devising a model structure to relate an observable to a theoretical attribute." In other words, the goal is to make models that link data to theoretical quantities of interest. What this means is that measurement is essentially continuous with theory construction. By creating and testing a good measurement model, you're creating and testing a key component of a good theory.

Nothing in childhood makes sense except in the light of continuous developmental change

I'm awestruck by the processes of development that operate over children's first five years. My daughter M is five and my newborn son J is just a bit more than a month old. J can't yet consistently hold his head up, and he makes mistakes even in bottle feeding – sometimes he continues to suck but forgets to swallow so that milk pours out of his mouth until his clothes are soaked. I remember this kind of thing happening with M as a baby ... and yet voila, five years later, you have someone who is writing text messages to grandma and illustrating new stories about Spiderman. How you could possibly get from A to B (or in my case, from J to M)? The immensity of this transition is perhaps the single most important challenge for theories of child development.

As a field, we have bounced back and forth between continuity and discontinuity theories to explain these changes. Continuity theories posit that infants' starting state is related to our end state, and that changes are gradual, not saltatory; discontinuity theories posit stage-like transitions. Behaviorist learning theory was fundamentally a continuity hypothesis – the same learning mechanisms (plus experience) underly all of behavior, and change is gradual. In contrast, Piagetian stage theory was fundamentally about explaining behavioral discontinuities. As the pendulum swung, we get core knowledge theory, a continuity theory: innate foundations are "revised but not overthrown" (paraphrasing Spelke et al. 1992). Gopnik and Wellman's "Theory theory" is a discontinuity theory: intuitive theories of domains like biology or causality are discovered like scientific theories. And so on.

For what it's worth, my take on the "modern synthesis" in developmental psychology is that development is domain-specific. Domain of development – perception, language, social cognition, etc. – progress on their own timelines determined by experience, maturation, and other constraining factors. And my best guess is that some domains develop continuously (especially motor and perceptual domains) while others, typically more "conceptual" ones, show more saltatory progress associated with stage changes. But – even though it would be really cool to be able to show this – I don't think we have the data to do so.

The problem is that we are not thinking about – or measuring – development appropriately. As a result, what we end up with is a theoretical mush. We talk as though everything is discrete, but that's mostly a function of our measurement methods. Instead, everything is at rock bottom continuous, and the question is how steep the changes are.

We talk as though everything is discontinuous all the time. The way we know how to describe development verbally is through what I call "milestone language." We discuss developmental transitions by (often helpful) age anchors, like "children say their first word around their first birthday," or "preschoolers pass the Sally-Ann task at around 3.5 years." When summarizing a study, we* assert that "by 7 months, babies can segment words from fluent speech," even if we know that this statement describes the fact that the mean performance of a group is significantly different than zero in a particular paradigm instantiating this ability, and even if we know that babies might show this behavior a month earlier if you tested enough of them! But it's a lot harder to say "early word production emerges gradually from 10 - 14 months (in most children)."

Beyond practicalities, one reason we use milestone language is because our measurement methods are only set up to measure discontinuities. First, our methods have poor reliability: we typically don't learn very much about any one child, so we can't say conclusively whether they truly show some behavior or not. In addition, most developmental studies are severely underpowered, just like most studies in neuroscience and psychology in general. So the precision of our estimates of a behavior for groups of children are noisy. To get around this problem, we use null hypothesis significance tests – and when the result is p < .05, we declare that development has happened. But of course we will see discrete changes in development if we use a discrete statistical cutoff!

And finally, we tend to stratify our samples into discrete age bins (which is a good way to get coverage), e.g. recruiting 3-month-olds, 5-month-olds, and 7-month-olds for a study. But then, we use these discrete samples as three separate analytic groups, ignoring the continuous developmental variation between them! This practice reduces statistical power substantially, much like taking median splits on continuous variables (taking a median split on average is like throwing away a third of your sample!). In sum, even in domains where development is continuous, our methods guarantee that we get binary outcomes. We don't try to estimate continuous functions, even when our data afford them.

The truth is, when you scratch the surface in development, everything changes continuously. Even the stuff that's not supposed to change still changes. I saw this in one of my very first studies, when I was a lab manager for Scott Johnson and we accidentally found ourselves measuring 3-9 month-olds' face preferences. Though I had learned from the literature that infants had an innate face bias, I was surprised to find that magnitude of face looking was changing dramatically across the range I was measuring. (Later we found that this change was related to the development of other visual orientating skills). Of course "it's not surprising" that some complex behavior goes up with development, says reviewer 3. But it is important, and the ways we talk about and analyze our data don't reflect the importance of quantifying continuous developmental change.

One reason that it's not surprising to see developmental change is that everything that children do is at its heart a skill. Sucking and swallowing is a skill. Walking is a skill. Recognizing objects is a skill. Recognizing words is a skill too - so too is the rest of language, at least according to some folks. Thinking about other people's thoughts is a skill. So that means that everything gets better with practice. It will – to a first approximation – follow a classic logistic curve like this:

Most skills get better with practice, and the ones described above are no exception. But developmental progress also happens in the absence of practice of specific skills due to physiological maturation – older children's brains are faster and more accurate at processing information, even for skills that haven't been practiced. So samples from this behavior should look like these red lines:

But here's the problem. If you have a complex behavior, it's built of simple behaviors, which are themselves skills. To get the probability of success on one of those complex skills, you can – as a first approximation – multiply the independent probabilities of success in each of the components. That process yields logistic curves that look like these (color indicating the number of components):

And samples from a process with many components look even more discrete, because the logistic is steeper!

Given this kind of perspective, we should expect complex behaviors to emerge relatively suddenly, even if they are simply the product of a handful of continuously changing processes.

This means, from a theoretical standpoint, we need stronger baselines. Our typical baseline at the moment is the null hypothesis of no difference; but that's a terrible baseline! Instead, we need to be comparing to a null hypothesis of "developmental business as usual." To show discontinuity, we need to take into account the continuous changes that a particular behavior will inevitably be undergoing. And then, we need to argue that the rate of developmental change that a particular process is undergoing is faster than we should expect based on simple learning of that skill. Of course to make these kinds of inferences requires far more data about individuals than we usually gather.

In a conference paper that I'm still quite proud of, we tried to create this sort of baseline for early word learning. Arguably, early word learning is a domain where there likely aren't huge, discontinuous changes – instead kids gradually get faster and more accurate in learning new words until they are learning several new words per day. We used meta-analysis to estimate developmental increases in two component processes of novel word mapping: auditory word recognition and social cue following. Both of these got faster and more accurate over the first couple of years. When we put these increases together, we found they together created really substantial changes in how much input would be needed for a new word mapping. (Of course what we haven't done in the three years since we wrote that paper is actually measure the parameters on the process of word mapping developmentally – maybe that's for a subsequent ManyBabies study...). Overall, this baseline suggests that even in the absence of discontinuity, continuous changes in many small processes can produce dramatic developmental differences.

In sum: sometimes developmental psychologists don't take the process of developmental change seriously enough. To do better, we need to start analyzing change continuously; measuring with sufficient precision to estimate rates of change; and creating better continuous baselines before we make claims about discrete change or emergence.

---
* I definitely do this too!

Thursday, August 30, 2018

Three (different) questions about development

(tl;dr: Some questions I'm thinking about, inspired by the idea of studying the broad structure of child development through larger-scale datasets.)

My daughter, M, started kindergarten this month. I began this blog when I was on paternity leave after she was born; the past five years have been an adventure and revolution for my understanding of development to watch her grow.* Perhaps the most astonishing feature of the experience is how continuous, incremental changes lead to what seem like qualitative revolutions. There is of course no moment in which she became the sort of person she is now: the kind of person who can tell a story about an adventure in which two imaginary characters encounter one another for the first time,** but some set of processes led us to this point. How do you uncover the psychological factors that contribute to this kind of growth and change?

My lab does two kinds of research. In both my hope is to contribute to this kind of understanding by studying the development of cognition and language in early childhood. The first kind of work we do is to conduct series of experiments with adults and children, usually aimed at getting answers to questions about representation and mechanism in early language learning in social contexts. The second kind of work is a larger-scale type of resource-building, where we create datasets and accompanying tools like Wordbank, MetaLab, and childes-db. The goal of this work is to make larger datasets accessible for analysis – as testbeds for reproducibility and theory-building.

Each of these activities connects to the project of understanding development at the scale of an entire person's growth and change. In the case of small-scale language learning experiments, the inference strategy is pretty standard. We hypothesize the operation of some mechanism or the utility of some information source in a particular learning problem (say, the utility of pragmatic inference in word learning). Then we carry out a series of experiments that shows a proof of concept that children can use the hypothesized mechanism to learn something in a lab situation, along with control studies that rule out other possibilities. When done well, these studies can give you pretty good traction on individual learning mechanisms. But they can't tell you that these mechanisms are used by children consistently (or even at all) in their actual language learning.

In contrast, when we work with large-scale datasets, we get a whole-child picture that isn't available in the small studies. In our Wordbank work, for example, we get a global picture of the child's vocabulary and linguistic abilities, for many children across many languages. The trouble is, it's very hard or even impossible to find answers to smaller-scale questions (say, about information seeking from social partners) in datasets that represent global snapshots of children's experience or outcomes. Both methods – the large-scale and the small-scale – are great. The trouble is, the questions don't necessarily line up. Instead, larger datasets tend to direct you towards different questions. Here are three.

Where does logical language come from? The social bootstrapping hypothesis

(Musings on the origins of logical language, inspired by work done in my lab by Ann Nordmeyer, Masoud Jasbi, and others).

For the last couple of years I've been part of a group of researchers who are interested in where logic comes from. While formal, boolean logic is a human discovery*, all human languages appear to have methods for making logical statements. We can negate a statement ("No, I didn't eat your dessert while you were away"), quantify ("I ate all of the cookies"), and express conditionals ("if you finish early, you can join me outside.").** While boolean logic doesn't offer a good description of these connectives, natural language still has some logical properties. How does this come about? Because I study word learning, I like to think about logic and logical language as a word learning problem. What is the initial meaning that "no" gets mapped to? What about "and", "or", or "if"?

Perhaps logical connectives are learned just like other words. When we're talking about object words like "ball" or "dog," a common hypothesis is that children have object categories as the possible meanings of nouns. These object categories are given to the child by perception*** in some form or other. Then, kids hear their parents refer to individual objects ("look! a dog! [POINTS TO DOG]"). The point allows the determination of reference; the referent is identified as an instance of a category, and – modulo some generalization and statistical inference – the word is learned, more or less.****

So how does this process work for logical language? There are plenty of linguistic complexities for the learner to deal with: Most logical words simply don't make sense on their own. You can't just turn to your friend and say "or" (at least not without a lot of extra context). So any inference that a child makes about the meaning of the word will have to involve disentangling that from the meaning of the sentence as a whole. But beyond that, what are the potential targets for the meaning of these words? There's nothing you can point to out in the world that is an "if," an "and," or even a "no."

What does it mean to get a degree in psychology these days?

(I was asked to give a speech yesterday at Stanford's Psychology commencement ceremony. Here is the text).

1. Chair, Colleagues, graduates of the class of 2018 – undergraduates and graduate students – family members, and friends. It’s a pleasure to be here today with all of you. Along with honoring our graduates, we especially honor all the wonderful speakers today for their accomplishments – MH for his excellence in research and teaching, Angela for her deep engagement with the department community. You could be forgiven for thinking that there was some special achievement that brought me here as well. In fact, by tradition, faculty take turns addressing the graduating class and is my turn this year. It’s a real pleasure to have one last chance to address you.

Two weeks ago, my daughter Madeline graduated from preschool. There was cake; photos were taken. They broke a piñata. It was a big deal! Several of her friends will be going to different schools, some moving away to other states or even other countries. This is one of the biggest changes she’s ever experienced. I’m already worried about what happens next. Parents, I can only imagine what you are going through today – but at least you know that your kids made it through the first day of kindergarten.

Graduates - Your graduation from Stanford today is a really big deal. You also get to have cake and photos. If you’re very lucky, some special person has even bought you a piñata. But more importantly, just like for Madeline this is a time of transitions. You may be moving somewhere new. Even if you are staying here, friends will be further away than the next dorm or the next office. So do not hesitate to take a little extra time today to celebrate with the people you love and who love you.

Congratulations.

2. I want to take a little time now to think about what it means to get a degree in psychology from Stanford.

When you sit next to someone on an airplane and tell them you are studying psychology, perhaps they ask you if you are reading their mind. Perhaps they wonder if you are studying Freudian analysis and have thoughts about their unconscious, or their relationship with their mother. Or maybe they are more up to date and wonder if you study psychological disorders as they manifest themselves in the clinic. But the truth is, knowing what you’ve done in your degrees here at Stanford, you probably haven’t done too much Freud. Or too much mind-reading. And although you may be interested in clinical work (and this is laudable), that’s not the core of what we teach here.

Gaining a degree in psychology also means that you have gone to many classes in psychology and learned about many studies – from social influence to stereotype threat, from mental rotation to marshmallow tests. Although this body of knowledge is a lovely thing to have come into contact with (and I hope that you continue to deepen your knowledge), knowing this content is also not the core of what it means to receive your degree.

What you have learned instead are tools; a specific kind of tools, namely tools for thought. These tools can be used to approach problems and construct solutions. This is what it means for psychology to be an academic discipline: a discipline denotes a particular mental toolbox. The university is the intellectual equivalent of a construction firm – different departments have the tools to solve different sorts of problems.

3. Like nearly all ideas, “cognitive tools” seem obvious – after you are used to them. Let’s take one example, a foundational cognitive tool that we use every single day: numbers. Because we are so numerate, a lot of people have the idea that numbers are easy and straightforward. But they aren’t.

Take the preschoolers in Madeline’s old classroom. Nearly all of them can count, at least to ten and maybe higher. But if you probe a bit more deeply, it all falls apart. If at snack time, you ask someone to give you exactly four cheerios, she’s liable to hand you seven, or a whole handful. Even when a child knows that “one” means exactly 1, it takes quite a few months for them to figure out that “two” means exactly 2, and more months for 3. When they finally figure out how the whole system works it enables so many new things! Madeline owes all of her dessert-negotiation prowess to her abilities with numbers. Seven gummi bears? No. How about six? This idea of exact comparison is a skill – even though it makes for tiresome after-dinner conversation.

Numbers are an invented, culturally-transmitted tool. In graduate school I worked with an Amazonian indigenous group, the Pirahã, who have no words for numbers. They are bright, sophisticated people who love a good practical joke. Many Pirahã can shoot a fish with an arrow while standing in a canoe. Yet because their language does not have these particular words in it – words like “seven” - and because they do not go through that laborious period of practice that Madeline and other kids learning languages like English do – they can’t remember that it’s exactly seven gummi bears. To them, six or eight seems like the same amount. They simply don’t have the tool.

4. So what are the tools of the psychologist?

There’s one tool that qualifies as the hammer of psychology – the single tool you can use to frame an entire house. That’s the experiment. The fundamental insight of all of modern psychology is that the puzzles of the human mind can be understood as objects of scientific study if we can design appropriately controlled experiments. As complicated and unpredictable as people are (especially when they are integrated into complex cultural systems), we can still learn about their inner workings via experiments.

This insight has spread far outside of psychology and far outside of the academy. Nowadays, Facebook runs a hundred experiments a day on you. Governments and political campaigns, startups and not-for-profits are all constantly experimenting to try to understand how to achieve their goals. There is a good chance that in the next few years of your professional life you will face a complicated human problem with an unknown solution. The psychologist’s approach will serve you well: formulate a hypothesis about how you should manipulate the world; then assess whether the manipulation has changed your measurement of interest. This strategy is shockingly effective.

But the serious carpenter has other, more specialized tools in the toolkit – the plane, awl, rasp, drawknife, jigsaw, bevel. Let me mention two more.

The first is the idea that our knowledge is not just a set of facts, but is organized into theories that help us understand the world. We call these theories intuitive theories – they are the explanatory frameworks that people carry with them to understand why things happen. What follows from this idea is that when you want to change people’s behavior, you can’t just tell them to change or tell them different facts. You need to change their theory. When I want Madeline to eat her vegetables, it turns out just telling her to “eat broccoli” doesn’t work very well – even if she does eat the broccoli, she won’t know what else to eat or why to eat it. And of course the well-known idea about fostering a growth mindset is precisely this kind of implicit theory: it’s a theory of whether ability is fixed or whether it can be improved with hard work.

The second idea I want to share is that our judgment is systematically biased. It’s biased by our own beliefs. Our minds are wonderful, efficient systems that deal with uncertainty – we piece together a sentence even in a noisy restaurant using our expectations about what that person might be trying to say to us. In most cases, this is an amazing feature of our own cognition, letting us operate flexibly using limited data. But this reliance on our own beliefs also has negative consequences: it leads us to stereotype, and to engage in confirmation bias, looking for evidence that further supports our own beliefs. Understanding of these sources of bias can help us avoid falling into this trap. A good grounding in psychology, in other words, helps us be more aware of our own limitations.

I’d love to tell you about more ideas. Every woodworker loves to show off their workbench. And the wonderful thing about tools is that when you use them together you can create new tools, in the same way the carpenter can first make a jig to make it easier to make a difficult cut. I could go on, but hopefully I’ve piqued your curiosity – and you have lots more to do today.

5. So. Make sure that you celebrate! Eat some cake, smash a piñata, and most of all, say your "thank you"s to the people who have supported you during your time here at Stanford. I speak for all of them when I say that we are very proud of you and cannot wait to see what you accomplish.

As this weekend passes and you head off for other things, it is all but certain that you will find yourself in new situations facing challenges that you have not considered before. (Life would not be fun without them!). But I am confident that your tools will be sufficient to the job. Keep them sharp and they will serve you well.

Friday, October 6, 2017

Introducing childes-db: a flexible and reproducible interface to CHILDES

Note: childes-db is a project that is a collaboration between Alessandro Sanchez, Stephan Meylan, Mika Braginsky, Kyle MacDonald, Dan Yurovsky, and me; this blogpost was written jointly by the group.

For those of us who study child development – and especially language development – the Child Language Data Exchange System (CHILDES) is probably the single most important resource in the field. CHILDES is a corpus of transcripts of children, often talking with a parent or an experimenter, and it includes data from dozens of languages and hundreds of children. It’s a goldmine. CHILDES has also been around since way before the age of “big data”: it started with Brian MacWhinney and Catherine Snow photocopying transcripts (and then later running OCR to digitize them!). The field of language acquisition has been a leader in open data sharing largely thanks to Brian’s continued work on CHILDES.

Despite these strengths, using CHILDES can sometimes be challenging, especially for the most casual or most in-depth interactions. Simple analyses like estimating word frequencies can be done using CLAN – the major interface to the corpora – but these require more comfort with command-line interfaces and programming than can be expected in many classroom settings. On the other end of the spectrum, many of us who use CHILDES for in-depth computational studies like to read in the entire database, parse out many of the rich annotations, and get a set of flat text files. But doing this parsing correctly is complicated, and often small decisions in the data-processing pipeline can lead to different downstream results. Further, it can be very difficult to reconstruct a particular data prep in order to do a replication study. We've been frustrated several times when trying to reproduce others' modeling results on CHILDES, not knowing whether our implementation of their model was wrong or whether we were simply parsing the data differently.

To address these issues and generally promote the use of CHILDES in a broader set of research and education contexts, we’re introducing a project called childes-db. childes-db aims to provide both a visualization interface for common analyses and an application programming interface (API) for more in-depth investigation. For casual users, you can explore the data with Shiny apps, browser-based interactive graphs that supplement CHILDES’s online transcript browser. For more intensive users, you can get direct access to pre-parsed text data using our API: an R package called childesr, which allows users to subset the corpora and get processed text. The backend of all of this is a MySQL database that’s populated using a publicly-available – and hopefully definitive – CHILDES parser, to avoid some of the issues caused by different processing pipelines.

What's the relationship between language and thought? The Optimal Semantic Expressivity Hypothesis

(This post came directly out of a conversation with Alex Carstensen. I'm writing a synthesis of others' work, but the core hypotheses here are mostly not my own.)

What is the relationship between language and thought? Do we think in language? Do people who speak different languages think about the world differently? Since my first exposure to cognitive science in college, I've been fascinated with the relationship between language and thought. I recently wrote about my experiences teaching about this topic. Since then I've been thinking more about how to connect the Whorfian literature – which typically investigates whether cross-linguistic differences in grammar and vocabulary result in differences in cognition – with work in semantic typology, pragmatics, language evolution, and conceptual development.

Each of these fields investigates questions about language and thought in different ways. By mapping cross-linguistic variation, typologists provide insight into the range of possible representations of thought – for example, Berlin & Kay's classic study of color naming across languages. Research in pragmatics describes the relationship between our internal semantic organization and what we actually communicate to one another, a relationship that can in turn lead to language evolution (see e.g., Box 4 of a review I wrote with Noah Goodman). And work on children's conceptual development can reveal effects of language on the emergence of concepts (e.g., as in classic work by Bowerman & Choi on learning to describe motion events in Korean vs. English).

All of these literatures provide their own take on the issue of language and thought, and the issue is further complicated by the many different semantic domains under investigation. Language and thought research has taken color as a central case study for the past fifty years, and there is also an extensive tradition of research on spatial cognition and navigation. But there are also more recent investigations of object categorization, number, theory of mind, kinship terms, and a whole host of other domains. And different domains provide more or less support to different hypothesized relationships. Color categorization seems to suggest a simple model where it's faster to categorize different colors because the words help with encoding and memory. In contrast, exact number may require much more in the way of conceptual induction, where children bootstrap wholly new concepts.

The Optimal Semantic Expressivity Hypothesis. Recently, a synthesis has begun to emerge that cuts across a number of these fields. Lots of people have contributed to this synthesis, but I associate it most with work by Terry Regier and collaborators (including Alex!), Dedre Gentner, and to a certain extent the tradition of language evolution research from Kenny Smith and Simon Kirby (also with a great and under-cited paper by Baddeley and Attewell).* This synthesis posits that languages have evolved over historical time to provide relatively optimal, discrete representations of particular semantic domains like color, number, or kinship. Let's call this the optimal semantic expressivity (OSE) hypothesis.**

Language and thought: Shifting the axis of the Whorfian debate

A summary of the changing axes of the debate over effects of language on cognition. (Click to see larger).

This spring I've been teaching in Stanford's study abroad program in Santiago, Chile. It's been a wonderful experience to come back to a city where I was an exchange student, and to navigate the challenges of living in a different language again – this time with a family. My course here is called "Language and Thought" (syllabus), and it deals with the Whorfian question of the relationship between cognition and language. I proposed it because effects of language on thought are often high in the mind of people having to navigate life in a new language and culture, and my own interest in the topic came out of trying to learn to speak other languages.

The exact form of the question of language and thought is one part of the general controversy surrounding this topic. But in Whorf's own words, his question was

Are our own concepts of 'time,' 'space,' and 'matter' given in substantially the same form by experience to all men, or are they in part conditioned by the structure of particular languages? (Whorf, 1941)

This question has personal significance for me since I got my start in research working as an undergraduate RA for Lera Boroditsky on a project on cross-linguistic differences in color perception, and I later went on to study cross-linguistic differences in language for number as part of my PhD with Ted Gibson.

Minimal nativism

(After blogging a little less in the last few months, I'm trying out a new idea: I'm going to write a series of short posts about theoretical ideas I've been thinking about.)

Is human knowledge built using a set of of perceptual primitives combined by the statistical structure of the environment, or does it instead rest on a foundation of pre-existing, universal concepts? The question of innateness is likely the oldest and most controversial in developmental psychology (think Plato vs. Aristotle, Locke vs. Descartes). In modern developmental work, this question so bifurcates the research literature that it can often feel like scientists are playing for different "teams," with incommensurable assumptions, goals, and even methods. But these divisions have a profoundly negative effect on our science. Throughout my research career, I've bounced back and forth between research groups and even institutions that are often seen as playing on different teams from one another (even if the principals involved personally hold much more nuanced positions). Yet it seems obvious that neither has sole claim to the truth. What does a middle position look like?

One possibility is a minimal nativist position. This term is developed in Noah Goodman and Tomer Ullman's work, showing up first in a very nice paper called Learning a Theory of Causality.* In that paper, they write:

... this [work] suggests a novel take on nativism—a minimal nativism—in which strong but domain-general inference and representational resources are aided by weaker, domain-specific perceptual input analyzers.

This statement comes in the context of the authors proposal that infants' theory of causal reasoning – often considered a primary innate building block of cognition – could in principle be constructed by a probabilistic learner. But that learner would still need some starting point; in particular, here the authors' learner had access to 1) a logical language of thought and 2) some basic information about causal interventions, perhaps from the infant's innate knowledge about contact causality or the actions of social agents (these are the "input analyzers" in the quote above).

Was Piaget a Bayesian?

tl;dr: Analogies between Piaget's theory of development and formal elements in the Bayesian framework.

Intro

I'm co-teaching a course with Alison Gopnik at Berkeley this quarter. It's called "What Changes?" and the goal is to revisit some basic ideas about what drives developmental changes. Here's the syllabus, if you're interested. As part of the course, we read the first couple of chapters of Flavell's brilliant book, "The Developmental Psychology of Jean Piaget." I had come into contact with Piagetian theory before of course, but I've never spent that much time engaging with the core ideas. In fact, I don't actually teach Piaget in my intro to developmental psychology course. Although he's clearly part of the historical foundations of the discipline, to a first approximation, a lot of what he said turned out to be wrong.

In my own training and work, I've been inspired by probabilistic models of cognition and cognitive development. These models use the probability calculus to represent degrees of belief in different hypotheses, and have been influential in a wide range of domains from perception and decision-making to communication and social cognition.¹ But as I have gotten more interested in the measurement of developmental change (e.g., in Wordbank or MetaLab, two new projects I've been involved in recently), I've become a bit more frustrated with these probabilistic tools, since there hasn't been as much progress in using them to understand children's developmental change (in contrast to progress characterizing the nature of particular representations). Hence my desire to teach this course and understand what other theoretical frameworks had to contribute.

Despite the seeming distance between the modern Bayesian framework and Piaget, reading Flavell's synthesis I was surprised to see that many of the key Piagetian concepts actually had nice parallels in Bayesian theory. So this blogpost is my attempt to translate some of these key concepts in theory into a Bayesian vocabulary.² It owes a lot to our class discussion, which was really exciting. For me, the translation highlights significant areas of overlap between Piagetian and Bayesian thinking, as well as some nice places where the Bayesian theory could grow.

Limited support for an app-based intervention

tl;dr: I reanalyzed a recent field-trial of a math-learning app. The results differ by analytic strategy, suggesting the importance of preregistration.

Last year, Berkowitz et al. published a randomized controlled trial of a learning app. Children were randomly assigned to math and reading app groups; their learning outcomes on standardized math and reading tests were assessed after a period of app usage. A math anxiety measure was also collected for children’s parents. The authors wrote that:

The intervention, short numerical story problems delivered through an iPad app, significantly increased children’s math achievement across the school year compared to a reading (control) group, especially for children whose parents are habitually anxious about math.

I got excited about this finding because I have recently been trying to understand the potential of mobile and tablet apps for intervention at home, but when I dug into the data I found that not all views of the dataset supported the success of the intervention. That's important because this was a well-designed, well-conducted trial. But the basic randomization to condition did not produce differences in outcome, as you can see in the main figure of my reanalysis.

My extensive audit of the dataset is posted here, with code and their data here. (I really appreciate that the authors shared their raw data so that I could do this analysis – this is a huge step forward for the field!). Quoting from my report:

In my view, the Berkowitz et al. study does not show that the intervention as a whole was successful, because there was no main effect of the intervention on performance. Instead, it shows that – in some analyses – more use of the math app was related to greater growth in math performance, a dose-response relationship that is subject to significant endogeneity issues (because parents who use math apps more are potentially different from those who don’t). In addition, there is very limited evidence for a relationship of this growth to math anxiety. In sum, this is a well-designed study that nevertheless shows only tentative support for an app-based intervention.

Here's a link to my published comment (which came out today), and here's Berkowitz et al.'s very classy response. Their final line is:

We welcome debate about data analysis and hope that this discussion benefits the scientific community.

Explorations in hierarchical drift diffusion modeling

tl;dr: Adventures in using different platforms/methods to fit drift diffusion models to data.

The drift diffusion model (DDM) is increasingly a mainstay of research on decision-making, both in neuroscience and cognitive science. The classic DDM defines a pseudo random-walk decision process that describes a distribution on both accuracies and reaction times. This kind of joint distribution is really useful for capturing tasks where there could be speed-accuracy tradeoffs, and hence where classic univariate analyses are uninformative. Here's the classic DDM picture, this version from Vandekerckhove, Tuerlinckx, & Lee (2010), who have a nice tutorial on hierarchical DDMs:

We recently started using DDM to try and understand decision-making behavior in the kinds of complex inference tasks that my lab and I have been studying for the past couple of years. For example, in one recently-submitted paper, we use DDM to look at decision processes for inhibition, negation, and implicature, trying to understand the similarities and differences in these three tasks:

We had initially hypothesized that performance in the negation and implicature tasks (our target tasks) would correlate with inhibition performance. It didn't, and what's more the data seemed to show very different patterns across the three tasks. So we turned to DDM to understand a bit more of the decision process for each of these tasks.* Also, in a second submitted paper, we looked at decision-making during "scalar implicatures," the inference that "I ate some of the cookies" implies that I didn't eat all of them. In both of these cases, we wanted to know what was going on in these complex, failure-prone inferences.

An additional complexity was that we are interested in the development of these inferences in children. DDM has not been used much with children, usually because of the large number of trials that DDM seems to require. But we were inspired by a recent paper by Ratcliff (one of the important figures in DDMs), which used DDMs for data from elementary-aged children. And since we have been using iPad experiments to get RTs and accuracies for preschoolers, we thought we'd try and do these analyses with data from both kids and adults.

But... it turns out that it's not trivial to fit DDMs (especially the more interesting variants) to data, so I wanted to use this blogpost to document my process in exploring different ecosystems for DDM and hierarchical DDM.

A conversation about scale construction

(Note: this post is joint with Brent Roberts and Michael Kraus, and is cross-posted on their blogs - MK and BR).

MK: Twitter recently rolled out a polling feature that allows its users to ask and answer questions of each other. The poll feature allows polling with two possible response options (e.g., Is it Fall? Yes/No). Armed with snark and some basic training in psychometrics and scale construction, I thought it would be fun to pose the following as my first poll:

Said training suggests that, all things being equal, some people are more “Yes” or more “No” than others, so having response options that include more variety will capture more of the real variance in participant responses. To put that into an example, if I ask you if you agree with the statement: “I have high self-esteem.” A yes/no two-item response won’t capture all the true variance in people’s responses that might be otherwise captured by six items ranging from strongly disagree to strongly agree. MF/BR, is that how you would characterize your own understanding of psychometrics?

MF: Well, when I’m thinking about dependent variable selection, I tend to start from the idea that the more response options for the participant, the more bits of information are transferred. In a standard two-alternative forced-choice (2AFC) experiment with balanced probabilities, each response provides 1 bit of information. In contrast, a 4AFC provides 2 bits, an 8AFC provides 3, etc. So on this kind of reasoning, the more choices the better, as illustrated by this table from Rosenthal & Rosnow’s classic text:

For example, in one literature I am involved in, people are interested in the ability of adults and kids to associate words and objects in the presence of systematic ambiguity. In these experiments, you see several objects and hear several words, and over time the ideas is that you build up some kind of links between objects and words that are consistently associated. In these experiments, initially people used 2 and 4AFC paradigms. But as the hypotheses about mechanism got more sophisticated, people shifted to using more stringent measures, like a 15AFC, which was argued to provide more information about the underlying representations.

On the other hand, getting more information out of such a measure presumes that there is some underlying signal. In the example above, the presence of this information was relatively likely because participants had been trained on specific associations. In contrast, in the kinds of polls or judgment studies that you’re talking about, it’s more unknown whether participants have the kind of detailed representations that allow for fine-grained judgements. So if you’re asking for a judgment in general (like in #TwitterPolls or classic likert scales), how many alternatives should you use?

MK: Right, most or all of my work (and I imagine a large portion of survey research) involves subjective judgments where it isn’t known exactly how people are making their judgments and what they’d likely be basing those judgments on. So, to reiterate your own question: How many response alternatives should you use?

MF: Turns out there is some research on this question. There’s a very well-cited paper by Preston & Coleman (2000), who ask about a service rating scale for restaurants. Not the most psychological example, but it’ll do. They present different participants with different numbers of response categories, ranging from 2 - 101. Here is their primary finding:

In a nutshell, the reliability is pretty good for two categories, but it gets somewhat better up to about 7-9 options, then goes down somewhat. In addition, scales with more than 7 options are rated as slower and harder to use. Now this doesn’t mean that all psychological constructs have enough resolution to support 7 or 9 different gradations, but at least simple ratings or preference judgements seem like they might.

MK: This is great stuff! But if I’m being completely honest here, I’d say the reliabilities for just two response categories, even though they aren’t as good as they are at 7-9 options, are good enough to use. BR, I’m guessing you agree with this because of your response to my Twitter Poll:

BR: Admittedly, I used to believe that when it came to response formats, more was always better. I mean, we know that dichotomizing continuous variables is bad, so how could it be that a dichotomous rating scale (e.g., yes/no) would be as good if not superior to a 5-point rating scale? Right?

Two things changed my perspective. The first was precipitated by being forced to teach psychometrics, which is minimally on the 5th level of Dante’s Hell teaching-wise. For some odd reason at some point I did a deep dive into the psychometrics of scale response formats and found, much to my surprise, a long and robust history going all they way back to the 1920s. I’ll give two examples. Like the Preston & Colemen (2000) study that Michael cites, some old old literature had done the same thing (god forbid, replication!!!). Here’s a figure showing the test-retest reliability from Matell & Jacoby (1971), where they varied the response options from 2 to 19 on measures of values:

The picture is a little different from the internal consistencies shown in Preston & Colemen (2000), but the message is similar. There is not a lot of difference between 2 and 19. What I really liked about the old school researchers is they cared as much about validity as they did reliability--here’s their figure showing simple concurrent validity of the scales:

The numbers bounce a bit because of the small samples in each group, but the obvious take away is that there is no linear relation between scale points and validity.

The second example is from Komorita & Graham (1965). These authors studied two scales, the evaluative dimension from the Semantic Differential and the Sociability scale from the California Psychological Inventory. The former is really homogeneous, the latter quite heterogeneous in terms of content. The authors administered 2 and 6 point response formats for both measures. Here is what they found vis a vis internal consistency reliability:

This set of findings is much more interesting. When the measure is homogeneous, the rating format does not matter. When it is heterogeneous, having 6 options leads to better internal consistency. The authors’ discussion is insightful and worth reading, but I’ll just quote them for brevity: “A more plausible explanation, therefore, is that some type of response set such as an “extreme response set” (Cronbach, 1946; 1950) may be operating to increase the reliability of heterogeneous scales. If the reliability of the response set component is greater than the reliability of the content component of the scale, the reliability of the scale will be increased by increasing the number of scale points.”

Thus, the old-school psychometricians argued that increasing the number of scale point options does not affect test-retest reliability, or validity. It does marginally increase internal consistency, but most likely because of “systematic error” such as, response sets (e.g., consistently using extreme options or not) that add some additional internal consistency to complex constructs.

One interpretation of our modern love of multi-option rating scales is that it leads to better internal consistencies which we all believe to be a good thing. Maybe it isn’t.

MK: I’ve have three reactions to this: First, I’m sorry that you had to teach psychometrics. Second, it’s amazing to me that all this work on scale construction and optimal item amount isn’t more widely known. Third, how come, knowing all this as you do, this is the first time I have heard you favor two-item response options?

BR: You might think that I would have become quite the zealot for yes/no formats after coming across this literature, but you would be wrong. I continued pursuing my research efforts using 4 and 5 point rating scales ad nauseum. Old dogs and new tricks and all of that.

The second experience that has turned me toward using yes/no more often, if not by default, came as a result of working with non-WEIRD [WEIRD = White, Educated, Industrial, Rich, and Democratic] samples and being exposed to some of the newer, more sophisticated approaches to modeling response information in Item Response Theory. For a variety of reasons our research of late has been in samples not typically employed in most of psychology, like children, adolescents, and less literate populations than elite college students. In many of these samples, the standard 5-point likert rating of personality traits tend to blow up (psychometrically speaking). We’ve considered a number of options for simplifying the assessment to make it less problematic for these populations to rate themselves, one of which is to simplify the rating scale to yes/no.

It just so happens that we have been doing some IRT work on an assessment experiment we ran on-line where we randomly assigned people to fill out the NPI in one of three conditions--the traditional paired-comparison, a 5-point likert ratings of all of the stems, and a yes/no rating of all of the NPI item stems (here’s one paper from that effort). I assumed that if we were going to turn to a yes/no format that we would need more items to net the same amount of information as a likert-style rating. So, I asked my colleague and collaborator, Eunike Wetzel, how many items you would need using a yes/no option to get the same amount of test information from a set of likert ratings of the NPI. IRT techniques allow you to estimate how much of the underlying construct a set of items captures via a test information function. What she reported back was surprising and fascinating. You get the same amount of information out of 10 yes/no ratings as you do out of 10 5-point likert scale ratings of the NPI.

So, Professor Kraus, this is the source of the pithy comeback to your tweet. It seems to me that there is no dramatic loss of information, reliability, or validity when using 2-point rating scales. If you consider the benefits gained--responses will be a little quicker, fewer response set problems, and the potential to be usable in a wider population, there may be many situations in which a yes/no is just fine. Conversely, we may want to be cautious about the gain in internal consistency reliability we find in highly verbal populations, like college students, because it may arise through response sets and have no relation to validity.

MK: I appreciate this really helpful response (and that you address me so formally). Using a yes/no format has some clear advantages, as it forces people to fall on one side of a scale or the other, is quicker to answer than questions that rely on 4-7 Likert items, and sounds (from your work BF) that it allows scales to hold up better for non-WEIRD populations. MF, what is your reaction to this work?

MF: This is totally fascinating. I definitely see the value of using yes/no in cases where you’re working with non-WEIRD populations. We are just in the middle of constructing an instrument dealing with values and attitudes about parenting and child development and the goal is to be able to survey broader populations than the university-town parents we often talk to. So I am certainly convinced that yes/no is a valuable option for that purpose and will do a pilot comparison shortly.

On the other hand, I do want to push back on the idea that there are never cases where you would want a more graded scale. My collaborators and I have done a bunch of work now using continuous dependent variables to get graded probabilistic judgments. Two examples of this work are Kao et al., (2014) – I’m not an author on that one but I really like it – and Frank & Goodman (2012). To take an example, in the second of those papers we showed people displays with a bunch of shapes (say a blue square, blue circle, and green square) and asked them, if someone used the word “blue,” which shape do you think they would be talking about?

In those cases, using sliders or “betting” measures (asking participants to assign dollar values between 0 and 100) really did seem to provide more information per judgement than other measures. I’ve also experimented with using binary dependent variables in these tasks, and my impression is that they both converge to the same mean, but that the confidence intervals on the binary DV are much larger. In other words, if we hypothesize in these cases that participants really are encoding some sort of continuous probability, then querying it in a continuous way should yield more information.

So Brent, I guess I’m asking you whether you think there is some wiggle room in the results we discussed above – for constructs and participants where scale calibration is a problem and psychological uncertainty is large, we’d want yes/no. But for constructs that are more cognitive in nature, tasks that are more well-specified, and populations that are more used to the experimental format, isn’t it still possible that there’s an information gain for using more fine-grained scales?

BR: Of course there is wiggle room. There are probably vast expanses of space where alternatives are more appropriate. My intention is not to create a new “rule of thumb” where we only use yes/no responses throughout. My intention was simply to point out that our confidence in certain rules of thumb is misplaced. In this case, the assumption that likert scales are always preferably is clearly not the case. On the other hand, there are great examples where a single, graded dimension is preferable--we just had a speaker discussing political orientation which was rated from conservative to moderate to liberal on a 9-point scale. This seems entirely appropriate. And, mind you, I have a nerdly fantasy of someday creating single-item personality Behaviorally Anchored Rating Scales (BARS). These are entirely cool rating scales where the items themselves become anchors on a single dimension. So instead of asking 20 questions about how clean your room is, I would anchor the rating points from “my room is messier than a suitcase packed by a spider monkey on crack” to “my room is so clean they make silicon memory chips there when I’m not in”. Then you could assess the Big Five or the facets of the Big Five with one item each. We can dream can’t we?

MF: Seems like a great dream to me. So - it sounds like if there’s one take-home from this discussion, it’s “don’t always default to the seven-point likert scale.” Sometimes such scales are appropriate and useful, but sometimes you want fewer – and maybe sometimes you’d even want more.

Friday, October 2, 2015

Can we improve math education with a 5000-year-old technology?

(This post is written jointly by my collaborator David Barner and me; we're posting it to both his new blog, MeaningSeeds, and to mine).

The first calculating machines invented by humans – stone tablets with grooves that contained counting stones or "calculi" – are no match for contemporary computers in terms of computational power. But they and their descendants, in the form of the modern Soroban abacus, may have an edge on modern techniques when it comes to mathematics education. In a study about to appear in Child Development, co-authored with George Alvarez, Jessica Sullivan, and Mahesh Srinivasan, we investigated a recent trend in math education that emanates from these first counting boards: The use of "mental abacus."

The abacus, which originates from Babylonian counting boards dating back to at least 2700 BC, has been used in a dozen different cultures in different forms for tallying, accounting, and basic arithmetic procedures like addition, subtraction, multiplication and division. And recently, it has made a comeback in classrooms in around the world, as a supplement to K-12 elementary mathematics. The most popular form of abacus – the Japanese Soroban (pictured below) – features a collection of beads arranged into vertical columns, each of which represents a place value – ones, tens, hundreds, thousands, etc. At the bottom of each column are four "earthly" beads, each of which represents a multiple of 1. On top is one "heavenly" bead, which represents a multiple of 5. When beads are moved toward the dividing beam, they are "in play", such that each column can represent a value up to 9.

When children learn mental abacus, they first are taught to represent numbers on the physical device, and then to add and subtract quantities by moving beads in and out of play. After some months of practice, they are then asked to do sums by simply imagining an abacus, rather than using the actual physical device. This mental version of the abacus has clear – and sometimes profound – computational benefits for some expert users. Highly trained users – called "masters" by those in the abacus world – can instantly encode and recall long strings of numbers, can add two digit numbers as fast as they can be called out in sequence, and can compute square roots – and even cube roots – almost instantaneously, even for large numbers. Most startling of all, these techniques can be practices while simultaneously talking, and can be mastered by children as young as 10 years of age with record breaking results (see also here, here, and here). If you haven’t ever seen this phenomenon, take a look at the YouTube video below. It is truly remarkable stuff.

In our study we asked whether this technique can be mastered to good effect by ordinary school children, in big, busy, modern classrooms. We conducted the research in Vadodara, India, a medium sized industrial town on the west coast of India, where abacus has recently become a popular supplement to standard math training in both after-school and standard K-12 settings. At the charitable school we visited, abacus training was already underway and was being taught to hundreds of children starting in Grade 2, in classrooms of 70 children per group. To see whether it was having a positive effect, we enrolled a new, previously untrained, cohort of roughly 200 Grade 2 kids and randomly assigned them to receive either abacus training from expert teachers or extra hours of standard math training, in addition to their regular math curriculum.

Even in these relatively large classrooms of children from low-income families, mental abacus technique edged out standard math. Though effects were modest in this group, they were reliable across multiple measures of math ability. Also, children attained the best mastery of mental abacus best if they began the study with strong spatial working memory abilities (to get a sense of how we measured spatial working memory take a look at this video).

Why did abacus have this positive effect? One possibility is that learning a different way of representing numbers helped kids make generalizations about how numbers work. For example, the abacus – like other math manipulatives – provides a concrete representation of place value – i.e., the idea that the same digit can represent a different quantity depending on its position (e.g., the first and second 3 in “33” represent 30 and 3 respectively). This better representation might have helped kids understand the conceptual basis of arithmetic. Another possibility is that the edge was chiefly due to the highly procedural nature of mental abacus training. Operations are initially learned as sequences of hand movements, rather than as linguistic rules, and according to users can be performed almost automatically, without reflection. Finally, it's possible that it's this unique mix of conceptual concreteness and procedural efficacy that gives the abacus its edge. Children may not have to learn procedures and then separately learn how these operations relate to objects and sets in the world: Abacus may allow both to be learned at the same time, a welcome tonic to the ongoing math wars.

Right now it's uncertain why mental abacus helps kids, and whether the effects we've found will last beyond early elementary school. Also, the technique has yet to be rigorously tested on US shores, where it's currently being adopted by public schools in at least two states. This is the focus of a new study, currently underway, which will test whether this ancient calculation technique should be left in museums, or instead be widely adopted to boost math achievement in the 21st century.

Wednesday, September 30, 2015

Descriptive vs. optimal bayesian modeling

In the past fifteen years, Bayesian models have fast become one of the most important tools in cognitive science. They have been used to create quantitative models of psychological data across a wide variety of domains, from perception and motor learning all the way to categorization and communication. But these models have also had their critics, and one of the recurring critiques of the models has been their entanglement with claims that the mind is rational or optimal. How can optimal models of mind be right when we also have so much evidence for the sub-optimality of human cognition?*

An exciting new manuscript by Tauber, Navarro, Perfors, and Steyvers makes a provocative claim: you can give up on the optimal foundations of Bayesian modeling and still make use of the framework as an explicit toolkit for describing cognition.** I really like this idea. For the last several years, I've been arguing for decoupling optimality from the Bayesian project. I even wrote a paper called "throwing out the Bayesian baby with the optimal bathwater" (which was about Bayesian models of baby data, clever right?).

In this post, I want to highlight two things about the TNPS paper, which I generally really liked and enjoyed reading. First, it contains an innovative fusion of Bayesian cognitive modeling and Bayesian data analysis. BDA has been a growing and largely independent strand of the literature; fusing BDA with cognitive models makes a lot of really rich new theoretical development possible. Second, it contains two direct replications that succeed spectacularly, and it does so without making any fuss whatsoever – this is, in my view, what observers of the "replication crisis" should be aspiring to.

1. Bayesian cognitive modeling meets Bayesian data analysis.

The meat of the TNPS paper revolves around three case studies in which they use the toolkit of Bayesian data analysis to fit cognitive models to rich experimental datasets. In each case they argue that taking an optimal perspective – in which the structure of the model is argued to be normative relative to some specified task – is overly restrictive. Instead, they specify a more flexible set of models with more parameters. Some settings of these parameters may be "suboptimal" for many tasks but have a better chance of fitting the human data. And the fitted parameters of these models then can reveal aspects of how human learners treat the data – for example, how heavily they weight new observations or what sampling assumptions they make.

This fusion of Bayesian cognitive modeling and Bayesian data analysis is really exciting to me because it allows the underlying theory to be much more responsive to the data. I've been doing less cognitive modeling in recent years in part because my experience was that my models weren't as responsive as I liked to the data that I and others collected. I often came to a point where I would have to do something awful to my elegant and simple cognitive model in order to make it fit the human data.

One example of this awfulness comes from a paper I wrote on word segmentation. We found that an optimal model from the computational linguistics literature did a really good job fitting human data - if you assumed that it observed data equivalent to something between a tenth and a hundredth of the data the humans observed. I chalked this problem up to "memory limitations" but didn't have much more to say about it. In fact, nearly all my work on statistical learning has included some kind of memory limitation parameter, more or less – a knob that I'd twiddle to make the model look like the data.***

In their first case study, TNPS estimate the posterior distribution of this "data discounting" parameter as part of their descriptive Bayesian analysis. That may not seem like a big advance from the outside, but in fact it opens the door to putting into place much more psychologically-inspired memory models as part of the analytic framework. (Dan Yurovsky and I played with something a bit like this in a recent paper on cross-situational word learning – where we estimated a power-law memory decay on top of an ideal observer word learning model – but without the clear theoretical grounding that TNPS). I would love to see this kind of work really try to understand what this sort of data discounting means, and how it integrates with our broader understanding of memory.

2. The role of replication.

Something that flies completely under the radar in this paper is how closely TNPS replicate the previous empirical findings reported. Their Figure 1 tells a great story:

Panel (a) shows the original data and model fits from Griffiths & Tenenbaum (2007), and panel (b) shows their own data and replicated fits. This is awesome. Sure, the model doesn't perfectly fit the data - and that's TNPS's eventual point (along with a related point about individual variation). But clearly GT measured a true effect, and they measured it with high precision.

The same thing was true of Griffiths & Tenenbaum (2006) – the second case study in TNPS. GT2006 was a study about estimating conditional distributions for different processes, e.g. given that you've lived X years, how likely is it that you live Y. At the risk of belaboring the point, I'll show you three datasets on this question. First from GT2006, second from TNPS, and third a new, unreported dataset from my replication class a couple of years ago.**** The conditions (panels) are plotted in different orders in each plot, but if you take the time to trace one, say lifespans or poems, you will see just how closely these three datasets replicate one another. Not just the shape of the curve but also the precise numerical values:

This result is the ideal outcome to strive for in our responses to the reproducibility crisis. Quantitative theory requires precise measurement - you just can't get anywhere fitting a model to a small number of noisily estimated conditions. So you have to strive to get precise measures – and this leads to a virtuous cycle. Your critics can disagree with your model precisely because they have a wealth of data to fit their more complex models to (that's exactly TNPS's move here).

I think it's no coincidence that quite a few of the first big data, mechanical turk studies I saw were done by computational cognitive scientists. Not only were they technically oriented and happy to port their experiments to the web, they also were motivated by a severe need for more measurement precision. And that kind of precision leads to exactly the kind of reproducibility we're all striving for.

---
* Think Tversky & Kahneman, but there are many many issues with this argument...
** Many thanks to Josh Tenenbaum for telling me about the paper; thanks also to the authors for posting the manuscript.
*** I'm not saying the models were in general overfit to the data – just that they needed some parameter that wasn't directly derived from the optimal task analysis.
**** Replication conducted by Calvin Wang.

Babies Learning Language