Thursday, February 21, 2019

Nothing in childhood makes sense except in the light of continuous developmental change

I'm awestruck by the processes of development that operate over children's first five years. My daughter M is five and my newborn son J is just a bit more than a month old. J can't yet consistently hold his head up, and he makes mistakes even in bottle feeding – sometimes he continues to suck but forgets to swallow so that milk pours out of his mouth until his clothes are soaked. I remember this kind of thing happening with M as a baby ... and yet voila, five years later, you have someone who is writing text messages to grandma and illustrating new stories about Spiderman. How you could possibly get from A to B (or in my case, from J to M)? The immensity of this transition is perhaps the single most important challenge for theories of child development.

As a field, we have bounced back and forth between continuity and discontinuity theories to explain these changes. Continuity theories posit that infants' starting state is related to our end state, and that changes are gradual, not saltatory; discontinuity theories posit stage-like transitions. Behaviorist learning theory was fundamentally a continuity hypothesis – the same learning mechanisms (plus experience) underly all of behavior, and change is gradual. In contrast, Piagetian stage theory was fundamentally about explaining behavioral discontinuities. As the pendulum swung, we get core knowledge theory, a continuity theory: innate foundations are "revised but not overthrown" (paraphrasing Spelke et al. 1992). Gopnik and Wellman's "Theory theory" is a discontinuity theory: intuitive theories of domains like biology or causality are discovered like scientific theories. And so on.

For what it's worth, my take on the "modern synthesis" in developmental psychology is that development is domain-specific. Domain of development – perception, language, social cognition, etc. – progress on their own timelines determined by experience, maturation, and other constraining factors. And my best guess is that some domains develop continuously (especially motor and perceptual domains) while others, typically more "conceptual" ones, show more saltatory progress associated with stage changes. But – even though it would be really cool to be able to show this – I don't think we have the data to do so.

The problem is that we are not thinking about – or measuring – development appropriately. As a result, what we end up with is a theoretical mush. We talk as though everything is discrete, but that's mostly a function of our measurement methods. Instead, everything is at rock bottom continuous, and the question is how steep the changes are.

We talk as though everything is discontinuous all the time. The way we know how to describe development verbally is through what I call "milestone language." We discuss developmental transitions by (often helpful) age anchors, like "children say their first word around their first birthday," or "preschoolers pass the Sally-Ann task at around 3.5 years." When summarizing a study, we* assert that "by 7 months, babies can segment words from fluent speech," even if we know that this statement describes the fact that the mean performance of a group is significantly different than zero in a particular paradigm instantiating this ability, and even if we know that babies might show this behavior a month earlier if you tested enough of them! But it's a lot harder to say "early word production emerges gradually from 10 - 14 months (in most children)."

Beyond practicalities, one reason we use milestone language is because our measurement methods are only set up to measure discontinuities. First, our methods have poor reliability: we typically don't learn very much about any one child, so we can't say conclusively whether they truly show some behavior or not. In addition, most developmental studies are severely underpowered, just like most studies in neuroscience and psychology in general. So the precision of our estimates of a behavior for groups of children are noisy. To get around this problem, we use null hypothesis significance tests – and when the result is p < .05, we declare that development has happened. But of course we will see discrete changes in development if we use a discrete statistical cutoff!

And finally, we tend to stratify our samples into discrete age bins (which is a good way to get coverage), e.g. recruiting 3-month-olds, 5-month-olds, and 7-month-olds for a study. But then, we use these discrete samples as three separate analytic groups, ignoring the continuous developmental variation between them! This practice reduces statistical power substantially, much like taking median splits on continuous variables (taking a median split on average is like throwing away a third of your sample!). In sum, even in domains where development is continuous, our methods guarantee that we get binary outcomes. We don't try to estimate continuous functions, even when our data afford them.

The truth is, when you scratch the surface in development, everything changes continuously. Even the stuff that's not supposed to change still changes. I saw this in one of my very first studies, when I was a lab manager for Scott Johnson and we accidentally found ourselves measuring 3-9 month-olds' face preferences. Though I had learned from the literature that infants had an innate face bias, I was surprised to find that magnitude of face looking was changing dramatically across the range I was measuring. (Later we found that this change was related to the development of other visual orientating skills). Of course "it's not surprising" that some complex behavior goes up with development, says reviewer 3. But it is important, and the ways we talk about and analyze our data don't reflect the importance of quantifying continuous developmental change.

One reason that it's not surprising to see developmental change is that everything that children do is at its heart a skill. Sucking and swallowing is a skill. Walking is a skill. Recognizing objects is a skill. Recognizing words is a skill too - so too is the rest of language, at least according to some folks. Thinking about other people's thoughts is a skill. So that means that everything gets better with practice. It will – to a first approximation – follow a classic logistic curve like this:


Most skills get better with practice, and the ones described above are no exception. But developmental progress also happens in the absence of practice of specific skills due to physiological maturation – older children's brains are faster and more accurate at processing information, even for skills that haven't been practiced. So samples from this behavior should look like these red lines:



But here's the problem. If you have a complex behavior, it's built of simple behaviors, which are themselves skills. To get the probability of success on one of those complex skills, you can – as a first approximation – multiply the independent probabilities of success in each of the components. That process yields logistic curves that look like these (color indicating the number of components):


And samples from a process with many components look even more discrete, because the logistic is steeper!


Given this kind of perspective, we should expect complex behaviors to emerge relatively suddenly, even if they are simply the product of a handful of continuously changing processes.

This means, from a theoretical standpoint, we need stronger baselines. Our typical baseline at the moment is the null hypothesis of no difference; but that's a terrible baseline! Instead, we need to be comparing to a null hypothesis of "developmental business as usual." To show discontinuity, we need to take into account the continuous changes that a particular behavior will inevitably be undergoing. And then, we need to argue that the rate of developmental change that a particular process is undergoing is faster than we should expect based on simple learning of that skill. Of course to make these kinds of inferences requires far more data about individuals than we usually gather.

In a conference paper that I'm still quite proud of, we tried to create this sort of baseline for early word learning. Arguably, early word learning is a domain where there likely aren't huge, discontinuous changes – instead kids gradually get faster and more accurate in learning new words until they are learning several new words per day. We used meta-analysis to estimate developmental increases in two component processes of novel word mapping: auditory word recognition and social cue following. Both of these got faster and more accurate over the first couple of years. When we put these increases together, we found they together created really substantial changes in how much input would be needed for a new word mapping. (Of course what we haven't done in the three years since we wrote that paper is actually measure the parameters on the process of word mapping developmentally – maybe that's for a subsequent ManyBabies study...). Overall, this baseline suggests that even in the absence of discontinuity, continuous changes in many small processes can produce dramatic developmental differences.

In sum: sometimes developmental psychologists don't take the process of developmental change seriously enough. To do better, we need to start analyzing change continuously; measuring with sufficient precision to estimate rates of change; and creating better continuous baselines before we make claims about discrete change or emergence. 

---
* I definitely do this too!

Sunday, December 9, 2018

How to run a study that doesn't replicate, experimental design edition

(tl;dr: Design features of psychology studies to avoid if you want to run a good study!)

Imagine reading about a psychology experiment in which participants are randomly assigned to one of two different short state inductions (say by writing a passage or unscrambling sentences), and then outcomes are measured via a question about an experimental vignette. The whole thing takes place in about 10 minutes and is administered through a survey, perhaps via Qualtrics.

The argument of this post is that this experiment has a low probability of replicating, and we can make that judgment purely from the experimental methods – regardless of the construct being measured, the content of the state induction, or the judgment that is elicited. Here's why I think so.

Friday was the last day of my graduate class in experimental methods. The centerpiece of the course is a replication project in which each student collects data on a new instantiation of a published experiment. I love teaching this course and have blogged before about outcomes from it. I've also written several journal articles about student replication in this model (Frank & Saxe, 2012Hawkins*, Smith*, et al., 2018). In brief, I think this is a really fun way for student to learn about experimental design and data analysis, open science methods, and the importance of replication in psychology. Further, the projects in my course are generally pretty high quality: they are pre-registered confirmatory tests with decent statistical power, and both the paradigm and the data analysis go through multiple rounds of review by the TAs and me (and sometimes also get feedback from the original authors).

Every year I rate each student project on its replication outcomes. The scale is from 0 to 1, with intermediate values indicating unclear results or partial patterns of replication (e.g., significant key test but different qualitative interpretation). The outcomes from the student projects this year were very disappointing. With 16/19 student projects finished, we have an average replication rate of .31. There were only 4 clear successes, 2 intermediate results, and 10 failure. Samples are small every year, but this rate was even lower than we saw in previous samples (2014-15: .57, N=38) and another one-year sample (2016: .55, N=11).

What happened? Many of the original experiments followed part or all of the schema described above, with a state induction followed by a question about a vignette. In other words, they were poorly designed.

Friday, September 7, 2018

Scale construction, continued

For psychometrics fans: I helped out with a post by Brent Roberts, "Yes or No 2.0: Are Likert scales always preferable to dichotomous rating scales?" This post is a continuation of our earlier conversation on scale construction and continues to examine the question of if – and if so, when – it's appropriate to use a Likert scale vs. a dichotomous scale. Spoiler: in some circumstances it's totally safe, while in others it is a disaster!

Thursday, August 30, 2018

Three (different) questions about development

(tl;dr: Some questions I'm thinking about, inspired by the idea of studying the broad structure of child development through larger-scale datasets.)

My daughter, M, started kindergarten this month. I began this blog when I was on paternity leave after she was born; the past five years have been an adventure and revolution for my understanding of development to watch her grow.* Perhaps the most astonishing feature of the experience is how continuous, incremental changes lead to what seem like qualitative revolutions. There is of course no moment in which she became the sort of person she is now: the kind of person who can tell a story about an adventure in which two imaginary characters encounter one another for the first time,** but some set of processes led us to this point. How do you uncover the psychological factors that contribute to this kind of growth and change?

My lab does two kinds of research. In both my hope is to contribute to this kind of understanding by studying the development of cognition and language in early childhood. The first kind of work we do is to conduct series of experiments with adults and children, usually aimed at getting answers to questions about representation and mechanism in early language learning in social contexts. The second kind of work is a larger-scale type of resource-building, where we create datasets and accompanying tools like Wordbank, MetaLab, and childes-db. The goal of this work is to make  larger datasets accessible for analysis – as testbeds for reproducibility and theory-building.

Each of these activities connects to the project of understanding development at the scale of an entire person's growth and change. In the case of small-scale language learning experiments, the inference strategy is pretty standard. We hypothesize the operation of some mechanism or the utility of some information source in a particular learning problem (say, the utility of pragmatic inference in word learning). Then we carry out a series of experiments that shows a proof of concept that children can use the hypothesized mechanism to learn something in a lab situation, along with control studies that rule out other possibilities. When done well, these studies can give you pretty good traction on individual learning mechanisms. But they can't tell you that these mechanisms are used by children consistently (or even at all) in their actual language learning.

In contrast, when we work with large-scale datasets, we get a whole-child picture that isn't available in the small studies. In our Wordbank work, for example, we get a global picture of the child's vocabulary and linguistic abilities, for many children across many languages. The trouble is, it's very hard or even impossible to find answers to smaller-scale questions (say, about information seeking from social partners) in datasets that represent global snapshots of children's experience or outcomes. Both methods – the large-scale and the small-scale – are great. The trouble is, the questions don't necessarily line up. Instead, larger datasets tend to direct you towards different questions. Here are three.

Friday, August 10, 2018

Where does logical language come from? The social bootstrapping hypothesis

(Musings on the origins of logical language, inspired by work done in my lab by Ann Nordmeyer, Masoud Jasbi, and others).

For the last couple of years I've been part of a group of researchers who are interested in where logic comes from. While formal, boolean logic is a human discovery*, all human languages appear to have methods for making logical statements. We can negate a statement ("No, I didn't eat your dessert while you were away"), quantify ("I ate all of the cookies"), and express conditionals ("if you finish early, you can join me outside.").** While boolean logic doesn't offer a good description of these connectives, natural language still has some logical properties. How does this come about? Because I study word learning, I like to think about logic and logical language as a word learning problem. What is the initial meaning that "no" gets mapped to? What about "and", "or", or "if"?

Perhaps logical connectives are learned just like other words. When we're talking about object words like "ball" or "dog," a common hypothesis is that children have object categories as the possible meanings of nouns. These object categories are given to the child by perception*** in some form or other. Then, kids hear their parents refer to individual objects ("look! a dog! [POINTS TO DOG]"). The point allows the determination of reference; the referent is identified as an instance of a category, and – modulo some generalization and statistical inference – the word is learned, more or less.****

So how does this process work for logical language? There are plenty of linguistic complexities for the learner to deal with: Most logical words simply don't make sense on their own. You can't just turn to your friend and say "or" (at least not without a lot of extra context). So any inference that a child makes about the meaning of the word will have to involve disentangling that from the meaning of the sentence as a whole. But beyond that, what are the potential targets for the meaning of these words? There's nothing you can point to out in the world that is an "if," an "and," or even a "no."

Monday, June 18, 2018

What does it mean to get a degree in psychology these days?

(I was asked to give a speech yesterday at Stanford's Psychology commencement ceremony. Here is the text). 

1. Chair, Colleagues, graduates of the class of 2018 – undergraduates and graduate students – family members, and friends. It’s a pleasure to be here today with all of you. Along with honoring our graduates, we especially honor all the wonderful speakers today for their accomplishments – MH for his excellence in research and teaching, Angela for her deep engagement with the department community. You could be forgiven for thinking that there was some special achievement that brought me here as well. In fact, by tradition, faculty take turns addressing the graduating class and is my turn this year. It’s a real pleasure to have one last chance to address you.

Two weeks ago, my daughter Madeline graduated from preschool. There was cake; photos were taken. They broke a piñata. It was a big deal! Several of her friends will be going to different schools, some moving away to other states or even other countries. This is one of the biggest changes she’s ever experienced. I’m already worried about what happens next. Parents, I can only imagine what you are going through today – but at least you know that your kids made it through the first day of kindergarten.

Graduates - Your graduation from Stanford today is a really big deal. You also get to have cake and photos. If you’re very lucky, some special person has even bought you a piñata. But more importantly, just like for Madeline this is a time of transitions. You may be moving somewhere new. Even if you are staying here, friends will be further away than the next dorm or the next office. So do not hesitate to take a little extra time today to celebrate with the people you love and who love you.

Congratulations.

2. I want to take a little time now to think about what it means to get a degree in psychology from Stanford.

When you sit next to someone on an airplane and tell them you are studying psychology, perhaps they ask you if you are reading their mind. Perhaps they wonder if you are studying Freudian analysis and have thoughts about their unconscious, or their relationship with their mother. Or maybe they are more up to date and wonder if you study psychological disorders as they manifest themselves in the clinic. But the truth is, knowing what you’ve done in your degrees here at Stanford, you probably haven’t done too much Freud. Or too much mind-reading. And although you may be interested in clinical work (and this is laudable), that’s not the core of what we teach here.

Gaining a degree in psychology also means that you have gone to many classes in psychology and learned about many studies – from social influence to stereotype threat, from mental rotation to marshmallow tests. Although this body of knowledge is a lovely thing to have come into contact with (and I hope that you continue to deepen your knowledge), knowing this content is also not the core of what it means to receive your degree.

What you have learned instead are tools; a specific kind of tools, namely tools for thought. These tools can be used to approach problems and construct solutions. This is what it means for psychology to be an academic discipline: a discipline denotes a particular mental toolbox. The university is the intellectual equivalent of a construction firm – different departments have the tools to solve different sorts of problems.

3. Like nearly all ideas, “cognitive tools” seem obvious – after you are used to them. Let’s take one example, a foundational cognitive tool that we use every single day: numbers. Because we are so numerate, a lot of people have the idea that numbers are easy and straightforward. But they aren’t.

Take the preschoolers in Madeline’s old classroom. Nearly all of them can count, at least to ten and maybe higher. But if you probe a bit more deeply, it all falls apart. If at snack time, you ask someone to give you exactly four cheerios, she’s liable to hand you seven, or a whole handful. Even when a child knows that “one” means exactly 1, it takes quite a few months for them to figure out that “two” means exactly 2, and more months for 3. When they finally figure out how the whole system works it enables so many new things! Madeline owes all of her dessert-negotiation prowess to her abilities with numbers. Seven gummi bears? No. How about six? This idea of exact comparison is a skill – even though it makes for tiresome after-dinner conversation.

Numbers are an invented, culturally-transmitted tool. In graduate school I worked with an Amazonian indigenous group, the Pirahã, who have no words for numbers. They are bright, sophisticated people who love a good practical joke. Many Pirahã can shoot a fish with an arrow while standing in a canoe. Yet because their language does not have these particular words in it – words like “seven” - and because they do not go through that laborious period of practice that Madeline and other kids learning languages like English do – they can’t remember that it’s exactly seven gummi bears. To them, six or eight seems like the same amount. They simply don’t have the tool.

4. So what are the tools of the psychologist?

There’s one tool that qualifies as the hammer of psychology – the single tool you can use to frame an entire house. That’s the experiment. The fundamental insight of all of modern psychology is that the puzzles of the human mind can be understood as objects of scientific study if we can design appropriately controlled experiments. As complicated and unpredictable as people are (especially when they are integrated into complex cultural systems), we can still learn about their inner workings via experiments.

This insight has spread far outside of psychology and far outside of the academy. Nowadays, Facebook runs a hundred experiments a day on you. Governments and political campaigns, startups and not-for-profits are all constantly experimenting to try to understand how to achieve their goals. There is a good chance that in the next few years of your professional life you will face a complicated human problem with an unknown solution. The psychologist’s approach will serve you well: formulate a hypothesis about how you should manipulate the world; then assess whether the manipulation has changed your measurement of interest. This strategy is shockingly effective.

But the serious carpenter has other, more specialized tools in the toolkit – the plane, awl, rasp, drawknife, jigsaw, bevel. Let me mention two more.

The first is the idea that our knowledge is not just a set of facts, but is organized into theories that help us understand the world. We call these theories intuitive theories – they are the explanatory frameworks that people carry with them to understand why things happen. What follows from this idea is that when you want to change people’s behavior, you can’t just tell them to change or tell them different facts. You need to change their theory. When I want Madeline to eat her vegetables, it turns out just telling her to “eat broccoli” doesn’t work very well – even if she does eat the broccoli, she won’t know what else to eat or why to eat it. And of course the well-known idea about fostering a growth mindset is precisely this kind of implicit theory: it’s a theory of whether ability is fixed or whether it can be improved with hard work.

The second idea I want to share is that our judgment is systematically biased. It’s biased by our own beliefs. Our minds are wonderful, efficient systems that deal with uncertainty – we piece together a sentence even in a noisy restaurant using our expectations about what that person might be trying to say to us. In most cases, this is an amazing feature of our own cognition, letting us operate flexibly using limited data. But this reliance on our own beliefs also has negative consequences: it leads us to stereotype, and to engage in confirmation bias, looking for evidence that further supports our own beliefs. Understanding of these sources of bias can help us avoid falling into this trap. A good grounding in psychology, in other words, helps us be more aware of our own limitations.

I’d love to tell you about more ideas. Every woodworker loves to show off their workbench. And the wonderful thing about tools is that when you use them together you can create new tools, in the same way the carpenter can first make a jig to make it easier to make a difficult cut. I could go on, but hopefully I’ve piqued your curiosity – and you have lots more to do today.

5. So. Make sure that you celebrate! Eat some cake, smash a piñata, and most of all, say your "thank you"s to the people who have supported you during your time here at Stanford. I speak for all of them when I say that we are very proud of you and cannot wait to see what you accomplish.

As this weekend passes and you head off for other things, it is all but certain that you will find yourself in new situations facing challenges that you have not considered before. (Life would not be fun without them!). But I am confident that your tools will be sufficient to the job. Keep them sharp and they will serve you well.



Saturday, May 5, 2018

nosub: a command line tool for pushing web experiments to Amazon Mechanical Turk

(This post is co-written with Long Ouyang, a former graduate student in our department, who is the developer of nosub, and Manuel Bohn, a postdoc in my lab who has created a minimal working example). 

Although my lab focuses primarily on child development, our typical workflow is to refine experimental paradigms via working with adults. Because we treat adults as a convenience population, Amazon Mechanical Turk (AMT) is a critical part of this workflow. AMT allows us to pay an hourly wage to participants all over the US who complete short experimental tasks. (Some background from an old post).

Our typical workflow for AMT tasks is to create custom websites that guide participants through a series of linguistic stimuli of one sort or another. For simple questionnaires we often use Qualtrics, a commercial survey product, but most tasks that require more customization are easy to set up as free-standing javascript/HTML sites. These sites then need to be pushed to AMT as "external HITs" (Human Intelligence Tasks) so that workers can find them, participate, and be compensated. 

nosub is a simple tool for accomplishing this process, building on earlier tools used by my lab.* The idea is simple: you customize your HIT settings in a configuration file and type

nosub upload

to upload your experiment to AMT. Then you can type

nosub download

to fetch results. Two nice features of nosub from a psychologist's perspective are: 1. worker IDs are anonymized by default so you don't need to worry about privacy issues (but they are deterministically hashed so you can still flag repeat workers),  and 2. nosub can post HITs in batches so that you don't get charged Amazon's surcharge for tasks with more than 9 hits. 

All you need to get started is to install Node.js; installation instructions for nosub are available in the project repository.

Once you've run nosub, you can download your data in JSON format, which can easily be parsed into R. We've put together a minimal working example of an experiment that can be run using nosub and a data analysis script in R that reads in the data.  

---
psiTurk is another framework that provides a way of serving and tracking HITs. psiTurk is great and we have used it for heavier-weight applications where we need to track participants, but can be tricky to debug and is not always compatible with some of our light-weight web experiments.