Tuesday, March 24, 2015

Estimating p(replication) in a practical setting

tl;dr - an estimate of the proportion of recent psychology findings that can be reproduced by an early-stage graduate student and some thoughts about the consequences of that estimate
----

I just finished reading the final projects for Psych 254, my graduate lab course. The centerpiece of the course is that each student chooses a published experiment and conducts a replication online. My view is replication projects are a great way to learn the nuts and bolts of doing research. Rebecca Saxe (who developed this model) and I wrote a paper about this idea a couple of years ago.

The goal of the course is to teach skills related to experimental data collection and analysis. Nevertheless, as a result of this exercise, we get real live data on (mostly) well-conducted projects by smart, hard-working, and motivated students. Some of these projects have been contributed to the Open Science Framework's Reproducibility Project. One has even been published, albeit with lots of additional work, as a stand-alone contribution (Philips, Ong, et al., in press). An additional benefit of this framework – and perhaps a point of contrast with other replication efforts – is that students choose projects that they want to build on in their own work.

I've been keeping a running tally of replications and non-replications from the course (see e.g., this previous blog post). I revisited these results and simplified my coding scheme a bit in order to consolidate them with this year's data. This post is my report of those findings. I'll first describe the methods for this informal study, then the results and some broader musings.

Before I get into the details, let me state up front that I do not think that these findings consist of an estimate of the reproducibility of psychological science (that's the goal of the Reproducibility Project, which has a random sample and perhaps a bit more more external scrutiny of replication attempts). But I do think they provide an estimate of how likely it is that a sophisticated and motivated graduate student can reproduce a project within the scope of a course (and using online techniques). And that estimate is a very useful number as well – it tells us how much of our literature is possible for trainees to reproduce and to build on.

Methods

The initial sample was 40 total studies (N=19 and N=21, for 2013 and 2015, respectively). All target articles  were published in the last 15 years, typically in major journals, e.g., Cognition, JPSP, Psych Sci, PNAS, Science. Pretty much all the replicators were psychology graduate students at Stanford, though there were two master's students and two undergrads; this variable was not a moderator of success. All of the studies were run on Amazon Mechanical Turk using either Qualtrics or custom JavaScript. All confirmatory analyses were pre-reviewed by me and the TAs via the template used by the Reproducibility Project.

Studies varied in their power, but all were powered to at least the sample of the original paper, and most were powered either to around 80% power according to post-hoc power analysis, or to 2.5x the original sample. See below for more discussion on this. The TAs and I pilot tested people's paradigms, and we would have excluded any study whose failure we knew to be due to experimenter error (e.g. bad programming), but we didn't catch any mistakes after piloting even though likely some exist. We excluded two studies for obvious failures to achieve the planned power (e.g., because one study that attempted to compare Asian Americans and Caucasian Americans was only able to recruit 6 Asian Americans); the final sample after these exclusions was 38 studies.

In consultation with the TAs, I coded replication success on a 0 – 1 scale. Zero was interpreted as no evidence of the key effect from the original paper; one was a solid, significant finding in the same direction and of approximately the same magnitude (see below). In between were many variously interpretable patterns of data, including .75 (roughly, reproducing most of the key statistical tests) and .25 (some hints of the same pattern, albeit much attenuated or substantially different). 

Because of numbers, I ended up splitting the studies into Cognitive (N=20) and Social (N=18) subfields. I initially tried to use a finer categorization but the numbers for most sub-fields were simply too small. So in practice, the Social subfield included some cross-cultural work, some affective work, lots of priming (more on this later), and a grab-bag of other effects in fields like social perception. The Cognitive work included perception, cognition, language, and some decision-making. When papers were about social cognition (as several were), I judged on sociological factors. If a target article was in JPSP it went in the Social pile; if it was by people who studied cognitive development, it went in the Cognition pile. 

Results

The mean replication code was .57 across all studies. Here is the general histogram of replication success codes across all data:


Overall, the modal outcome was a full replication, but the next most common was a complete failure. For more information, you can see the breakdown across subfields and years:



There was a slight increase in replications from 2013 to 2015 that was not significant in an unpaired t-test (t(36) = -1.08, p = .29). In contrast, the contrast between Social and Cognitive findings was significant (t(36) = -2.32, p = .03). 

What happened with the social psych findings? The sample is too sparse to tell, really, but I can speculate. One major trend is that we tried 6 findings that I would classify as "social priming" – many involving a word-unscrambling task. The mean reproducibility rating for these studies was .21, with none receiving a code higher than .5. I don't know what to make of that, but minimally I wouldn't recommend doing social priming on mechanical turk. In addition, given my general belief in the relative similarity of turk and other populations (motivation 1, motivation 2), I personally would be hesitant to take on one of these paradigms.

One other trend stood out to me. As someone trained in psycholinguistics, I am used to creating a set of experimental "items" – e.g., different sentences that all contain the phenomenon you're interested in. Clark (1973, the "Language-As-Fixed-Effect Fallacy") makes a strong argument that this kind of design – along with the appropriate statistical testing – is critical for ensuring the validity of our experiments. The same issue has been noted for social psychology (Wells & Windschitl, 1999).

Despite these arguments, seven of our studies had exactly one stimulus item per condition. Usually, a study of this type tests the idea that being exposed to some sort of idea or evidence leads to a difference in attitude, and the manipulation is that participants read a passage (different in each condition) that invokes this idea. Needless to say, there is a problem here with validity; but in a post-hoc analysis, we also found that such studies were less likely to be reproducible. Only one in this "single stimulus" group was coded above .25 for replication (t(9.74) = -2.60, p = .03, not assuming equal variances). This result is speculative due to small numbers and the fact that it's post-hoc; but it's still striking. Maybe we're seeing the lack of reliability coming as a result of different populations responding differently to those individual stimulus items.

The other findings that we failed to reproduce were a real grab-bag. They include both complicated reaction-time paradigms and very simple surveys. Sometimes we had ideas for issues with the paradigms (perhaps, things that the original authors had solved by clear instructions or un-reported aspects of the paradigm). Sometimes we were completely mystified. It would be very interesting to find out which of these we would be eventually able to understand; but that's a tremendous amount of work – we did it in exactly one case and it took almost a dozen studies.

Broader Musings

This course makes all of us think hard about some knotty issues in replication. First, how do you decide how many participants to include in a replication in order to ensure sufficient statistical power. One solution is to assume that the target paper has a good estimate of the target effect size, then use that effect size to do power analysis. The problem (as many folks have pointed out) is that post-hoc power is problematic. In addition, with complex ANOVAs or linear models, we almost never have the data to perform power analyses correctly.

Nevertheless, we don't have much else in the way of tools, with one exception. Based on some clever statistical analysis, Uri Simonsohn's "small telescopes" piece advocates simply running 2.5x the sample. This is a nice idea and generally seems conservative. But when the initial study already appears overpowered, this approach is almost certainly overkill – and it's very rough on the limited course budget. In practice, this year we did a mix of post-hoc power, 2.5x, and budget-limited samples. There were few cases, however, where we worried about the power of the replications: The paradigms that didn't replicate that tended to be the short paradigms that we were able to power most effectively given our budget. Even so, deciding on sample sizes is typically one of the trickiest parts of the project planning process.

A second tricky issue is deciding when a particular effect replicated. Following the Reproducibility Project format, students in the course replicate key statistical tests, and so if all of these are reliable and of roughly the same magnitude, it's easy to say that the replication was positive. But there are nevertheless many edge cases where the pattern of magnitudes is theoretically different, or similar in size but nevertheless nonsignificant. The logic of "small telescopes" is very compelling for simple t-tests, but it's often quite hard to extend to the complex, theoretically-motivated interactions that we sometimes see in sophisticated studies – for example, we sometimes don't even know the effect size! As a result, I can't guarantee that the replication codes we used above are what the original author would assign – perhaps they'd say "oh yes, that pattern of findings is consistent with the spirit of the original piece" even if the results were quite different. But this kind of uncertainty is always going to be an issue – there's really no way to judge a replication except with respect to the theory the original finding was meant to support.

Conclusions

I love teaching this course. It's wonderful to see students put so much energy into so many exciting, new projects. Students don't choose replications to do "take downs" or to "bully or intimidate."* They choose projects they want to learn about and build on in their own research. As a consequence, it's very sad when they fail to reproduce these findings! It's sad because it creates major uncertainty around work that they like and admire. And it's also sad because this is essentially lost effort for them; they can't build on the work they did in creating a framework for running their study and analyzing the data. In contrast, their classmates – whose paradigms "worked" – are able to profit directly from the experience by using their new data and experimental tools to create cool new extensions.

When I read debates about whether replications are valuable, whether they should be published, whether they are fair or unfair, ad nauseaum, I'm frustrated by the lack of consideration for this most fundamental use-case for replication. Replication is most typically about the adoption of a paradigm for future research. If our scientific work can't be built on by the people doing the science – trainees – then we've really lost track of our basic purpose.  

----
Major thanks to Long Ouyang and Desmond Ong, the TAs for the course!

* Added scare quotes here 3/25 to indicate that I don't think anyone really does experimental psychology to bully or intimidate, even if it sometimes feels that way! 

Monday, March 23, 2015

Team up or slow down!

(This post is a draft of a talk I gave at SRCD last week, in a round-table discussion organized by Melanie Soderstrom on the topic of standardizing infancy methods.)

We were asked to consider what the big issues are in standardizing infancy methods. I think there's one big issue: statistical power. Infancy experiments are dramatically underpowered. So any given experiment we run tells us relatively little unless it has a lot of babies in it. This means that if we want good data, we need either to team up or else to slow down.

1. Statistical power.

The power on a test is the probability that the test will reject the null (at p < .05), given the effect size. A general standard is that you want 80% power. So the smaller the effect, the larger the sample you need to have to detect that effect. We're talking about standard effect sizes here (Cohen's d), so if d = 1, then the two groups are a standard deviation apart. That's a really big effect.

A couple of facts to get you calibrated. The traditional sample size in infancy research is 16. Let's assume a within-subjects t-test. Then 16 infants gets you 80% power to detect an effect of around d = .75. That's a big effect, by most standards. But the important question is, how big is your average effect in infancy research?

2. Facts on power in infancy research.

Luckily, Sho Tsuji (another presenter in the roundtable), Christina Bergmann, and Alex Cristia have been getting together these lovely databases of infant speech perception work. And my student Molly Lewis is presenting a meta-analysis of the "mutual exclusivity phenomenon" in the posters on Saturday morning (edit: link). There's some good news and some bad news. 
  • Mutual exclusivity (ME) is really robust, at least with older kids. D is around 1, meaning you really can see this effect reliably with 16 kids (as you might have suspected). But if you go younger, you need substantially higher power and groups of 36 or more are warranted.
  • Phoneme recognition is also pretty good. Traditional methods like high-amplitude sucking and conditioned head turn yield effect sizes over 1. On the other hand, head-turn preference only is around d = .5. So again, you need more like 36 infants per group to have decent power. 
  • Word segmentation, on the other hand, not so good. The median effect size is just above .25. So your power in those studies is actually pretty close to zero. 
Again thanks to Sho, Christina, and Alex for putting this stuff on the internet so I could play with it. I can't stress enough important that is.

3. What does this mean?

First, if you do underpowered experiments, you get junk. The best case is that you recognize that, but more often than not, what we do is over-interpret away some kind of spurious finding that we got with one age group or stimulus but not another, or with boys but not girls. Theory can provide some guide – so if you didn't expect a result, be very skeptical unless you have high power!

Second, all of this is about power to reject the null. So that means the situation is much worse when you want to see an age by stimulus interaction, or a difference between groups in the magnitude of an effect (say, whether bilinguals show less mutual exclusivity than monolinguals). The power on these interaction tests will be very low, and you are likely to need many dozens or hundreds of children to be able to test this kind of hypothesis accurately. Let's say for the sake of argument that we're looking at a difference of d = .5 – that is, ME is .5 SDs stronger for monolinguals than bilinguals. That's a whopping difference. We need 200 kids to have 80% power on that interaction. Effects smaller than d = .4, don't even bother testing interactions because you won't have the power to detect them. That's just the harsh statistical calculus.

4. So where do we go from here?

There are a bunch of options, and no one option is the only way forward – in fact, these are very complementary options:

You can slow down and test more kids. This is a great option - we've started working with children's museums to try and collect much larger samples, and we've found these samples give us much more interesting and clear patterns of data. We can really see developmental quantitatively with 200+ kids.

You can team up. Multiple labs working together can test the same number of kids but put the results together and get the power they need - with the added bonus of greater generalizability. You'll get papers with more authors, but that's the norm in many other sciences at this point. And if you team up, you will need to standardize – that means planning the stimuli out and sharing them across labs.

You can meta-analyze. That means people need to share their data (or at least publish enough about it so that we can do all the analyses we need). I don’t know about Sho et al., but our experience has been that reporting standards are still very bad for developmental papers. We need information about effect size and variability.

5. Conclusions

It isn't just infancy research that is having trouble with this issue. Social psychologists are struggling with the lack of reproducibility in their area; neuroimaging is in a similar place, trying to figure out what to do because more than 12 fMRI scans is too expensive but 12 subjects doesn't give you any power. We are together figuring out that this is a tough position to be in. But pursuing open methods, collaboration, and high-powered studies will certainly help.

Wednesday, February 18, 2015

Stories from the mind of a toddler

Right around the holidays, M started doing something new and charming. Although she had been speaking individual words all throughout the fall, she began to use single word utterances and gestures together to tell stories about events that had happened to her. Even young infants have long-lasting memories that seem tied to individual events, but there's something different about being able to share your memories using language. For us, it meant the first opportunity to have multi-turn conversations with M. She can now share something she's thinking about and we can reminisce about it together.

The first narrative I saw was very simple. Perhaps it was less clearly evocative of a particular event than a class of events, but still different from what came before. M would say "meow" (or her equivalent, which has a lengthened second syllable only, so /aw/) and then make the gesture/sign for come here, palm upright with fingers pulling towards her. This was how she recalled our nightly walks, which tended to go to a street near our house where a friendly cat could occasionally be seen. We'd say "Are you thinking about the kitty cats? ["yah!" replies M] Did you see a kitty last night? [M - /aw/] What did you say to it? [M makes "come here" gesture] That's right - come here, kitty kitty."

So you can see how this little narrative becomes a linguistic routine, something that reinforces the memory it refers to. But this is example is also a little weak; our scaffolding of the memory was probably critical to making it more than just generalized, undirected longing for cats (something that M has quite a bit of).

The second example – the one that really convinced me – came a few weeks later when we took M to a local farm. She had been reading animal books with us for many months, so we thought she'd enjoy the experience. We were right. She was totally transfixed by watching a cow be milked and fed. After we got home, she wanted to talk about the cow a lot. Every fifteen minutes or so, out of nowhere she would moo, but we were confused by what she was trying to express. After mooing, she'd say "uh oh" and point to her tongue.

It took us several repetitions to figure out what M was telling us. Here's the story (in all its toddler glory): when the cow was eating, a lot of its feed fell out of its mouth (hence "uh oh"). But then once we figured out the story, the gesture of food coming out of the mouth got conventionalized into something like the reverse of an "eat, eat" gesture – fingers pressed together, pulling from the mouth. And then we had to do the whole routine over and over again. M: "moo" - parent: "are you thinking about the cow?" M - "uh oh!" and then [falling food gesture]. Ad nauseam. Two months later, she will still tell this story.

When M was very tiny and I had some free nap times, I read a fair amount about baby sign – we never made a decision not to teach signs to M, but life seemed too short. But the most interesting thing I learned was that there was a whole literature on children's spontaneous signs (one paper by the baby sign folks; another by an independent group). The baby sign folks in particular had documented idiosyncratic signs that got built up in structured conversations between parents and kids.

This is exactly the kind of thing M has been doing – she now has a repertoire of such signs that she uses in her narratives, including the "come here," the "falling food" gesture, as well as signs for rain, traffic, and getting splashed in the face by water. Several of them seem quite bound to particular stories, but she has also generalized. For example, her "traffic" sign is pointing and then rapidly swinging her arm back and forth. She spontaneously used this to tell a story about a horse that she had seen, who had both done some messy eating and also run around: "neigh" - [food falling sign] - [running sign]. So these signs do seem at least somewhat flexible and word-like.

Children famously suffer from "childhood amnesia" – the relative inability to recall specific episodic memories from childhood. There are many explanations for this puzzling phenomenon, but one I'm especially fond of (without too much support) is that language plays a role. Episodic memories are not always stored in language, but language provides a medium for encoding and rehearsal. Of course, if language is changing dramatically from the time of encoding to retrieval, that could cause problems for retrieval – hence possible "amnesia."

So I think M's narratives might be a first attempt to express and re-encode episodic memories. The degree to which she retains them will be variable. The relative distinctiveness of the memories will play a role, but the way she encodes them may also matter. The use of signs – which will probably vanish from her vocabulary once she can produce the appropriate words – may help in telling the stories now. But maybe – as the signs fade – she'll also forget the stories more easily?

Monday, February 9, 2015

Could conference submission be preregistration?

If we care about the answer to a particular question, preregistration – registering hypotheses and analyses ahead of time so that they are not data-dependent – is an important strategy for improving the strength of the evidence from studies bearing on that question. Of course, preregistration has some pros and cons. In my mind, the most notable these is that prereg is more appropriate for large, expensive, confirmatory studies than small, cheap, exploratory studies that are easily replicated (see my post about this topic).

In brief: My worry about pre-registered journal papers is that they can be very expensive in terms of research effort. If no one really cares about a hypothesis, then it's not a big deal not to publish on it. But if you preregister your crazy, speculative claim, then you may be stuck writing a paper telling everyone something they already expected: that your crazy idea, which would have been cool if true, is actually false. And writing papers is hard work: it takes a long time, and has severe opportunity costs. You could be doing new research during the time you are writing a careful, clear, and comprehensive paper on a thing that no one cares about because it wasn't likely to be true and indeed isn't.*

Nevertheless, there's no denying that it's good to be able to see an unbiased sample of experimental hypotheses. So here's a thought.

Something I always tell students NOT to do is to submit to conferences before they are done collecting data. This practice means that you have to impose your own biases on your preliminary data, and it can put you in an awkward position if you write a strongly hypothesis-driven abstract about data that don't end up supporting your spin on them.

But what about if we exploited this issue? We could create a track at conferences where you would submit an abstract on what you were going to do but hadn't yet done – essentially a prereg track. Then we'd have a particular poster session for seeing the results. All we'd need to do is to make sure that the conference abstracts themselves were indexed appropriately, and perhaps require an updated, post data-collection addendum. The upsides would be A) a chance for folks (especially undergrad and early grad students) to get an opportunity to present their work, and B) a low-cost preregistration mechanism.

The Cognitive Science Society already has a track for lightly-reviewed "member abstracts" – essentially posters on work that isn't done enough to merit a six-page paper. Why not "pre data-collection abstracts" too?

---
* Let me emphasize here that I'm not talking about hypotheses where a null result is important and informative, e.g. as in intervention work, or tests of theoretically-central claims. I'm talking about the kind of exploratory work – trying to play around with novel theoretical ideas – that characterizes a lot of research in cognitive science.

Tuesday, January 6, 2015

On "training" your children

tl;dr: Analysis of rhetoric from a parenting column. Why, brain, why?

A recent piece in the Washington Post's parenting advice column got me a bit bothered. The question was about how to get a two-year-old to sleep in her own bed if she doesn't want to, and the advice given, essentially, was don't. The columnist, Meghan Leary (a parenting coach) advocates letting the child get in bed with the parents, which is after all what the child wants in the first place.

Although we sleep trained with M, I don't think that the column's advice is necessarily wrong; co-sleeping is a reasonable solution, if it works for the child and the parents. Co-sleeping is the norm in many cultures, and the research suggesting that co-sleeping is hazardous is typically focused on kids who are much younger and at risk for SIDS for other reasons. Of course, probably the reason the parents are asking is because they don't like being elbowed in the face by a toddler. Regardless, what bothered me in the column wasn't the advice, per se.

Instead, it was the rhetorical discussion of why the parents shouldn't sleep train. Here's the argument:
Children are born to attach to a caregiver. They are reliant on that caregiver for years and years — far longer than the young of almost any species on Earth. (Just ask your neighbors about that basement apartment occupied by their 20-somethings.) Without a responsible caregiver, they wouldn’t last a day, let alone a lifetime. Our children need us, and their brains are wired to make sure they stay close to us. 
So, when a 2-year-old has faced separation all day when she goes to day care and then experiences separation again at bedtime, her young brain goes into panic mode. And that young brain is built to take her to the parent, over and over and over.

And so when the parent places a gate at the door, her brain lights up with fear and panic, and it is experienced as a physical problem. Vomiting, breathing problems: This is a systemwide panic meltdown. It is too much for her to process “Why is Mom leaving me?!” and her body starts to compensate for what her brain cannot handle.
Note that this whole passage doesn't have any evidence in it at all, nor any talk about the behavioral history – what the child has experienced prior to the current situation – or even consequences of the child's actions. It also doesn't include any talk about the parents' quality of life. Instead, what replaces these is a set of explanations and expansions of the questioner's original statement that her child cried and threw up when she was left alone in her room at night.

Most of these explanations aren't necessarily wrong – in current form they're really too vague to be judged on the basis of scientific evidence – but they do a lot of negative rhetorical work by invoking -isms that keep people from thinking sensibly about parenting.

Nativism. Nativism in cognitive development is an interesting and important theoretical position, and it has its place. But all of this talk of being "wired" for X or "born to" Y here is just a way of stopping argument about whether a particular behavior is something we want to encourage. Many people would argue that we are born to discriminate against members of other groups, but we should still teach our children values of tolerance and openness. Maybe kids are "born to sleep next to their caregivers" but if that means the caregivers can't get a good night of rest, then perhaps teaching (even "training") can be a reasonable option to consider.

Dualism. When her "body starts to compensate for what her brain cannot handle" this situation sounds pretty dire. But what does that mean? Brain and body (actually, mind and body) are connected. Psychological stress can lead to health problems, etc. etc. So what do we get from separating mind and body here other than an implication that sleep-training related separation anxiety can lead to bad health outcomes, a claim that is not supported by any evidence?

Brain-o-centrism. Why is it the toddler's brain that goes into overload? And why is it her brain lighting up with fear? Again, this seems more like a rhetorical strategy than any kind of evidence. Empty brain statements like these serve as proxies for explanation without any explanatory content.

Anti-daycare-ism. Finally, why does day care have anything to do with this? This little thrown-off clause ("has faced separation all day when she goes to day care and then") really annoyed me – anti-daycare bias is deeply discriminatory against working parents (e.g. like M's mom and me). I know of no evidence that sleep training or sleep problems more generally interact with whether a child is in daycare. There is an interesting and complex body of research on the behavioral consequences of day care, but that's not what is being referenced here. So I find the offhand implication here that day care separation anxiety can cause sleep disturbance to be deeply problematic.

Why am I picking on this one column, whose advice I don't even necessarily dispute? All throughout parenthood I've been confronted with cases where people have very strong opinions about parenting that seem as though they are grounded in the research that I supposedly pursue for a living (trying to understand children's cognitive development). And yet when I look into them more deeply, they make no sense at all. This column, posted on facebook by an acquaintance, was the last straw.

That's it for now. I'm off to pick up M at daycare.

Friday, December 19, 2014

Why can't toddlers play with one another? An alternative account of parallel play

Whenever I go to daycare, or interact with other parents of toddlers,  I hear about how M and other kids her age – 17 months now – are engaged in parallel play. The basic idea is that, even though young toddlers like to be near other kids their age, they don't play together: they engage in the same sorts of activities in close proximity, but without any sort of reciprocal interaction. I'll argue here that this label is at best a descriptive convenience – it doesn't reflect any inability to engage in reciprocal play – and masks an interesting developmental story.

The idea of parallel play idea dates back to Parten (1932), who noted the prevalence of this kind of behavior in young preschoolers. For fun, here's the key figure from her study:
The data are pretty clear – and the graph surprisingly modern! In fact, you can see this sort of thing happening in any daycare classroom, and even more so for 1 - 2 year-olds than the preschoolers in Parten's study. But the question is what to make of this descriptive observation (Parten herself doesn't give much of any interpretation, at least in that paper).

So we turn to the internet. Of course, whattoexpect.com has an interpretation of why parallel play occurs:
[Parallel play is] par for the developmental course for babies and toddlers. Why? Because a child this age is still busy figuring out so much about the world and doesn't yet realize that people his own size are indeed people (who might actually be fun to do stuff with). He's too young to make friends, but companionable side-by-side play is a good start.
You hear this echoed across many other sources of information for parents, including the teachers at M's daycare. These sorts of stage labels are endemic in developmental psych of the popular variety, and they often imply that there is a cognitive change that accompanies the behavioral stage shift. I think this developmental story is deeply wrong.

Over the last 15 - 20 years, a large body of evidence has accumulated that suggests that young children have very robust expectations for the social world by their second year. Babies can build social expectations for almost anything – even for eyeless blobs – so they definitely should have such expectations for other toddlers. Other work suggests that very small cues like reaching, looking, and movement towards a target can effectively cue inferences about an agent's goals and desires. So toddlers almost certainly understand that their peers have goals and desires, perhaps desires that even differ from the toddler's own. In addition, toddlers have no trouble engaging in reciprocal interactions with older children and adults (e.g., giving games, simple games of catch).

In fact, in a recent paper by Cortes and Dweck, having adults engage in parallel play – rather than reciprocal play – with toddlers made them less likely to help that adult achieve a goal later on. So that's a nice piece of evidence for two things. First, parallel play is far from being the only way that toddlers can interact. Second, they actually think it's negative in some way when an adult doesn't play with them reciprocally, so they are forming strong expectations both about and from the type of play they engage in with different partners.

Why do toddlers exhibit so little parallel play, then? I think what's going wrong is that the appropriate social cognitive abilities are very much present in kids of this age, but they are hard to exercise, and critically, social computations are slow. Reciprocal interaction with a peer requires fast online recognition of goals and action planning with respect to those goals. You need to know what your play partner wants you to do, and you need to figure that out before she loses interest and gets distracted. That's pretty easy for adults to do; they create structured play opportunities for toddlers all the time. (For example, last night I set up a tea party for M and helped her serve tea to a wide variety of different stuffed animals).

But when you get two toddlers together, they strike out so often that it might be adaptive to avoid trying to engage! In a recent episode I watched, M saw that another little girl Y wanted a toy car. But by the time she figured out that Y wanted the car, Y had already moved on to other things. The result was that M walked up to Y at a totally inappropriate time and thrust a car in her face for seemingly no reason. Nice idea, but poor execution. Maybe if you are a toddler, you learn not to try out this kind of gambit until you're more confident you will succeed.

This explanation – that parallel play is an adaptive consequence of toddlers' poor speed of processing – is a product of something that I've been exploring a lot on this blog: that babies and toddlers are surprisingly knowledgeable about the world, but their ability to use this knowledge is sharply limited. The limitation here is that social computations are very slow, so that by the time the computations are done, their output is less likely to be relevant. In other words, "parallel play" as a description is correct, but the shift to a more reciprocal style of play may not have anything to do with a cognitive shift. Instead it may emerge from more gradual changes in children's speed of social processing.

ResearchBlogging.org Cortes Barragan R, & Dweck CS (2014). Rethinking natural altruism: Simple reciprocal interactions trigger children's benevolence. Proceedings of the National Academy of Sciences of the United States of America, 111 (48), 17071-4 PMID: 25404334

Monday, November 24, 2014

The piecemeal emergence of language

It's been a while since I last wrote about M. She's now 16 months, and it's remarkable to see the trajectory of her early language. On the one hand, she still produces relatively few distinct words that I can recognize; on the other, her vocabulary in comprehension is quite large and she clearly understands a number of different speech acts (declaratives, imperatives, questions) and their corresponding constructions.

Some observations on production:

  • She still doesn't say "mama." She does say "mamamamamama" to express need, a pattern that Clark 1973 noted is common. She definitely knows what "mama" means, and even does funny things like pointing to me and saying "dada" then pointing to her mother and opening her mouth. 
  • I have nevertheless heard her make un-cued productions of "scissors," "bulldozer," and "motorcycle" (though not with great reliability). Motorcycle translated to something like "dodo SY-ku" – a kind of indistinct prosodic foot and then a second heavily stressed foot. Her production vocabulary is extremely idiosyncratic compared with her comprehension, precisely the pattern identified by Mayor & Plunkett (2014) in a very cool recent paper. 
  • "BA ba" (repeated over and over again) seems to mean "let's sing a song" – or especially, let's watch inane internet children's song videos. We don't do this last all that often, but it has made an outsize impression on her, perhaps because she's seen so little TV in her short life. This is also the first time that she's taken to repeating a single word / label over and over again, so as to emphasize the point. 
And on comprehension:
  • Our life got vastly better when M learned how to say "yes" to yes/no questions. For about a month now, we've been able to say things like "would you like to go outside?" and she will reply "da!" (she is Russian, apparently). "Da" has very recently morphed into "yah" but it's very clearly a strong affirmative. M will occasionally turn her head away and wrinkle her nose if she doesn't like the suggestion. This response feels a lot like a generalization of her I don't want to eat that bite face. 
  • Other types of questions have been slower. Maybe unsurprisingly, "or" is still not a success – she either stays silent or responds to the second option, even if she knows how to produce a word for one or both options. "Where" questions have been emerging in the last week or so. This morning, M was very clear in directing me when I asked her "where should we go?" "What's this" is uneven – occasionally I'll get a "ba" or "da" (ball/dog) type production. And "what do you want" has only gotten a successful production once or twice (bottle, I think). 
  • M understands and responds to simple imperatives just fine: "take the cup to baby" gets a positive response, though her accuracy on less plausible sentences is low.
  • Explanations seem to hold a lot of water with her. I don't think she understands the explanation at all, but if we need to give something to someone, or leave something behind that she's holding, we ask her and then explain. For example, telling her why we can't bring her favorite highlighter pen in the car with us seems to convince her to put it down. What's going through her mind here? Maybe just our seriousness about the idea – something like wow, they used a lot of words, they must really mean it
  • She is remarkably good at negation (at least when she wants to be). A few days ago we were headed out the door to the playground, and M tried to drag a big stroller blanket out the door.  I said "We're not going to bring our blanket outside." She headed back over to the stroller, and dropped the blanket. Of course, then she headed back towards the door, turned back, and grabbed a smaller blanket. There was a lot of contextual support to this sequence, but understanding my sentence still took some substantial sophistication. The negation "we're not" is embedded in the sentence, and wasn't supported by too much in the way of prosody. This success was very striking to me, given the failures of much older toddlers to understand more decontextualized negations in some research that Ann Nordmeyer and I have been doing
Overall, I am still struck by how hard production is for M, compared with comprehension. A new word, say "playground" might start as something resembling "PAI-go" but merge back into "BA-ba" by the end of a few repetitions. M has never been a big babbler, and so I suspect that she is slow to produce language because the skills of production are simply not as well-practiced. There are some kids who babble up a storm, and I imagine all of the motor routines are much easier for them In contrast, M just doesn't have the sounds of language in her mouth yet.