Friday, December 19, 2014

Why can't toddlers play with one another? An alternative account of parallel play

Whenever I go to daycare, or interact with other parents of toddlers,  I hear about how M and other kids her age – 17 months now – are engaged in parallel play. The basic idea is that, even though young toddlers like to be near other kids their age, they don't play together: they engage in the same sorts of activities in close proximity, but without any sort of reciprocal interaction. I'll argue here that this label is at best a descriptive convenience – it doesn't reflect any inability to engage in reciprocal play – and masks an interesting developmental story.

The idea of parallel play idea dates back to Parten (1932), who noted the prevalence of this kind of behavior in young preschoolers. For fun, here's the key figure from her study:
The data are pretty clear – and the graph surprisingly modern! In fact, you can see this sort of thing happening in any daycare classroom, and even more so for 1 - 2 year-olds than the preschoolers in Parten's study. But the question is what to make of this descriptive observation (Parten herself doesn't give much of any interpretation, at least in that paper).

So we turn to the internet. Of course, has an interpretation of why parallel play occurs:
[Parallel play is] par for the developmental course for babies and toddlers. Why? Because a child this age is still busy figuring out so much about the world and doesn't yet realize that people his own size are indeed people (who might actually be fun to do stuff with). He's too young to make friends, but companionable side-by-side play is a good start.
You hear this echoed across many other sources of information for parents, including the teachers at M's daycare. These sorts of stage labels are endemic in developmental psych of the popular variety, and they often imply that there is a cognitive change that accompanies the behavioral stage shift. I think this developmental story is deeply wrong.

Over the last 15 - 20 years, a large body of evidence has accumulated that suggests that young children have very robust expectations for the social world by their second year. Babies can build social expectations for almost anything – even for eyeless blobs – so they definitely should have such expectations for other toddlers. Other work suggests that very small cues like reaching, looking, and movement towards a target can effectively cue inferences about an agent's goals and desires. So toddlers almost certainly understand that their peers have goals and desires, perhaps desires that even differ from the toddler's own. In addition, toddlers have no trouble engaging in reciprocal interactions with older children and adults (e.g., giving games, simple games of catch).

In fact, in a recent paper by Cortes and Dweck, having adults engage in parallel play – rather than reciprocal play – with toddlers made them less likely to help that adult achieve a goal later on. So that's a nice piece of evidence for two things. First, parallel play is far from being the only way that toddlers can interact. Second, they actually think it's negative in some way when an adult doesn't play with them reciprocally, so they are forming strong expectations both about and from the type of play they engage in with different partners.

Why do toddlers exhibit so little parallel play, then? I think what's going wrong is that the appropriate social cognitive abilities are very much present in kids of this age, but they are hard to exercise, and critically, social computations are slow. Reciprocal interaction with a peer requires fast online recognition of goals and action planning with respect to those goals. You need to know what your play partner wants you to do, and you need to figure that out before she loses interest and gets distracted. That's pretty easy for adults to do; they create structured play opportunities for toddlers all the time. (For example, last night I set up a tea party for M and helped her serve tea to a wide variety of different stuffed animals).

But when you get two toddlers together, they strike out so often that it might be adaptive to avoid trying to engage! In a recent episode I watched, M saw that another little girl Y wanted a toy car. But by the time she figured out that Y wanted the car, Y had already moved on to other things. The result was that M walked up to Y at a totally inappropriate time and thrust a car in her face for seemingly no reason. Nice idea, but poor execution. Maybe if you are a toddler, you learn not to try out this kind of gambit until you're more confident you will succeed.

This explanation – that parallel play is an adaptive consequence of toddlers' poor speed of processing – is a product of something that I've been exploring a lot on this blog: that babies and toddlers are surprisingly knowledgeable about the world, but their ability to use this knowledge is sharply limited. The limitation here is that social computations are very slow, so that by the time the computations are done, their output is less likely to be relevant. In other words, "parallel play" as a description is correct, but the shift to a more reciprocal style of play may not have anything to do with a cognitive shift. Instead it may emerge from more gradual changes in children's speed of social processing. Cortes Barragan R, & Dweck CS (2014). Rethinking natural altruism: Simple reciprocal interactions trigger children's benevolence. Proceedings of the National Academy of Sciences of the United States of America, 111 (48), 17071-4 PMID: 25404334

Monday, November 24, 2014

The piecemeal emergence of language

It's been a while since I last wrote about M. She's now 16 months, and it's remarkable to see the trajectory of her early language. On the one hand, she still produces relatively few distinct words that I can recognize; on the other, her vocabulary in comprehension is quite large and she clearly understands a number of different speech acts (declaratives, imperatives, questions) and their corresponding constructions.

Some observations on production:

  • She still doesn't say "mama." She does say "mamamamamama" to express need, a pattern that Clark 1973 noted is common. She definitely knows what "mama" means, and even does funny things like pointing to me and saying "dada" then pointing to her mother and opening her mouth. 
  • I have nevertheless heard her make un-cued productions of "scissors," "bulldozer," and "motorcycle" (though not with great reliability). Motorcycle translated to something like "dodo SY-ku" – a kind of indistinct prosodic foot and then a second heavily stressed foot. Her production vocabulary is extremely idiosyncratic compared with her comprehension, precisely the pattern identified by Mayor & Plunkett (2014) in a very cool recent paper. 
  • "BA ba" (repeated over and over again) seems to mean "let's sing a song" – or especially, let's watch inane internet children's song videos. We don't do this last all that often, but it has made an outsize impression on her, perhaps because she's seen so little TV in her short life. This is also the first time that she's taken to repeating a single word / label over and over again, so as to emphasize the point. 
And on comprehension:
  • Our life got vastly better when M learned how to say "yes" to yes/no questions. For about a month now, we've been able to say things like "would you like to go outside?" and she will reply "da!" (she is Russian, apparently). "Da" has very recently morphed into "yah" but it's very clearly a strong affirmative. M will occasionally turn her head away and wrinkle her nose if she doesn't like the suggestion. This response feels a lot like a generalization of her I don't want to eat that bite face. 
  • Other types of questions have been slower. Maybe unsurprisingly, "or" is still not a success – she either stays silent or responds to the second option, even if she knows how to produce a word for one or both options. "Where" questions have been emerging in the last week or so. This morning, M was very clear in directing me when I asked her "where should we go?" "What's this" is uneven – occasionally I'll get a "ba" or "da" (ball/dog) type production. And "what do you want" has only gotten a successful production once or twice (bottle, I think). 
  • M understands and responds to simple imperatives just fine: "take the cup to baby" gets a positive response, though her accuracy on less plausible sentences is low.
  • Explanations seem to hold a lot of water with her. I don't think she understands the explanation at all, but if we need to give something to someone, or leave something behind that she's holding, we ask her and then explain. For example, telling her why we can't bring her favorite highlighter pen in the car with us seems to convince her to put it down. What's going through her mind here? Maybe just our seriousness about the idea – something like wow, they used a lot of words, they must really mean it
  • She is remarkably good at negation (at least when she wants to be). A few days ago we were headed out the door to the playground, and M tried to drag a big stroller blanket out the door.  I said "We're not going to bring our blanket outside." She headed back over to the stroller, and dropped the blanket. Of course, then she headed back towards the door, turned back, and grabbed a smaller blanket. There was a lot of contextual support to this sequence, but understanding my sentence still took some substantial sophistication. The negation "we're not" is embedded in the sentence, and wasn't supported by too much in the way of prosody. This success was very striking to me, given the failures of much older toddlers to understand more decontextualized negations in some research that Ann Nordmeyer and I have been doing
Overall, I am still struck by how hard production is for M, compared with comprehension. A new word, say "playground" might start as something resembling "PAI-go" but merge back into "BA-ba" by the end of a few repetitions. M has never been a big babbler, and so I suspect that she is slow to produce language because the skills of production are simply not as well-practiced. There are some kids who babble up a storm, and I imagine all of the motor routines are much easier for them In contrast, M just doesn't have the sounds of language in her mouth yet.

Wednesday, November 19, 2014

Musings on the "file drawer" effect

tl;dr: Even if you love science, you don't have to publish every experiment you conduct.

I was talking with a collaborator a few days ago and discussing which of a series of experiments we should include in our writeup. In the course of this conversation, he expressed uncertainty about whether we were courting ethical violation by choosing to exclude from a potential publication a set of generally consistent but somewhat more poorly executed studies. Publication bias is a major problem in the social sciences (and elsewhere).* Could we be contributing to the so-called "file drawer problem," in which meta-analytic estimates of effects are inflated by the failure to publish negative findings?

I'm pretty sure the answer is "no."

Some time during my first year of graduate school, I had run a few studies that produced positive findings (e.g., statistically significant differences between groups).  I went to my advisor and started saying all kinds of big things about how I would publish them and they'd be in this paper and that paper and the other; probably it came off as quite grandiose. After listening for a while, he said, "we don't publish every study we run."

His point was that a publishable study – or set of studies – is not one that produces a "significant" result. A publishable study is one that advances our knowledge, whether the result is statistically significant or not. If a study is uninteresting, it may not be worth publishing. Of course, the devil is in the details of what "worth publishing" means, so I've been thinking about how you might assess this. Here's my proposal:
It is unethical to avoid publishing a result if a knowledgeable and adversarial reviewer could make a reasonable case that your publication decision was due to a theoretical commitment to one outcome over another. 
I'll walk through both sides of this proposal below. If you have feedback, or counterexamples, I'd be eager to hear them. 

When it's fine not to publish. First, everyone doesn't have an obligation to publish scientific research. For example, I've supervised some undergraduate honors theses that were quite good, but the students weren't interested in a career in science. I regret that they didn't do the work to write up their data for publication, but I don't think they were being unethical, at least from the perspective of publication bias (if they had discovered a lifesaving drug, the analysis might be different).

Second, publication has a cost. The cost is mostly in terms of time, but time is translatable directly into money (whether from salary or from research opportunity cost). Under the current publication system, publishing a peer-reviewed paper is extremely slow. In addition to the authors' writing time, a paper takes hours of time from editors and reviewers, and much thought and effort in responding to reviews. A discussion of the merits of peer review is a topic for another post (spoiler: I'm in favor of it).** But even the most radical alternatives – think generalized arXiv – do not eliminate the cost of writing a clear, useful manuscript. 

So on a cost-benefit analysis, there is a lot of work that shouldn't be written up. For example, cases of experimenter error are pretty clear cut. If I screw up my stimuli and Group A's treatment was contaminated with items that Group B should have seen, then what do we learn? The generalizable knowledge from that kind of experiment is pretty thin. It seems uncontroversial that this sort of results aren't worth publishing.

What about correct but boring experiments? What if I show that the Stroop effect is unaffected by font choice – or perhaps I show a tiny, statistically significant but not meaningful, effect of serif fonts on Stroop effect.*** For either of these experiments, I imagine I could find someone to publish them. In principle, if they were well-executed, PLoS ONE would be a viable venue, since they do not referee for impact. But I am not sure why anyone would be particularly interested, and I don't think it'd be unethical not to publish them.

When it's NOT fine not to publish. First, when a finding is "null" – meaning, not statistically significant despite your expectation that it would be. Someone who held an alternative position (e.g. that the finding would not be predicted to yield a significant result) could say that you were biasing the literature due to your theoretical commitment. This is probably the most common case of publication bias.

Second, if your finding is inconsistent with a particular theory, this fact also should not be used in the decision about publication. Obviously, an adversarial critic could argue – rightly – that you suppressed the finding, which in turn leads to an exaggeration in the degree of published evidence for your preferred theory.

Third, when a finding (finding #1) is contradictory to another finding (finding #2) that you do intend to publish. Here, just think about if your reviewer knew about #1 as well. Could you justify on independent, a priori grounds that you should not publish #1, independent of the theory? In my experience, the only time that is possible is if #1 is clearly a flawed experiment and does not have any evidential value for the question you're interested in.****

Conclusions. Publication bias is a significant issue, and we need use a variety of tools to combat it. Funnel plots are a useful tool, and some new work by Simonsohn et al. uses p-curve analysis. But the solution is certainly not to assume that researchers should publish all their experiments – that solution might be as bad as the problem, in terms of the cost for scientific productivity. Instead, to determine if they are suppressing evidence due to their own biases, researchers should consider applying an ethical test like the one I proposed above.

(The footnotes here got a little out of control).

* A recent, high impact study used TESS (Time-Sharing Experiments in the Social Sciences, a resource for doing pre-peer reviewed experiments with large, representative samples) to estimate publication bias in the social sciences. I like this study a lot, but I am not sure how general the bias estimates are, because TESS is a special case. TESS is a limited resource, and experiments submitted to TESS undergo substantial additional scrutiny due to TESS's pre-data collection review. They are relatively more well-vetted for potential theoretical impact, and substantially less likely to have basic errors, compared with a one-off study using a convenience sample. I suspect – based on no data except my own experience – that relatively more data is left unpublished than the TESS study's estimate, but also that relatively less of it should be published.

** You could always say, hey, we should just put all our data online. We actually do something sort of like that. But you can't just go to and easily find out whether we conducted an experiment on your theoretical topic of choice. Reporting experiments is not just about putting the data out there – you need description, links to relevant literature, etc.

*** Actually, someone has done Stroop for fonts, though that's a different and slightly more interesting experiment.

**** Here's a trickier one. If a finding is consistent with a theory, could this consistency be grounds to avoid publishing it? A Popperian falsificationist scientist should never publish data that are simply consistent with a particular theory, because those data have no value. But basically no one operates in this way – we all routinely make predictions from theory and are excited when they are satisfied.  For a Bayesian scientist of this type, data consistent with a theory are important. But some data may be consistent with many theories and hence provide little evidential value. Other data may be consistent with a theory, but that theory is already so well-supported, so the experiments make little change in our overall degree of belief – consider the case of experiments supportive of Newton's laws, or of further Stroop replications. These cases also potentially work under the adversarial reviewer test, but only if we include the cost-benefit analysis above, and the logic is dicier. A reviewer could accuse you of bias against the Stroop effect, but you might respond that you just didn't think the incremental evidence was worth the effort. Nevertheless, this balance seems less straightforward. Reflecting this complexity, perhaps the failure to publish confirmatory evidence actually does matter. In a talk I heard last spring, John Ioannidis made the point that there are basically no medical interventions out there with d (standardized effect size) > 3 or so (I forget the exact number). I think this is actually a case of publication bias against confirmation of obvious effects. For example, I can't find a clinical trial of the rabies vaccine anywhere after Pasteur – because the mortality rate without the vaccine is apparently around 99%, and with the vaccine most people survive. The effect size there is just enormous – so big that you should just treat people! So actually the literature does have systematic bias against really big effects.

Monday, November 10, 2014

Comments on "reproducibility in developmental science"

A new article by Duncan et al. in the journal Developmental Psychology highlights best practices for reproducibility in developmental research. From the abstract:
Replications and robustness checks are key elements of the scientific method and a staple in many disciplines. However, leading journals in developmental psychology rarely include explicit replications of prior research conducted by different investigators, and few require authors to establish in their articles or online appendices that their key results are robust across estimation methods, data sets, and demographic subgroups. This article makes the case for prioritizing both explicit replications and, especially, within-study robustness checks in developmental psychology. 
I'm very interested in this topic in general and think that the broader message is on target. Nevertheless, I was surprised by the specific emphasis in this article on what they call "robustness checking" practices. In particular, all three of the robustness practices they describe – multiple estimation techniques, multiple datasets, and subgroup analyses – seem to be most useful for non-experimental studies that involve large correlational datasets (e.g. from nationally representative studies).

Multiple estimation techniques refers to the use of several different statistical models (e.g. standard regression, propensity matching, instrumental variable regression) to estimate the same effect. This is not a bad practice, but it is much more important when there are many different ways of controlling for confounders (e.g. in a large observational dataset). In a two-condition experiment, the menu of options is more limited. Similarly, subgroup estimation – estimating models on smaller populations within the main sample – is typically only possible with a large, multi-site dataset. And the use of multiple datasets presupposes that there are many datasets that bear on the question of interest, something that is not usually true when you are making experimental tests of a new theoretical question.

So all this means that the primary empirical claim of the article – that developmental psych is behind other disciplines (like applied economics) in these practices – is a bit unfair. Here's the key table from the article:

The main point we're supposed to take away from this table is that the econ articles are doing many more robustness checks than the developmental psych articles. But I'd bet that most of the developmental psych journals are filled with novel empirical studies that don't afford comparison with large, pre-existing datasets; subgroup analyses; or use of multiple estimation techniques. And I'm not sure that's a bad thing – at very least, causal inference is far more straightforward in randomized experiments than large-scale observational studies.

I think I have the same goals as the authors: making developmental (and other) research more reproducible. But I would start with a different set of recommendations to the developmental psych community. Here are three simple ones:
  • Larger samples. It is still common in the literature on infancy and early childhood to have extremely small sample sizes. N=16 is still the accepted standard in infancy research, believe it or not. Given the evidence that looking time is a quantitative variable (e.g. here and here), we need to start measuring it with precision. Infants are expensive, but not as expensive as false positives. And preschoolers are cheap, so there's really no excuse for tiny cell sizes.
  • Internal replication. There are many papers – again especially in infant research but also in work with older children – where the primary effect is demonstrated in Study 1 and then the rest of the reported findings are negative controls. A good practice for these studies is to pair each control with a de novo replication. This facilitates statistical comparison (e.g., equating for small aspects of population or testing setup that may change between studies) and also ensures robustness of the effect. 
  • Developmental comparison. This recommendation probably should go without saying. For developmental research – that is, work that tries to understand mechanisms of growth and change – it's critical to provide developmental comparisons and not just sample a single convenient age group. Developmental comparison groups also provide an important opportunity for internal replication. If 3-year-olds are above chance on your task and 4- and 5-year-olds aren't, then perhaps you've discovered an amazing phenomenon; but it's also possible you have a false positive. Our baseline hypotheses about development provide useful constraints on the pattern of results we expect, meaning that developmental comparison groups can provide both new data and a useful sanity check.
Perhaps this all just reflects my different orientation towards the field than Duncan et al.; but a quick flip through a recent issue of Child Development suggests that the modal article is not a large observational study but a much smaller-scale set of experiments. The recommendations Duncan et al. make are certainly reasonable, but we need to supplement them with guidelines for experimental research as well. Duncan GJ, Engel M, Claessens A, & Dowsett CJ (2014). Replication and robustness in developmental research. Developmental psychology, 50 (11), 2417-25 PMID: 25243330

(HT: Dan Yurovsky)

Monday, November 3, 2014

Is mind-reading automatic? A replication and investigation

tl;dr: We have a new paper casting doubt on previous work on automatic theory of mind. Musings on open science and replication.

Do we automatically represent other people's perspective and beliefs, just by seeing them? Or is mind-reading effortful and slow? An influential paper by Kovács, Téglás, and Endress (2010; I'll call them KTE) argued for the automaticity of mind-reading based on an ingenious and complex paradigm.  The participant watched an event – a ball moving back and forth, sometimes going behind a screen – at the same time as another agent (a Smurf) watched part of the same event but sometimes missed critical details. So, for example, the Smurf might leave and not see the ball go behind the screen.

When participants were tested on whether the ball was really behind the screen, they appeared to be faster when their beliefs lined up with the Smurf's. This experiment – and a followup with infants – gave apparently strong evidence for automaticity. Even though the Smurf was supposed to be completely "task-irrelevant" (see below), participants apparently couldn’t help "seeing the world through the Smurf’s eyes." They were slower to detect the ball, even when they themselves expected the ball, if the Smurf didn’t expect it to be there. (If this short description doesn't make everything clear, then take a look at our paper or KTE's original writeup. I found the paradigm quite hard to follow the first time I saw it.)

I was surprised and intrigued when I first read KTE's paper. I don't study theory of mind, but a lot of my research on language understanding intersects with this domain and I follow it closely. So a couple of years later, I added this finding to the list of projects to replicate for my graduate methods class (my class is based on the idea – originally from Rebecca Saxe – that students learning experimental methods should reproduce a published finding). Desmond Ong, a grad student in my department, chose the project. I found out later that Rebecca had also added this paper to her project list.

One major obstacle to the project, though, was that KTE had repeatedly declined to share their materials – in direct conflict with the Science editorial policy, which requires this kind of sharing. I knew that Jonathan Philips (Yale) and Andrew Surtees (Birmingham) had worked together to create an alternative stimulus set, so Desmond got connected with them and they generously shared their videos. Rebecca's group created their own Smurf videos from scratch. (Later in the project, we contacted KTE again and even asked the Science editors to intervene. After the intervention, KTE promised to get us the materials but never did. As a result, we still haven't run our experiments with their precise stimuli, something that is very disappointing from the perspective of really making sure we understand their findings, though I would stress that because of the congruence between the two new stimulus sets in our paper, we think the conclusions are likely to be robust across low-level variations.)

After we got Jonathan's stimulus set, Desmond created a MTurk version of the KTE experiment and collected data in a high-power replication, which reproduced all of their key statistical tests. We were very pleased, and got together with Jonathan to plan followup experiments. Our hope was to use this very cool paradigm to test all kinds of subtleties about belief representation, like how detailed the participants' encoding was and whether it respected perceptual access. But then we started taking a closer look at the data we had collected and noticed that the pattern of findings didn't quite make sense. Here is that graph – we scraped the values from KTE's figures and replotted their SEM as 95% CIs:

The obvious thing is the difference in intercept between the two studies, but we actually don't have a lot to say about that – RT calculations depend on when you start the clock, and we don't know when KTE started the clock in their study. We also did our study on the internet, and though you can get reliable RTs online, they may be a bit slower for all kinds of unimportant reasons.

We also saw some worrisome qualitative differences between the two datasets, however. KTE's data look like people are slower when the Smurf thinks the ball is absent AND when they themselves think the ball is absent too. In contrast, we see a crossover interaction – people are slow when they and the Smurf think the thing is absent, but they are also slow when they and the Smurf think the thing is present. That makes no sense on KTE's account – that should be the fastest condition. What's more, we can't be certain that KTE wouldn't have seen that result, because their overall effects were so much smaller and their relative precision given those small effects seemed lower.

I won't go through all the ways that we tried to make this crossover interaction go away. Suffice it to say, it was a highly replicable finding, across labs and across all kinds of conditions that shouldn't have produced it. Here's Figure 3 from the paper:

Somewhere during this process, we joined forces with Rebecca, and found that they saw the crossover as well (panels marked "1c: Lab" and "2b: Lab 2AFC"). So Desmond and Jonathan then led the effort to figure out the source of the crossover.

Here's what they found. The KTE paradigm includes an "attention check" in which participants have to respond that they saw the Smurf leave the room. But the timing of this attention check is not the same across belief conditions – in other words, it's confounded with condition. And the timing of the attention check actually looks a lot like the crossover we observed: In the two conditions where we observed the slowest RTs, the attention check is closer in time to the actual decision that participants have to make.

There's an old literature showing that making two reaction time decisions right after one another makes the second one slower. We think this is exactly what's going on in KTE's paper, and we believe our experiments demonstrate it pretty clearly. When we don't have an attention check, we don't see the crossover; when we do have the check, but it doesn't have a person in it at all (just a lightbulb), we still see the crossover; and when we control the timing of the check, we eliminate the crossover again.

Across studies, we were able to produce a double dissociation between belief condition and the attention check. To my mind, this dissociation provides strong evidence that the attention check – and not the agent's belief – is responsible for the findings that KTE observed. In fact, KTE and collaborators actually also see a flat RT pattern in a recent article that didn't use the attention check (check their SI for the behavioral results)! So their results are congruent with our own – this also partially mitigates our concern about the stimulus issue. In sum, we don't think the KTE paradigm provides any evidence on the presence of automatic theory of mind.

Thoughts on this process. 

1. Several people who are critical of the replication movement more generally (e.g. Jason Mitchell, Tim Wilson) have suggested that we pursue "positive replications," where we identify moderator variables that control the effects of interest. That's what we did here. We "debugged" the experiment – figured out exactly what went wrong and led to the observed result. Of course, the attention check wasn't a theoretically-interesting moderator, but I think we did exactly what Mitchell and Wilson are talking about.

But I don't think "positive replication" is a sustainable strategy more generally. KTE's original N=24 experiment took us 11 experiments and almost 900 participants to "replicate positively," though we knew much sooner that it wouldn't be the paradigm we'd use for future investigations (what we might have learned from the first direct replication, given that the RTs didn't conform to the theoretical predictions). 

The burden in science can't fall this hard on the replicators all the time. Our work here was even a success by my standards, in the sense that we eventually figured out what was going on! There are other experiments I've worked on, both replications and original studies, where I've never figured out what the problem was, even though I knew there was a problem. So we need to acknowledge that replication can establish – at least – that some findings are not robust enough to build on, or do not reflect the claimed process, without ignoring the data until the replicators figure out exactly what is going on.

2. The replication movement has focused too much for my taste on binary effect size estimation or hypothesis testing, rather than model- or theory-based analysis. There's been lots of talk about replication as a project of figuring out if the original statistical test is significant, or if the effect size is comparable. That's not what this project was about – I should stress that KTE's original paradigm did prove replicable. All of the original statistical tests were statistically significant on basically every replication we did. The problem was that the overall pattern of data still wasn't consistent with the proposed theory. And that's really what the science was about.

This sequence of experiments was actually a bit of a reductio ad absurdum with respect to null-hypothesis statistical testing more generally. Our paper includes 11 separate samples. Despite having planned to have >80% power for all of the tests we did, the sheer scope means that a good number of them would not come out statistically significant, just by chance. So we were in a bit of a quandary – we had predictions that "weren't satisfied" in individual experiments, but we'd strongly expect that to be the case just by chance! (The probability of 11 statistical tests, each with 80% power, all coming out significant is less than .1).

So rather than looking at whether all the p-values were independently below .05, we decided to aggregate the RT effect size on the key effect using meta-analysis. This analysis allowed us to see which account best predicted the RT differences across experiments and conditions. We aggregated the RT coefficients for the crossover interaction (panel A below) and the RT differences for the key contrast in the automatic theory of mind hypothesis (panel B below).  You can see the result here:

The attention check hypothesis clearly differentiates between conditions where we see a big crossover effect and conditions where we don't. In contrast, the automatic theory of mind hypothesis doesn't really differentiate the experiments, and the meta-analytic effect estimate goes in the wrong direction. So the combined evidence across our studies supports the attention check being the source of the RT effect.

Although this analysis isn't full Bayesian model comparison, it's a reasonable method for doing something similar in spirit – comparing the explanatory value of two different hypotheses. Overall, this experience has shifted me much more strongly to more model-driven analyses for large study sets, since individual statistical tests are guaranteed to fail in the limit – and that limit is much closer than I expected.

3. Where does this leave theory of mind, and the infant literature in particular? As we are at pains to say in our paper, we don't know. KTE's infant experiments didn't have the particular problem we describe here, and so they may still reflect automatic belief encoding. On the other hand, the experience of investigating this paradigm has made me much more sensitive to the issues that come up when you try to create complex psychological state manipulations while holding constant participants' low-level perceptual experience. It's hard! Cecilia Hayes has recently written several papers making this same kind of point (here and here). So I am uncertain about this question more generally.

That's a low note to end on, but this experience has taught me a tremendous amount about care in task design, the flaws of the general null-hypothesis statistical approach as we get to "slightly larger data" (not "big data" even!) in psychology, and replication more broadly. All of our data, materials, and code – as well as a link to our actual experiments, in case you want to try them out – are available on github. We'd welcome any feedback; I believe KTE are currently writing a response as well, and we look forward to seeing their views on the issue.

(Minor edit: I got Cecilia Heyes' name wrong, thanks to Dale Barr for the correction).

Sunday, October 26, 2014

Response to Marcus & Davis (2013)

tl;dr: We wrote a response to a critique of our work. Some more musings about overfitting.

Last year, Gary Marcus and Ernie Davis (M&D) wrote a piece in Psychological Science that was quite critical of probabilistic models of higher-level cognition. They asked whether such models are selectively fit to particular tasks to the exclusion of counter-evidence ("task selection") and whether the models are hand-tweaked to fit those particular tasks ("model selection"). On the basis of these critiques, they questioned whether probabilistic models are a robust framework for describing human cognition.

It's never fun to have your work criticized, but sometimes there is a lot that can be learned from these discussions. For example, in a previous exchange with Ansgar Endress (critique here, my response here, his response here), I got a chance to think through my attitude towards the notion of rationality or optimality. Similarly, in Tom Griffiths' and colleagues response to another critique, they have some nice discussion of this issue.

In that spirit, a group of Bayesians whose work was mentioned in the critique have recently written a response letter that will be published in the same journal as M&D's critique (after M&D get a chance to reply). Our response is very short, but hopefully it captures our attitude towards probabilistic models as being a relevant and robust method – not the only one, but one that has shown a lot of recent promise – for describing higher-level cognition. Here I want to discuss one thing that got compressed in the response, though.

One of the pieces of work M&D critiqued was the model of pragmatic reasoning that Noah Goodman and I published a couple of years ago (I'll call that the FG2012 model). Our one-page paper reported only a single study with a very restricted stimulus set, but there is actually a substantial amount of recent work on this topic that suggests such models do a good job at describing human reasoning about language in context; I posted a bibliography of such work a little while ago.

M&D criticized a particular modeling decision that we took in FG212 – the use of a Luce choice rule to approximate human judgments. They pointed out that other choices (that could have been justified a priori) nevertheless would have fit the data much worse. Summarizing their critique, they wrote:
"Individual researchers are free to tinker, but the collective enterprise suffers if choices across domains and tasks are unprincipled and inconsistent. Models that have been fit only to one particular set of data have little value if their assumptions cannot be verified independently; in that case, the entire framework risks becoming an exercise in squeezing round pegs into square holes." (p. 2357)
I agree with this general sentiment, and have tried in much of my work to compare models from different traditions across multiple datasets and experiments using the same fitting procedures and evaluation standards (examples here, here, and here). But I don't think the accusation is fair in the case of FG2012. I'll go through a couple of specific interpretations of the critique that don't hold in our case, and then argue that in fact the "model selection" argument is really just a repetition of the (supposedly separate) "task selection" argument.

Problem: numerical overfitting -> Solution: cross-validation. A very specific critique that M&D could be leveling at us is that we are overfitting. The critique would then be that we tuned our model post-hoc to fit our particular dataset. In a standard case of overfitting, say for a classification problem, the remedy is to evaluate the model on held-out data that wasn't used for parameter tuning. If the model performs as well on the out-of-sample generalization, then it's not overfit. Our (extremely simple) model was clearly not overfit in this sense, however: It had no numerical parameters that were fit to the data.

Problem: post-hoc model tweaking -> Solution 1: pre-registration. Another name for overfitting – when it concerns the researcher's choice of analytic model – is p-hacking. This is closer to what M&D say: Maybe we changed details of the model after seeing the data,  in order to achieve a good fit. But that's not true in this case. As datacolada says, when someone accuses you of p-hacking, the right response is to say "I decided in advance." In this case, we did decide in advance – the Luce choice rule was used in our 2009 CogSci proceedings paper with a predecessor model and a large, independent dataset.*

Problem: post-hoc model tweaking -> Solution 2: direct replication. A second response to questions about post-hoc model choice is direct replication. Both we and at least one other group that I know of have done direct replications of this study – it was a very simple MTurk survey, so it is quite easy to rerun with essentially no modifications (if you are interested, the original materials are here**). The data look extremely similar.*** So again, our model really wasn't tweaked to the particulars of the dataset we collected on our task.

What is the critique, then? I suspect that M&D are annoyed about the fact that FG2012 proposed a model of pragmatic reasoning and tested it on only one particular task (which it fit well). We didn't show that our model generalized to other pragmatic reasoning tasks, or other social cognition tasks more broadly. So the real issue is about the specificity of the model for this experiment vs. the broader empirical coverage it offers.

In their response, M&D claim to offer two different critiques: "model selection" (that's the one we've been discussing) and "task selection" (the claim that Bayesian modelers choose to describe the subset of phenomena that their models fit, but omit other evidence in the discussion). In light of the discussion above, I don't see these as two different points at all. "Model selection," while implying all sorts of bad things like overfitting and p-hacking, in this case is actually a charge that we need to use our models to address more different tasks. And if the worst thing you can say about a model is, "it's so good on those data, you should apply it to more stuff," then you're in pretty good shape.

* Confession: I've actually used Luce choice in basically every single cognitive model I've ever worked on. At least every one that required linkage between a probability distribution and a N-alternative forced-choice task.

** Contact me if you want to do this and I can explain some idiosyncrasy in the notation we used.

*** I'm not sure how much I can share about the other project (it was done by a student at another institution whose current contact info I don't have) but the result was extremely close to ours. Certainly there were no differences that could possibly inspire us to reject our choice rule.

(Thanks to Noah Goodman, Gary Marcus, and Ernie Davis for reading and commenting on a draft of this post).

Friday, October 17, 2014

Semantic bleaching in early language

M is now 15 months, and her receptive vocabulary is quite large. She knows her vehicles, body parts, and a remarkable number of animals (giraffe, hyena, etc.).  Not coincidentally, we spend a large amount of our time together reading books about vehicles, body parts, and animals – at her initiation. Her productive language is also proceeding apace. As one friend astutely observed, she has a dozen words, nearly all of them "ba."*

I've noticed something very interesting over the last three months. Here's one example. When M first discovered the word "da," she used it for several days, with extreme enthusiasm, in a way that seemed like was identical to the word "dog." The form-function mapping was consistent and distinctive: it would be used for dogs and pictures of dogs, but nothing else. But then over the course of subsequent days, it felt like this word got "bleached" of meaning – it went from being "dog" to being "wow, cool!"

The same thing happened with her first extended experience with cats, at a visit to my parents' apartment. She started producing something that sounded like "tih" or "dih" – very clearly in response to the cats. But this vocalization then gradually became a noise of excitement that was seemingly applied to things that were completely different. Perhaps not coincidentally, our visit was over and we didn't have any cat-oriented books with us, so she couldn't use the word referentially. Now that we're back in the land of cat books, the word is back to having a "real" meaning.

This looks to me to be a lot like the phenomenon of "semantic bleaching," where words' meanings get gradually broadened to incorporate other meanings (like the loss of the "annus" – year – part of the meaning of anniversary). This kind of bleaching typically happens over a much longer timescale as part of grammaticalization, the process by which content words can become particles or affixes (e.g. as in the content verb "go" becoming a particle you can use to describe things in the future like "going to X"). But maybe it is happening very quickly due to the intense communicative pressure on M's extremely small lexicon?

The idea here would be that M only has a very small handful of words. If they don't fit the current situation, she adapts them. But mostly the communicative context she's adapting to is, "hey, look at that!" or "let me grab that!" So the process of bleaching words out from "dog" to something more like "cool!" could actually be a very adaptive strategy.

I've checked the research literature and haven't found anything like this described. Any readers know more?

* It's really more like ball ("ba"), balloon ("ba", but possibly just a ball), bus ("ba"), perhaps baby ("ba") and bottle ("ba") as well, dog ("da"),  yes ("dah"), cat/kittie ("dih"), truck ("duh"), daddy ("dada"), yum ("mum-mum"), more ("muh"), hi ("ha-e"), and bye ("ba-e"). Some of these are speculative, but I think this is a pretty good estimate.

Friday, September 19, 2014

Probabilistic pragmatics bibliography

Pragmatics is the study of human communication in context. A tremendous amount of experimental and theoretical work has been done on pragmatics since Grice's seminal statement of the cooperative principle. In recent years, a number of people have been working on a new set of formal models of pragmatics, using probabilistic methods and approaches from game theory to quantify human pragmatic reasoning. 

This post is an incomplete bibliography of some of the recent work following this approach. My goal in compiling this bibliography is primarily personal: I want to keep track of this growing literature and the different branches it's taken. I've primarily included research that is either formal/computational in nature, or based directly on formal models. Please let me know in the comments or by email if you have work that you would like added here.
Probabilistic Models and Experimental Tests
One flaw in this literature is that right now there's no one good paper to look at for an intro. The first paper on this list is (IMO) a good introduction, but it's only a page long, so if you want details you have to look elsewhere. 
Game Theoretic Approaches
This section is a very incomplete list of some of the great work on this topic in the game theory tradition. Note, Michael Franke is someone different from me
Extensions to Other Phenomena
Many of these models have been applied primarily to reference resolution but many other linguistic phenomena seem amenable to the probabilistic pragmatics approach.
Connections to Language Acquisition
Connections with Pedagogy and Teaching
There are many interesting and as-yet-unexplored connections between pragmatics and teaching. 

Wednesday, September 10, 2014

Sharing research using RMarkdown

(An example of using R Markdown to do chunk-based analysis, from this tutorial.)

This last year has brought some very positive changes in the way my lab works with and shares data. As I've mentioned in previous posts (here and here), we have adopted the version control tool git and the site github for collaborating and sharing data both within the lab and outside it. I'm very pleased to report that nearly all of our publications for 2014 have code and data openly shared through github links.

In the course of using this ecosystem, however, I've come to think that it's still not perfect for collaboration. In particular, in order to view analysis results from a collaborator or student, I need to clone into the repository and run all of their analyses, regenerating their figures and working out what they were intending in their code. For simple projects, this isn't so bad. But for anything that requires a modicum of data analysis, it really doesn't work very well. For example, I shouldn't have to rerun all the data munging for an eye-tracking project on my local machine just to see the resulting graphs.

For that reason, we've started using R Markdown for writing our analyses and sharing them with collaborators. R Markdown is a method for writing chunks of code interspersed with simple formatted text. Plots, tables, etc. are inserted inline. This combo then can be rendered to HTML, PDF, or even Word formats. Here's a nice tutorial – the source of the sample image above. The basics are so simple, it should only take about 5 minutes to get started. And all of this can be done from within RStudio, which is a substantially better IDE than the basic Mac R interface.*

Using R Markdown, we write our analyses in a (relatively) comprehensible way, explaining and delineating sections as necessary. We then can compile these to HTML and share them using RPubs, a service that is currently integrated with the R Markdown functionality in RStudio. That way we can just send links to one another (and we can republish and update with new analyses as needed).

Overall, this workflow means that we have full version control over all of our analyses (via git), but also have a friendly way to share with time-crunched or less tech-savvy collaborators. And critically, the overhead to swap to this way of working has been almost nonexistent. Several of our students in the CSLI undergraduate summer internship program this summer completed projects where all their data analysis was done this way. No ecosystem is perfect, but this one is a nice balance between reproducibility and openness on the one hand and ease of use on the other.

* I can't help mentioning that it would be nice if the internal plotting window was a quartz window that could save vector PDFs. The quartz() workaround is very ugly when you are working in OS X full-screen mode.

** Right now, all RPubs documents are shared publicly, but that's not such a big deal if you're used to working in a primarily public ecosystem using github anyway.

Thursday, August 28, 2014

More on the nature of first words

About two weeks ago, M – now 13 months old – started using "dada" to refer to me. She has been producing "da" and "dada" as part of her babble for quite a while, but this was touching and new. It's a wonderful moment when your daughter first calls to you using language, not just a wordless cry.

Of course, congruent with what happened with "brown bear," I haven't heard much "dada" in about a week. She still seems to understand it (and likely did before producing it), but the production really seems to come and go with these first words. Now she's big into balls and appears to produce the sequence "BA(l)" pretty consistently while pointing to them. (I'm writing "BA(l)" because there's a hint of a liquid at the end, in contrast to the punctate "ba" that she uses for dogs and birds that we see at the park).

I want to comment on something neat that happened, though. In the very first day of M's "dada" production, we saw two really interesting novel uses of the word, both supporting my previous discussion about the flexibility of early language.

The first use was during a game we often play with M where she unpacks and repacks all the cards in my wallet. A couple of years ago, I lost my credit cards several times, and the bank started putting my photo on my card. (I think they do this for folks who are at high risk for identity theft). During the wallet-unpacking game, M took one of the cards, pointed to the photo of me (a small, blurry, old photo at that), and said "dada."

Kids do understand and recognize photos and other depictions early in life. My favorite piece of evidence for children's picture understanding is a beautiful old study by Hochbert & Brooks (1962). They found that their own child, after being deprived of access to drawings and photos until the age of 19 months, nevertheless had very good recognition objects he knew from both kinds of images, the very first time he saw them.* M's generalization of "dada" to my photo thus might not be completely surprising, but it certainly supports the idea that the word was never dependent on me actually being there.

The second example, reported by my wife, is even more striking.  When I had stepped out of the house for a moment, M pointed to the bedroom door where I had been and said "dada" – as though she was searching for me. This kind of displacement – use of language to describe something that is absent  – is argued to be a critical design feature of language in a really nice, under-appreciated article by Hockett (1960). Some interesting experiments suggest that even toddlers can use language to learn about unseen events, but I don't know about systematic studies of the use of early words to express displaced meanings. M's use of "dada" to refer to my absence (or perhaps to question whether I was present but unseen) suggests that she already is able – in principle – to use language in this way.

More broadly, in watching these first steps into language I am stunned by the disconnect between comprehension and production. Production is difficult and laborious: M accomplishes something like "brown bear" or "dada" but then quickly forgets or loses interest in what she has learned.** But the core understanding of how language works seems much more mature than I ever would have imagined. For M, the places where she shows the most ability is in understanding language a signal of future action. When we say "diaper time" or "would you like something to eat?" she apparently takes these as signals to initiate the routine, and toddles over to the changing pad or dinner table. But when we're in the middle of the routine, saying "diaper" doesn't inspire her to point to her diaper.

Again and again I am left with the impression of a mind that quickly apprehends the basic framework assumptions of the physical and social world, even as carrying out the simplest actions using that knowledge remains extraordinarily difficult.


* Needless to say, this was an epic study to conduct. Drily, H&B write in their paper that “the constant vigilance and improvisation required of the parents proved to be a considerable chore from the start—further research of this kind should not be undertaken lightly.”

** On a behaviorist account of early language, M would never forget "dada" – I was so overjoyed that I probably offered more positive reinforcement than she could even appreciate.

(minor updates and typo fixes 7/29)

Friday, August 15, 2014

Exploring first words across children

(This post is joint with Rose Schneider, lab manager in the Language and Cognition Lab.)

For toddlers, the ability to communicate using words is an incredibly important part of learning to make their way in the world. A friend's mother tells a story that probably resonates with a lot of parents. After getting more and more frustrated trying to figure out why her son was insistently pointing at the pantry, she almost cried when he looked straight at her and said, “cookie!” She was so grateful for the clear communication that she gave him as many cookies as he wanted.

We're interested in early word learning as a way to look into the emergence of language more broadly. What does it take to learn a word? And why is there so much variability in the emergence of children's language, given that nearly all kids end up with typical language skills later in childhood?

One way into these questions is to ask about the content of children's first words. Several studies have looked at early vocabulary (e.g. this nice one that compares across cultures), but – to our knowledge – there is not a lot of systematic data on children's absolute first word.* The first word is both a milestone for parents and caregivers and also an interesting window into the things that very young children want to (and are able to) talk about.

To take a look at this issue, we partnered with Children’s Discovery Museum of San Jose to do a retrospective survey of children's first word. We're very pleased that they were interested in supporting this kind of developmental research and were willing to send our survey out to their members! In the survey, we were especially interested in content words, rather than names for people, so for this study, we omitted "mama" and "dada" and their equivalents. (There are lots of reasons why parents might want these particular words to get produced – and to spot them in babble even when they aren't being used meaningfully).

We put together a very short online questionnaire and asked about the child's first word, the situation it occurred in, the age of the child, the age of the utterance, and the child's current age and gender. The survey generated around 500 responses, and we preprocessed the data by translating words into English (when we had a translation available) and categorizing the words by the MacArthur-Bates Communicative Development Inventory (CDI) classification, a common way to group children's vocabulary into basic categories. We did our data analysis in R using ggplot2, reshape2, and ddply.

Here's the graphic we produced for CDM:

We were struck by a couple of features of the data, and the goal of this post is to talk a bit more about these, as well as some of other things that didn't fit in the graphic.

First, the distribution of words seemed pretty reasonable, with short, common words for objects ("ball," "car"), animals ("dog," "duck" – presumably from bathtime), and social routines ("hi"). The gender difference between "ball" and "hi" was also striking, reflecting some gender stereotypes – and some data – about girls' greater social orientation in infancy. Of course, we can't say anything about the source of such differences from these data!

Another interesting feature of the data was the age distribution we observed. On parent report forms like the CDI, parents often report that their children understand many words even in infancy, with the 75th percentile being reported to know 50 words at 8 months. While there is some empirical evidence for word knowledge before the first birthday, this 50 word number has always been surprising, and no one really knows how much wishful thinking it includes. The production numbers for the CDI are much lower, but still have a median value above zero for 10-month-olds. So is this overestimation? Probably depends on your standards. M, Mike's daughter, had something "word-like" at 10 months, but is only now producing "hi" as a 12-month-old (typical girl).

One possible confound in this reporting would be parents misremembering the age at which their child first produced a word, perhaps reporting systematically younger or older ages (or even ages rounded more towards the first birthday) as the first word recedes into the past. We didn't find evidence of this, however. The distribution of reported age of first word was the same regardless of how old the child was at the time of reporting:

Now on to some substantive analyses that didn't make it into the graphic. Grouping our responses into broad categories is a good way to explore what classes of objects, actions, etc., were the referents of first words. While many of the words we observed in parents’ responses were on the CDI, we had to classify some others ad-hoc, and still others we were unable to classify (we ended up excluding about 50 for a total sample of 454, 42% female). Here's a graph of the proportions in each category:
So no individual animal name dominated, but overall they were most frequent, followed by "games and routines" (including social routines like "hi" and "bye") and toys. People were next, followed by animal sounds.

There are some interesting ways to break this down further. Note that girls generally are a few months ahead, language-wise, so analyses of age and gender are a bit confounded. Here's our age distribution broken down by gender:
As expected, we see girls a bit over-represented in the youngest age bin and boys a little bit over-represented in the oldest bin.

That said, here are the splits by age:
and gender:
Overall, younger kids are similar to older kids, but are producing more names for people. Older kids were producing slightly more vehicle names and sounds, but this may be because the older kids skew more male (see gender graph, where vehicles are almost exclusively the provenance of male babies). The only big gender trends were 1) a preference for toys and action words for the males and 2) a general broader spread across different categories. This second trend could be a function of boys' tendency to have more idiosyncratic interests (in childhood at least, perhaps beyond).

Overall, these data give us a new way to look at early vocabulary, not at the shape of semantic networks within a single child, but at the variability of first words across a large population. We invite you to look at the data if you are interested! 

Thanks very much to Jenni Martin at CDM for her support of our research!

* What does that even mean? Is a word a word if no one understands or recognizes it? That seems pretty philosophically deep, but hard to assess empirically. We'll go with the first word that someone else, usually a parent, recognized as being part of communicating (or trying to communicate). 

Monday, August 11, 2014

Getting latex to work at journals

I got good news about a manuscript being accepted today, but I was reminded again how painful it can be to get journals to accept LaTeX source. Sometimes I wonder if I waste as much time in wrangling journals as I save in writing by using tex and bibtex.

I routinely have manuscripts bounced back by editorial assistants who ask "Why can't I edit your PDF? Can you send me the word source?" Hopefully policies like Elsevier's Your Paper Your Way and the PNAS equivalent, "express submission," will promote a new norm for first submissions.

For my own memory as much as anyone else, here are some tips for getting Elsevier's maddening EES to accept tex source:

  • Uploading an archive with all files never has worked for me, so I skip this step
  • Use elsarticle.cls and the included model5-names.bst for formatting and APA-style references
  • Upload both .bib AND .bbl files as supplementary information (this was the tricky one!) – why would this be necessary?
  • Upload all figures separately; PDF format has worked for me though EPS is requested.
Also, if you have uploaded a version of a file and then want to replace it, be careful to rename it. EES keeps the oldest version of a file so it will not update if you upload a newer version (totally idiotic).

Wednesday, July 30, 2014

I would have shocked myself

(One of my favorite xkcd cartoons. According to recent research, 
maybe we're all scientists? At least, we keep pulling the lever over and 
over again...)

Do people hate being alone with their thoughts so much that they will shock themselves to avoid thinking? In a series of studies, Wilson et al. (2014) asked people to spend time quietly thinking without distractions and concluded that people generally found the experience aversive. Others have reinterpreted this conclusion, but the general upshot is that people at least didn't find it particularly pleasurable. Several manipulations (e.g. planning the thing you were going to think about) also didn't make the experience much better. Overall this is a very interesting paper, and I really appreciated the emphasis on a behavior – mind wandering – that has received much more attention in the neuroscience literature than in psychology.

The part of the paper that got the most attention, however, was a study that measured whether people would give themselves an electric shock while they were supposed to be quietly thinking. The authors shocked the participants once and then checked to make sure that the participants found it sufficiently aversive to say they would pay money to avoid having it done to them again. But then when the participants were left to think by themselves, around two thirds of men and a quarter of women went and shocked themselves again, often several times. The authors interpret this finding as follows:
What is striking is that simply being alone with their own thoughts... was apparently so aversive that it drove many participants to self-administer an electric shock that they had earlier said they would pay to avoid.
Something feels wrong about this interpretation as it stands. Here are my two conflicting intuitions. First, I actually like being alone with my own thoughts. I sometimes consciously create time to space out and think about a particular topic, or about nothing at all. Second, I am absolutely certain I would have shocked myself.

I would have shocked myself at least once, but possibly five or more times. I might even have been the guy who shocked himself 190 times and had to get excluded from the study. Even when I said I would pay money to avoid having someone else shock me. I definitely would have done it to myself. Why? I don't really know.

There are many sensations that I would pay money not to have someone do to me: stick a paperclip under my fingernail, pluck out hairs from my beard, bite my nail to the quick. Yet I will sometimes do these things to myself, even though they are painful and I will regret it afterwards. The exploration of small, moderately painful stimuli is something that I do on a regular basis. (Other people do these things too). I am not sure why I do them, but I don't think it's because I hate being bored so much that I would rather be in pain.

Boredom and pain are not zero sum, in other words. Pain can drive boredom away, but the two can coexist as well. I don't do these things when I'm engaged in something else like reading the internet on my phone. But I do actually do them on a regular basis when I'm listening to a talk or thinking about a complicated paper.

I don't know why I cause myself minor pain sometimes. But it feels like there are at least two component reasons. One is some kind of automatic exploration (they happen when my mind is otherwise occupied, as the examples above show). But I also do these sorts of things in part because I want to see how they feel. Kind of like ripping off a hangnail or playing with a sharp knife. There's some novelty seeking involved, but doing them again and again isn't quite about novelty seeking; we've all had a hangnail or pulled out a hair. Perhaps it's about the exact sensation and the predictions we make – will it feel better or worse? Can I predict exactly what it will be like?

What I'm arguing is that these things are mysterious on any view of humans as rational agents. The Wilson paper doesn't sufficiently acknowledge this mystery, instead choosing to treat people as purely rational: they paid to avoid X, but then they do it anyway, it must be because X is better than Y. But there isn't a direct, utility-theoretic tradeoff between mind-wandering and electric shock. Consider if Wilson et al. had played Enya to participants and found they shocked themselves (which I bet I would have). Would they then conclude that Enya is so bad that people shock themselves to get away from her? Wilson TD, Reinhard DA, Westgate EC, Gilbert DT, Ellerbeck N, Hahn C, Brown CL, & Shaked A (2014). Social psychology. Just think: the challenges of the disengaged mind. Science, 345, 75-77 PMID: 24994650

Tuesday, July 29, 2014

Are first words bound to specific contexts?

("Internet high five." From

It's so fun to watch the emergence of language. M just had her first birthday, and – though we still haven't seen much in the way of production beyond "brown bear" (see previous post) and maybe "yum" – she's starting to show some exciting signs of knowing some words. It's endlessly fascinating to gather evidence about her comprehension, but I'm continuously amazed at how tenuous my evidence is for any given word.*

In particular, I've been wondering for the past week or so whether M knows the meaning of the word/phrase "high five." She loves the swings at the playground, and really enjoys playing games while swinging. One day we started doing hand slaps (accompanied by me saying "high five"). After a couple of times playing this game, when I said "high five," she would raise her hands, even without the extra cue of me raising my hands. Word knowledge, right?

It turns out that one persistent question about first words is how contextually-bound they are: whether their meanings are general across contexts, or whether they apply only in specific cases. Some of this is a remnant of older, behaviorist analyses of early language – word A is a conditioned response to situation B – which don't seem to account for the data. Most people who study child language agree that early nouns like "dog" can be generalized across situations quite handily – in fact, overgeneralization is relatively common. But you still see references to "context-specific" language in textbooks and materials for parents (exampleexample). My goal here is to propose an alternative – rational – account of why much early language looks context specific, even though it's not.

I can see why ideas about context-specific language stick around. When I investigated M's "high five" knowledge further, I was disappointed. Although I could get her to give me a high five on the swings, I simply couldn't elicit the gesture in response to my words when we came home to the house. This looked to me a lot like "high five" was bound to the context of the swing set.

But here's another possibility, in two parts. Part one: Language comprehension for a one-year-old is hard. A well-known set of experiments by Stager & Werker (1998) suggest that even relatively small attentional demands can disrupt the encoding of speech. In their experiments, 14-month-olds (and even 8-month-olds) could distinguish the sounds "bih" and "dih." But the same age children had trouble learning to pair these sounds consistently with different pictures, even though they could do it just fine with more dissimilar words (e.g. "lif" and "neem"). 

Part two: When you have a hard comprehension task, context can make it easier. Contextual predictability effects have been very well studied in word recognition (example), with the caveat that context is typically defined as being the sentence in which a sound occurs. The basic idea is very Bayesian: a context creates a higher prior probability of a particular sound, which helps in identifying that sound from noisy perceptual input.

So perhaps contextual-boundedness effects in early child language have exactly the same source. When M recognizes "high five," it could be that she is getting a boost from its use in a familiar context, even if she could – in principle – recognize it in another context, given a sufficiently clear and unambiguous signal. Inspired by this idea, I tried asking her again at the house the other day. I said, "M! M! Can you give me a HIGH... FIVE?" in my best child-directed speech. She grinned and reached her hand up for the win. Of course, while I was figuring out my theory, perhaps she was generalizing...

* There are so many reasons why any uncontrolled individual test of comprehension doesn't provide good evidence for her knowledge.** For example, if I'm trying to figure out whether she knows the word "cat," I can't use a book where we have previously pointed to a cat photo, since she tends to come back to parts of the book we've attended to. On the other hand, if I find two new objects (a cat and a ball), typically one will be more exciting than the other. In some of our recent eye-tracking work, we've been finding that salience of this kind has an outsize effect on word recognition (echoing much earlier findings), and the best work on very early word knowledge explicitly measures and subtracts this salience bias.

** That's why experiments, I guess...