Sunday, October 26, 2014

Response to Marcus & Davis (2013)

tl;dr: We wrote a response to a critique of our work. Some more musings about overfitting.

Last year, Gary Marcus and Ernie Davis (M&D) wrote a piece in Psychological Science that was quite critical of probabilistic models of higher-level cognition. They asked whether such models are selectively fit to particular tasks to the exclusion of counter-evidence ("task selection") and whether the models are hand-tweaked to fit those particular tasks ("model selection"). On the basis of these critiques, they questioned whether probabilistic models are a robust framework for describing human cognition.

It's never fun to have your work criticized, but sometimes there is a lot that can be learned from these discussions. For example, in a previous exchange with Ansgar Endress (critique here, my response here, his response here), I got a chance to think through my attitude towards the notion of rationality or optimality. Similarly, in Tom Griffiths' and colleagues response to another critique, they have some nice discussion of this issue.

In that spirit, a group of Bayesians whose work was mentioned in the critique have recently written a response letter that will be published in the same journal as M&D's critique (after M&D get a chance to reply). Our response is very short, but hopefully it captures our attitude towards probabilistic models as being a relevant and robust method – not the only one, but one that has shown a lot of recent promise – for describing higher-level cognition. Here I want to discuss one thing that got compressed in the response, though.

One of the pieces of work M&D critiqued was the model of pragmatic reasoning that Noah Goodman and I published a couple of years ago (I'll call that the FG2012 model). Our one-page paper reported only a single study with a very restricted stimulus set, but there is actually a substantial amount of recent work on this topic that suggests such models do a good job at describing human reasoning about language in context; I posted a bibliography of such work a little while ago.

M&D criticized a particular modeling decision that we took in FG212 – the use of a Luce choice rule to approximate human judgments. They pointed out that other choices (that could have been justified a priori) nevertheless would have fit the data much worse. Summarizing their critique, they wrote:
"Individual researchers are free to tinker, but the collective enterprise suffers if choices across domains and tasks are unprincipled and inconsistent. Models that have been fit only to one particular set of data have little value if their assumptions cannot be verified independently; in that case, the entire framework risks becoming an exercise in squeezing round pegs into square holes." (p. 2357)
I agree with this general sentiment, and have tried in much of my work to compare models from different traditions across multiple datasets and experiments using the same fitting procedures and evaluation standards (examples here, here, and here). But I don't think the accusation is fair in the case of FG2012. I'll go through a couple of specific interpretations of the critique that don't hold in our case, and then argue that in fact the "model selection" argument is really just a repetition of the (supposedly separate) "task selection" argument.

Problem: numerical overfitting -> Solution: cross-validation. A very specific critique that M&D could be leveling at us is that we are overfitting. The critique would then be that we tuned our model post-hoc to fit our particular dataset. In a standard case of overfitting, say for a classification problem, the remedy is to evaluate the model on held-out data that wasn't used for parameter tuning. If the model performs as well on the out-of-sample generalization, then it's not overfit. Our (extremely simple) model was clearly not overfit in this sense, however: It had no numerical parameters that were fit to the data.

Problem: post-hoc model tweaking -> Solution 1: pre-registration. Another name for overfitting – when it concerns the researcher's choice of analytic model – is p-hacking. This is closer to what M&D say: Maybe we changed details of the model after seeing the data,  in order to achieve a good fit. But that's not true in this case. As datacolada says, when someone accuses you of p-hacking, the right response is to say "I decided in advance." In this case, we did decide in advance – the Luce choice rule was used in our 2009 CogSci proceedings paper with a predecessor model and a large, independent dataset.*

Problem: post-hoc model tweaking -> Solution 2: direct replication. A second response to questions about post-hoc model choice is direct replication. Both we and at least one other group that I know of have done direct replications of this study – it was a very simple MTurk survey, so it is quite easy to rerun with essentially no modifications (if you are interested, the original materials are here**). The data look extremely similar.*** So again, our model really wasn't tweaked to the particulars of the dataset we collected on our task.

What is the critique, then? I suspect that M&D are annoyed about the fact that FG2012 proposed a model of pragmatic reasoning and tested it on only one particular task (which it fit well). We didn't show that our model generalized to other pragmatic reasoning tasks, or other social cognition tasks more broadly. So the real issue is about the specificity of the model for this experiment vs. the broader empirical coverage it offers.

In their response, M&D claim to offer two different critiques: "model selection" (that's the one we've been discussing) and "task selection" (the claim that Bayesian modelers choose to describe the subset of phenomena that their models fit, but omit other evidence in the discussion). In light of the discussion above, I don't see these as two different points at all. "Model selection," while implying all sorts of bad things like overfitting and p-hacking, in this case is actually a charge that we need to use our models to address more different tasks. And if the worst thing you can say about a model is, "it's so good on those data, you should apply it to more stuff," then you're in pretty good shape.

---
* Confession: I've actually used Luce choice in basically every single cognitive model I've ever worked on. At least every one that required linkage between a probability distribution and a N-alternative forced-choice task.

** Contact me if you want to do this and I can explain some idiosyncrasy in the notation we used.

*** I'm not sure how much I can share about the other project (it was done by a student at another institution whose current contact info I don't have) but the result was extremely close to ours. Certainly there were no differences that could possibly inspire us to reject our choice rule.

(Thanks to Noah Goodman, Gary Marcus, and Ernie Davis for reading and commenting on a draft of this post).

Friday, October 17, 2014

Semantic bleaching in early language

M is now 15 months, and her receptive vocabulary is quite large. She knows her vehicles, body parts, and a remarkable number of animals (giraffe, hyena, etc.).  Not coincidentally, we spend a large amount of our time together reading books about vehicles, body parts, and animals – at her initiation. Her productive language is also proceeding apace. As one friend astutely observed, she has a dozen words, nearly all of them "ba."*

I've noticed something very interesting over the last three months. Here's one example. When M first discovered the word "da," she used it for several days, with extreme enthusiasm, in a way that seemed like was identical to the word "dog." The form-function mapping was consistent and distinctive: it would be used for dogs and pictures of dogs, but nothing else. But then over the course of subsequent days, it felt like this word got "bleached" of meaning – it went from being "dog" to being "wow, cool!"

The same thing happened with her first extended experience with cats, at a visit to my parents' apartment. She started producing something that sounded like "tih" or "dih" – very clearly in response to the cats. But this vocalization then gradually became a noise of excitement that was seemingly applied to things that were completely different. Perhaps not coincidentally, our visit was over and we didn't have any cat-oriented books with us, so she couldn't use the word referentially. Now that we're back in the land of cat books, the word is back to having a "real" meaning.

This looks to me to be a lot like the phenomenon of "semantic bleaching," where words' meanings get gradually broadened to incorporate other meanings (like the loss of the "annus" – year – part of the meaning of anniversary). This kind of bleaching typically happens over a much longer timescale as part of grammaticalization, the process by which content words can become particles or affixes (e.g. as in the content verb "go" becoming a particle you can use to describe things in the future like "going to X"). But maybe it is happening very quickly due to the intense communicative pressure on M's extremely small lexicon?

The idea here would be that M only has a very small handful of words. If they don't fit the current situation, she adapts them. But mostly the communicative context she's adapting to is, "hey, look at that!" or "let me grab that!" So the process of bleaching words out from "dog" to something more like "cool!" could actually be a very adaptive strategy.

I've checked the research literature and haven't found anything like this described. Any readers know more?

---
* It's really more like ball ("ba"), balloon ("ba", but possibly just a ball), bus ("ba"), perhaps baby ("ba") and bottle ("ba") as well, dog ("da"),  yes ("dah"), cat/kittie ("dih"), truck ("duh"), daddy ("dada"), yum ("mum-mum"), more ("muh"), hi ("ha-e"), and bye ("ba-e"). Some of these are speculative, but I think this is a pretty good estimate.

Friday, September 19, 2014

Probabilistic pragmatics bibliography

Pragmatics is the study of human communication in context. A tremendous amount of experimental and theoretical work has been done on pragmatics since Grice's seminal statement of the cooperative principle. In recent years, a number of people have been working on a new set of formal models of pragmatics, using probabilistic methods and approaches from game theory to quantify human pragmatic reasoning. 

This post is an incomplete bibliography of some of the recent work following this approach. My goal in compiling this bibliography is primarily personal: I want to keep track of this growing literature and the different branches it's taken. I've primarily included research that is either formal/computational in nature, or based directly on formal models. Please let me know in the comments or by email if you have work that you would like added here.
Probabilistic Models and Experimental Tests
One flaw in this literature is that right now there's no one good paper to look at for an intro. The first paper on this list is (IMO) a good introduction, but it's only a page long, so if you want details you have to look elsewhere. 
Game Theoretic Approaches
This section is a very incomplete list of some of the great work on this topic in the game theory tradition. Note, Michael Franke is someone different from me
Extensions to Other Phenomena
Many of these models have been applied primarily to reference resolution but many other linguistic phenomena seem amenable to the probabilistic pragmatics approach.
Connections to Language Acquisition
Connections with Pedagogy and Teaching
There are many interesting and as-yet-unexplored connections between pragmatics and teaching. 

Wednesday, September 10, 2014

Sharing research using RMarkdown

(An example of using R Markdown to do chunk-based analysis, from this tutorial.)

This last year has brought some very positive changes in the way my lab works with and shares data. As I've mentioned in previous posts (here and here), we have adopted the version control tool git and the site github for collaborating and sharing data both within the lab and outside it. I'm very pleased to report that nearly all of our publications for 2014 have code and data openly shared through github links.

In the course of using this ecosystem, however, I've come to think that it's still not perfect for collaboration. In particular, in order to view analysis results from a collaborator or student, I need to clone into the repository and run all of their analyses, regenerating their figures and working out what they were intending in their code. For simple projects, this isn't so bad. But for anything that requires a modicum of data analysis, it really doesn't work very well. For example, I shouldn't have to rerun all the data munging for an eye-tracking project on my local machine just to see the resulting graphs.

For that reason, we've started using R Markdown for writing our analyses and sharing them with collaborators. R Markdown is a method for writing chunks of code interspersed with simple formatted text. Plots, tables, etc. are inserted inline. This combo then can be rendered to HTML, PDF, or even Word formats. Here's a nice tutorial – the source of the sample image above. The basics are so simple, it should only take about 5 minutes to get started. And all of this can be done from within RStudio, which is a substantially better IDE than the basic Mac R interface.*

Using R Markdown, we write our analyses in a (relatively) comprehensible way, explaining and delineating sections as necessary. We then can compile these to HTML and share them using RPubs, a service that is currently integrated with the R Markdown functionality in RStudio. That way we can just send links to one another (and we can republish and update with new analyses as needed).

Overall, this workflow means that we have full version control over all of our analyses (via git), but also have a friendly way to share with time-crunched or less tech-savvy collaborators. And critically, the overhead to swap to this way of working has been almost nonexistent. Several of our students in the CSLI undergraduate summer internship program this summer completed projects where all their data analysis was done this way. No ecosystem is perfect, but this one is a nice balance between reproducibility and openness on the one hand and ease of use on the other.

----
* I can't help mentioning that it would be nice if the internal plotting window was a quartz window that could save vector PDFs. The quartz() workaround is very ugly when you are working in OS X full-screen mode.

** Right now, all RPubs documents are shared publicly, but that's not such a big deal if you're used to working in a primarily public ecosystem using github anyway.

Thursday, August 28, 2014

More on the nature of first words

About two weeks ago, M – now 13 months old – started using "dada" to refer to me. She has been producing "da" and "dada" as part of her babble for quite a while, but this was touching and new. It's a wonderful moment when your daughter first calls to you using language, not just a wordless cry.

Of course, congruent with what happened with "brown bear," I haven't heard much "dada" in about a week. She still seems to understand it (and likely did before producing it), but the production really seems to come and go with these first words. Now she's big into balls and appears to produce the sequence "BA(l)" pretty consistently while pointing to them. (I'm writing "BA(l)" because there's a hint of a liquid at the end, in contrast to the punctate "ba" that she uses for dogs and birds that we see at the park).

I want to comment on something neat that happened, though. In the very first day of M's "dada" production, we saw two really interesting novel uses of the word, both supporting my previous discussion about the flexibility of early language.

The first use was during a game we often play with M where she unpacks and repacks all the cards in my wallet. A couple of years ago, I lost my credit cards several times, and the bank started putting my photo on my card. (I think they do this for folks who are at high risk for identity theft). During the wallet-unpacking game, M took one of the cards, pointed to the photo of me (a small, blurry, old photo at that), and said "dada."

Kids do understand and recognize photos and other depictions early in life. My favorite piece of evidence for children's picture understanding is a beautiful old study by Hochbert & Brooks (1962). They found that their own child, after being deprived of access to drawings and photos until the age of 19 months, nevertheless had very good recognition objects he knew from both kinds of images, the very first time he saw them.* M's generalization of "dada" to my photo thus might not be completely surprising, but it certainly supports the idea that the word was never dependent on me actually being there.

The second example, reported by my wife, is even more striking.  When I had stepped out of the house for a moment, M pointed to the bedroom door where I had been and said "dada" – as though she was searching for me. This kind of displacement – use of language to describe something that is absent  – is argued to be a critical design feature of language in a really nice, under-appreciated article by Hockett (1960). Some interesting experiments suggest that even toddlers can use language to learn about unseen events, but I don't know about systematic studies of the use of early words to express displaced meanings. M's use of "dada" to refer to my absence (or perhaps to question whether I was present but unseen) suggests that she already is able – in principle – to use language in this way.

More broadly, in watching these first steps into language I am stunned by the disconnect between comprehension and production. Production is difficult and laborious: M accomplishes something like "brown bear" or "dada" but then quickly forgets or loses interest in what she has learned.** But the core understanding of how language works seems much more mature than I ever would have imagined. For M, the places where she shows the most ability is in understanding language a signal of future action. When we say "diaper time" or "would you like something to eat?" she apparently takes these as signals to initiate the routine, and toddles over to the changing pad or dinner table. But when we're in the middle of the routine, saying "diaper" doesn't inspire her to point to her diaper.

Again and again I am left with the impression of a mind that quickly apprehends the basic framework assumptions of the physical and social world, even as carrying out the simplest actions using that knowledge remains extraordinarily difficult.

----

* Needless to say, this was an epic study to conduct. Drily, H&B write in their paper that “the constant vigilance and improvisation required of the parents proved to be a considerable chore from the start—further research of this kind should not be undertaken lightly.”

** On a behaviorist account of early language, M would never forget "dada" – I was so overjoyed that I probably offered more positive reinforcement than she could even appreciate.

(minor updates and typo fixes 7/29)

Friday, August 15, 2014

Exploring first words across children

(This post is joint with Rose Schneider, lab manager in the Language and Cognition Lab.)

For toddlers, the ability to communicate using words is an incredibly important part of learning to make their way in the world. A friend's mother tells a story that probably resonates with a lot of parents. After getting more and more frustrated trying to figure out why her son was insistently pointing at the pantry, she almost cried when he looked straight at her and said, “cookie!” She was so grateful for the clear communication that she gave him as many cookies as he wanted.

We're interested in early word learning as a way to look into the emergence of language more broadly. What does it take to learn a word? And why is there so much variability in the emergence of children's language, given that nearly all kids end up with typical language skills later in childhood?

One way into these questions is to ask about the content of children's first words. Several studies have looked at early vocabulary (e.g. this nice one that compares across cultures), but – to our knowledge – there is not a lot of systematic data on children's absolute first word.* The first word is both a milestone for parents and caregivers and also an interesting window into the things that very young children want to (and are able to) talk about.

To take a look at this issue, we partnered with Children’s Discovery Museum of San Jose to do a retrospective survey of children's first word. We're very pleased that they were interested in supporting this kind of developmental research and were willing to send our survey out to their members! In the survey, we were especially interested in content words, rather than names for people, so for this study, we omitted "mama" and "dada" and their equivalents. (There are lots of reasons why parents might want these particular words to get produced – and to spot them in babble even when they aren't being used meaningfully).

We put together a very short online questionnaire and asked about the child's first word, the situation it occurred in, the age of the child, the age of the utterance, and the child's current age and gender. The survey generated around 500 responses, and we preprocessed the data by translating words into English (when we had a translation available) and categorizing the words by the MacArthur-Bates Communicative Development Inventory (CDI) classification, a common way to group children's vocabulary into basic categories. We did our data analysis in R using ggplot2, reshape2, and ddply.

Here's the graphic we produced for CDM:


We were struck by a couple of features of the data, and the goal of this post is to talk a bit more about these, as well as some of other things that didn't fit in the graphic.

First, the distribution of words seemed pretty reasonable, with short, common words for objects ("ball," "car"), animals ("dog," "duck" – presumably from bathtime), and social routines ("hi"). The gender difference between "ball" and "hi" was also striking, reflecting some gender stereotypes – and some data – about girls' greater social orientation in infancy. Of course, we can't say anything about the source of such differences from these data!

Another interesting feature of the data was the age distribution we observed. On parent report forms like the CDI, parents often report that their children understand many words even in infancy, with the 75th percentile being reported to know 50 words at 8 months. While there is some empirical evidence for word knowledge before the first birthday, this 50 word number has always been surprising, and no one really knows how much wishful thinking it includes. The production numbers for the CDI are much lower, but still have a median value above zero for 10-month-olds. So is this overestimation? Probably depends on your standards. M, Mike's daughter, had something "word-like" at 10 months, but is only now producing "hi" as a 12-month-old (typical girl).

One possible confound in this reporting would be parents misremembering the age at which their child first produced a word, perhaps reporting systematically younger or older ages (or even ages rounded more towards the first birthday) as the first word recedes into the past. We didn't find evidence of this, however. The distribution of reported age of first word was the same regardless of how old the child was at the time of reporting:

Now on to some substantive analyses that didn't make it into the graphic. Grouping our responses into broad categories is a good way to explore what classes of objects, actions, etc., were the referents of first words. While many of the words we observed in parents’ responses were on the CDI, we had to classify some others ad-hoc, and still others we were unable to classify (we ended up excluding about 50 for a total sample of 454, 42% female). Here's a graph of the proportions in each category:
So no individual animal name dominated, but overall they were most frequent, followed by "games and routines" (including social routines like "hi" and "bye") and toys. People were next, followed by animal sounds.

There are some interesting ways to break this down further. Note that girls generally are a few months ahead, language-wise, so analyses of age and gender are a bit confounded. Here's our age distribution broken down by gender:
As expected, we see girls a bit over-represented in the youngest age bin and boys a little bit over-represented in the oldest bin.

That said, here are the splits by age:
and gender:
Overall, younger kids are similar to older kids, but are producing more names for people. Older kids were producing slightly more vehicle names and sounds, but this may be because the older kids skew more male (see gender graph, where vehicles are almost exclusively the provenance of male babies). The only big gender trends were 1) a preference for toys and action words for the males and 2) a general broader spread across different categories. This second trend could be a function of boys' tendency to have more idiosyncratic interests (in childhood at least, perhaps beyond).

Overall, these data give us a new way to look at early vocabulary, not at the shape of semantic networks within a single child, but at the variability of first words across a large population. We invite you to look at the data if you are interested! 

---
Thanks very much to Jenni Martin at CDM for her support of our research!

* What does that even mean? Is a word a word if no one understands or recognizes it? That seems pretty philosophically deep, but hard to assess empirically. We'll go with the first word that someone else, usually a parent, recognized as being part of communicating (or trying to communicate). 


Monday, August 11, 2014

Getting latex to work at journals

I got good news about a manuscript being accepted today, but I was reminded again how painful it can be to get journals to accept LaTeX source. Sometimes I wonder if I waste as much time in wrangling journals as I save in writing by using tex and bibtex.

I routinely have manuscripts bounced back by editorial assistants who ask "Why can't I edit your PDF? Can you send me the word source?" Hopefully policies like Elsevier's Your Paper Your Way and the PNAS equivalent, "express submission," will promote a new norm for first submissions.

For my own memory as much as anyone else, here are some tips for getting Elsevier's maddening EES to accept tex source:

  • Uploading an archive with all files never has worked for me, so I skip this step
  • Use elsarticle.cls and the included model5-names.bst for formatting and APA-style references
  • Upload both .bib AND .bbl files as supplementary information (this was the tricky one!) – why would this be necessary?
  • Upload all figures separately; PDF format has worked for me though EPS is requested.
Also, if you have uploaded a version of a file and then want to replace it, be careful to rename it. EES keeps the oldest version of a file so it will not update if you upload a newer version (totally idiotic).