Sunday, October 26, 2014

Response to Marcus & Davis (2013)

tl;dr: We wrote a response to a critique of our work. Some more musings about overfitting.

Last year, Gary Marcus and Ernie Davis (M&D) wrote a piece in Psychological Science that was quite critical of probabilistic models of higher-level cognition. They asked whether such models are selectively fit to particular tasks to the exclusion of counter-evidence ("task selection") and whether the models are hand-tweaked to fit those particular tasks ("model selection"). On the basis of these critiques, they questioned whether probabilistic models are a robust framework for describing human cognition.

It's never fun to have your work criticized, but sometimes there is a lot that can be learned from these discussions. For example, in a previous exchange with Ansgar Endress (critique here, my response here, his response here), I got a chance to think through my attitude towards the notion of rationality or optimality. Similarly, in Tom Griffiths' and colleagues response to another critique, they have some nice discussion of this issue.

In that spirit, a group of Bayesians whose work was mentioned in the critique have recently written a response letter that will be published in the same journal as M&D's critique (after M&D get a chance to reply). Our response is very short, but hopefully it captures our attitude towards probabilistic models as being a relevant and robust method – not the only one, but one that has shown a lot of recent promise – for describing higher-level cognition. Here I want to discuss one thing that got compressed in the response, though.

One of the pieces of work M&D critiqued was the model of pragmatic reasoning that Noah Goodman and I published a couple of years ago (I'll call that the FG2012 model). Our one-page paper reported only a single study with a very restricted stimulus set, but there is actually a substantial amount of recent work on this topic that suggests such models do a good job at describing human reasoning about language in context; I posted a bibliography of such work a little while ago.

M&D criticized a particular modeling decision that we took in FG212 – the use of a Luce choice rule to approximate human judgments. They pointed out that other choices (that could have been justified a priori) nevertheless would have fit the data much worse. Summarizing their critique, they wrote:
"Individual researchers are free to tinker, but the collective enterprise suffers if choices across domains and tasks are unprincipled and inconsistent. Models that have been fit only to one particular set of data have little value if their assumptions cannot be verified independently; in that case, the entire framework risks becoming an exercise in squeezing round pegs into square holes." (p. 2357)
I agree with this general sentiment, and have tried in much of my work to compare models from different traditions across multiple datasets and experiments using the same fitting procedures and evaluation standards (examples here, here, and here). But I don't think the accusation is fair in the case of FG2012. I'll go through a couple of specific interpretations of the critique that don't hold in our case, and then argue that in fact the "model selection" argument is really just a repetition of the (supposedly separate) "task selection" argument.

Problem: numerical overfitting -> Solution: cross-validation. A very specific critique that M&D could be leveling at us is that we are overfitting. The critique would then be that we tuned our model post-hoc to fit our particular dataset. In a standard case of overfitting, say for a classification problem, the remedy is to evaluate the model on held-out data that wasn't used for parameter tuning. If the model performs as well on the out-of-sample generalization, then it's not overfit. Our (extremely simple) model was clearly not overfit in this sense, however: It had no numerical parameters that were fit to the data.

Problem: post-hoc model tweaking -> Solution 1: pre-registration. Another name for overfitting – when it concerns the researcher's choice of analytic model – is p-hacking. This is closer to what M&D say: Maybe we changed details of the model after seeing the data,  in order to achieve a good fit. But that's not true in this case. As datacolada says, when someone accuses you of p-hacking, the right response is to say "I decided in advance." In this case, we did decide in advance – the Luce choice rule was used in our 2009 CogSci proceedings paper with a predecessor model and a large, independent dataset.*

Problem: post-hoc model tweaking -> Solution 2: direct replication. A second response to questions about post-hoc model choice is direct replication. Both we and at least one other group that I know of have done direct replications of this study – it was a very simple MTurk survey, so it is quite easy to rerun with essentially no modifications (if you are interested, the original materials are here**). The data look extremely similar.*** So again, our model really wasn't tweaked to the particulars of the dataset we collected on our task.

What is the critique, then? I suspect that M&D are annoyed about the fact that FG2012 proposed a model of pragmatic reasoning and tested it on only one particular task (which it fit well). We didn't show that our model generalized to other pragmatic reasoning tasks, or other social cognition tasks more broadly. So the real issue is about the specificity of the model for this experiment vs. the broader empirical coverage it offers.

In their response, M&D claim to offer two different critiques: "model selection" (that's the one we've been discussing) and "task selection" (the claim that Bayesian modelers choose to describe the subset of phenomena that their models fit, but omit other evidence in the discussion). In light of the discussion above, I don't see these as two different points at all. "Model selection," while implying all sorts of bad things like overfitting and p-hacking, in this case is actually a charge that we need to use our models to address more different tasks. And if the worst thing you can say about a model is, "it's so good on those data, you should apply it to more stuff," then you're in pretty good shape.

---
* Confession: I've actually used Luce choice in basically every single cognitive model I've ever worked on. At least every one that required linkage between a probability distribution and a N-alternative forced-choice task.

** Contact me if you want to do this and I can explain some idiosyncrasy in the notation we used.

*** I'm not sure how much I can share about the other project (it was done by a student at another institution whose current contact info I don't have) but the result was extremely close to ours. Certainly there were no differences that could possibly inspire us to reject our choice rule.

(Thanks to Noah Goodman, Gary Marcus, and Ernie Davis for reading and commenting on a draft of this post).

Friday, October 17, 2014

Semantic bleaching in early language

M is now 15 months, and her receptive vocabulary is quite large. She knows her vehicles, body parts, and a remarkable number of animals (giraffe, hyena, etc.).  Not coincidentally, we spend a large amount of our time together reading books about vehicles, body parts, and animals – at her initiation. Her productive language is also proceeding apace. As one friend astutely observed, she has a dozen words, nearly all of them "ba."*

I've noticed something very interesting over the last three months. Here's one example. When M first discovered the word "da," she used it for several days, with extreme enthusiasm, in a way that seemed like was identical to the word "dog." The form-function mapping was consistent and distinctive: it would be used for dogs and pictures of dogs, but nothing else. But then over the course of subsequent days, it felt like this word got "bleached" of meaning – it went from being "dog" to being "wow, cool!"

The same thing happened with her first extended experience with cats, at a visit to my parents' apartment. She started producing something that sounded like "tih" or "dih" – very clearly in response to the cats. But this vocalization then gradually became a noise of excitement that was seemingly applied to things that were completely different. Perhaps not coincidentally, our visit was over and we didn't have any cat-oriented books with us, so she couldn't use the word referentially. Now that we're back in the land of cat books, the word is back to having a "real" meaning.

This looks to me to be a lot like the phenomenon of "semantic bleaching," where words' meanings get gradually broadened to incorporate other meanings (like the loss of the "annus" – year – part of the meaning of anniversary). This kind of bleaching typically happens over a much longer timescale as part of grammaticalization, the process by which content words can become particles or affixes (e.g. as in the content verb "go" becoming a particle you can use to describe things in the future like "going to X"). But maybe it is happening very quickly due to the intense communicative pressure on M's extremely small lexicon?

The idea here would be that M only has a very small handful of words. If they don't fit the current situation, she adapts them. But mostly the communicative context she's adapting to is, "hey, look at that!" or "let me grab that!" So the process of bleaching words out from "dog" to something more like "cool!" could actually be a very adaptive strategy.

I've checked the research literature and haven't found anything like this described. Any readers know more?

---
* It's really more like ball ("ba"), balloon ("ba", but possibly just a ball), bus ("ba"), perhaps baby ("ba") and bottle ("ba") as well, dog ("da"),  yes ("dah"), cat/kittie ("dih"), truck ("duh"), daddy ("dada"), yum ("mum-mum"), more ("muh"), hi ("ha-e"), and bye ("ba-e"). Some of these are speculative, but I think this is a pretty good estimate.