Sunday, October 26, 2014

Response to Marcus & Davis (2013)

tl;dr: We wrote a response to a critique of our work. Some more musings about overfitting.

Last year, Gary Marcus and Ernie Davis (M&D) wrote a piece in Psychological Science that was quite critical of probabilistic models of higher-level cognition. They asked whether such models are selectively fit to particular tasks to the exclusion of counter-evidence ("task selection") and whether the models are hand-tweaked to fit those particular tasks ("model selection"). On the basis of these critiques, they questioned whether probabilistic models are a robust framework for describing human cognition.

It's never fun to have your work criticized, but sometimes there is a lot that can be learned from these discussions. For example, in a previous exchange with Ansgar Endress (critique here, my response here, his response here), I got a chance to think through my attitude towards the notion of rationality or optimality. Similarly, in Tom Griffiths' and colleagues response to another critique, they have some nice discussion of this issue.

In that spirit, a group of Bayesians whose work was mentioned in the critique have recently written a response letter that will be published in the same journal as M&D's critique (after M&D get a chance to reply). Our response is very short, but hopefully it captures our attitude towards probabilistic models as being a relevant and robust method – not the only one, but one that has shown a lot of recent promise – for describing higher-level cognition. Here I want to discuss one thing that got compressed in the response, though.

One of the pieces of work M&D critiqued was the model of pragmatic reasoning that Noah Goodman and I published a couple of years ago (I'll call that the FG2012 model). Our one-page paper reported only a single study with a very restricted stimulus set, but there is actually a substantial amount of recent work on this topic that suggests such models do a good job at describing human reasoning about language in context; I posted a bibliography of such work a little while ago.

M&D criticized a particular modeling decision that we took in FG212 – the use of a Luce choice rule to approximate human judgments. They pointed out that other choices (that could have been justified a priori) nevertheless would have fit the data much worse. Summarizing their critique, they wrote:
"Individual researchers are free to tinker, but the collective enterprise suffers if choices across domains and tasks are unprincipled and inconsistent. Models that have been fit only to one particular set of data have little value if their assumptions cannot be verified independently; in that case, the entire framework risks becoming an exercise in squeezing round pegs into square holes." (p. 2357)
I agree with this general sentiment, and have tried in much of my work to compare models from different traditions across multiple datasets and experiments using the same fitting procedures and evaluation standards (examples here, here, and here). But I don't think the accusation is fair in the case of FG2012. I'll go through a couple of specific interpretations of the critique that don't hold in our case, and then argue that in fact the "model selection" argument is really just a repetition of the (supposedly separate) "task selection" argument.

Problem: numerical overfitting -> Solution: cross-validation. A very specific critique that M&D could be leveling at us is that we are overfitting. The critique would then be that we tuned our model post-hoc to fit our particular dataset. In a standard case of overfitting, say for a classification problem, the remedy is to evaluate the model on held-out data that wasn't used for parameter tuning. If the model performs as well on the out-of-sample generalization, then it's not overfit. Our (extremely simple) model was clearly not overfit in this sense, however: It had no numerical parameters that were fit to the data.

Problem: post-hoc model tweaking -> Solution 1: pre-registration. Another name for overfitting – when it concerns the researcher's choice of analytic model – is p-hacking. This is closer to what M&D say: Maybe we changed details of the model after seeing the data,  in order to achieve a good fit. But that's not true in this case. As datacolada says, when someone accuses you of p-hacking, the right response is to say "I decided in advance." In this case, we did decide in advance – the Luce choice rule was used in our 2009 CogSci proceedings paper with a predecessor model and a large, independent dataset.*

Problem: post-hoc model tweaking -> Solution 2: direct replication. A second response to questions about post-hoc model choice is direct replication. Both we and at least one other group that I know of have done direct replications of this study – it was a very simple MTurk survey, so it is quite easy to rerun with essentially no modifications (if you are interested, the original materials are here**). The data look extremely similar.*** So again, our model really wasn't tweaked to the particulars of the dataset we collected on our task.

What is the critique, then? I suspect that M&D are annoyed about the fact that FG2012 proposed a model of pragmatic reasoning and tested it on only one particular task (which it fit well). We didn't show that our model generalized to other pragmatic reasoning tasks, or other social cognition tasks more broadly. So the real issue is about the specificity of the model for this experiment vs. the broader empirical coverage it offers.

In their response, M&D claim to offer two different critiques: "model selection" (that's the one we've been discussing) and "task selection" (the claim that Bayesian modelers choose to describe the subset of phenomena that their models fit, but omit other evidence in the discussion). In light of the discussion above, I don't see these as two different points at all. "Model selection," while implying all sorts of bad things like overfitting and p-hacking, in this case is actually a charge that we need to use our models to address more different tasks. And if the worst thing you can say about a model is, "it's so good on those data, you should apply it to more stuff," then you're in pretty good shape.

* Confession: I've actually used Luce choice in basically every single cognitive model I've ever worked on. At least every one that required linkage between a probability distribution and a N-alternative forced-choice task.

** Contact me if you want to do this and I can explain some idiosyncrasy in the notation we used.

*** I'm not sure how much I can share about the other project (it was done by a student at another institution whose current contact info I don't have) but the result was extremely close to ours. Certainly there were no differences that could possibly inspire us to reject our choice rule.

(Thanks to Noah Goodman, Gary Marcus, and Ernie Davis for reading and commenting on a draft of this post).

1 comment:

  1. Well argued. This post ably defends one aspect of Mike Frank and Noah Goodman's work that we had criticized. However, it does not address our criticism of the larger literature, which we feel remains valid. (We will elaborate in a forthcoming response in Psychological Science.)
    Gary Marcus & Ernie Davis