(This post is written in part as a reference for a seminar I'm currently teaching with Jamil Zaki and Noah Goodman, Models of Social Behavior).

The goal of research in cognitive science is to produce explicit computational theories that describe the workings of the human mind. To this end, a major part of the research enterprise consists of making formal artifacts – computational models. These are artifacts that take as their inputs some stimuli, usually in coded form, and produce as their outputs some measures that are interpretable with respect to human behavior.

In this post I'll discuss a visualization-based strategy for assessing the fit of models to data, based on moving between plots at different levels of abstraction. I often refer to this informally as "putting a model through its paces."

I take as my starting point the idea that a model is successful if it is both parsimonious itself and if it provides a parsimonious description of a body of data. To unpack this, a bit more, the basic idea is that any model is created as a formal theory of some set of empirical facts. If you know the theory, you can predict the facts – and so you don't need to remember them because they can be re-derived. A model can fail because it's too complicated – it predicts the facts, but at the cost of having so many moving parts that it is itself hard to remember. Or or it can fail because it is consistent with all patterns of data – and hence doesn't compress the particular pattern of data much at all. (I've discussed this this "minimum description length" view of cognitive modeling in more depth both in this post and in my academic writing here and here. Note that it's consistent with a wide range of modeling formalisms from neural networks to probabilistic models).

So how do we assess models within this framework, and how do we compare them to data? Although the minimum description length framework provides a guiding intuition, it's not that easy to say exactly how parsimonious a particular model is; there are actually fundamental mathematical difficulties with computing parsimony. But there are nevertheless many methods for statistical comparison between models, and these can be very useful – especially when you are using models that are posed in coherent and equivalent vocabularies. Here are great slides from a tutorial that Mark Pitt and Jay Myung gave at CogSci a couple of years ago on model comparison.

My focus in this post is a bit more informal, however. What I want to do is to discuss a set of plots for model assessment that can be used together to gain understanding about the relationship of a model or set of models to data:

- A characteristic model plot, one that lets you see details of the model's internal state or scoring so that you can understand what it has learned or why it has produced a particular result.
- A plot of model results across conditions or experiments, in precisely the same format as the experimental data are typically plotted.
- A scatter plot of model vs. data for comparing across experiments and across models.
- A plot of model fit statistics as parameters are varied.

In each of these, I've used examples from the probabilistic models literature, taken mostly from my work and the work of my collaborators. This choice is purely because I know these examples well, not because of anything special about these examples. The broader approach is due to Andrew Gelman's philosophy of exploratory model checking (on display, e.g. in this chapter).

### 1. Characteristic Model Plot

The first plot I typically make when I am working on a model has the goal of understanding

*why*the model produces a particular result when it is given a certain pattern of input data. This plot is often highly idiosyncratic to the particular representation or model framework that I am using – but it gives insight into the guts of the model. It typically doesn't include the empirical data.
Here is one example, from Frank, Goodman, & Tenenbaum (2009). We were trying to understand how our model of word learning performed on a task that's sometimes called "mutual exclusivity" that had been used in the language acquisition literature. The task is simple: you show a child a known object (e.g. a BIRD) and a novel object (a DAX), and you say "can you show me the dax?" Children in this task reliably pick the novel object.

Our model made this same inference, but we wanted to understand why. So we chose four different hypotheses that the model could have about what the word "dax" meant (represented at the top of each of the four panels) and computed the model's scores for each of these. In our model, the posterior score of a hypothesis about word meanings was the product of the prior probability of the lexicon, the probability of a corpus of input data, and the probability assigned in this specific experimental situation. So we plotted each of those numbers on a relative probability scale such that we could easily compare them. The result was an interesting insight: The major reason why the model preferred lexicon B (the one consistent with the data) was that it placed higher probability on previous utterances in which the word "dax" hadn't been heard before even though the BIRD object had been seen. It was this unseen data that made it odd to think that "dax" actually did mean BIRD after all (lexicons C and D).

Another example of this kind of plot comes from a follow-up to this paper (Lewis & Frank, 2013), again looking at the "mutual exclusivity" phenomenon. In this case, we plotted the relative probabilities assigned to different lexicons in a simple bar graph (with shading indicating different priors), but we used a graphical representation of the lexicon as the axis label. The generalization that emerged from this plot is that the lexicon where each word is correctly mapped one-to-one (the middle lexicon with one gray bar and one green bar), almost regardless of the prior that is used.

In general, visualizations of this class will be very project- and model-specific, because they will depend on the relevant aspects of the model that you find most informative in your explanations. Nevertheless, they form a crucial tool for diagnosing why the model produced a particular result for a given input configuration.

### 2. Data-space plot

This style of plot is very important for evaluating the correspondence of a model to human data. It's the first thing I typically do after getting a model working. I try to produce predictions about what the experimental data should look like, on the same axes (or at least in the same format) as the original plots of the experimental data.

Here is one example, from a paper on word segmentation (Frank et al., 2010). We tracked human performance in statistical word segmentation tasks while we varied parameters of the language the participants were segmenting in three different experiments. We then examined the fit of a range of models to these data. The human data here are abstracted away from any variability and are just solid curves repeated across plots; the model predictions make clear that while there is some variability in performance on the first two experiments, it is Experiment 3 where all models fail:

Here's a second example, this one from Baker, Saxe, & Tenenbaum (2009). They had participants judge the goals of a cartoon agent at various different "checkpoints" along a path. For example, in panel (a), condition 1, the agent wended its way around a wall and then headed for the corner. At each numbered point, participants made a judgment about whether the agent was headed towards goal A, B, or C. In panel (b), experimental data show graded ratings for each goal, and in panel (c) you can compare model-derived ratings:

The general point here is that visualization is about comparison (a point I identify with William Cleveland but don't have a good citation for). This plot makes it easy to compare the gestalt pattern of model vs. experimental data in a format where you can readily identify the particular conditions under which deviations occur – sometimes even across models.

### 3. Cross-dataset, cross-model plot

These plots are more obvious and more conventional. You plot model predictions on the horizontal and human behavior on the vertical, resulting in a scatter plot where greater coherence indicates greater correlation between model and data. Of course, the result can be quite deceptive because it obscures the model's degrees of freedom. But this sort of plot is a simple and powerful tool for quickly assessing fit. Here's one nice one:

This example comes from Orban et al. (2008), a paper on visual statistical learning. They plot log probability ratio of test items by proportion correct in human experiments. Each datapoint is a separate condition, and the key relates each datapoint back to the experimental data, so that you can look up which points don't fit as well. When possible it's nice to plot the the actual labels on the axes so that you don't have to use a key, but sometimes this strategy can get overly messy. A minor note: I really like to plot confidence intervals (rather than standard error) here so that we can better assess by eye whether variability is due to measurement noise or a truly incorrect prediction by the model.

This sort of plot can also be very good at comparing across models, because you can see at a glance both which model fits better and which points are negatively affecting fit. Here's another version, from Sanjana & Tenenbaum (2002). The data aren't labeled here, but I like both the matrix of small plots (the grid is different models on the columns, different experimental conditions on the rows) and the prominent reporting of the correlation coefficients in each plot:

One interesting thing that you can see by presenting the data this way is that there are sets of conditions that don't vary in their model predictions (e.g. the bunches of dots on the right side of the center panel). As an analyst I would be interested to know whether these conditions are truly distinct experimentally and getting lumped inappropriately by this model. My next move would probably be to make plots #1 and #2 with just these conditions included. More generally, I find it extremely important to have the ability to move flexibly back and forth between more abstract plots like this one and plots that are more grounded in the data.

### 4. Parameter sensitivity plots

Even the simplest cognitive models typically have some parameters that can be set with respect to the data (free parameters). And all of the plots described above produce an output given some settings of the model's free parameters, allowing you to assess fit to data. But, as a number of influential critiques point out, a good fit is not necessarily what we should be looking for in a cognitive model. I take these critiques to be largely targeted at excessive flexibility in models – which can lead to overfitting.

So an important diagnostic for any model is how its fit to data varies in different parameter regimes. Showing this kind of variability can be tricky, however. Sometimes parameters are cognitively interesting, but in other circumstances they are not – yet it is important to explore them fully.

These plots typically plot either model performance (in the case of a small number of conditions) or a summary statistic capturing goodness of fit or performance (in the case of more data) as a function of one or more free parameters. The goodness of fit statistic is usually a measure like mean squared error (deviation from human data) or simple Pearson correlation with the data, but can be any number of other summary statistics as well. In models with only one or two free parameters, visualizing the model space is not too difficult, but as the model space balloons, visualization can be difficult or impossible.

Here is one example, from a paper I wrote on modeling infants' ability to learn simple rules:

It's a combination of a #4 plot (left) and a #2 plot, middle/right. On the left, I show model predictions across five conditions, as alpha (a noise parameter) was varied. The salient generalization from this plot is that the ordering of conditions stays the same, even though the absolute values change. The filled markers call out the parameter value that is then plotted in the middle as a bar graph for easy comparison with the right-hand panel. As you can see, the fit is not perfect, but the relative ordering of conditions is similar.

Here is another example, one that I'm not proud of from a visualization perspective (especially as it uses the somewhat unintuitive matlab jet colormap):

This comes from our technical report on the word learning model described above. The model had three free parameters, alpha, gamma, and kappa. This plot shows a heatmap of f-score (a measure of the model's word learning performance) as a function of all three of those continuous parameters, with the maxima for each plot labeled. Although this isn't a great publication-ready graphic, it was very useful for me as a diagnostic of the parameter-sensitivity of the model – and led to us using a parameter-reduction technique to try to avoid this problem.

Parameter plots may not always make for the best viewing, but they can be an extremely important tool for understanding how your model's performance varies in even a high-dimensional parameter space.

### Conclusions

I've tried to argue here for an exploratory visualization approach to model-checking for cognitive models especially. The approach is predicated on having plots at multiple levels of abstraction, from diagnostic plots that let you understand why a particular datapoint was predicted in a certain way, all the way up to plots that let you consider the stability of summary statistics throughout parameter space. It is not always trivial to code up all of these visualizations, let alone to create an ecosystem in which you can move flexibly between them. Nevertheless, it can be extremely useful in both debugging and in gaining scientific understanding.

## No comments:

## Post a Comment