Wednesday, September 30, 2015

Descriptive vs. optimal bayesian modeling

In the past fifteen years, Bayesian models have fast become one of the most important tools in cognitive science. They have been used to create quantitative models of psychological data across a wide variety of domains, from perception and motor learning all the way to categorization and communication. But these models have also had their critics, and one of the recurring critiques of the models has been their entanglement with claims that the mind is rational or optimal. How can optimal models of mind be right when we also have so much evidence for the sub-optimality of human cognition?*

An exciting new manuscript by Tauber, Navarro, Perfors, and Steyvers makes a provocative claim: you can give up on the optimal foundations of Bayesian modeling and still make use of the framework as an explicit toolkit for describing cognition.** I really like this idea. For the last several years, I've been arguing for decoupling optimality from the Bayesian project. I even wrote a paper called "throwing out the Bayesian baby with the optimal bathwater" (which was about Bayesian models of baby data, clever right?).

In this post, I want to highlight two things about the TNPS paper, which I generally really liked and enjoyed reading. First, it contains an innovative fusion of Bayesian cognitive modeling and Bayesian data analysis. BDA has been a growing and largely independent strand of the literature; fusing BDA with cognitive models makes a lot of really rich new theoretical development possible. Second, it contains two direct replications that succeed spectacularly, and it does so without making any fuss whatsoever – this is, in my view, what observers of the "replication crisis" should be aspiring to.

1. Bayesian cognitive modeling meets Bayesian data analysis.

The meat of the TNPS paper revolves around three case studies in which they use the toolkit of Bayesian data analysis to fit cognitive models to rich experimental datasets. In each case they argue that taking an optimal perspective – in which the structure of the model is argued to be normative relative to some specified task – is overly restrictive. Instead, they specify a more flexible set of models with more parameters. Some settings of these parameters may be "suboptimal" for many tasks but have a better chance of fitting the human data. And the fitted parameters of these models then can reveal aspects of how human learners treat the data – for example, how heavily they weight new observations or what sampling assumptions they make.

This fusion of Bayesian cognitive modeling and Bayesian data analysis is really exciting to me because it allows the underlying theory to be much more responsive to the data. I've been doing less cognitive modeling in recent years in part because my experience was that my models weren't as responsive as I liked to the data that I and others collected. I often came to a point where I would have to do something awful to my elegant and simple cognitive model in order to make it fit the human data.

One example of this awfulness comes from a paper I wrote on word segmentation. We found that an optimal model from the computational linguistics literature did a really good job fitting human data - if you assumed that it observed data equivalent to something between a tenth and a hundredth of the data the humans observed. I chalked this problem up to "memory limitations" but didn't have much more to say about it. In fact, nearly all my work on statistical learning has included some kind of memory limitation parameter, more or less – a knob that I'd twiddle to make the model look like the data.***

In their first case study, TNPS estimate the posterior distribution of this "data discounting" parameter as part of their descriptive Bayesian analysis. That may not seem like a big advance from the outside, but in fact it opens the door to putting into place much more psychologically-inspired memory models as part of the analytic framework. (Dan Yurovsky and I played with something a bit like this in a recent paper on cross-situational word learning – where we estimated a power-law memory decay on top of an ideal observer word learning model – but without the clear theoretical grounding that TNPS). I would love to see this kind of work really try to understand what this sort of data discounting means, and how it integrates with our broader understanding of memory.

2. The role of replication.

Something that flies completely under the radar in this paper is how closely TNPS replicate the previous empirical findings reported. Their Figure 1 tells a great story:


Panel (a) shows the original data and model fits from Griffiths & Tenenbaum (2007), and panel (b) shows their own data and replicated fits. This is awesome. Sure, the model doesn't perfectly fit the data - and that's TNPS's eventual point (along with a related point about individual variation). But clearly GT measured a true effect, and they measured it with high precision.

The same thing was true of Griffiths & Tenenbaum (2006) – the second case study in TNPS. GT2006 was a study about estimating conditional distributions for different processes, e.g. given that you've lived X years, how likely is it that you live Y. At the risk of belaboring the point, I'll show you three datasets on this question. First from GT2006, second from TNPS, and third a new, unreported dataset from my replication class a couple of years ago.**** The conditions (panels) are plotted in different orders in each plot, but if you take the time to trace one, say lifespans or poems, you will see just how closely these three datasets replicate one another. Not just the shape of the curve but also the precise numerical values:





This result is the ideal outcome to strive for in our responses to the reproducibility crisis. Quantitative theory requires precise measurement - you just can't get anywhere fitting a model to a small number of noisily estimated conditions. So you have to strive to get precise measures – and this leads to a virtuous cycle. Your critics can disagree with your model precisely because they have a wealth of data to fit their more complex models to (that's exactly TNPS's move here).

I think it's no coincidence that quite a few of the first big data, mechanical turk studies I saw were done by computational cognitive scientists. Not only were they technically oriented and happy to port their experiments to the web, they also were motivated by a severe need for more measurement precision. And that kind of precision leads to exactly the kind of reproducibility we're all striving for.

---
* Think Tversky & Kahneman, but there are many many issues with this argument...
** Many thanks to Josh Tenenbaum for telling me about the paper; thanks also to the authors for posting the manuscript.
*** I'm not saying the models were in general overfit to the data – just that they needed some parameter that wasn't directly derived from the optimal task analysis.
**** Replication conducted by Calvin Wang.

2 comments:

  1. Very cool stuff, Mike. Thank you for posting this.

    I am a little bit unclear about all this optimality business, and it may be own naivety of the history of the literature, what I’ve heard described by J. B. Tenenbaum as “philosophical baggage” and related things. I thought bayesian models (descriptive, optimal, or otherwise) were always “optimal” w.r.t. a prior and a likelihood. That is, Bayes Rule gives you the optimal way to combine these two sources of information. This view may be a very weak optimality claim (maybe that evolutionary psychologists wouldn’t get inspired about), but it seems that it is always present with a bayesian model. What then is characteristic about the “optimal models” as described by TNPS? The argument seems to rest on what is going into the prior and likelihood.

    I find it easier to think about the priors so I’ll start there. Take Case Study 2: The optimal/descriptive distinction of TNPS seems to rest on the question of “what are the priors?” with the possible answers being (1) environmental (optimal), or (2) non-environmental (non-optimal). They find that (2) is mostly the case, but (1) isn’t terrible. The distinction between optimal/non seems to rest on “are the priors optimal”, not “is the reasoning optimal”. I don’t yet find this distinction of optimal vs. non optimal priors compelling. Do we have criteria to tell whether priors are optimal? In psychology, it seems that the priors being perfectly aligned with environmental statistics are conceivably not optimal. For example, is it optimal to include infant mortalities into your beliefs about lifespan, or might you give special status to infant mortalities, reserving those beliefs their own distribution? This gloss on the optimality question seems to be removed from the empirical landscape and more appropriate for philosophical quarters.*

    The case that “everything is (relative to some prior, likelihood) optimal” is perhaps a little more nuanced in the case of modifying the likelihood. I really like their Case Study 1 approach of discounting evidence / modifying the likelihood (and your gloss relating it to memory is also very interesting). They show that psychokinesis information effectively requires more evidence to produce the same updating as the genetics information. But given that the updating is discounted in that way, the incorporation of that discounted evidence with the prior is still optimal in the sense of how these information sources are combined.**

    In the end, I think back to the work on subjective randomness, where there’s a very clear case that people are not optimal with respect to veridical statistical reasoning (and its corresponding generative process e.g. flipping a coin) but seem to be optimal with respect to some lay theory of how the data could have been generated and what the experimenter’s question is really asking (random vs. non-random generative process).

    I think “descriptive Bayes” as TNPS put it is methodologically superior and a more tractable way of doing science. I still think there is optimality in there, perhaps a weaker optimality than implicated in the early Bayesian literature.

    MHT

    *Also on priors: TNPS says the optimality question doesn’t apply to the hypothetical priors of future lifespans, but I think there is still an optimality question: Given beliefs about future lifespans, and the likelihood function specified, are the inferences optimal?

    ** What is really cool about the TNPS approach is that it brings light to the “discounted updating” phenomenon, which raises the question of “why?” It’s quite conceivable the likelihood function is different as the result of a different lay theory about the information sources (e.g. psychokineticians are more likely to be fraudulent in reporting their results than geneticists).

    ReplyDelete
    Replies
    1. MH, thanks for the comments.

      My take is that "optimal" here refers to "optimal with respect to some natural task," as in some versions of Marr's Computational Theory level of analysis, or as in rational analysis. The sense of optimality you're talking about is "optimal inference with respect to the model definition." Confusion between these two is a source of much stress and conflict, IMO.

      I see TNPS as saying, let's give up on that first sense of optimal, since (as you point out) arguments that a particular prior is exactly right with respect to some environmental task can be both pretty flimsy and unnecessarily constraining of the data analyst.

      Delete