Monday, May 26, 2014

Another replication of Schnall, Benton, & Harvey (2008)


Simone Schnall, in her recent blogpost, notes that she has received many requests for materials and data to investigate her work on cleanliness priming. One of those requests came from Fiona Lee, a student in my replication-based graduate research methods course (info and syllabus here). Fiona did a project conducting a replication of Study 1 from Schnall, Benton, & Harvey (2008) using Amazon Mechanical Turk. We'd like to say at the outset that we really appreciate Simone Schnall's willingness to share her stimuli and her responsiveness to our messages.

After we realized the relevance of this replication attempt to the recent discussion, Fiona decided to make her data, report, and results public. Our results are described below, followed by some thoughts on the take-home messages in terms of both the science and the tone of the discussion.

Here is a link to Fiona's replication report (in the style of the Reproducibility Project Reports). Here are her (anonymized and cleaned-up) data and my analysis code. Here's her key figure:


In a sample of 96 adults (90 after planned exclusions), Fiona found no effect of priming condition (cleanliness vs. neutral) on participants' moral judgment severity in her planned analysis. In exploratory analyses, she found an unpredicted age effect on moral judgments, perhaps due to the fact that the age spread of the Mechanical Turk population was greater than that of the original study. When age was broken into quartiles, she saw some support for the hypothesis that the youngest participants showed the predicted priming effect, suggesting that age might have been a moderator of the effect. Here's the key figure for the age analysis:



I have confirmed Fiona's general analyses and I agree with her conclusions, though I would qualify that I feel the statistical support for the age-moderation hypothesis is quite limited. The main effect of age on moral judgment is very reliable, but the interaction with priming condition is not.

Some further thoughts on the relevance of these data to the Johnson et al. study. First, the age issue doesn't apply in that case, since there was no age differentiation in the Johnson et al. sample. Second, in Fiona's replication, the moral judgments were relatively coherent with one another (alpha=.71), so we didn't see any problem with averaging them. Finally, a ceiling effect was one of the concerns regarding the failed replication of Johnson et al. In Fiona's dataset, after planned exclusions (failure to pass the attention check or correctly guessing the hypothesis of the study), the percentage of extreme responses was about 24% for the entire dataset, and about 27% for the neutral condition, which was very close to the percentage score of extreme responses in the neutral condition of the original study (28%). Some exploratory histograms are embedded in my analysis code, if others are interested in seeing another dataset that uses Schnall's stimuli.

The fact that Fiona ran her study on Mechanical Turk is an important difference between her experiment and previous work. For someone interested in pursuing a related line of research, experiments like this one are trivially easy to run on Mechanical Turk, but there are real questions about whether priming research in particular can be replicated online. Although Turk replications of cognitive psychology tasks work exceedingly well, Turk could be a poor platform for social priming research specifically. Fiona suggests also that unscrambling tasks in particular may be different online, as participants type rather than writing longhand. A very useful goal for future research would be the replication of other priming experiments on Turk to determine the efficacy of the platform for research of this type.

So what's the upshot? These data provide further evidence that, to the extent cleanliness primes have an effect on moral judgment, there may be a number of moderators of this effect – age and/or online administration being possible candidates. Hence, our experience confirms that the finding in Experiment 1 of Schnall, Benton, & Harvey (2008) is not trivial to reproduce (though see two successes on PsychFiledrawer). Further research with larger samples, a number of administration methods, and – critically in my view – a wider range of well-normed judgment problems may be required.

Nevertheless, both Fiona and I feel that the tone of the discussion surrounding this issue has been far too negative and offer apologies for any issues of tone in our contact or discussion of this issue. (I will be writing a separate post on the issue of tone and replication shortly). Regardless of the eventual determination regarding cleanliness priming, we appreciate Schnall's willingness to engage with the community to understand the issue further.

Saturday, April 26, 2014

Data analysis, one step deeper

tl;dr: Using git+R is good for reproducible research. If you already knew that, then you won't learn a lot here.

I just read Dorothy Bishop's new post, Data analysis: Ten tips I wish I'd known sooner. I really like her clear, commonsense recommendations and agree with just about all of them. But in the last couple of years, I've become convinced that even for the less technically-minded student (let alone for the sophisticated researcher), the spirit of many of her tips can be implemented using open tools like git and R. As a side benefit, many of the recommendations are a natural part of the workflow in these tools, rather than requiring extra effort.

My goal in this post is to explain how this ecosystem works, and why it (more or less) does the right thing. I also want to show you why I think it's a pretty good tradeoff between learning time and value added. Of course, you can get even more reproducible, managing your project on the Open Science Framework (and using their git support), and use sweave and LaTeX to typeset exactly the same code you've written. These are all great things. But for many people, starting out with such a complex, interlocking set of tools can be quite daunting. I'd like to think the git+R setup that we use strikes a good balance.

Bishop's recommendations implicitly address several major failure modes for data analysis:
  1. Not being able to figure out what set of steps you did, 
  2. Not being able to figure out what those steps actually accomplished, and
  3. Not being able to reconstruct the dataset you did them to.
These are problems in the reproducibility of your analysis, and as such, pose major risks to basic science you're trying to do. The recommendations that address these issues are very sensible: keep track of what you did (recs 8 and 9), label and catalogue your variables in a semantically transparent way (recs 2 and 4), archive and back up your data (recs 5 and 6). Here's how I accomplish this in git+R.

Writing analyses in R as a keystone of reproducible analysis

Bishop's recommendations focus on the importance of keeping a log of analyses. This is the classic best-practices approach in bench science: keep a lab notebook! Although I don't think you can go wrong with this approach, it has a couple of negatives. First, it requires a lot of discipline. If you get excited and start doing analyses, you have to stop yourself and remember to document them fully. Second, keeping a paper lab notebook means going back and forth between computer and notebook all the time (and having to work on analyses only when you have a notebook with you). On the other hand, using an electronic notebook can mean you run into major formatting difficulties in including code, data, images, etc.

These problems have been solved very nicely by iPython, an interactive notebook that allows the inclusion of data, code, images, and text in a single flexible format.  I suspect that once this approach is truly mature and can be used across languages, interactive notebooks are what we all should be using. But I am not yet convinced that we should be writing python code to analyze our data yet – and I definitely don't think we should start students out this way. Python is a general-purpose language (and a much better one than R) but the idioms of data analysis are not yet as codified or as accessible in it, even though they are improving rapidly.

In the mean time, I think the easiest way for students to learn to do reproducible data analysis is to write well-commented R scripts. These scripts can simply be executed to produce the desired analyses. (There is of course scripting functionality in SPSS as well, but the combination of clicking and scripting can be devastating to reproducibility: the script gives the impression of reproducibility while potentially depending on some extra ad-hoc clicks that are not documented).

The reasons why I think R is a better data analysis language for students to learn than python are largely due to Hadley Wickham, who has done more than anyone else to rationalize R analysis. In particular, a good, easy-to-read analysis will typically only have a few steps: read in the data, aggregate the data across some units (often taking means across conditions and subjects), plot this aggregated data, and apply a statistical model to characterize patterns seen in the plots. In the R ecosystem, each of these can be executed in only one or at most a few lines of code.

Here's an example from a repository I've been working on with Ali Horowitz, a graduate student in my lab. This is an experiment on children's use of discourse information to learn the meanings of words. Children across ages choose which toy (side) they think a word refers to, in conditions with and without discourse continuity information. The key analysis script does most of its work in four chunks:

#### 1. read in data
d <- read.csv("data/all_data.csv") 

#### 2. aggregate for each subject and then across subjects
mss <- aggregate(side ~ subid + agegroup + corr.side + condition, 
                 data = d, mean)
ms <- aggregate(side ~ agegroup + corr.side + condition, 
                data = mss, mean)

#### 3. plot
qplot(agegroup, side, colour = corr.side, 
      facets = .~condition,  
      group = corr.side, 
      geom = "line", 
      data = ms)

##
## 4. linear mixed-effects model
lm.all <- glmer(side ~ condition * corr.side * age + 
                (corr.side | subid), 
                data = kids, family = "binomial")

This is simplified somewhat – I've left out the confidence intervals and a few pieces of data cleaning – but the overall schema is one that reappears over and over again in my analyses. Because this idiom for expressing data analysis is so terse (but still so flexible), I find it extremely easy to debug. In addition, if the columns of your original datasheet are semantically transparent (e.g. agegroup, condition, etc.), your expressions are very easy to read and interpret. (R's factor data structure helps with this too, by keeping track of different categorical variables in your data). Overall, there is very little going on that is not stated in the key formula expressions in the calls to aggregate, qplot, and glmer; this in turn means that good naming practices make it easy to interpret the code in terms of the underlying design of the experiment you ran. It's much easier to debug this kind of code than your typical matlab data analysis script, where rows and columns are often referred to numerically (e.g. plot(d(:,2), d(:,3)) rather than qplot(condition, correct, data=d)). 

Often the data you collect are not in the proper form to facilitate this kind of analysis workflow. In that case, my choice is to create another script, called something like "preprocessing.R" that uses tools like reshape2 to move from e.g. a mechanical turk output file to a tidy data format (a long-form tabular dataset). That way I have a two-step workflow, but I am saving both the original data and the intermediate datafile, and can easily check each by eye in a text editor or Excel for conversion/reformatting errors. 

Overall, the key thing about using R for the full analysis is that – especially when the analysis is version controlled, as described below – you have a full record of the steps that you took to get to a particular conclusion. In addition, with the general workflow described above, the steps in the analysis are described in a semantically transparent way (modulo understanding the particular conventions of, say, ggplot2, which can take some time). Both of these dramatically improve reproducibility by making debugging, rerunning, and updating this codebase easier. 

Archiving analyses on git

When I am ready to analyze the data from an experiment (or sometimes even before), I have started setting up a git repository on github.com. It took me a little while to get the hang of this practice, but now I am convinced that it is overall a huge time-saver. (A good tutorial is available here). The repository for an experimental project is initialized with the original datafile(s) that I collect, e.g. the eye-tracking records, behavioral testing software outputs, or logfiles, suitably anonymized. These source datafiles should remain unchanged throughout the lifetime of the analysis – confirmed by their git history.

I work on a local copy of that repository and push updates back to it so that I always have the analysis backed up. (I've begun doing all my analysis in the clear on github, but for academic users you can get free private repositories if that makes you uncomfortable). This strategy helps me keep track of the original data files, intermediate processed and aggregated data, and the analysis code, all in one place. So at its most basic it's a very good backup.

But managing data analysis through git has a couple of other virtues, too:
  • The primary benefits of version control. This is the obvious stuff for anyone who has worked with git or subversion before, but for me as a new user, this was amazing! Good committing practices – checking in versions of your code regularly – mean that you never have to have more than one version of a file. For example, if you're working on a file called "analysis.R," you don't have to have "analysis 4-21-14 doesn't quite work.R" and "analysis 4-22-14 final.R." Instead, "analysis.R" can reflect in its git history many different iterations that you can browse through whenever you want. You can even use branches or tags to keep track of multiple different conflicting approaches in the same file. 
  • Transparency within collaborations. Your collaborators can look at what you are doing while the analysis is ongoing, and they can even make changes and poke around without upsetting the applecart or creating totally incommensurable analysis drafts. This transparency can dramatically reduce sharing overhead and crosstalk between collaborators in a large project. It also means that it is way easier for authors to audit the analysis on a paper prior to submission – something that I think should probably be mandatory for complex analyses. 
  • Ease of sharing analyses during the publication and review process. When you're done – or even while analysis is still ongoing – you can share the repository with outsiders or link to it in your publications. Then, you can post updates to it if you have corrections or extensions, and new viewers will automatically see these rather than having to track you down. This means sharing your data and analysis is always as simple as sharing a link – no need to hunt down a lot of extra dependencies and clean things up after the fact (something that I suspect is a real reason why many data sharing requests go unanswered).
The open git analysis approach is not perfect for all datasets – the examples that come to mind are confidential data that cannot easily be anonymized (e.g. transcripts with lots of identifying information) and neuroimaging, where the data are too large to push back and forth to external repositories all the time. But for a wide range of projects, this can be a major win.

Conclusion

It takes an initial, upfront investment to master both git and R. Neither is as easy as using pivot tables in Excel. But the payoff is dramatic, both in terms of productivity and in terms of reproducibility. There are steps further that you can take if you are really committed to documenting every step of your work, but I think this is a reasonable starting point, even for honors students or beginning graduate students. For any project longer than a quick one-off, I am convinced that the investment is well worth while.

Of course, I don't mean to imply that you can't do bad or irreproducible research using this ecosystem – it's very easy to do both. But I do believe that it nudges you towards substantially better practices than tools like Excel and SPSS. And sometimes a nudge in the right direction can go a long way towards promoting the desired behavior.

Saturday, April 19, 2014

Slides for Stanford Autism Update

I'm giving a talk at the Stanford Autism Update today. My slides can be found here: http://bit.ly/1nzhM7V. This is my first time using figshare – certainly seems easy.

Tuesday, April 1, 2014

Assessing cognitive models: visualization strategies

(This post is written in part as a reference for a seminar I'm currently teaching with Jamil Zaki and Noah Goodman, Models of Social Behavior).

The goal of research in cognitive science is to produce explicit computational theories that describe the workings of the human mind. To this end, a major part of the research enterprise consists of making formal artifacts – computational models. These are artifacts that take as their inputs some stimuli, usually in coded form, and produce as their outputs some measures that are interpretable with respect to human behavior. 

In this post I'll discuss a visualization-based strategy for assessing the fit of models to data, based on moving between plots at different levels of abstraction. I often refer to this informally as "putting a model through its paces." 

I take as my starting point the idea that a model is successful if it is both parsimonious itself and if it provides a parsimonious description of a body of data. To unpack this, a bit more, the basic idea is that any model is created as a formal theory of some set of empirical facts. If you know the theory, you can predict the facts – and so you don't need to remember them because they can be re-derived. A model can fail because it's too complicated – it predicts the facts, but at the cost of having so many moving parts that it is itself hard to remember. Or or it can fail because it is consistent with all patterns of data – and hence doesn't compress the particular pattern of data much at all. (I've discussed this this  "minimum description length" view of cognitive modeling in more depth both in this post and in my academic writing here and here. Note that it's consistent with a wide range of modeling formalisms from neural networks to probabilistic models).

So how do we assess models within this framework, and how do we compare them to data? Although the minimum description length framework provides a guiding intuition, it's not that easy to say exactly how parsimonious a particular model is; there are actually fundamental mathematical difficulties with computing parsimony. But there are nevertheless many methods for statistical comparison between models, and these can be very useful – especially when you are using models that are posed in coherent and equivalent vocabularies. Here are great slides from a tutorial that Mark Pitt and Jay Myung gave at CogSci a couple of years ago on model comparison.

My focus in this post is a bit more informal, however. What I want to do is to discuss a set of plots for model assessment that can be used together to gain understanding about the relationship of a model or set of models to data:

  1. A characteristic model plot, one that lets you see details of the model's internal state or scoring so that you can understand what it has learned or why it has produced a particular result. 
  2. A plot of model results across conditions or experiments, in precisely the same format as the experimental data are typically plotted. 
  3. A scatter plot of model vs. data for comparing across experiments and across models.
  4. A plot of model fit statistics as parameters are varied. 

In each of these, I've used examples from the probabilistic models literature, taken mostly from my work and the work of my collaborators. This choice is purely because I know these examples well, not because of anything special about these examples. The broader approach is due to Andrew Gelman's philosophy of exploratory model checking (on display, e.g. in this chapter).

1. Characteristic Model Plot

The first plot I typically make when I am working on a model has the goal of understanding why the model produces a particular result when it is given a certain pattern of input data. This plot is often highly idiosyncratic to the particular representation or model framework that I am using – but it gives insight into the guts of the model. It typically doesn't include the empirical data.

Here is one example, from Frank, Goodman, & Tenenbaum (2009). We were trying to understand how our model of word learning performed on a task that's sometimes called "mutual exclusivity" that had been used in the language acquisition literature. The task is simple: you show a child a known object (e.g. a BIRD) and a novel object (a DAX), and you say "can you show me the dax?" Children in this task reliably pick the novel object. 


Our model made this same inference, but we wanted to understand why. So we chose four different hypotheses that the model could have about what the word "dax" meant (represented at the top of each of the four panels) and computed the model's scores for each of these. In our model, the posterior score of a hypothesis about word meanings was the product of the prior probability of the lexicon, the probability of a corpus of input data, and the probability assigned in this specific experimental situation. So we plotted each of those numbers on a relative probability scale such that we could easily compare them. The result was an interesting insight: The major reason why the model preferred lexicon B (the one consistent with the data) was that it placed higher probability on previous utterances in which the word "dax" hadn't been heard before even though the BIRD object had been seen. It was this unseen data that made it odd to think that "dax" actually did mean BIRD after all (lexicons C and D).

Another example of this kind of plot comes from a follow-up to this paper (Lewis & Frank, 2013), again looking at the "mutual exclusivity" phenomenon. In this case, we plotted the relative probabilities assigned to different lexicons in a simple bar graph (with shading indicating different priors), but we used a graphical representation of the lexicon as the axis label. The generalization that emerged from this plot is that the lexicon where each word is correctly mapped one-to-one (the middle lexicon with one gray bar and one green bar), almost regardless of the prior that is used.


In general, visualizations of this class will be very project- and model-specific, because they will depend on the relevant aspects of the model that you find most informative in your explanations. Nevertheless, they form a crucial tool for diagnosing why the model produced a particular result for a given input configuration.

2. Data-space plot

This style of plot is very important for evaluating the correspondence of a model to human data. It's the first thing I typically do after getting a model working. I try to produce predictions about what the experimental data should look like, on the same axes (or at least in the same format) as the original plots of the experimental data. 

Here is one example, from a paper on word segmentation (Frank et al., 2010). We tracked human performance in statistical word segmentation tasks while we varied parameters of the language the participants were segmenting in three different experiments. We then examined the fit of a range of models to these data. The human data here are abstracted away from any variability and are just solid curves repeated across plots; the model predictions make clear that while there is some variability in performance on the first two experiments, it is Experiment 3 where all models fail:


Here's a second example, this one from Baker, Saxe, & Tenenbaum (2009). They had participants judge the goals of a cartoon agent at various different "checkpoints" along a path. For example, in panel (a), condition 1, the agent wended its way around a wall and then headed for the corner. At each numbered point, participants made a judgment about whether the agent was headed towards goal A, B, or C. In panel (b), experimental data show graded ratings for each goal, and in panel (c) you can compare model-derived ratings:


The general point here is that visualization is about comparison (a point I identify with William Cleveland but don't have a good citation for). This plot makes it easy to compare the gestalt pattern of model vs. experimental data in a format where you can readily identify the particular conditions under which deviations occur – sometimes even across models.

3. Cross-dataset, cross-model plot

These plots are more obvious and more conventional. You plot model predictions on the horizontal and human behavior on the vertical, resulting in a scatter plot where greater coherence indicates greater correlation between model and data. Of course, the result can be quite deceptive because it obscures the model's degrees of freedom. But this sort of plot is a simple and powerful tool for quickly assessing fit. Here's one nice one:


This example comes from Orban et al. (2008), a paper on visual statistical learning. They plot log probability ratio of test items by proportion correct in human experiments. Each datapoint is a separate condition, and the key relates each datapoint back to the experimental data, so that you can look up which points don't fit as well. When possible it's nice to plot the the actual labels on the axes so that you don't have to use a key, but sometimes this strategy can get overly messy. A minor note: I really like to plot confidence intervals (rather than standard error) here so that we can better assess by eye whether variability is due to measurement noise or a truly incorrect prediction by the model.

This sort of plot can also be very good at comparing across models, because you can see at a glance both which model fits better and which points are negatively affecting fit. Here's another version, from Sanjana & Tenenbaum (2002). The data aren't labeled here, but I like both the matrix of small plots (the grid is different models on the columns, different experimental conditions on the rows) and the prominent reporting of the correlation coefficients in each plot:


One interesting thing that you can see by presenting the data this way is that there are sets of conditions that don't vary in their model predictions (e.g. the bunches of dots on the right side of the center panel). As an analyst I would be interested to know whether these conditions are truly distinct experimentally and getting lumped inappropriately by this model. My next move would probably be to make plots #1 and #2 with just these conditions included. More generally, I find it extremely important to have the ability to move flexibly back and forth between more abstract plots like this one and plots that are more grounded in the data.

4. Parameter sensitivity plots

Even the simplest cognitive models typically have some parameters that can be set with respect to the data (free parameters). And all of the plots described above produce an output given some settings of the model's free parameters, allowing you to assess fit to data. But, as a number of influential critiques point out, a good fit is not necessarily what we should be looking for in a cognitive model. I take these critiques to be largely targeted at excessive flexibility in models – which can lead to overfitting. 

So an important diagnostic for any model is how its fit to data varies in different parameter regimes. Showing this kind of variability can be tricky, however. Sometimes parameters are cognitively interesting, but in other circumstances they are not – yet it is important to explore them fully. 

These plots typically plot either model performance (in the case of a small number of conditions) or a summary statistic capturing goodness of fit or performance (in the case of more data) as a function of one or more free parameters. The goodness of fit statistic is usually a measure like mean squared error (deviation from human data) or simple Pearson correlation with the data, but can be any number of other summary statistics as well.  In models with only one or two free parameters, visualizing the model space is not too difficult, but as the model space balloons, visualization can be difficult or impossible. 

Here is one example, from a paper I wrote on modeling infants' ability to learn simple rules:


It's a combination of a #4 plot (left) and a #2 plot, middle/right. On the left, I show model predictions across five conditions, as alpha (a noise parameter) was varied. The salient generalization from this plot is that the ordering of conditions stays the same, even though the absolute values change. The filled markers call out the parameter value that is then plotted in the middle as a bar graph for easy comparison with the right-hand panel. As you can see, the fit is not perfect, but the relative ordering of conditions is similar. 

Here is another example, one that I'm not proud of from a visualization perspective (especially as it uses the somewhat unintuitive matlab jet colormap):


This comes from our technical report on the word learning model described above. The model had three free parameters, alpha, gamma, and kappa. This plot shows a heatmap of f-score (a measure of the model's word learning performance) as a function of all three of those continuous parameters, with the maxima for each plot labeled. Although this isn't a great publication-ready graphic, it was very useful for me as a diagnostic of the parameter-sensitivity of the model – and led to us using a parameter-reduction technique to try to avoid this problem. 

Parameter plots may not always make for the best viewing, but they can be an extremely important tool for understanding how your model's performance varies in even a high-dimensional parameter space. 

Conclusions

I've tried to argue here for an exploratory visualization approach to model-checking for cognitive models especially. The approach is predicated on having plots at multiple levels of abstraction, from diagnostic plots that let you understand why a particular datapoint was predicted in a certain way, all the way up to plots that let you consider the stability of summary statistics throughout parameter space. It is not always trivial to code up all of these visualizations, let alone to create an ecosystem in which you can move flexibly between them. Nevertheless, it can be extremely useful in both debugging and in gaining scientific understanding. 

Thursday, March 27, 2014

Making messes brings babies closer to people

(Via astrorhysy).

M loves increasing entropy in the world – making messes. She is attracted to order: stuff in a basket, books on a shelf, or a pile of freshly folded clothing. She crawls over as fast as her little limbs can go, and begins sowing the seeds of disorder. I'm not the only one who has made this observation. But a study by Newman et al. (2010) suggests that maybe one reason that babies are so attracted to order is that they see it as related to people, since people tend be the primary sources of order in the world. So maybe M's drive to explore orderly things is related to her deep interest in understanding and sharing attention with other people.

In the Newman et al. study, the researchers used the violation of expectation method to test whether 7- and 12-month-old infants had a sense that order is something that is caused by people, rather than inanimate objects. Babies in the study saw videos of an animated ball roll towards a set of blocks that was covered by a screen. When the screen went down, it was revealed that the ball had either sorted or un-sorted the blocks.

In one condition, the ball had some cues that – according to other research in this tradition – should cause babies to think it is an animate agent (a person, more or less): it had eyes, and it seemed to move by itself in a way that indicated it was self-propelled. In another condition, it was just a ball and it rolled across the screen without stopping.

In the animate condition, 12-month-old infants didn't seem to have a strong expectation about what the ball would do and looked equally at both outcomes. In contrast, in the inanimate condition, they looked longer (indicating a violation of expectation) when the ball made the disorderly set of blocks more orderly. The seven-month-olds didn't show any systematic looking differences. A second experiment showed a conceptual replication of this finding using the contrast of a claw and a hand – again infants seemed to expect the claw to be more likely to create disorder than order.

So perhaps M's – and other babies' – interest in order stems from a general interest in people and the patterns they leave in their environment. Maybe when she sees a bookshelf full of books, just ripe for throwing on the ground, she thinks to herself, "I wonder who did that?"


ResearchBlogging.org Newman GE, Keil FC, Kuhlmeier VA, & Wynn K (2010). Early understandings of the link between agents and order. Proceedings of the National Academy of Sciences of the United States of America, 107 (40), 17140-5 PMID: 20855603

Sunday, March 23, 2014

How does sleep training work?


(A baby uninterested in sleep.)

About two months ago, M began waking up even more often in the middle of the night and it became hard to get a decent night's sleep. (Things had never been great, but they got noticeably worse). We decided to sleep train. Sleep training for infants is a very emotionally-charged topic for parents. Along with its strong advocates, you can find people arguing that sleep training is child abuse.

M's mom and I found a meta-analysis suggesting the efficacy of a variety of sleep-training techniques and a strong recent study that found no major negative effects. In addition, many of the critiques seemed like misreadings of the literature on infant attachment. So we decided to go for it. Based on a survey of friends, blogs, and our own conscience, we selected the "graduated extinction" method – also known as "Ferberization" – where you let the baby cry for gradually longer intervals before providing comfort. (I found Ferber's book to be the clearest and best-written of the baby sleep guides, as well).

The process was fairly traumatic for us. But it was comforting to see that M showed no evidence of caring in the morning – she still seemed to like us just fine. And after a few nights, she went to bed without crying; after a few more she slept through the night without waking. This amazing result has continued (with a few interruptions) for about a month and a half. It's an exaggeration to say that either of us is well-rested, but the situation is markedly improved.

Now on the other side of this process, it's easy for me to say that – despite many stories of cases where sleep training doesn't work well for individual kids – the evidence for population-level efficacy seems pretty clear. What I found astonishing, though, is how little discussion there is of why it works. In addition, the explanation that is on offer – Ferber's own – doesn't make much sense. It uses the language of behaviorism, but on a closer examination, it isn't consistent with actual behaviorism at all.

Ferber gives the following preamble to his instructions:
The goal of this approach is to help your child learn a new and more appropriate set of associations with falling asleep so that when he wakes in the middle of the night he will find himself still in the same conditions that were present at bedtime, conditions that he already is used to falling asleep under. But, to do this, you must first identify the pattern of associations that is currently interfering with his sleep (and yours) and which he must unlearn.
This seems straightforward: the child has associations with falling asleep – being rocked, maybe having a bottle, perhaps lying upright on your shoulder – that prevent him or her from falling asleep in the crib. You break these associations and the child learns to "self-soothe" and put him- or herself to sleep.

This analysis sounds reasonable. But fundamentally, its reasonableness is post hoc, like a lot of other psychological explanations. The technique works, so we accept the explanation. But there are many similar explanations that we would accept if they were linked to a technique that worked. And if the technique didn't work, we'd easily discard the explanation.

A big part of the prima facie reasonableness of the Ferber explanation is that it's couched in language we know: the associative language of classical (Pavlovian) conditioning. In classical conditioning,  a tone (conditioned stimulus: CS) is paired with a shock (unconditioned stimulus: US) to induce a fear response (conditioned response: CR). Eventually, the tone produces the fear response without the shock. (Somewhat) similarly, cuddling is associated with sleep. Eventually, cuddling becomes prerequisite for sleep. To allow for sleep without cuddling, this association must be "extinguished."

But there are a number of places where this analysis breaks down:

  • What's the US (the shock) in this case? Is it drowsiness? Presumably baby still needs to be drowsy to fall asleep after cuddling - though the cuddling might help the process along. So the US isn't ever out of the loop here.
  • Sleep isn't a stimulus-triggered aversive behavior like fear-based freezing in rats. Drowsiness is a feeling that is generated by the baby herself, more like hunger or thirst. I'm definitely not an expert on classical conditioning, but a quick reading of the literature suggests that maybe it's not quite so easy to condition appetites
  • Conditioning of a response to some cue doesn't mean that the response cannot be triggered by another CS (though there maybe some weakening), and especially doesn't mean that the US is no longer powerful. For example, a rat conditioned to freeze based on a tone will still freeze based on a shock. So why on this analysis can't babies sleep without "sleep associations"? Presumably they still get drowsy.
  • Finally, nothing anywhere in the conditioning analysis predicts crying as an outcome of not having the cues for sleep. I guess the idea is that the baby wants to sleep, but doesn't have the cues to allow sleep, and then is frustrated? But that extension of the explanation definitely isn't congruent with a conditioning analysis. Instead it reflects some kind of second-order response: frustration about the mismatch between the baby's needs and what's happening (I want to sleep but I can't). 
So despite the superficial theoretical trappings of behaviorism, it doesn't seem like a standard conditioning analysis predicts many of the salient parts of the phenomenon. 

Perhaps I haven't lined up the pieces right here. Perhaps there is actually a classical conditioning analysis that makes sense, but my still-somewhat-sleep-deprived mind hasn't been able to work it out. If so, I'd love to hear it. In the absence of that kind of analysis, though, the overall success of the Ferber method is even more magical. 

Wednesday, February 12, 2014

"Psychological plausibility" considered harmful

"goto" statement (programiz.com)
goto is considered harmful in programming languages.


The fundamental enterprise of cognitive science is to create a theory of what minds are. A major component of this enterprise is the creation of models – explicit theoretical descriptions that capture aspects of mind, usually by reference to the correspondence between their properties and some set of data. These models can be posed in a wide variety of frameworks or formalisms, from symbolic architectures to neural networks and probabilistic models.

Superficially, there are many arguments one can make against a particular model of mind. You can say that it doesn't fit the data, that it's overfit, that there are many possible alternative models, that it predicts absurd consequences, that it has a hack to capture some phenomenon, that it has too many free parameters, and so forth. But nearly all of these superficially different arguments boil down to well-posed statistical criticisms.

Consider a theory to be a compression of a large mass of data into a more parsimonious form, e.g. the "minimum description length" framework. For a given set of data, the total description length is the length of the theory (including its parameters), the predicted data from the theory, and some metric over the deviation of the data from those predictions. Under this kind of setup, the critiques above boil down to the following two critiques:
  1. There's a theory that compresses the data more, either (a) by having fewer free parameters, or (b) by being overall more parsimonious. 
  2.  If we add more data, your theory won't compress the new data well at all, where the new data are either (a) other pre-existing experiments that weren't considered in the initial analysis, or (b) genuinely new data ("model predictions"). Concerns about overfitting and generalization fall squarely into this bucket. 
Of course, we don't have a single description language for all theories, and so it's often hard to compare across different frameworks or formalisms. But within a formalism, it's typically pretty easy to say "this change to a model increased or decreased fit and/or parsimony." In linear regression, AIC and BIC are metrics for doing this sort of model comparison and selection. In the general Bayesian or statistical framework, the tradeoff between parsimony and fit to data is a natural consequence of the paradigm and has been formalized to good effect.

In this context, I want to call out one kind of critique as distinct from this set: the critique that a model is not "psychologically plausible." In my view, any way you read this kind of critique, it's harmful and should be replaced with other language. Here are some possible interpretations:

1. "Model X is psychologically implausible" means "model X is inconsistent with other data." This is perhaps the most common argument from plausibility. For example, "your model assumes that people can remember everything they hear."  Often this is an instance of argument (2a) above, only with an appeal to uncited, often non-quantitative data, so it is impossible to argue against. If there is an argument on the basis of memory/computation limits, citing the data makes it possible to have a discussion about possible modifications to model architecture (and the rationale for doing so). And often it becomes clearer what is at stake, as in the case of e.g. asking a model of word segmentation to account for data about explicit memory (discussion here and here) when the phenomenon itself may rely on implicit mechanisms.

2. "Model X is psychologically implausible" means "model X doesn't fit with framework Y." Different computational frameworks have radically different limitations, e.g. parallel frameworks make some computations easy while symbolic architectures make others easy. Consider Marr & Poggio's 1976 paper on stereo disparity, which shows that a computation that could be intractable using one model of "plausible" resources actually turns out to be very doable with a localist neural net.* We don't know  what the brain can compute. Limiting your interpretation of one model by reference to some other model (which is in turn necessarily wrong) creates circularity. Perhaps these arguments are best thought of as "poverty of the imagination" arguments.

3. "Model X is psychologically implausible" means "model X is posed at a higher/lower level of abstraction than other models I have in mind." To me, this is a standard question about the level of abstraction at which a model is posed – is it at the level of what neurons are doing, what psychological processes are involved, or the structure of the computation necessary in a particular environment. (This is the standard set of distinctions proposed by Marr, but there may even be other useful ones to make). As I've recently argued, from a scientific perspective I think it's pretty clear we want descriptions of mind at every level of abstraction. Perhaps some of these arguments are in fact requests to be clearer about levels of description (or even rejections of the levels of description framework).

In other words, arguments from psychological plausibility are harmful. Some possible interpretations of such arguments are reasonable – that a model should account for more data or be integrated with other frameworks. In these cases, the argument should be stated directly in a way that allows a response via modification of the model or the data that are considered. But other interpretations of plausibility arguments are circular claims or confusions about level of analysis. Either way, such arguments lump together a number of different possibilities without providing the clarity necessary to move the discussion forward.

---
Thanks to Ed Vul, Steve Piantadosi, Tim Brady, and Noah Goodman for helpful discussion, and * = HT Vikash Mansinghka. (Small edits 2/12/14.)