*tl;dr: Adventures in using different platforms/methods to fit drift diffusion models to data.*

The drift diffusion model (DDM) is increasingly a mainstay of research on decision-making, both in neuroscience and cognitive science. The classic DDM defines a pseudo random-walk decision process that describes a distribution on both accuracies and reaction times. This kind of joint distribution is really useful for capturing tasks where there could be speed-accuracy tradeoffs, and hence where classic univariate analyses are uninformative. Here's the classic DDM picture, this version from Vandekerckhove, Tuerlinckx, & Lee (2010), who have a nice tutorial on hierarchical DDMs:

We recently started using DDM to try and understand decision-making behavior in the kinds of complex inference tasks that my lab and I have been studying for the past couple of years. For example, in one recently-submitted paper, we use DDM to look at decision processes for inhibition, negation, and implicature, trying to understand the similarities and differences in these three tasks:

We had initially hypothesized that performance in the negation and implicature tasks (our target tasks) would correlate with inhibition performance. It didn't, and what's more the data seemed to show very different patterns across the three tasks. So we turned to DDM to understand a bit more of the decision process for each of these tasks.* Also, in a second submitted paper, we looked at decision-making during "scalar implicatures," the inference that "I ate

*some*of the cookies" implies that I didn't eat

*all*of them. In both of these cases, we wanted to know what was going on in these complex, failure-prone inferences.

An additional complexity was that we are interested in the development of these inferences in children. DDM has not been used much with children, usually because of the large number of trials that DDM seems to require. But we were inspired by a recent paper by Ratcliff (one of the important figures in DDMs), which used DDMs for data from elementary-aged children. And since we have been using iPad experiments to get RTs and accuracies for preschoolers, we thought we'd try and do these analyses with data from both kids and adults.

But... it turns out that it's not trivial to fit DDMs (especially the more interesting variants) to data, so I wanted to use this blogpost to document my process in exploring different ecosystems for DDM and hierarchical DDM.

#### RWiener

In the two papers linked above, we used the RWiener package to fit DDM parameters independently for each subject and condition, via maximum likelihood. It's very nice to have a package like RWiener to do the work of assessing the diffusion model for you, but all the same, it is showing its age. It's a very "base R" package – meaning that it doesn't easily play nice with a lot of the new dplyr ecosystem that I find so helpful in the rest of my R coding.For example, to fit models, you have to feed each participant's data (as a data.frame, not data_tbl) into an optimizer with the variables named just right (e.g., the accuracy data needs to be called "resp" and needs to be a character vector with labels "upper" and "lower" for the boundary of the DDM that is being hit). Then the optimizer returns nameless parameter values. Here's one hacky/annoying way to do this process within dplyr:

params <- d %>% group_by(subid) %>% do(data.frame(value = optim(c(1, .1, .1, 1), wiener_deviance, dat = data.frame(select(., q, resp)), method = "Nelder-Mead")$par, param = c("a","t","b","v")))But in addition to these minor annoyances, the more major issue is that we were trying to fit DDMs with a very small amount of data. Although each child in the inhibition/negation/implicature paper gave us 180 trials (and that's a ton for a preschooler!), it's still much less than previous DDM studies have suggested is necessary. Our approach is essentially a "no pooling" approach – we estimate parameters completely independently and compute inferential statistics across participants. This should be pretty conservative, and indeed we do get some consistency in coefficients across participants, with decent samples (N=48 adults, N=66 kids).

Even though this approach seemed like a reasonable starting point, we still thought that it would be better to use hierarchical models. Hierarchical models can be immensely helpful in cases where you have limited data because they shrink individual participants' parameter estimates towards the mean. When individuals are very variable this can really help constrain parameter estimates (Vandekerckhove, Tuerlinckx, & Lee, 2010, is all about this issue). In addition, we thought that estimating developmental trends within the model might be more accurate than estimating parameters independently and then fitting regression models to those parameters outside the model. Finally, parameters in DDMs can be highly correlated, and using maximum likelihood (as in RWiener) can mask these correlations.

#### HDDM in Stan

My first attempt at writing down an HDDM used Stan. Stan is a new-ish probabilistic programming language with an easy-to-use description language, generally good documentation, and generally good performance. Folks in my lab and I have been using it recently for a variety of applications (e.g. this model), and have generally found it to be a great tool. When I was first starting out with probabilistic programming, we wrote all of the samplers for our models by hand. This practice was very labor-intensive, and extremely error-prone. You basically never knew when your sampler was wrong (there's a paper by Geweke that suggests this even happens to very experienced statisticians). But now that you can write a model specification and have the inference computed generically, at least for some (pretty big) class of models. Unfortunately, I ran into a bunch of trouble getting various HDDM variants to work in Stan.First, while there is a wiener diffusion process implemented in Stan, the documentation is a bit limited. I found some useful tricks in this thread, but once I got the models up and running, they were really pretty slow. I suspect some of the key slowness comes because of the statement of the models, e.g. you have to loop over observations because this parameterization requires an "if" to distinguish correct and incorrect responses:

for (n in 1:N) { if (resp[n] == 1) { y[n] ~ wiener(bound_sep, nondecision, bias, delta[sub[n]]); } else { y[n] ~ wiener(bound_sep, nondecision, 1-bias, -delta[sub[n]]); } }

(This code is also tricky in that in cases of error, it treats the upper bound like the lower bound by flipping both the bias and the drift rate). In addition, I had trouble with a bunch of convergence issues (though I did discover ggmcmc, a very nice package for plotting Stan and JAGS samples, in the process).

I also found that there is some ambiguity in the literature in how to parameterize HDDMs. For example, Vandekerkhove et al. recommend including a trial-level variability parameter. But in the supplemental information for the article, they note that the trial-level parameter can lead to unstable inference, and provide a different distribution primitive available in BUGS/JAGS (wiener.eta) that allows parameterization of the DDM with a trial-level standard deviation rather than actually fitting trial-level parameters. So I wasn't sure how to handle this issue in Stan – just omit the trial-level parameter? Or actually fit these parameters per participant, which slowed things down even further?

The other thing about HDDM in Stan is that Stan is actually a terrible use-case for RMarkdown. My typical method for writing RMarkdown documents is that I write code chunk by chunk and execute it in the standard REPL. Then, when I'm done and want things to look pretty, I re-knit and publish to RPubs or github. The problem is that the knitted version and the REPL version don't share state with one another. So when I am running a Stan model that takes say 20 minutes to run, all the computation I do in exploring it is lost (unless I save an image of it and reload it, which is a hack that kind of defeats the purpose of the whole project). In sum, I suspect all of these issues are solvable, but I sunk quite a bit of time into HDDM in Stan and wasn't able to get much in the way of satisfying results.

#### HDDM in Python

Wiecki, Sofer, & Frank (2013, WTF) present a python package using PyMC that fits a wide variety of HDDMs and gets generally good reviews from lots of folks. (Why didn't I start here?). Maybe the most awkward thing about using this version of HDDM is that the last author is Michael J. Frank, a very nice guy at Brown who is not me. This HDDM was quite easy to get up and running: I installed Anaconda, then used it to install the hddm package.This package is substantially faster than my HDDMs in Stan. I am not sure what the exact factor is, as I'm not using the same parameterization, but certainly for smaller datasets it seems quite snappy. Also, slowdowns seem to come when fitting large numbers of parameters, not large amounts of data, which I suspect differs from the Stan version because of the "if" statement in the current implementation.

In addition, using Jupyter notebooks for Python HDDM works much better than the RMarkdown framework for Stan, because the computation for each model gets cached in the actual document framework. (The static HTML is shareable, also, which means that there doesn't seem to be a downside). Lots of people like Jupyter for this kind of compute-intensive work, and I can see why.

I did find the WTF paper a bit unclear on the precise form of the models they fit, however.** They give a little less detail then would be wanted about how trial-level variability parameters are used (e.g., how does

*sv*influence the diffusion process?). But the models are quite flexible, and adding stimulus-level conditions is extremely easy. Regression-style models are a bit more difficult, but seem within reach.

Using this approach, my individual parameter estimates match quite nicely to those provided by the independent RWiener analysis. Here's a diffusion plot from our paper for the implicature task:

These are just drift (v) and boundary (a) separation. You can see that the posterior means for boundary separation line up nicely with the maximum likelihood (~1.5 and ~2 respectively). Drift rates (slopes) also match pretty nicely (~1.8 and ~2.8 in the ML analysis). This analysis (with either method) highlights an interesting finding: implicature shows slower drift and smaller boundary separation, suggesting more variability in the computation, even though the RT means are identical. More generally, I'm hoping that these techniques give us some greater leverage on questions about the decision processes underlying some more complex linguistic inferences.

---

* Note, our first hypothesis tests and explorations were conducted on an independent sample from the one we report in the paper. In that second sample, the hypotheses and use of DDM were a priori, not post hoc.

** Their Figure 2 has subscripts flipped with the text of the paper, so

*i*is for subjects in the text but trials in the figure.

Awesome! Thanks for sharing these explorations. Could the Stan vs. PyMC speed difference be due to NUTS vs Metropolis?

ReplyDeleteYes, definitely could be sampler, or sampler implementation, or the model structure. Very tough to tell because I wasn't able to match model forms exactly.

ReplyDeleteNice write-up! I just wanted to add that the Stan implementation isn't my favorite (as I pointed out in that Google group, the odd omission of the binary outcome as a parameter is a problem for several reasons). You can probably speed it up by splitting the data set into a "corrects" and "errors" vector before feeding it to Stan, and then dispense with the if-block.

ReplyDeleteThat said, until the Stan implementation gets looked at properly, the JAGS implementation is the most stable/useful around.

There's a lot to be said about the trial-to-trial variability parameters, but I tend to avoid them as much as I can. They're not only numerically tricky, they're hard to identify, and if you implement them as we did in one example (with unique nodes for each trial) you risk building a near-unfalsifiable model (which can still be useful for measurement, mind you...).

Thanks very much, Joachim. In retrospect I probably should have started with JAGS - but having switched to Stan I am now more used to its syntax etc.

ReplyDeleteThanks also for all the work you have done on these software implementations, I know the research community appreciates it!

Agreed about the trial-level variability issue, also. I suspect this will be a moot point in our dataset anyway, as I have just read Ratcliff and Childer's HDDM simulations that suggest this variability is undetectable with our number of trials (http://star.psy.ohio-state.edu/coglab/People/roger/pdf/lownfinaldec14.pdf).