Monday, February 26, 2018

Mixed effects models: Is it time to go Bayesian by default?

(tl;dr: Bayesian mixed effects modeling using brms is really nifty.)

Introduction: Teaching Statistical Inference?

How do you reason about the relationship between your data and your hypotheses? Bayesian inference provides a way to make normative inferences under uncertainty. As scientists – or even as rational agents more generally – we are interested in knowing the probability of some hypothesis given the data we observe. As a cognitive scientist I've long been interested in using Bayesian models to describe cognition, and that's what I did much of my graduate training in. These are custom models, sometimes fairly difficult to write down, and they are an area of active research. That's not what I'm talking about in this blogpost. Instead, I want to write about the basic practice of statistics in experimental data analysis.

Mostly when psychologists do and teach "stats," they're talking about frequentist statistical tests. Frequentist statistics are the standard kind people in psych have been using for the last 50+ years: t-tests, ANOVAs, regression models, etc. Anything that produces a p-value. P-values represent the probability of the data (or any more extreme) under the null hypothesis (typically "no difference between groups" or something like that). The problem is that this is not what we really want to know as scientists. We want the opposite: the probability of the hypothesis given the data, which is what  Bayesian statistics allow you to compute. You can also compute the relative evidence for one hypothesis over another (the Bayes Factor).  

Now, the best way to set psychology twitter on fire is to start a holy war about who's actually right about statistical practice, Bayesians or frequentists. There are lots of arguments here, and I see some merit on both sides. That said, there is lots of evidence that much of our implicit statistical reasoning is Bayesian. So I tend towards the Bayesian side on the balance <ducks head>. But despite this bias, I've avoided teaching Bayesian stats in my classes. I've felt like, even with their philosophical attractiveness, actually computing Bayesian stats had too many very severe challenges for students. For example, in previous years you might run into major difficulties inferring the parameters of a model that would be trivial under a frequentist approach. I just couldn't bring myself to teach a student a philosophical perspective that – while coherent – wouldn't provide them with an easy toolkit to make sense of their data.  

The situation has changed in recent years, however. In particular, the BayesFactor R package by Morey and colleagues makes it extremely simple to do basic inferential tasks using Bayesian statistics. This is a huge contribution! Together with JASP, these tools make the Bayes Factor approach to hypothesis testing much more widely accessible. I'm really impressed by how well these tools work. 

All that said, my general approach to statistical inference tends to rely less on inference about a particular hypothesis and more on parameter estimation – following the spirit of folks like Gelman & Hill (2007) and Cumming (2014). The basic idea is to fit a model whose parameters describe substantive hypotheses about the generating sources of the dataset, and then to interpret these parameters based on their magnitude and the precision of the estimate. (If this sounds vague, don't worry – the last section of the post is an example). The key tool for this kind of estimation is not tests like the t-test or the chi-squared. Instead, it's typically some variant of regression, usually mixed effects models. 

Mixed-Effects Models

Especially in psycholinguistics where our experiments typically show many people many different stimuli, mixed effects models have rapidly become the de facto standard for data analysis. These models (also known as hierarchical linear models) let you estimate sources of random variation ("random effects") in the data across various grouping factors. For example, in a reaction time experiment some participants will be faster or slower (and so all data from those particular individuals will tend to be faster or slower in a correlated way). Similarly, some stimulus items will be faster or slower and so all the data from these groupings will vary. The lme4 package in R was a game-changer for using these models (in a frequentist paradigm) in that it allowed researchers to estimate such models for a full dataset with just a single command. For the past 8-10 years, nearly every paper I've published has had a linear or generalized linear mixed effects model in it. 

Despite their simplicity, the biggest problem with mixed effects models (from an educational point of view, especially) has been figuring out how to write consistent model specifications for random effects. Often there are many factors that vary randomly (subjects, items, etc.) and many other factors that are nested within those (e.g., each subject might respond differently to each condition). Thus, it is not trivial to figure out what model to fit, even if fitting the model is just a matter of writing a command. Even in a reaction-time experiment with just items and subjects as random variables, and one condition manipulation, you can write

(1) rt ~ condition + (1 | subject) + (1 |  item)

for just random intercepts by subject and by item, or you can nest condition (fitting a random slope) for one or both:

(2) rt ~ condition + (condition | subject) + (condition |  item)

and you can additionally fiddle with covariance between random effects for even more degrees of freedom!

Luckily, a number of years ago, a powerful and clear simulation paper by Barr et al. (2013) came out. They argued that there was a simple solution to the specification issue: use the "maximal" random effects structure supported by the design of the experiment. This meant adding any random slopes that were actually supported by your design (e.g., if condition was a within-subject variable, you could fit condition by subject slopes). While this suggestion was quite controversial,* Barr et al.'s simulations were persuasive evidence that this suggestion led to conservative inferences. In addition, having a simple guideline to follow eliminated a lot of the worry about analytic flexibility in random effects structure. If you were "keeping it maximal" that meant that you weren't intentionally – or even inadvertently – messing with your model specification to get a particular result. 

Unfortunately, a new problem reared its head in lme4: convergence. With very high frequency, when you specify the maximal model, the approximate inference algorithms that search for the maximum likelihood solution for the model will simply not find a satisfactory solution. This outcome can happen even in cases where you have quite a lot of data – in part because the number of parameters being fit is extremely high. In the case above, not counting covariance parameters, we are fitting a slope and an intercept across participants, plus a slope and intercept for every participant and for every item

To deal with this, people have developed various strategies. The first is to do some black magic to try and change the optimization parameters (e.g., following these helpful tips). Then you start to prune random effects away until your model is "less maximal" and you get convergence. But these practices mean you're back in flexible-model-adjustment land, and vulnerable to all kinds of charges of post-hoc model tinkering to get the result you want. We've had to specify lab best-practices about the order for pruning random effects – kind of a guide to "tinkering until it works," which seems suboptimal. In sum, the models are great, but the methods for fitting them don't seem to work that well. 

Enter Bayesian methods. For several years, it's been possible to fit Bayesian regression models using Stan, a powerful probabilistic programming language that interfaces with R. Stan, building on BUGS before it, has put Bayesian regression within reach for someone who knows how to write these models (and interpret the outputs). But in practice, when you could fit an lmer in one line of code and five seconds, it seemed like a bit of a trial to hew the model by hand out of solid Stan code (which looks a little like C: you have to declare your variable types, etc.). We have done it sometimes, but typically only for models that you couldn't fit with lme4 (e.g., an ordered logit model). So I still don't teach this set of methods, or advise that students use them by default. 

brms?!? A worked example

In the last couple of years, the package brms has been in development. brms is essentially a front-end to Stan, so that you can write R formulas just like with lme4 but fit them with Bayesian inference.* This is a game-changer: all of a sudden we can use the same syntax but fit the model we want to fit! Sure, it takes 2-3 minutes instead of 5 seconds, but the output is clear and interpretable, and we don't have all the specification issues described above. Let me demonstrate. 

The dataset I'm working on is an unpublished set of data on kids' pragmatic inference abilities. It's similar to many that I work with. We show children of varying ages a set of images and ask them to choose the one that matches some description, then record if they do so correctly. Typically some trials are control trials where all the child has to do is recognize that the image matches the word, while others are inference trials where they have to reason a little bit about the speaker's intentions to get the right answer. Here are the data from this particular experiment:


I'm interested in quantifying the relationship between participant age and the probability of success in pragmatic inference trials (vs. control trials, for example). My model specification is:

(3) correct ~ condition * age + (condition | subject) + (condition | stimulus)

So I first fit this with lme4. Predictably, the full desired model doesn't converge, but here are the fixed effect coefficients: 

                      beta       stderr  z        p
intercept             0.50 0.19 2.65 0.01
condition             2.13 0.80 2.68 0.01
age                   0.41 0.18 2.35 0.02
condition:age        -0.22 0.36 -0.61 0.54
Now let's prune the random effects until the convergence warning goes away. In the simplified version of the dataset that I'm using here I can keep stimulus and subject intercepts and still get convergence when there are no random slopes. But in the larger dataset, the model won't converge unless i do just the random intercept by subject:

                      beta       stderr  z        p
intercept             0.50 0.21 2.37 0.02
condition             1.76 0.33 5.35 0.00
age                   0.41 0.18 2.34 0.02
condition:age        -0.25 0.33 -0.77 0.44

Coefficient values are decently different (but the p-values are not changed dramatically in this example, to be fair). More importantly, a number of fairly trivial things matter to whether the model converges. For example, I can get one random slope in if I set the other level of the condition variable to be the intercept, but it doesn't converge with either in this parameterization. And in the full dataset, the model wouldn't converge at all if I didn't center age. And then of course I haven't tweaked the optimizer or messed with the convergence settings for any of these variants. All of this means that there are a lot of decisions about these models that I don't have a principled way to make – and critically, they need to be made conditioned on the data, because I won't be able to tell whether a model will converge a priori!

So now I switched to the Bayesian version using brms, just writing brm() with the model specification I wanted (3). I had to do a few tweaks: upping the number of iterations (suggested by the warning messages from the output, changing to a Bernoulli model rather than binomial (for efficiency, again suggested by the error message), but this was very straightforward otherwise. For simplicity I've adopted all the default prior choices, but I could have gone more informative.

Here's the summary output for the fixed effects:

                      estimate  error    l-95% CI u-95% CI
intercept             0.54      0.48    -0.50     1.69
condition             2.78      1.43     0.21     6.19
age                   0.45      0.20     0.08     0.85
condition:age        -0.14      0.45    -0.98     0.84

From this call, we get back coefficient estimates that are somewhat similar to the other models, along with 95% credible interval bounds. Notably, the condition effect is larger (probably corresponding to being able to estimate a more extremal value for the logit based on sparse data), and then the interaction term is smaller but has higher error. Overall, coefficients look more like the first non-convergent maximal model than the second converging one. 

The big deal about this model is not that what comes out the other end of the procedure is radically different. It's that it's not different. I got to fit the model I wanted, with a maximal random effects structure, and the process was almost trivially easy. In addition, and as a bonus, the CIs that get spit out are actually credible intervals that we can reason about in a sensible way (as opposed to frequentist confidence intervals, which are quite confusing if you think about them deeply enough). 

Conclusion

Bayesian inference is a powerful and natural way of fitting statistical models to data. The trouble is that, up until recently, you could easily find yourself in a situation where there was a dead-obvious frequentist solution but off-the-shelf Bayesian tools wouldn't work or would generate substantial complexity. That's no longer the case. The existence of tools like BayesFactor and brms means that I'm going to suggest that people in my lab go Bayesian by default in their data analytic practice. 

----
Thanks to Roger Levy for pointing out that model (3) above could include an age | stimulus slope to be truly maximal. I will follow this advice in the paper. 

* Who would have thought that a paper about statistical models would be called "the cave of shadows"?
** Rstanarm did this also, but it covered fewer model specifications and so wasn't as helpful. 

23 comments:

  1. There are many issues of statistical modeling in this post that I would debate but I would mention first that if lme4 users are concerned about convergence they should try the MixedModels package for Julia. Combined with the RData or RCall packages, it is possible to take data from R and fit the model in Julia fairly easily.

    ReplyDelete
    Replies
    1. Not only that, if the *only* reason that you are switching to brms or rstanarm is that you want to fit a maximal model no matter what, Julia will fit the models much faster. Dave Kleinschmidt can probably help psychologists get started with that, I think that he's active on the Julia MixedModels github repo.

      Delete
    2. I hope the post above suggests that I am certainly not *only* switching because I want to fit a maximal model. Rather, I am switching because I stuck with frequentist LMEMs because I was thinking "well at least they're easy to fit." Now that advantage is gone, I'm very happy to go Bayesian on philosophical grounds!

      Delete
    3. yes, sorry, i was actually thinking of Barr's comment on twitter when I wrote this, which suggests that's the only reason:

      https://twitter.com/dalejbarr/status/969573499349159941

      But it's twitter, I can't hold it against Dale if he didn't mean that either.

      Delete
  2. One minor point of terminology, a mixed-effects model with random effects for subject and for item is not a hierarchical linear model. "Hierarchical" means that the grouping factors for the random effects are nested. That is, they form a hierarchy.

    One of the basic design objectives of lme4 was to be able to fit models with crossed (each subject is exposed to each item) or partially crossed (each student is taught over time by one or more different teachers) random effects.

    ReplyDelete
    Replies
    1. Thanks very much for the comments Doug.

      Regarding fitting models in Julia, I will try this out - that seems like a very interesting option. As I mentioned in the post, I'm mostly concerned here with what to teach as a default for students doing very straightforward experiments. I'll have to play around but I worry that moving data into an entirely new language might not be an ideal solution (knowing that there are other benefits to Julia that caused you and others to switch).

      Regarding HLM vs. LMEM - yes, of course, you are right. I was trying to simplify and of course went too far in this direction. In part because lme4 has - thanks to your work - made these distinctions unnecessary for most users, I don't tend to emphasize the differences between nested and crossed effects in the way I would if choices on this front required users to switch software packages!

      Delete
    2. I was checking in the documentation for the MixedModels package

      http://dmbates.github.io/MixedModels.jl/latest/

      if I had an example of fitting a model in both lme4 and MixedModels. I don't at present.

      If you or another researcher could suggest a data set and model to use for demonstration I would be happy to write up a description of how to fit such models in Julia.

      Delete
  3. The point that is lost in discussions of maximal models is the relationship between interactions and main effects. It is often difficult to make sense of a significance test for a main effect in the presence of a non-negligible interaction. In a model of the form

    rt ~ 1 + c + (1|subj) + (1|item)

    testing whether c is significant is a test of the condition taking into account variation between subjects and between items.

    On the other hand fitting the model

    rt ~ 1 + c + (1 + c|subj) + (1 + c|item)

    assumes that subjects and items have an effect on the overall level of the response and on the change with condition. What does the main effect for c mean such a case? You are testing "given that we know that c has an effect that varies from subject to subject and that varies from item to item, does c have an effect"? The answer is obviously "yes". You have assumed that it has an effect.

    This is why statisticians tend to structure tests from higher order to lower order. If you conclude that you have a non-negligible interaction then you don't test for the main effect.

    ReplyDelete
    Replies
    1. As a psychologist, the thing I care about in many of my studies is estimating the effect of c that is general across a population of participants. So assuming an effect of c by subject, drawn from normal, seems to me to be quite reasonable. Then I want the person-independent effect of c in the fixed effect. In part the reasonableness of the random slope is because I don't typically have the precision to measure each subject's c effect very well and so will get some by measurement error even if there is a true zero effect of c. But in general for any psychological manipulation, it's pretty safe to assume some variation...

      Delete
    2. Doug, when psychologists say "c has an effect" they mean that the p-value is less than 0.05, or the absolute t is greater than 2. I think that by "c has an effect" you mean "including it as a fixed effect in the model"?

      Delete
    3. I can't tell if you're including me in "when psychologists say." I care *what the effect is* - which to me means, what is my best estimate of the magnitude. I'm not really that interested in |t|>2 or whatever.

      I also was convinced years ago by Andrew Gelman's arguments regarding the importance of including theoretically-meaningful but non-significant predictors in models. If you know the true generative structure of the data, you should model it even if you don't necessarily have the power to make strong decision-theoretic inferences about the importance of every single predictor in model fit.

      Delete
    4. To be honest, I wasn't sure whether you are among the group of psychologists I am referring to. I'm relieved to hear you are not! But, if you do care what the magnitude of the effect is and not just that it is significant, then if power is low, it simply doesn't matter what model one fits, maximal, minimal, whatever. This is because any significant effect under a low power study is *guaranteed* to be an overestimate (I mean 100% probability of being 2-7 times larger than the true effect). Why would anyone care to publish that in a top journal as big news? Yet, that is exactly what Cognition and JML routinely publish. I elaborate on this point about the great need to focus on whether the effect is accurately estimated, and the importance of paying attention to the imprecision of the estimate, in this paper: https://psyarxiv.com/hbqcw. I did fit maximal models throughout :)

      Given that most published studies in psycholx (I don't know about child language acquisition) are heavily underpowered (see Appendix B of http://www.ling.uni-potsdam.de/~vasishth/pdfs/JaegerEngelmannVasishthJML2017.pdf for interference studies), I don't know why anyone even cares about whether the model is maximal or not. Whatever we are finding out from the data is misleading us either by giving us an exaggerated effect size, or the wrong sign, or both? I wish the authors of the Keep it Maximal paper had sent a high impact message like Keep it Powerful. Right now, people are continuing to run underpowered studies and publishing them in Cognition and JML, and worrying about whether their model is maximal or not when lack of power is the real problem. Meehl and Cohen came and went, and had no impact on psychology.

      I know you probably know all this, I am just engaging in a start-of-weekend rant.

      Delete
    5. Rant away, Shravan! This is my cause too.

      My lab has taken a number of steps to try and work on these issues. First, we preregister the key analyses for essentially every study we do. This is critical for frequentist stats, but I believe it's important for avoiding this inflation issue regardless of estimation method. Second, we have been actively working through projects like metalab (http://metalab.stanford.edu) and manybabies (http://manybabies.stanford.edu) to compare the state of the literature to unbiased, large-scale estimates of key effects.

      Delete
    6. yeah, that metalab thing is just simply amazing. why isn't this the norm now in all fields? i want to do something similar with interference studies, but need to find some time to create such an online tool.

      Delete
    7. Thanks - let me know if you would like to collaborate on this. In principle, it would not be too difficult to import and visualize data of a different type. The difficulty is really in getting the meta-analytic data! Each of the MAs for development is essentially a full, labor-intensive paper being written by one of the collaborators...

      Delete
    8. That would be great (collaboration)! We have two completed meta-analyses, and we are working on a third one. Why not visit us in Potsdam sometime this year, and we can talk and plan this out? If this becomes more widespread in psycholx, people will adopt the standards you folks have developed.

      Delete
  4. I would like to add two different thoughts to this discussion:

    1. If the goal is to fit a model that is essentially equivalent to a model that could be estimated with lme4::lmer or lme4::glmer, there is IMHO not much to be gained by using brms over rstanarm. Rather, rstanarm seems preferable. For those cases that simply require Bayesian estimation (i.e., regularization via prior distributions on top of the regularization provided by the hierarchical-structure), rstanarm::stan_lmer or rstanarm::stan_glmer implements exactly the same likelihood (minus the prior) as their lme4 equivalents. And in comparison to brms, rstanarm is faster because of two reasons. First, rstanarm does not require compilation and second, the likelihood is implemented in a somewhat more optimal way (I am sure about the first of these reasons and kind of sure about the second one).

    2. One problem that is not yet fully solved in a Bayesian framework is hypothesis testing (there are good arguments for Bayes factors, but these are not easily available with appropriate priors for such models). In cases in which categorical covariates (i.e., factors) have only two levels, as in the current example, this is not a big problem. One can simply inspect the posterior to learn about the difference between the factor levels. Of course, appropriate orthogonal contrasts (e.g., contr.sum) need to be used if the categorical covariates are in an interaction (in the same way that 0 should be meaningful if age is a continuous covariate).
    However, when a factor has more than two levels simply inspecting the posterior often becomes difficult. Especially if a variable is in an interaction. The reason for this is that categorical covariates with k levels are transformed into k-1 parameters. For example, a factor with three levels will be transformed into two parameters. If the intercept is the grand mean then each parameter contains information pertaining to more than one factor level. So simply inspecting the posterior of those parameter values does not provide a lot of easy to interpret information.

    ReplyDelete
    Replies
    1. Hi Henrik,

      Good points regarding rstanarm - I will go back and do some comparisons!

      Agreed about hypothesis testing. There is Wagenmakers' method for doing bayes factors "without tears." I haven't experimented with this method. Do you have a view? Personally, the research that I do is more about measurement and less about hypothesis testing so I often care about measuring the effect with some precision and not about testing whether it is "there."

      Delete
    2. The bridgesampling R package developed by Quentin Gronau, myself, and E.J. Wagenmakers (https://cran.r-project.org/package=bridgesampling) indeed allows to calculate Bayes factors for all models fitted in Stan 'without tears'. It also works directly with models fitted with brms and rstanarm. However, IMHO these packages currently do not allow to specify priors that are appropriate for Bayes factor based model selection. So whereas I agree that we have solved the technical problem of how to get Bayes factors, the conceptual problem of specifying adequate priors still remains. But I hope there will be progress on this front this year.

      Furthermore, even when only interested in estimation or measurement, there exists a somewhat subtle issue related to categorical covariates with more than two factor levels in a Bayesian setting. For most coding schemes, the priors do not have the same effect on all factor levels. For example, for contr.sum the prior is more diffuse for the last compared to all other levels. For contr.helmert the prior becomes gradually more diffuse for each subsequent factor level with the exception of the first two. Thus, to make sure that the results are not biased and depend on which factor level is 'first' or 'last', a different type of coding scheme must be used. One such orthonormal coding scheme was developed by Rouder, Morey, Speckman, and Province (2012, JMP) and, AFAIK, is the one used internally by the BayesFactor package.

      Delete
    3. Henrik, sorry I didn't realize your contributions on bridgesampling (writing fast)! This is very helpful detail on the technical issues as well. I will have to read more on this, thank you!

      Delete
    4. This comment has been removed by the author.

      Delete
    5. When you remove compilation time, brms will be faster than rstanarm on almost any multilevel model, because the Stan code can be hand tailored to the input of the user. For any non-trivial multilevel model, estimation will take a few minutes, and at the time frame brms will usually already be faster even when including compilation time. Why do people continue to think the likelihood is implemented in a more optimal way in rstanarm?

      Delete