*(tl;dr: wonky post on statistical modeling)*

I fit linear mixed effects models (LMMs) for most of the experimental data I collect. My data are typically repeated observations nested within subjects, and often have crossed effects of items as well; this means I need to account for this nesting and crossing structure when estimating the effects of various experimental manipulations. For the last ten years or so, I've been fitting these models in lme4 in R, a popular package that allows quick specification of complex models.

One question that comes up frequently regarding these models is what random effect structure to include? I typically follow the advice of Barr et al. (2013), who recommend "maximal" models – models that nest all the fixed effects within a random factor that have repeated observations for that random grouping factor. So for example, if you have observations for both conditions for each subject, fit random condition effects by subject. This approach contrasts, however, with the "parsimonious" approach of Bates et al.,* who argue that such models can be over-parameterized relative to variability in the data. The issue of choosing an approach is further complicated by the fact that, in practice, lme4 can almost never fit a completely maximal model and instead returns convergence warnings. So then you have to make a bunch of (perhaps ad-hoc) decisions about what to prune or how to tweak the optimizer.

Last year, responding to this discussion, I posted a blogpost that became surprisingly popular, arguing for the adoption of Bayesian mixed effects models. My rationale was not mainly that Bayesian models are interpretively superior – which they are, IMO – but just that they allow us to fit the random effect structure that we want without doing all that pruning business. Since then, we've published a few papers (e.g. this one) using Bayesian LMMs (mostly without anyone even noticing or commenting).**

In the mean time, I was working on the ManyBabies project. We finally completed data collection on our first study, a 60+ lab consortium study of babies' preference for infant-directed speech! This is exciting and big news, and I will post more about it shortly. But in the course of data analysis, we had to grapple with this same set of LMM issues. In our pre-registration (which, for what it's worth, was written before I really had tried the Bayesian methods), we said we would try to fit a maximal LMM with the following structure. It doesn't really matter what all the predictors are, but trial_type is the key experimental manipulation:

**M1)**log_lt ~ trial_type * method +

trial_type * trial_num +

age_mo * trial_num +

trial_type * age_mo * nae +

(trial_type * trial_num | subid) +

(trial_type * age_mo | lab) +

(method * age_mo * nae | item)

Of course, we knew this model would probably not converge. So we preregistered a pruning procedure, which we followed during data analysis, leaving us with:

**M2)**log_lt ~ trial_type * method +

trial_type * trial_num +

age_mo * trial_num +

trial_type * age_mo * nae +

(trial_type | subid) +

(trial_type | lab) +

(1 | item)

We fit that model and report it in the (under review) paper, and we interpret the

*p*-values as real

*p*-values (well, as real as

*p*-values can be anyway), because we are doing exactly the confirmatory thing we said we'd do. But in the back of my mind, I was wondering if we shouldn't have fit the whole thing with Bayesian inference and gotten the random effect structure that we hoped for.***

So I did that. Using the amazing brms package, all you need to do is replace "lmer" with "brm" (to get a default prior model with default inference).**** Fitting the full LMM on my MacBook Pro takes about 4hrs/chain with completely default parameters, so 16 hrs total – though if you do it in parallel you can fit all four at once. I fit M1 (the maximal model, called "bayes"), M2 (the pruned model, "bayes_pruned"), and for comparison the frequentist (also pruned, called "freq") model. Then I plotted coefficients and CIs against one another for comparison. There are three plots, corresponding to the three pairwise comparisons (brms M1 vs. lme4 M2, brms M1 vs. brms M2, and brms M2 vs. lme4 M2). (So as not to muddy the interpretive waters for ManyBabies, I'm just showing the coefficients without labels here). Here are the results.

*and*the frequentist M2. In other words, the fitting method didn't matter with this big dataset – but the random effects structure did! Further, if you dig into the confidence intervals, they are again similar between fitting methods but different between random effects structures. Here's a pairs plot of the correlation between upper CI limits (note that .00 here means a correlation of 1.00!):

Not huge differences, but they track with random effect structure again, not with the fitting method.

In sum, in one important practical case, we see that fitting the maximal model structure (rather than the maximal

*convergent*model structure) seems to make a difference to model fit and interpretation. This evidence to me supports the Bayesian approach that I recommended in my prior post. I don't know that M1 is the

*best*model – I'm trusting the "keep it maximal" recommendation on that point. But to the extent that I should be able to fit all the models I want to try, then using brms (even if it's slower) seems important. So I'm going to keep using this fitting procedure in the immediate future.

----

* This approach seems very promising, but also a bit tricky to implement. I have to admit, I am a bit lazy and it is really helpful when software provides a solution for fitting that I can share with people in my lab as standard practice. A collaborator and I tried someone else's implementation of parsimonious models and it completely failed, and then we gave up. If someone wants to try it on this dataset I'd be happy to share!

* An aside: after I posted, Doug Bates kindly engaged and encouraged me to adopt Julia, rather than R, for model fitting, if it was fitting that I wanted and not Bayesian inference. We did experiment a bit with this, and Mika Braginsky wrote the jglmm package to use Julia for fitting. This experiment resulted in her in-press paper using Julia for model fits, but also with us recognizing that 1) Julia is TONS faster than R for big mixed models, which is a win, but 2) Julia can't fit some of the baroque random effects structures that we occasionally use, and 3) installing Julia and getting everything working is very non-trivial, meaning that it's hard to recommend for folks just getting started.

** Jake Westfall, back in 2016 when we were planning the study, said we should do this, and I basically told him that I thought that developmental psychologists wouldn't agree to it. But I think he was probably right.

*** Code for this post is on github.

You asked me to not hold back so here goes. ö

ReplyDeleteThere are several important things here that have been glossed over and have the potential to lead to further misuse of statistical methods.

Briefly:

- There is little to be gained by fitting a model to make a discovery claim unless you have some idea about the power and Type I, M, S error properties of your design. One must begin with power and Type I error and establish that one can in principle get accurate estimates. It has taken me forever to understand this point. If power is likely to be low, any effects you find (as in "significant" effects or Bayes factors or whatever) are guaranteed to be overestimates that will not in general replicate (Gelman and Carlin, 2014, and a recent JML paper I wrote with Gelman demonstrating the point). I feel that it's extremely damaging to make claims like "fit a maximal model" or "don't fit a maximal model" without making any qualifying statements about the capabilities of your design to---in principle---make discovery claims.

- This post is really disturbing to me for another reason: it implicitly encourages the standard McDonalds way of statistical thinking we practice, that you can drop into the stats shop and quickly leave with a complete analysis (of course it takes hours to fit a Bayesian model, but one can do it overnight while sleeping, so it still feels like a fast-food event to me). I know for sure that you personally would never work in this way (although you do say that you tend to be lazy---I know that you at least know the consequences of that, but IMO it sends the wrong message). Any newcomer reading this post is going to take away the message that one can just load a data-set and run a maximal model in brms and fertig. We should discourage this kind of magical thinking. (I have done this too, as I said I was slow to understand this stuff.) Betancourt has written extensively about this, and we tried to translate his ideas to Cognitive Science to demonstrate the point: https://arxiv.org/abs/1904.12765. The sheer pointlessness of the analyses I have done in the past myself and that I see in many papers is just depressing. We should actively discourage the idea that a quick load-and-fire approach to data fitting can get us anywhere. This is how I was taught to do analyses 17 years ago, and I am still pissed off about that.

- In the brms model, you will pay a price down the road (eventually) for using the default priors they provide. Even the author of brms warns against this. One should insist on explicitly stating the priors as that is part of the model---leaving them implicit like you did is asking for trouble. E.g., doing a Bayes factor analysis using default priors is in general the road to insanity. Also, I always do a sensitivity analysis; even if it is not in the final paper, it's in the supplementary materials. People say that priors don't matter when you have enough data. True, but for all the abstract variance components we use, priors become more and more important. I have never been in a situation where I could say I have enough power to recover all the variance component estimates accurately, and I don't even work with baby data, I can collect as much data as I like because I have the resources for it.

So, the reason I don't like this kind of argument (use brms because it allows you to fit maximal linear mixed models) is that it doesn't come with any qualifications and caveats. It encourages business as usual. No linguist doing a syntactic analysis would use automated software to come up with a syntactic derivation, but that is exactly what we are implicitly teaching students to do, except it's in statistical analysis.

I would say that asking whether to fit a maximal model or not is asking the wrong question. If you can manage to run a high-powered study, it really is not going to matter. If you are not running a high powered study, your problems lie elsewhere, not in the maximality of the model.

There. My rant is done. ö

Hi Shravan, rhanks for these points - I generally agree with all three of them and appreciate the caution and the warning. Quick responses to each:

ReplyDelete1. Yes, no question that if you have a deeply underpowered study, you are in hot water. That said, I think I might slightly disagree about the nature of the random effects issue. I typically try to run developmental studies that are *adequately* powered for my effects of interest, but are not over-powered. In that scenario the model specification does often matter quite a lot.

2. Regarding McDonalds analyses. Yes, absolutely - people should be thoughtful about their analysis! But - not all fast food is crested equal. Chipotle might be healthier than McDonalds even though both are fast. I think we want to create *good defaults* and then be thoughtful in our use of them and our deviations. Having bad defaults doesn't encourage people to be thoughtful.

3. Priors, yes. In principle we should know more about our default priors, especially for internal parameters. In practice though I do want to have good defaults (I should understand why they are good of course).

So in sum, I think if you have a truly high powered study, maybe it doesn't matter what the specification is or what the priors are. But that's not the world we live in. So we do need to choose good priors AND good (often maximal) specifications for the inferences we want to make.

Finally, I think the issue of *removing analytic flexibility* here is an important one. If we are going to make binary inferences from models, e.g. X has non-zero weight in the model, then we need to have good default workflows. Otherwise we have a lot of flexibility to try and justify our way into biases in the model that move X further from zero!

Hi Mike, in case you haven't seen this, here's a paper and a poster making a complementary point to yours (from 2017, but I only discovered them now!)

ReplyDeletehttps://arxiv.org/ftp/arxiv/papers/1701/1701.04858.pdf

http://publish.illinois.edu/quantitativelinguistics/files/2017/01/LSA2017.Mixed-Models-are-Sometimes-Terrible-final.pdf

One point I find compelling from personal experience is that more data isn't necessarily going to solve convergence and model selection problems in lme4. I have had convergence failures for all but random intercept models with large datasets (thousands of participants). But what really finally pushed me over the edge are repeated instances of wild swings in estimates and p-values for fixed effects given models with different random effect structures *in cases where model comparisons showed no significant differences in fit*. At that point, it's either figure out brms or succumb to Lovecraftian madness, no matter whether you eat fast food or enjoy fine statistical dining like Shravan.