Tuesday, July 23, 2019

An ethical duty for open science?

Let's do a thought experiment. Imagine that you are the editor of a top-flight scientific journal. You are approached by a famous researcher who has developed a novel molecule that is a cure for a common disease, at least in a particular model organism. She would like to publish in your journal. Here's the catch: her proposed paper describes the molecule and asserts its curative properties. You are a specialist in this field, and she will personally show you any evidence that you need to convince you that she is correct – including allowing you to administer this molecule to an animal under your control and allowing you to verify that the molecule is indeed the one that she claims it is. But she will not put any of these details in the paper, which will contain only the factual assertion.

Here's the question: should you publish the paper?

Monday, May 6, 2019

It's the random effects, stupid!

(tl;dr: wonky post on statistical modeling)

I fit linear mixed effects models (LMMs) for most of the experimental data I collect. My data are typically repeated observations nested within subjects, and often have crossed effects of items as well; this means I need to account for this nesting and crossing structure when estimating the effects of various experimental manipulations. For the last ten years or so, I've been fitting these models in lme4 in R, a popular package that allows quick specification of complex models.

One question that comes up frequently regarding these models is what random effect structure to include? I typically follow the advice of Barr et al. (2013), who recommend "maximal" models – models that nest all the fixed effects within a random factor that have repeated observations for that random grouping factor. So for example, if you have observations for both conditions for each subject, fit random condition effects by subject. This approach contrasts, however, with the "parsimonious" approach of Bates et al.,* who argue that such models can be over-parameterized relative to variability in the data. The issue of choosing an approach is further complicated by the fact that, in practice, lme4 can almost never fit a completely maximal model and instead returns convergence warnings. So then you have to make a bunch of (perhaps ad-hoc) decisions about what to prune or how to tweak the optimizer.

Last year, responding to this discussion, I posted a blogpost that became surprisingly popular, arguing for the adoption of Bayesian mixed effects models. My rationale was not mainly that Bayesian models are interpretively superior – which they are, IMO – but just that they allow us to fit the random effect structure that we want without doing all that pruning business. Since then, we've published a few papers (e.g. this one) using Bayesian LMMs (mostly without anyone even noticing or commenting).**

In the mean time, I was working on the ManyBabies project. We finally completed data collection on our first study, a 60+ lab consortium study of babies' preference for infant-directed speech! This is exciting and big news, and I will post more about it shortly. But in the course of data analysis, we had to grapple with this same set of LMM issues. In our pre-registration (which, for what it's worth, was written before I really had tried the Bayesian  methods), we said we would try to fit a maximal LMM with the following structure. It doesn't really matter what all the predictors are, but trial_type is the key experimental manipulation:

M1) log_lt ~ trial_type * method +
             trial_type * trial_num +
             age_mo * trial_num +
             trial_type * age_mo * nae +
             (trial_type * trial_num | subid) +
             (trial_type * age_mo | lab) + 
             (method * age_mo * nae | item)

Of course, we knew this model would probably not converge. So we preregistered a pruning procedure, which we followed during data analysis, leaving us with:

M2) log_lt ~ trial_type * method +
             trial_type * trial_num +
             age_mo * trial_num +
             trial_type * age_mo * nae +
             (trial_type | subid) +
             (trial_type | lab) + 
             (1 | item)

We fit that model and report it in the (under review) paper, and we interpret the p-values as real p-values (well, as real as p-values can be anyway), because we are doing exactly the confirmatory thing we said we'd do. But in the back of my mind, I was wondering if we shouldn't have fit the whole thing with Bayesian inference and gotten the random effect structure that we hoped for.***

So I did that. Using the amazing brms package, all you need to do is replace "lmer" with "brm" (to get a default prior model with default inference).**** Fitting the full LMM on my MacBook Pro takes about 4hrs/chain with completely default parameters, so 16 hrs total – though if you do it in parallel you can fit all four at once. I fit M1 (the maximal model, called "bayes"), M2 (the pruned model, "bayes_pruned"), and for comparison the frequentist (also pruned, called "freq") model. Then I plotted coefficients and CIs against one another for comparison. There are three plots, corresponding to the three pairwise comparisons (brms M1 vs. lme4 M2, brms M1 vs. brms M2, and brms M2 vs. lme4 M2). (So as not to muddy the interpretive waters for ManyBabies, I'm just showing the coefficients without labels here). Here are the results.

As you can see, to a first approximation, there are not huge differences in coefficient magnitudes, which is good. But, inspecting the top row of plots, you can see that the full Bayesian M1 does have two coefficients that are different from both the Bayesian M2 and the frequentist M2. In other words, the fitting method didn't matter with this big dataset – but the random effects structure did! Further, if you dig into the confidence intervals, they are again similar between fitting methods but different between random effects structures.  Here's a pairs plot of the correlation between upper CI limits (note that .00 here means a correlation of 1.00!):

Not huge differences, but they track with random effect structure again, not with the fitting method.

In sum, in one important practical case, we see that fitting the maximal model structure (rather than the maximal convergent model structure) seems to make a difference to model fit and interpretation. This evidence to me supports the Bayesian approach that I recommended in my prior post. I don't know that M1 is the best model – I'm trusting the "keep it maximal" recommendation on that point. But to the extent that I should be able to fit all the models I want to try, then using brms (even if it's slower) seems important. So I'm going to keep using this fitting procedure in the immediate future.

* This approach seems very promising, but also a bit tricky to implement. I have to admit, I am a bit lazy and it is really helpful when software provides a solution for fitting that I can share with people in my lab as standard practice. A collaborator and I tried someone else's implementation of parsimonious models and it completely failed, and then we gave up. If someone wants to try it on this dataset I'd be happy to share!

* An aside: after I posted, Doug Bates kindly engaged and encouraged me to adopt Julia, rather than R, for model fitting, if it was fitting that I wanted and not Bayesian inference. We did experiment a bit with this, and Mika Braginsky wrote the jglmm package to use Julia for fitting. This experiment resulted in her in-press paper using Julia for model fits, but also with us recognizing that 1) Julia is TONS faster than R for big mixed models, which is a win, but 2) Julia can't fit some of the baroque random effects structures that we occasionally use, and 3) installing Julia and getting everything working is very non-trivial, meaning that it's hard to recommend for folks just getting started.

** Jake Westfall, back in 2016 when we were planning the study, said we should do this, and I basically told him that I thought that developmental psychologists wouldn't agree to it. But I think he was probably right.

*** Code for this post is on github.

Monday, April 8, 2019

A (mostly) positive framing of open science reforms

I don't often get the chance to talk directly and openly to people who are skeptical of the methodological reforms that are being suggested in psychology. But recently I've been trying to persuade someone I really respect that these reforms are warranted. It's a challenge, but one of the things I've been trying to do is give a positive, personal framing to the issues. Here's a stab at that. 

My hope is that a new graduate student in the fields I work on – language learning, social development, psycholinguistics, cognitive science more broadly – can pick up a journal and choose a seemingly strong study, implement it in my lab, and move forward with it as the basis for a new study. But unfortunately my experience is that this has not been the case much of the time, even in cases where it should be. I would like to change that, starting with my own work.

Here's one example of this kind of failure: As a first-year assistant professor, a grad student and I tried to replicate one of my grad school advisors' well-known studies. We failed repeatedly – despite the fact that we ended up thinking the finding was real (eventually published as Lewis & Frank, 2016, JEP:G). The issue was likely that the original finding was an overestimate of the effect, because the original sample was very small. But converging on the truth was very difficult and required multiple iterations.

Thursday, February 21, 2019

Nothing in childhood makes sense except in the light of continuous developmental change

I'm awestruck by the processes of development that operate over children's first five years. My daughter M is five and my newborn son J is just a bit more than a month old. J can't yet consistently hold his head up, and he makes mistakes even in bottle feeding – sometimes he continues to suck but forgets to swallow so that milk pours out of his mouth until his clothes are soaked. I remember this kind of thing happening with M as a baby ... and yet voila, five years later, you have someone who is writing text messages to grandma and illustrating new stories about Spiderman. How you could possibly get from A to B (or in my case, from J to M)? The immensity of this transition is perhaps the single most important challenge for theories of child development.

As a field, we have bounced back and forth between continuity and discontinuity theories to explain these changes. Continuity theories posit that infants' starting state is related to our end state, and that changes are gradual, not saltatory; discontinuity theories posit stage-like transitions. Behaviorist learning theory was fundamentally a continuity hypothesis – the same learning mechanisms (plus experience) underly all of behavior, and change is gradual. In contrast, Piagetian stage theory was fundamentally about explaining behavioral discontinuities. As the pendulum swung, we get core knowledge theory, a continuity theory: innate foundations are "revised but not overthrown" (paraphrasing Spelke et al. 1992). Gopnik and Wellman's "Theory theory" is a discontinuity theory: intuitive theories of domains like biology or causality are discovered like scientific theories. And so on.

For what it's worth, my take on the "modern synthesis" in developmental psychology is that development is domain-specific. Domain of development – perception, language, social cognition, etc. – progress on their own timelines determined by experience, maturation, and other constraining factors. And my best guess is that some domains develop continuously (especially motor and perceptual domains) while others, typically more "conceptual" ones, show more saltatory progress associated with stage changes. But – even though it would be really cool to be able to show this – I don't think we have the data to do so.

The problem is that we are not thinking about – or measuring – development appropriately. As a result, what we end up with is a theoretical mush. We talk as though everything is discrete, but that's mostly a function of our measurement methods. Instead, everything is at rock bottom continuous, and the question is how steep the changes are.

We talk as though everything is discontinuous all the time. The way we know how to describe development verbally is through what I call "milestone language." We discuss developmental transitions by (often helpful) age anchors, like "children say their first word around their first birthday," or "preschoolers pass the Sally-Ann task at around 3.5 years." When summarizing a study, we* assert that "by 7 months, babies can segment words from fluent speech," even if we know that this statement describes the fact that the mean performance of a group is significantly different than zero in a particular paradigm instantiating this ability, and even if we know that babies might show this behavior a month earlier if you tested enough of them! But it's a lot harder to say "early word production emerges gradually from 10 - 14 months (in most children)."

Beyond practicalities, one reason we use milestone language is because our measurement methods are only set up to measure discontinuities. First, our methods have poor reliability: we typically don't learn very much about any one child, so we can't say conclusively whether they truly show some behavior or not. In addition, most developmental studies are severely underpowered, just like most studies in neuroscience and psychology in general. So the precision of our estimates of a behavior for groups of children are noisy. To get around this problem, we use null hypothesis significance tests – and when the result is p < .05, we declare that development has happened. But of course we will see discrete changes in development if we use a discrete statistical cutoff!

And finally, we tend to stratify our samples into discrete age bins (which is a good way to get coverage), e.g. recruiting 3-month-olds, 5-month-olds, and 7-month-olds for a study. But then, we use these discrete samples as three separate analytic groups, ignoring the continuous developmental variation between them! This practice reduces statistical power substantially, much like taking median splits on continuous variables (taking a median split on average is like throwing away a third of your sample!). In sum, even in domains where development is continuous, our methods guarantee that we get binary outcomes. We don't try to estimate continuous functions, even when our data afford them.

The truth is, when you scratch the surface in development, everything changes continuously. Even the stuff that's not supposed to change still changes. I saw this in one of my very first studies, when I was a lab manager for Scott Johnson and we accidentally found ourselves measuring 3-9 month-olds' face preferences. Though I had learned from the literature that infants had an innate face bias, I was surprised to find that magnitude of face looking was changing dramatically across the range I was measuring. (Later we found that this change was related to the development of other visual orientating skills). Of course "it's not surprising" that some complex behavior goes up with development, says reviewer 3. But it is important, and the ways we talk about and analyze our data don't reflect the importance of quantifying continuous developmental change.

One reason that it's not surprising to see developmental change is that everything that children do is at its heart a skill. Sucking and swallowing is a skill. Walking is a skill. Recognizing objects is a skill. Recognizing words is a skill too - so too is the rest of language, at least according to some folks. Thinking about other people's thoughts is a skill. So that means that everything gets better with practice. It will – to a first approximation – follow a classic logistic curve like this:

Most skills get better with practice, and the ones described above are no exception. But developmental progress also happens in the absence of practice of specific skills due to physiological maturation – older children's brains are faster and more accurate at processing information, even for skills that haven't been practiced. So samples from this behavior should look like these red lines:

But here's the problem. If you have a complex behavior, it's built of simple behaviors, which are themselves skills. To get the probability of success on one of those complex skills, you can – as a first approximation – multiply the independent probabilities of success in each of the components. That process yields logistic curves that look like these (color indicating the number of components):

And samples from a process with many components look even more discrete, because the logistic is steeper!

Given this kind of perspective, we should expect complex behaviors to emerge relatively suddenly, even if they are simply the product of a handful of continuously changing processes.

This means, from a theoretical standpoint, we need stronger baselines. Our typical baseline at the moment is the null hypothesis of no difference; but that's a terrible baseline! Instead, we need to be comparing to a null hypothesis of "developmental business as usual." To show discontinuity, we need to take into account the continuous changes that a particular behavior will inevitably be undergoing. And then, we need to argue that the rate of developmental change that a particular process is undergoing is faster than we should expect based on simple learning of that skill. Of course to make these kinds of inferences requires far more data about individuals than we usually gather.

In a conference paper that I'm still quite proud of, we tried to create this sort of baseline for early word learning. Arguably, early word learning is a domain where there likely aren't huge, discontinuous changes – instead kids gradually get faster and more accurate in learning new words until they are learning several new words per day. We used meta-analysis to estimate developmental increases in two component processes of novel word mapping: auditory word recognition and social cue following. Both of these got faster and more accurate over the first couple of years. When we put these increases together, we found they together created really substantial changes in how much input would be needed for a new word mapping. (Of course what we haven't done in the three years since we wrote that paper is actually measure the parameters on the process of word mapping developmentally – maybe that's for a subsequent ManyBabies study...). Overall, this baseline suggests that even in the absence of discontinuity, continuous changes in many small processes can produce dramatic developmental differences.

In sum: sometimes developmental psychologists don't take the process of developmental change seriously enough. To do better, we need to start analyzing change continuously; measuring with sufficient precision to estimate rates of change; and creating better continuous baselines before we make claims about discrete change or emergence. 

* I definitely do this too!

Sunday, December 9, 2018

How to run a study that doesn't replicate, experimental design edition

(tl;dr: Design features of psychology studies to avoid if you want to run a good study!)

Imagine reading about a psychology experiment in which participants are randomly assigned to one of two different short state inductions (say by writing a passage or unscrambling sentences), and then outcomes are measured via a question about an experimental vignette. The whole thing takes place in about 10 minutes and is administered through a survey, perhaps via Qualtrics.

The argument of this post is that this experiment has a low probability of replicating, and we can make that judgment purely from the experimental methods – regardless of the construct being measured, the content of the state induction, or the judgment that is elicited. Here's why I think so.

Friday was the last day of my graduate class in experimental methods. The centerpiece of the course is a replication project in which each student collects data on a new instantiation of a published experiment. I love teaching this course and have blogged before about outcomes from it. I've also written several journal articles about student replication in this model (Frank & Saxe, 2012Hawkins*, Smith*, et al., 2018). In brief, I think this is a really fun way for student to learn about experimental design and data analysis, open science methods, and the importance of replication in psychology. Further, the projects in my course are generally pretty high quality: they are pre-registered confirmatory tests with decent statistical power, and both the paradigm and the data analysis go through multiple rounds of review by the TAs and me (and sometimes also get feedback from the original authors).

Every year I rate each student project on its replication outcomes. The scale is from 0 to 1, with intermediate values indicating unclear results or partial patterns of replication (e.g., significant key test but different qualitative interpretation). The outcomes from the student projects this year were very disappointing. With 16/19 student projects finished, we have an average replication rate of .31. There were only 4 clear successes, 2 intermediate results, and 10 failure. Samples are small every year, but this rate was even lower than we saw in previous samples (2014-15: .57, N=38) and another one-year sample (2016: .55, N=11).

What happened? Many of the original experiments followed part or all of the schema described above, with a state induction followed by a question about a vignette. In other words, they were poorly designed.

Friday, September 7, 2018

Scale construction, continued

For psychometrics fans: I helped out with a post by Brent Roberts, "Yes or No 2.0: Are Likert scales always preferable to dichotomous rating scales?" This post is a continuation of our earlier conversation on scale construction and continues to examine the question of if – and if so, when – it's appropriate to use a Likert scale vs. a dichotomous scale. Spoiler: in some circumstances it's totally safe, while in others it is a disaster!

Thursday, August 30, 2018

Three (different) questions about development

(tl;dr: Some questions I'm thinking about, inspired by the idea of studying the broad structure of child development through larger-scale datasets.)

My daughter, M, started kindergarten this month. I began this blog when I was on paternity leave after she was born; the past five years have been an adventure and revolution for my understanding of development to watch her grow.* Perhaps the most astonishing feature of the experience is how continuous, incremental changes lead to what seem like qualitative revolutions. There is of course no moment in which she became the sort of person she is now: the kind of person who can tell a story about an adventure in which two imaginary characters encounter one another for the first time,** but some set of processes led us to this point. How do you uncover the psychological factors that contribute to this kind of growth and change?

My lab does two kinds of research. In both my hope is to contribute to this kind of understanding by studying the development of cognition and language in early childhood. The first kind of work we do is to conduct series of experiments with adults and children, usually aimed at getting answers to questions about representation and mechanism in early language learning in social contexts. The second kind of work is a larger-scale type of resource-building, where we create datasets and accompanying tools like Wordbank, MetaLab, and childes-db. The goal of this work is to make  larger datasets accessible for analysis – as testbeds for reproducibility and theory-building.

Each of these activities connects to the project of understanding development at the scale of an entire person's growth and change. In the case of small-scale language learning experiments, the inference strategy is pretty standard. We hypothesize the operation of some mechanism or the utility of some information source in a particular learning problem (say, the utility of pragmatic inference in word learning). Then we carry out a series of experiments that shows a proof of concept that children can use the hypothesized mechanism to learn something in a lab situation, along with control studies that rule out other possibilities. When done well, these studies can give you pretty good traction on individual learning mechanisms. But they can't tell you that these mechanisms are used by children consistently (or even at all) in their actual language learning.

In contrast, when we work with large-scale datasets, we get a whole-child picture that isn't available in the small studies. In our Wordbank work, for example, we get a global picture of the child's vocabulary and linguistic abilities, for many children across many languages. The trouble is, it's very hard or even impossible to find answers to smaller-scale questions (say, about information seeking from social partners) in datasets that represent global snapshots of children's experience or outcomes. Both methods – the large-scale and the small-scale – are great. The trouble is, the questions don't necessarily line up. Instead, larger datasets tend to direct you towards different questions. Here are three.