Babies Learning Language: Methods

Showing posts with label Methods. Show all posts

Monday, March 27, 2023

Domain-specific data repositories for better data sharing in psychology!

Data sharing is a critical part of ensuring a reproducible and robust research literature. It's also increasingly the law of the land, with new federal mandates taking effect in the US this year. How should psychologists and other behavioral scientists share their data?

Repositories should clearly be FAIR - findable, accessible, interoperable, and reusable. But here's the thing - most data on a FAIR repository like the Open Science Framework (which is great, btw), will never be reused. It's findable and accessible, but it's not really interoperable or reusable. The problem is that most psychological data are measurements of some stuff in some experimental context. The measures we use are all over the place. We do not standardize our measures, let alone our manipulations. The metadata are comprehensible but not machine readable. And there is no universal ontology that lets someone say "I want all the measurements of self-regulation on children that are posted on OSF."

What makes a dataset reusable really depends on the particular constructs that it measures, which in turn depends on the subfield and community those data are being collected for. When I want to reuse data, I don't want data in general. I want data about a specific construct, from a specific instrument, with metadata particular to my use field. Such should be stored in repositories specific to that measure, construct, or instrument. Let's call these Domain Specific Data Repositories (DSDRs). DSDRs are a way to make sure data actually are interoperable and actually do get reused by the target community.

Methodological reforms, or, If we all want the same things, why can't we be friends?

(tl;dr: "Ugh, can't we just get along?!" OR "aspirational reform meet actual policy?" OR "whither metascience?")

This post started out as a thread about the tribes of methodological reform in psychology, all of whom I respect and admire. Then it got too long, so it became a blogpost.

As folks might know, I think methodological reform in psychology is critical (some of my views have been formed by my work with the ManyBabies consortium). For the last ~2 years, I've been watching two loose groups of methodological reformers get mad at each other. It has made me very sad to see these conflicts because I like all of the folks involved. I've actually felt like I've had to take a twitter holiday several times because I can't stand to see some of my favorite folks on the platform yelling at each other.

This post is my - perhaps misguided - attempt to express appreciation for everyone involved and try to spell out some common ground.

Letter of recommendation: Attack of the Psychometricians

(tl;dr: It's letter of recommendation season, and so I decided to write one to a paper that's really been influential in my recent thinking. Psychometrics, y'all.)

To whom it may concern:

I am writing to provide my strongest recommendation for the paper, "Attack of the Psychometricians" by Denny Borsboom (2006). Reading this paper oriented me to a rich tradition of psychometric modeling – but more than that, it changed my perspective on the relationship between psychological measurement and theory. (It also taught me to use the term "sumscore"* as an insult). I urge you to consider it for a position in your reading list, syllabus, or lab meeting.

I first met AotP (or Attack!, as I like to call it) via a link on twitter. Not the most auspicious beginning, but from a quick skim on my phone, I could tell that this was a paper that needed further study.

The paper presents and discusses what it calls the central insight of psychometrics: that "measurement does not consist of finding the right observed score to substitute for a theoretical attribute, but of devising a model structure to relate an observable to a theoretical attribute." In other words, the goal is to make models that link data to theoretical quantities of interest. What this means is that measurement is essentially continuous with theory construction. By creating and testing a good measurement model, you're creating and testing a key component of a good theory.

A (mostly) positive framing of open science reforms

I don't often get the chance to talk directly and openly to people who are skeptical of the methodological reforms that are being suggested in psychology. But recently I've been trying to persuade someone I really respect that these reforms are warranted. It's a challenge, but one of the things I've been trying to do is give a positive, personal framing to the issues. Here's a stab at that.

My hope is that a new graduate student in the fields I work on – language learning, social development, psycholinguistics, cognitive science more broadly – can pick up a journal and choose a seemingly strong study, implement it in my lab, and move forward with it as the basis for a new study. But unfortunately my experience is that this has not been the case much of the time, even in cases where it should be. I would like to change that, starting with my own work.

Here's one example of this kind of failure: As a first-year assistant professor, a grad student and I tried to replicate one of my grad school advisors' well-known studies. We failed repeatedly – despite the fact that we ended up thinking the finding was real (eventually published as Lewis & Frank, 2016, JEP:G). The issue was likely that the original finding was an overestimate of the effect, because the original sample was very small. But converging on the truth was very difficult and required multiple iterations.

How to run a study that doesn't replicate, experimental design edition

(tl;dr: Design features of psychology studies to avoid if you want to run a good study!)

Imagine reading about a psychology experiment in which participants are randomly assigned to one of two different short state inductions (say by writing a passage or unscrambling sentences), and then outcomes are measured via a question about an experimental vignette. The whole thing takes place in about 10 minutes and is administered through a survey, perhaps via Qualtrics.

The argument of this post is that this experiment has a low probability of replicating, and we can make that judgment purely from the experimental methods – regardless of the construct being measured, the content of the state induction, or the judgment that is elicited. Here's why I think so.

Friday was the last day of my graduate class in experimental methods. The centerpiece of the course is a replication project in which each student collects data on a new instantiation of a published experiment. I love teaching this course and have blogged before about outcomes from it. I've also written several journal articles about student replication in this model (Frank & Saxe, 2012; Hawkins*, Smith*, et al., 2018). In brief, I think this is a really fun way for student to learn about experimental design and data analysis, open science methods, and the importance of replication in psychology. Further, the projects in my course are generally pretty high quality: they are pre-registered confirmatory tests with decent statistical power, and both the paradigm and the data analysis go through multiple rounds of review by the TAs and me (and sometimes also get feedback from the original authors).

Every year I rate each student project on its replication outcomes. The scale is from 0 to 1, with intermediate values indicating unclear results or partial patterns of replication (e.g., significant key test but different qualitative interpretation). The outcomes from the student projects this year were very disappointing. With 16/19 student projects finished, we have an average replication rate of .31. There were only 4 clear successes, 2 intermediate results, and 10 failure. Samples are small every year, but this rate was even lower than we saw in previous samples (2014-15: .57, N=38) and another one-year sample (2016: .55, N=11).

What happened? Many of the original experiments followed part or all of the schema described above, with a state induction followed by a question about a vignette. In other words, they were poorly designed.

Scale construction, continued

For psychometrics fans: I helped out with a post by Brent Roberts, "Yes or No 2.0: Are Likert scales always preferable to dichotomous rating scales?" This post is a continuation of our earlier conversation on scale construction and continues to examine the question of if – and if so, when – it's appropriate to use a Likert scale vs. a dichotomous scale. Spoiler: in some circumstances it's totally safe, while in others it is a disaster!

Saturday, May 5, 2018

nosub: a command line tool for pushing web experiments to Amazon Mechanical Turk

(This post is co-written with Long Ouyang, a former graduate student in our department, who is the developer of nosub, and Manuel Bohn, a postdoc in my lab who has created a minimal working example).

Although my lab focuses primarily on child development, our typical workflow is to refine experimental paradigms via working with adults. Because we treat adults as a convenience population, Amazon Mechanical Turk (AMT) is a critical part of this workflow. AMT allows us to pay an hourly wage to participants all over the US who complete short experimental tasks. (Some background from an old post).

Our typical workflow for AMT tasks is to create custom websites that guide participants through a series of linguistic stimuli of one sort or another. For simple questionnaires we often use Qualtrics, a commercial survey product, but most tasks that require more customization are easy to set up as free-standing javascript/HTML sites. These sites then need to be pushed to AMT as "external HITs" (Human Intelligence Tasks) so that workers can find them, participate, and be compensated.

nosub is a simple tool for accomplishing this process, building on earlier tools used by my lab.* The idea is simple: you customize your HIT settings in a configuration file and type

nosub upload

to upload your experiment to AMT. Then you can type

nosub download

to fetch results. Two nice features of nosub from a psychologist's perspective are: 1. worker IDs are anonymized by default so you don't need to worry about privacy issues (but they are deterministically hashed so you can still flag repeat workers), and 2. nosub can post HITs in batches so that you don't get charged Amazon's surcharge for tasks with more than 9 hits.

All you need to get started is to install Node.js; installation instructions for nosub are available in the project repository.

Once you've run nosub, you can download your data in JSON format, which can easily be parsed into R. We've put together a minimal working example of an experiment that can be run using nosub and a data analysis script in R that reads in the data.

---

* psiTurk is another framework that provides a way of serving and tracking HITs. psiTurk is great and we have used it for heavier-weight applications where we need to track participants, but can be tricky to debug and is not always compatible with some of our light-weight web experiments.

Monday, February 26, 2018

Mixed effects models: Is it time to go Bayesian by default?

(tl;dr: Bayesian mixed effects modeling using brms is really nifty.)

Introduction: Teaching Statistical Inference?

How do you reason about the relationship between your data and your hypotheses? Bayesian inference provides a way to make normative inferences under uncertainty. As scientists – or even as rational agents more generally – we are interested in knowing the probability of some hypothesis given the data we observe. As a cognitive scientist I've long been interested in using Bayesian models to describe cognition, and that's what I did much of my graduate training in. These are custom models, sometimes fairly difficult to write down, and they are an area of active research. That's not what I'm talking about in this blogpost. Instead, I want to write about the basic practice of statistics in experimental data analysis.

Mostly when psychologists do and teach "stats," they're talking about frequentist statistical tests. Frequentist statistics are the standard kind people in psych have been using for the last 50+ years: t-tests, ANOVAs, regression models, etc. Anything that produces a p-value. P-values represent the probability of the data (or any more extreme) under the null hypothesis (typically "no difference between groups" or something like that). The problem is that this is not what we really want to know as scientists. We want the opposite: the probability of the hypothesis given the data, which is what Bayesian statistics allow you to compute. You can also compute the relative evidence for one hypothesis over another (the Bayes Factor).

Now, the best way to set psychology twitter on fire is to start a holy war about who's actually right about statistical practice, Bayesians or frequentists. There are lots of arguments here, and I see some merit on both sides. That said, there is lots of evidence that much of our implicit statistical reasoning is Bayesian. So I tend towards the Bayesian side on the balance <ducks head>. But despite this bias, I've avoided teaching Bayesian stats in my classes. I've felt like, even with their philosophical attractiveness, actually computing Bayesian stats had too many very severe challenges for students. For example, in previous years you might run into major difficulties inferring the parameters of a model that would be trivial under a frequentist approach. I just couldn't bring myself to teach a student a philosophical perspective that – while coherent – wouldn't provide them with an easy toolkit to make sense of their data.

The situation has changed in recent years, however. In particular, the BayesFactor R package by Morey and colleagues makes it extremely simple to do basic inferential tasks using Bayesian statistics. This is a huge contribution! Together with JASP, these tools make the Bayes Factor approach to hypothesis testing much more widely accessible. I'm really impressed by how well these tools work.

All that said, my general approach to statistical inference tends to rely less on inference about a particular hypothesis and more on parameter estimation – following the spirit of folks like Gelman & Hill (2007) and Cumming (2014). The basic idea is to fit a model whose parameters describe substantive hypotheses about the generating sources of the dataset, and then to interpret these parameters based on their magnitude and the precision of the estimate. (If this sounds vague, don't worry – the last section of the post is an example). The key tool for this kind of estimation is not tests like the t-test or the chi-squared. Instead, it's typically some variant of regression, usually mixed effects models.

Mixed-Effects Models

Especially in psycholinguistics where our experiments typically show many people many different stimuli, mixed effects models have rapidly become the de facto standard for data analysis. These models (also known as hierarchical linear models) let you estimate sources of random variation ("random effects") in the data across various grouping factors. For example, in a reaction time experiment some participants will be faster or slower (and so all data from those particular individuals will tend to be faster or slower in a correlated way). Similarly, some stimulus items will be faster or slower and so all the data from these groupings will vary. The lme4 package in R was a game-changer for using these models (in a frequentist paradigm) in that it allowed researchers to estimate such models for a full dataset with just a single command. For the past 8-10 years, nearly every paper I've published has had a linear or generalized linear mixed effects model in it.

Despite their simplicity, the biggest problem with mixed effects models (from an educational point of view, especially) has been figuring out how to write consistent model specifications for random effects. Often there are many factors that vary randomly (subjects, items, etc.) and many other factors that are nested within those (e.g., each subject might respond differently to each condition). Thus, it is not trivial to figure out what model to fit, even if fitting the model is just a matter of writing a command. Even in a reaction-time experiment with just items and subjects as random variables, and one condition manipulation, you can write

(1) rt ~ condition + (1 | subject) + (1 | item)

for just random intercepts by subject and by item, or you can nest condition (fitting a random slope) for one or both:

(2) rt ~ condition + (condition | subject) + (condition | item)

and you can additionally fiddle with covariance between random effects for even more degrees of freedom!

Luckily, a number of years ago, a powerful and clear simulation paper by Barr et al. (2013) came out. They argued that there was a simple solution to the specification issue: use the "maximal" random effects structure supported by the design of the experiment. This meant adding any random slopes that were actually supported by your design (e.g., if condition was a within-subject variable, you could fit condition by subject slopes). While this suggestion was quite controversial,* Barr et al.'s simulations were persuasive evidence that this suggestion led to conservative inferences. In addition, having a simple guideline to follow eliminated a lot of the worry about analytic flexibility in random effects structure. If you were "keeping it maximal" that meant that you weren't intentionally – or even inadvertently – messing with your model specification to get a particular result.

Unfortunately, a new problem reared its head in lme4: convergence. With very high frequency, when you specify the maximal model, the approximate inference algorithms that search for the maximum likelihood solution for the model will simply not find a satisfactory solution. This outcome can happen even in cases where you have quite a lot of data – in part because the number of parameters being fit is extremely high. In the case above, not counting covariance parameters, we are fitting a slope and an intercept across participants, plus a slope and intercept for every participant and for every item.

To deal with this, people have developed various strategies. The first is to do some black magic to try and change the optimization parameters (e.g., following these helpful tips). Then you start to prune random effects away until your model is "less maximal" and you get convergence. But these practices mean you're back in flexible-model-adjustment land, and vulnerable to all kinds of charges of post-hoc model tinkering to get the result you want. We've had to specify lab best-practices about the order for pruning random effects – kind of a guide to "tinkering until it works," which seems suboptimal. In sum, the models are great, but the methods for fitting them don't seem to work that well.

Enter Bayesian methods. For several years, it's been possible to fit Bayesian regression models using Stan, a powerful probabilistic programming language that interfaces with R. Stan, building on BUGS before it, has put Bayesian regression within reach for someone who knows how to write these models (and interpret the outputs). But in practice, when you could fit an lmer in one line of code and five seconds, it seemed like a bit of a trial to hew the model by hand out of solid Stan code (which looks a little like C: you have to declare your variable types, etc.). We have done it sometimes, but typically only for models that you couldn't fit with lme4 (e.g., an ordered logit model). So I still don't teach this set of methods, or advise that students use them by default.

brms?!? A worked example

In the last couple of years, the package brms has been in development. brms is essentially a front-end to Stan, so that you can write R formulas just like with lme4 but fit them with Bayesian inference.* This is a game-changer: all of a sudden we can use the same syntax but fit the model we want to fit! Sure, it takes 2-3 minutes instead of 5 seconds, but the output is clear and interpretable, and we don't have all the specification issues described above. Let me demonstrate.

The dataset I'm working on is an unpublished set of data on kids' pragmatic inference abilities. It's similar to many that I work with. We show children of varying ages a set of images and ask them to choose the one that matches some description, then record if they do so correctly. Typically some trials are control trials where all the child has to do is recognize that the image matches the word, while others are inference trials where they have to reason a little bit about the speaker's intentions to get the right answer. Here are the data from this particular experiment:

I'm interested in quantifying the relationship between participant age and the probability of success in pragmatic inference trials (vs. control trials, for example). My model specification is:

(3) correct ~ condition * age + (condition | subject) + (condition | stimulus)

So I first fit this with lme4. Predictably, the full desired model doesn't converge, but here are the fixed effect coefficients:

beta stderr z p

intercept 0.50 0.19 2.65 0.01

condition 2.13 0.80 2.68 0.01

age 0.41 0.18 2.35 0.02

condition:age -0.22 0.36 -0.61 0.54

Now let's prune the random effects until the convergence warning goes away. In the simplified version of the dataset that I'm using here I can keep stimulus and subject intercepts and still get convergence when there are no random slopes. But in the larger dataset, the model won't converge unless i do just the random intercept by subject:

beta stderr z p

intercept 0.50 0.21 2.37 0.02

condition 1.76 0.33 5.35 0.00

age 0.41 0.18 2.34 0.02

condition:age -0.25 0.33 -0.77 0.44

Coefficient values are decently different (but the p-values are not changed dramatically in this example, to be fair). More importantly, a number of fairly trivial things matter to whether the model converges. For example, I can get one random slope in if I set the other level of the condition variable to be the intercept, but it doesn't converge with either in this parameterization. And in the full dataset, the model wouldn't converge at all if I didn't center age. And then of course I haven't tweaked the optimizer or messed with the convergence settings for any of these variants. All of this means that there are a lot of decisions about these models that I don't have a principled way to make – and critically, they need to be made conditioned on the data, because I won't be able to tell whether a model will converge a priori!

So now I switched to the Bayesian version using brms, just writing brm() with the model specification I wanted (3). I had to do a few tweaks: upping the number of iterations (suggested by the warning messages from the output, changing to a Bernoulli model rather than binomial (for efficiency, again suggested by the error message), but this was very straightforward otherwise. For simplicity I've adopted all the default prior choices, but I could have gone more informative.

Here's the summary output for the fixed effects:

                      estimate  error    l-95% CI u-95% CI
intercept             0.54      0.48    -0.50     1.69
condition             2.78      1.43     0.21     6.19
age                   0.45      0.20     0.08     0.85
condition:age        -0.14      0.45    -0.98     0.84

From this call, we get back coefficient estimates that are somewhat similar to the other models, along with 95% credible interval bounds. Notably, the condition effect is larger (probably corresponding to being able to estimate a more extremal value for the logit based on sparse data), and then the interaction term is smaller but has higher error. Overall, coefficients look more like the first non-convergent maximal model than the second converging one.

The big deal about this model is not that what comes out the other end of the procedure is radically different. It's that it's not different. I got to fit the model I wanted, with a maximal random effects structure, and the process was almost trivially easy. In addition, and as a bonus, the CIs that get spit out are actually credible intervals that we can reason about in a sensible way (as opposed to frequentist confidence intervals, which are quite confusing if you think about them deeply enough).

Conclusion

Bayesian inference is a powerful and natural way of fitting statistical models to data. The trouble is that, up until recently, you could easily find yourself in a situation where there was a dead-obvious frequentist solution but off-the-shelf Bayesian tools wouldn't work or would generate substantial complexity. That's no longer the case. The existence of tools like BayesFactor and brms means that I'm going to suggest that people in my lab go Bayesian by default in their data analytic practice.

----
Thanks to Roger Levy for pointing out that model (3) above could include an age | stimulus slope to be truly maximal. I will follow this advice in the paper.

* Who would have thought that a paper about statistical models would be called "the cave of shadows"?

** Rstanarm did this also, but it covered fewer model specifications and so wasn't as helpful.

Tuesday, January 16, 2018

MetaLab, an open resource for theoretical synthesis using meta-analysis, now updated

(This post is jointly written by the MetaLab team, with contributions from Christina Bergmann, Sho Tsuji, Alex Cristia, and me.)

A typical “ages and stages” ordering. Meta-analysis helps us do better.

Developmental psychologists often make statements of the form “babies do X at age Y.” But these “ages and stages” tidbits sometimes misrepresent a complex and messy research literature. In some cases, dozens of studies test children of different ages using different tasks and then declare success or failure based on a binary p < .05 criterion. Often only a handful of these studies – typically those published earliest or in the most prestigious journals – are used in reviews, textbooks, or summaries for the broader public. In medicine and other fields, it’s long been recognized that we can do better.

Meta-analysis (MA) is a toolkit of techniques for combining information across disparate studies into a single framework so that evidence can be synthesized objectively. The results of each study are transformed into a standardized effect size (like Cohen’s d) and are treated as a single data point for a meta-analysis. Each data point can be weighted to reflect a given study’s precision (which typically depends on sample size). These weighted data points are then combined into a meta-analytic regression to assess the evidential value of a given literature. Follow-up analyses can also look at moderators – factors influencing the overall effect – as well as issues like publication bias or p-hacking.* Developmentalists will often enter participant age as a moderator, since meta-analysis enables us to statistically assess how much effects for a specific ability increase as infants and children develop.

An example age-moderation relationship for studies of mutual exclusivity in early word learning.

Meta-analyses can be immensely informative – yet they are rarely used by researchers. One reason may be because it takes a bit of training to carry them out or even understand them. Additionally, MAs go out of date as new studies are published.

To facilitate developmental researchers’ access to up-to-date meta-analyses, we created MetaLab. MetaLab is a website that compiles MAs of phenomena in developmental psychology. The site has grown over the last two years from just a small handful of MAs to 15 at present, with data from more than 16,000 infants. The data from each MA are stored in a standardized format, allowing them to be downloaded, browsed, and explored using interactive visualizations. Because all analyses are dynamic, curators or interested users can add new data as the literature expands.

Open science is not inherently interesting. Do it anyway.

tl;dr: Open science practices themselves don't make a study interesting. They are essential prerequisites whose absence can undermine a study's value.

There's a tension in discussions of open science, one that is also mirrored in my own research. What I really care about are the big questions of cognitive science: what makes people smart? how does language emerge? how do children develop? But in practice I spend quite a bit of my time doing meta-research on reproducibility and replicability. I often hear critics of open science – focusing on replication, but also other practices – objecting that open science advocates are making science more boring and decreasing the focus on theoretical progress (e.g., Locke, Strobe & Strack). The thing is, I don't completely disagree. Open science is not inherently interesting.

Sometimes someone will tell me about a study and start the description by saying that it's pre-registered, with open materials and data. My initial response is "ho hum." I don't really care if a study is preregistered – unless I care about the study itself and suspect p-hacking. Then the only thing that can rescue the study is preregistration. Otherwise, I don't care about the study any more; I'm just frustrated by the wasted opportunity.

So here's the thing: Although being open can't make your study interesting, the failure to pursue open science practices can undermine the value of a study. This post is an attempt to justify this idea by giving an informal Bayesian analysis of what makes a study interesting and why transparency and openness is then the key to maximizing study value.

Talk on reproducibility and meta-science

I just gave a talk at UCSD on reproducibility and meta-science issues. The slides are posted here. I focused somewhat on developmental psychology, but a number of the studies and recommendations are more general. It was lots of fun to chat with students and faculty, and many of my conversations focused on practical steps that people can take to move their research practice towards a more open, reproducible, and replicable workflow. Here are a few pointers:

Preregistration. Here's a blogpost from last year on my lab's decision to preregister everything. I also really like Nosek et al's Preregistration Revolution paper. AsPredicted.org is a great gateway to simple preregistration (guide).

Reproducible research. Here's a blogpost on why I advocate for using RMarkdown to write papers. The best package for doing this is papaja (pronounced "papaya"). If you don't use RMarkdown but do know R, here's a tutorial.

Data sharing. Just post it. The Open Science Framework is an obvious choice for file sharing. Some nice video tutorials make an easy way to get started.

Sunday, November 5, 2017

Co-work, not homework

Coordination is one of the biggest challenges of academic collaborations. You have two or more busy collaborators, working asynchronously on a project. Either the collaboration ping-pongs back and forth with quick responses but limited opportunity for deeper engagement or else one person digs in and really makes conceptual progress, but then has to wait an excruciating amount of time for collaborators to get engaged, understand the contribution, and respond themselves. What's more, there are major inefficiencies caused by having to load up the project back into memory each time you begin again. ("What was it we were trying to do here?")

The "homework" model in collaborative projects is sometimes necessary, but often inefficient. This default means that we meet to discuss and make decisions, then assign "homework" based on that discussion and make a meeting to review the work and make a further plan. The time increments of these meetings are usually 60 minutes, with the additional email overhead for scheduling. Given the amount of time I and the collaborators will actually spend on the homework the ratio of actual work time to meetings is sometimes not much better than 2:1 if there are many decisions to be made on a project – as in design, analytic, and writeup stages.* Of course if an individual has to do data collection or other time-consuming tasks between meetings, this model doesn't hold!

Increasingly, my solution is co-work. The idea is that collaborators schedule time to sit together and do the work – typically writing code or prose, occasionally making stimuli or other materials – either in person or online. This model means that when conceptual or presentational issues come up we can chat about them as they arise, rather than waiting to resolve them by email or in a subsequent meeting.** As a supervisor, I love this model because I get to see how the folks I work with are approaching a problem and what their typical workflow is. This observation can help me give process-level feedback as I learn how people organize their projects. I also often learn new coding tricks this way.***

Introducing childes-db: a flexible and reproducible interface to CHILDES

Note: childes-db is a project that is a collaboration between Alessandro Sanchez, Stephan Meylan, Mika Braginsky, Kyle MacDonald, Dan Yurovsky, and me; this blogpost was written jointly by the group.

For those of us who study child development – and especially language development – the Child Language Data Exchange System (CHILDES) is probably the single most important resource in the field. CHILDES is a corpus of transcripts of children, often talking with a parent or an experimenter, and it includes data from dozens of languages and hundreds of children. It’s a goldmine. CHILDES has also been around since way before the age of “big data”: it started with Brian MacWhinney and Catherine Snow photocopying transcripts (and then later running OCR to digitize them!). The field of language acquisition has been a leader in open data sharing largely thanks to Brian’s continued work on CHILDES.

Despite these strengths, using CHILDES can sometimes be challenging, especially for the most casual or most in-depth interactions. Simple analyses like estimating word frequencies can be done using CLAN – the major interface to the corpora – but these require more comfort with command-line interfaces and programming than can be expected in many classroom settings. On the other end of the spectrum, many of us who use CHILDES for in-depth computational studies like to read in the entire database, parse out many of the rich annotations, and get a set of flat text files. But doing this parsing correctly is complicated, and often small decisions in the data-processing pipeline can lead to different downstream results. Further, it can be very difficult to reconstruct a particular data prep in order to do a replication study. We've been frustrated several times when trying to reproduce others' modeling results on CHILDES, not knowing whether our implementation of their model was wrong or whether we were simply parsing the data differently.

To address these issues and generally promote the use of CHILDES in a broader set of research and education contexts, we’re introducing a project called childes-db. childes-db aims to provide both a visualization interface for common analyses and an application programming interface (API) for more in-depth investigation. For casual users, you can explore the data with Shiny apps, browser-based interactive graphs that supplement CHILDES’s online transcript browser. For more intensive users, you can get direct access to pre-parsed text data using our API: an R package called childesr, which allows users to subset the corpora and get processed text. The backend of all of this is a MySQL database that’s populated using a publicly-available – and hopefully definitive – CHILDES parser, to avoid some of the issues caused by different processing pipelines.

Confessions of an Associate Editor

For the last year and a half I've been an Associate Editor at the journal Cognition. I joined up because Cognition is the journal closest to my core interests; I've published nine papers there, more than in any other outlet by a long shot. Cognition has been important historically, and continues to publish recent influential papers as well. I was also excited about a new initiative by Steve Sloman (the EIC) to require authors to post raw data. Finally, I joined knowing that Cognition is currently an Elsevier journal. I – perhaps naively – hoped that like Glossa, Cognition could leave Elsevier (which has a very bad reputation, to say the least) and go open access. I'm stepping down as an AE in the fall because of family constraints and other commitments, and so I wanted to take the opportunity to reflect on the experience and some lessons I've learned.

Be kind to your local editor. Editing is hard work done by ordinary academics, and it's work they do over and above all the typical commitments of non-editor academics. I was (and am) slow as an editor, and I feel very guilty about it. The guilt over not moving faster has been the hardest aspect of the job; often when I am doing some other work, I will be distracted by my slipping editorial responsibilities.¹ Yet if I keep on top of them I feel that I'm neglecting my lab or my teaching. As a result, I have major empathy now for other editors – and massive respect for the faster ones. Also, whenever someone complains about slow editors on twitter, my first thought is "cut them some slack!"

Make data open (and share code too, while you're at it)! I was excited by Sloman's initiative for data openness when I first read about it. I'm still excited about it: It's the right thing to do. Data sharing is prerequisite for ensuring the reproducibility of results in papers, and enables reuse of data for folks doing meta-analysis, computational modeling, and other forms of synthetic theoretical work. It's also very useful for replication – students in my graduate class do replications of published papers and often learn a tremendous amount about the paradigm and analyses of the original experiment by looking at posted data when they are available. But sharing data is not enough. Tom Hardwicke, a postdoc in my lab and in the METRICS center at Stanford, is currently doing a study of computational reproducibility of results published in Cognition – data are still coming in, but our first impression is that it's often difficult to reproduce the findings in a good number of papers based on the raw data and their written description of analyses. Cognition and other journals can do much more to facilitate posting of analytic code.

Open access is harder than it looks. I care deeply about open access – as both an ethical priority and a personal convenience. And the journal publishing model is broken. At the same time, my experiences have convinced me that it is no small thing to switch a major journal to a truly OA model. I could spend an entire blogpost on this issue alone (and maybe I will later), but the key issue here is money: where it comes from and where it goes. Running Cognition is a costly affair in its current form. There is an EIC, two senior AEs, and nine other AEs. All receive small but not insignificant stipends. There is also a part-time editorial assistant, and an editorial software platform. I don't know most of these costs, but my guess is that replicating this system as is – without any of the legal, marketing, and other infrastructure – would be minimally $150,000 USD/year (probably closer to 200k or more, depending on software).

Paper submission checklist

It's getting to be CogSci submission time, and this year I am thinking more about trying to set uniform standards for submission. Following my previous post on onboarding, here's a pre-submission checklist that I'm encouraging folks in my lab to follow. Note that, as described in that post, all our papers are written in RStudio using R Markdown, so the paper should be a single document that compiles all analyses and figures into a single PDF. This process helps deal with much of the error-checking of results that used to be the bulk of my presubmission checking.

Paper writing*

Is the first paragraph engaging and clear to an outsider who doesn't know this subfield?
Are multiple alternative hypotheses stated clearly in the introduction and linked to supporting prior literature?
Does the paragraph before the first model/experiment clearly lay out the plan of the paper?
Does the abstract describe the main contribution of the paper in terms that are accessible to a broad audience?
Does the first paragraph of the general discussion clearly describe the contributions of the paper to someone who hasn't read the results in detail?
Is there a statement of limitations on the work (even a few lines) in the general discussion?

Onboarding

Reading twitter this morning I saw a nice tweet by Page Piccinini, on the topic of organizing project folders:

@chbergma @JeffRouder @fusaroli TL;DR For any analysis in R I have the following folders: 1) data, 2) scripts, 3) figures, 4) write_up.
— Page Piccinini (@pageinini) January 3, 2017

This is exactly what I do and ask my students to do, and I said so. I got the following thoughtful reply from my old friend Adam Abeles:

@mcxfrank what's your onboarding process? You should have creating that structure then on the checklist to save the pain later.
— Adam Abeles (@aabeles) January 3, 2017

He's exactly right. I need some kind of onboarding guide. Since I'm going to have some new folks joining my lab soon, no time like the present. Here's a brief checklist for what to expect from a new project.

Don't bar barplots, but use them cautiously

Should we outlaw the the commonest visualization in psychology? The hashtag #barbarplots has been introduced as part of a systematic campaign to promote a ban on bar graphs. The argument is simple: barplots mask the distributional form of the data, and all sorts of other visualization forms exist that are more flexible and precise, including boxplots, violin plots, and scatter plots. All of these show the distributional characteristics of a dataset more effectively than a bar plot.

Every time the issue gets discussed on twitter, I get a little bit rant-y; this post is my attempt to explain why. It's not because I fundamentally disagree with the argument. Barplots do mask important distributional facts about datasets. But there's more we have to take into account.

Preregister everything

Which methodological reforms will be most useful for increasing reproducibility and replicability? I've gone back and forth on this blog about a number of possible reforms to our methodological practices, and I've been particularly ambivalent in the past about preregistration, the process of registering methodological and analytic decisions prior to data collection. In a post from about three years ago, I worried that preregistration was too time-consuming for small-scale studies, even if it was appropriate for large-scale studies. And last year, I worried whether preregistration validates the practice of running (and publishing) one-offs, rather than running cumulative study sets. I think these worries were overblown, and resulted from my lack of understanding of the process.

Instead, I want to argue here that we should be preregistering every experiment do. The cost is extremely low and the benefits – both to the research process and to the credibility of our results – are substantial. Starting in the past few months, my lab has begun to preregister every study we run. You should too.

The key insights for me were:

Different preregistrations can have different levels of detail. For some studies, you write down "we're going to run 24 participants in each condition, and exclude them if they don't finish." For others you specify the full analytic model and the plots you want to make. But there is no study for which you know nothing ahead of time.
You can save a ton of time by having default analytic practices that don't need to be registered every time. For us these live on our lab wiki (which is private but I've put a copy here).
It helps me get confirmation on what's ready to run. If it's registered, then I know that we're ready to collect data. I especially like the interface on AsPredicted, that asks coauthors to sign off prior to the registration going through. (This also incidentally makes some authorship assumptions explicit).

Limited support for an app-based intervention

tl;dr: I reanalyzed a recent field-trial of a math-learning app. The results differ by analytic strategy, suggesting the importance of preregistration.

Last year, Berkowitz et al. published a randomized controlled trial of a learning app. Children were randomly assigned to math and reading app groups; their learning outcomes on standardized math and reading tests were assessed after a period of app usage. A math anxiety measure was also collected for children’s parents. The authors wrote that:

The intervention, short numerical story problems delivered through an iPad app, significantly increased children’s math achievement across the school year compared to a reading (control) group, especially for children whose parents are habitually anxious about math.

I got excited about this finding because I have recently been trying to understand the potential of mobile and tablet apps for intervention at home, but when I dug into the data I found that not all views of the dataset supported the success of the intervention. That's important because this was a well-designed, well-conducted trial. But the basic randomization to condition did not produce differences in outcome, as you can see in the main figure of my reanalysis.

My extensive audit of the dataset is posted here, with code and their data here. (I really appreciate that the authors shared their raw data so that I could do this analysis – this is a huge step forward for the field!). Quoting from my report:

In my view, the Berkowitz et al. study does not show that the intervention as a whole was successful, because there was no main effect of the intervention on performance. Instead, it shows that – in some analyses – more use of the math app was related to greater growth in math performance, a dose-response relationship that is subject to significant endogeneity issues (because parents who use math apps more are potentially different from those who don’t). In addition, there is very limited evidence for a relationship of this growth to math anxiety. In sum, this is a well-designed study that nevertheless shows only tentative support for an app-based intervention.

Here's a link to my published comment (which came out today), and here's Berkowitz et al.'s very classy response. Their final line is:

We welcome debate about data analysis and hope that this discussion benefits the scientific community.

Explorations in hierarchical drift diffusion modeling

tl;dr: Adventures in using different platforms/methods to fit drift diffusion models to data.

The drift diffusion model (DDM) is increasingly a mainstay of research on decision-making, both in neuroscience and cognitive science. The classic DDM defines a pseudo random-walk decision process that describes a distribution on both accuracies and reaction times. This kind of joint distribution is really useful for capturing tasks where there could be speed-accuracy tradeoffs, and hence where classic univariate analyses are uninformative. Here's the classic DDM picture, this version from Vandekerckhove, Tuerlinckx, & Lee (2010), who have a nice tutorial on hierarchical DDMs:

We recently started using DDM to try and understand decision-making behavior in the kinds of complex inference tasks that my lab and I have been studying for the past couple of years. For example, in one recently-submitted paper, we use DDM to look at decision processes for inhibition, negation, and implicature, trying to understand the similarities and differences in these three tasks:

We had initially hypothesized that performance in the negation and implicature tasks (our target tasks) would correlate with inhibition performance. It didn't, and what's more the data seemed to show very different patterns across the three tasks. So we turned to DDM to understand a bit more of the decision process for each of these tasks.* Also, in a second submitted paper, we looked at decision-making during "scalar implicatures," the inference that "I ate some of the cookies" implies that I didn't eat all of them. In both of these cases, we wanted to know what was going on in these complex, failure-prone inferences.

An additional complexity was that we are interested in the development of these inferences in children. DDM has not been used much with children, usually because of the large number of trials that DDM seems to require. But we were inspired by a recent paper by Ratcliff (one of the important figures in DDMs), which used DDMs for data from elementary-aged children. And since we have been using iPad experiments to get RTs and accuracies for preschoolers, we thought we'd try and do these analyses with data from both kids and adults.

But... it turns out that it's not trivial to fit DDMs (especially the more interesting variants) to data, so I wanted to use this blogpost to document my process in exploring different ecosystems for DDM and hierarchical DDM.

Monday, March 27, 2023

Sunday, February 21, 2021

Tuesday, November 5, 2019

Monday, April 8, 2019

Sunday, December 9, 2018

Friday, September 7, 2018

Saturday, May 5, 2018

Monday, February 26, 2018

Tuesday, January 16, 2018

Thursday, December 7, 2017

Friday, November 10, 2017

Sunday, November 5, 2017

Friday, October 6, 2017

Thursday, June 1, 2017

Thursday, January 26, 2017

Paper writing*

Tuesday, January 3, 2017

Friday, November 4, 2016

Friday, July 22, 2016

Thursday, March 10, 2016

Thursday, February 18, 2016