Sunday, December 9, 2018

How to run a study that doesn't replicate, experimental design edition

(tl;dr: Design features of psychology studies to avoid if you want to run a good study!)

Imagine reading about a psychology experiment in which participants are randomly assigned to one of two different short state inductions (say by writing a passage or unscrambling sentences), and then outcomes are measured via a question about an experimental vignette. The whole thing takes place in about 10 minutes and is administered through a survey, perhaps via Qualtrics.

The argument of this post is that this experiment has a low probability of replicating, and we can make that judgment purely from the experimental methods – regardless of the construct being measured, the content of the state induction, or the judgment that is elicited. Here's why I think so.

Friday was the last day of my graduate class in experimental methods. The centerpiece of the course is a replication project in which each student collects data on a new instantiation of a published experiment. I love teaching this course and have blogged before about outcomes from it. I've also written several journal articles about student replication in this model (Frank & Saxe, 2012Hawkins*, Smith*, et al., 2018). In brief, I think this is a really fun way for student to learn about experimental design and data analysis, open science methods, and the importance of replication in psychology. Further, the projects in my course are generally pretty high quality: they are pre-registered confirmatory tests with decent statistical power, and both the paradigm and the data analysis go through multiple rounds of review by the TAs and me (and sometimes also get feedback from the original authors).

Every year I rate each student project on its replication outcomes. The scale is from 0 to 1, with intermediate values indicating unclear results or partial patterns of replication (e.g., significant key test but different qualitative interpretation). The outcomes from the student projects this year were very disappointing. With 16/19 student projects finished, we have an average replication rate of .31. There were only 4 clear successes, 2 intermediate results, and 10 failure. Samples are small every year, but this rate was even lower than we saw in previous samples (2014-15: .57, N=38) and another one-year sample (2016: .55, N=11).

What happened? Many of the original experiments followed part or all of the schema described above, with a state induction followed by a question about a vignette. In other words, they were poorly designed.

There's now a strong meta-scientific literature suggesting that prediction markets can accurately guess which studies will not replicate. Some of this effect is likely due to general plausibility of study results – the general correlation of prior and posterior probabilities of effects. There are also general statistical predictors of failures to replicate – small samples, small effect sizes, and p-values relatively close to the .05 boundary. Over the past 5-6 years, the community has received a real education about these issues. In my class, we try to spot effects with these sorts of issues and sometimes now ask students not to select projects with statistical red flags. Further, within the constraints of our class budget (which is limited), we try to recruit decent sample sizes.*

This year in my class, I think experimental design was the culprit for many of our failed replications, however. Further, I suspect that many of the prediction markets are picking up on problematic design features as well as the statistical issues mentioned above. Here are the experimental design features that appear – both in my experience and, in some cases, in the broader literature – related to replication success. These "negative features" shape my defaults about how to design a study.

Single-question DVs. Psychological measurements are noisy. If you have high noise, you will have low signal to detect the effect of even a strong manipulation. One way to reduce noise is to measure many times and combine those measurements. Papers that fail to take advantage of this strategy dramatically reduce their ability to find effects of their manipulation. Yet it is striking how many of the findings we look at nevertheless have a single "key question" that is supposed to detect their manipulation. From an item response theory perspective, even if you found the perfect item (optimal discrimination) for a particular population, that item is still likely to be suboptimal and yield under-informative estimates about other populations. This means that your design is unlikely to be replicable in a different context, just because your item isn't designed to measure people in that context.

Single-item manipulations. The counterpart to single question DVs is single-item manipulations, e.g. instantiations of a particular theoretical contrast in a particular experimental vignette or stimulus. Even if an effect induced via a particular item is replicable, it is likely not easily generalizable to a larger population of experimental items (as has been noted since Clark, 1972). But in addition, if you have only a single stimulus of interest, the chance of variation in response to this stimulus – due to sample differences including demographic variation or overall cohort change – is very high; this is exactly the same point as is made above about the DV, now made about the IV. Further, there is a substantial threat to internal validity if this stimulus is used by any other psychologists (as frequently happens with popular tasks - e.g., the prisoner's dilemma).

Between-subjects designs. Variation between people is a huge source of the total variation in psychological measurements. By subtracting out this variance, within-subjects designs dramatically decrease the variance in the measurement of some manipulation. As a result, between-subjects effects tend to replicate less (unless their original samples were really huge). This effect shows up in the original OSC 2015 replication sample, and it also shows up in our previous class sample. In our 16 project sample so far this year, the replication rate for the between subjects experiments was .21 (2.5 successes out of 12) vs. .625 (2.5 successes out of 4).

Short state induction manipulations. It's hard to change people's state in a significant way during a very short experiment, at least, given the tools available to ethical psychologists. If you want to make some one feel powerful, or greedy, or afraid, or anxious, there's only so much you can do by showing them images on a computer screen, making them read words, or making them reflect on their experience by writing a short paragraph. And if you make even a moderate change to someone's state, they are extremely likely to reflect on this experience in the context of the experiment in some very substantial ways (see Task demands, below). It's hard, but probably not impossible to do these kinds of manipulations right; likely manipulations of this type that can and do work.** But think about the counterfactual world where experimenters really could push people's feelings around quite flexibly and easily – we'd be constantly bent to our environment, pushed one way or the other by the precise stimuli we came into contact with, with the incumbent policy implications (Hal Pashler and Andrew Gelman have both made this point previously in several different ways).

Task demands. When I was an undergraduate, my girlfriend – now my wife – and I used to walk over to the business school and do experimental studies for fun (they paid better than psychology). After we were done, we'd walk out and compare notes on what the point of each study was, as well as what condition we thought we were in. MTurk workers are just the same – probably better because many of them have done more studies. Participants will be thinking about what your study is about, and reacting based on some complex combination of that guess (correct or not) and their desired self-presentation and feelings about that goal. It is remarkable how many studies do not consider this issue. Two-stage studies like the one I described at the beginning are extremely vulnerable to this kind of reasoning: if your survey consists only of a state induction and a vignette, it is a guarantee that people will read the two together and then think about the connection. Hmm, I wonder what my feeling of powerlessness has to do with my reading about moral judgements? I wonder what reading a news article about the environment has to do with my judgements about future planning? This kind of design (especially without a good cover story) is a recipe to include participants' interpretive thinking in your pattern of results. Yet most of these paradigms do not even include strategies like a funnel debrief to detect such issues.***

No manipulation checks. Manipulation checks are tricky in state-induction experiments. Because they often directly refer to the construct of interest ("how powerful do you feel?") they can increase task demands and explicit reasoning. They also often are only single items themselves and aren't necessarily psychometrically valid measures of the precise construct of interest. That said: without a manipulation check, if your experiment fails in the type of design we're considering, there is typically no signal for understanding what went wrong. In classic perception, memory, and learning experiments there are usually correct answers, allowing the experimenter to think about whether participants understood the task and were at floor or at ceiling in their performance. In contrast, in judgement studies of the type I'm writing about here, there is not typically any calibration of the measurement. In many experiments without manipulation checks, there is no signal (beyond a difference on the key DV) that allows experimenters (or readers) to verify that the participants understood the materials and were affected by the manipulation.

A subtitle to this post could well be "revenge of the psychometricians." (They already attacked us once). Many of the problematic practices I see come down to poor measurement: single items for the DV measure, single items for the IV manipulation, lack of within-subjects design. All of these are places where experimenters can reduce measurement variation in easy ways. It is not that experiments like the one I've described here are impossible to do right, or that they never replicate. (ManyLabs 1 and ManyLabs 2 have both had replicable and non-replicable examples of such experiments in each). It's that there are so many lost opportunities to do better. 

* We probably don't have the power to detect small effects in the cases where the authors initially reported large ones, however.
** Some good ones likely take advantage of apparent task demands to cause deeper reasoning about the state induction. 
*** Surprisingly I couldn't find a good description of this strategy online. In brief, ask successively more specific questions to try to elicit an understanding of how much they knew about the manipulation, e.g. "what did you think this experiment was about? what did you think about the other person in the experiment? did you notice anything odd about him? did you know he was a confederate?"

[Correction: w/in subjects designs decrease variance, thanks Yoel Sanchez-Araujo]

No comments:

Post a Comment