Monday, May 26, 2014

Another replication of Schnall, Benton, & Harvey (2008)


Simone Schnall, in her recent blogpost, notes that she has received many requests for materials and data to investigate her work on cleanliness priming. One of those requests came from Fiona Lee, a student in my replication-based graduate research methods course (info and syllabus here). Fiona did a project conducting a replication of Study 1 from Schnall, Benton, & Harvey (2008) using Amazon Mechanical Turk. We'd like to say at the outset that we really appreciate Simone Schnall's willingness to share her stimuli and her responsiveness to our messages.

After we realized the relevance of this replication attempt to the recent discussion, Fiona decided to make her data, report, and results public. Our results are described below, followed by some thoughts on the take-home messages in terms of both the science and the tone of the discussion.

Here is a link to Fiona's replication report (in the style of the Reproducibility Project Reports). Here are her (anonymized and cleaned-up) data and my analysis code. Here's her key figure:


In a sample of 96 adults (90 after planned exclusions), Fiona found no effect of priming condition (cleanliness vs. neutral) on participants' moral judgment severity in her planned analysis. In exploratory analyses, she found an unpredicted age effect on moral judgments, perhaps due to the fact that the age spread of the Mechanical Turk population was greater than that of the original study. When age was broken into quartiles, she saw some support for the hypothesis that the youngest participants showed the predicted priming effect, suggesting that age might have been a moderator of the effect. Here's the key figure for the age analysis:



I have confirmed Fiona's general analyses and I agree with her conclusions, though I would qualify that I feel the statistical support for the age-moderation hypothesis is quite limited. The main effect of age on moral judgment is very reliable, but the interaction with priming condition is not.

Some further thoughts on the relevance of these data to the Johnson et al. study. First, the age issue doesn't apply in that case, since there was no age differentiation in the Johnson et al. sample. Second, in Fiona's replication, the moral judgments were relatively coherent with one another (alpha=.71), so we didn't see any problem with averaging them. Finally, a ceiling effect was one of the concerns regarding the failed replication of Johnson et al. In Fiona's dataset, after planned exclusions (failure to pass the attention check or correctly guessing the hypothesis of the study), the percentage of extreme responses was about 24% for the entire dataset, and about 27% for the neutral condition, which was very close to the percentage score of extreme responses in the neutral condition of the original study (28%). Some exploratory histograms are embedded in my analysis code, if others are interested in seeing another dataset that uses Schnall's stimuli.

The fact that Fiona ran her study on Mechanical Turk is an important difference between her experiment and previous work. For someone interested in pursuing a related line of research, experiments like this one are trivially easy to run on Mechanical Turk, but there are real questions about whether priming research in particular can be replicated online. Although Turk replications of cognitive psychology tasks work exceedingly well, Turk could be a poor platform for social priming research specifically. Fiona suggests also that unscrambling tasks in particular may be different online, as participants type rather than writing longhand. A very useful goal for future research would be the replication of other priming experiments on Turk to determine the efficacy of the platform for research of this type.

So what's the upshot? These data provide further evidence that, to the extent cleanliness primes have an effect on moral judgment, there may be a number of moderators of this effect – age and/or online administration being possible candidates. Hence, our experience confirms that the finding in Experiment 1 of Schnall, Benton, & Harvey (2008) is not trivial to reproduce (though see two successes on PsychFiledrawer). Further research with larger samples, a number of administration methods, and – critically in my view – a wider range of well-normed judgment problems may be required.

Nevertheless, both Fiona and I feel that the tone of the discussion surrounding this issue has been far too negative and offer apologies for any issues of tone in our contact or discussion of this issue. (I will be writing a separate post on the issue of tone and replication shortly). Regardless of the eventual determination regarding cleanliness priming, we appreciate Schnall's willingness to engage with the community to understand the issue further.