Monday, June 30, 2014

Revisions to APA

Following a recent twitter thread, I've been musing on the APA publication and citation formats. Here are few modest suggestions for modification of APA standards.
  • Place figures and captions in the text. Back when we used typewriters and early word processors, it was not possible for people to put the figures in text. That's no longer the case. Flipping back and forth between figures, captions, and text is cognitively challenging for reviewers and serves no current typesetting purpose.
  • Get rid of addresses for publishers. There is no clear format for these – do you put City, State, Country? Or City, State? Or City, Country? Does it depend on whether it's a US city or not? It's essentially impossible to find out what city you should put for most major publishers anyway, since they are all currently international conglomerates. 
  • Do away with issue numbers. The current idea is that you are supposed to know how the page numbers are assigned in a volume to decide whether to include the issue number. That is absolutely crazy. 
  • Require DOIs. This recommendation is currently not enforced in any systematic way because it's a bit of a pain. But requiring DOIs would deal with many of the minor, edge-case ambiguities caused by removing addresses and issue numbers, as well as making many bibliometric analyses far easier.
  • Come up with a better standard for conference proceedings. For those of us who cite papers at Cognitive Science or Computer Science conferences like CogSci, NIPS, or ACL, it can be a total pain to invent or track down volume, page, editors, etc. for digital proceedings that don't really have this information.
---
Update in response to questions: I have had several manuscripts returned without review for failures to put figures and captions at the end of the manuscript, both from APA journals (JEP:G and JEP:LMC), although other journals explicitly ask you to put figures inline

Taxonomies for teaching and learning

(This post is joint with Pat Shafto and Noah Goodman; Pat is the first author and you can also find it on his blog).

Earlier this year, we read an article in the journal journal Behavioral and Brain Sciences (BBS) by Kline, entitled How to learn about teaching: An evolutionary framework for the study of teaching behavior in humans and other animals. The article starts out from the premise that there are major debates about what constitutes teaching behavior, both in human communities and in ethological analyses of non-human behavior. Kline then proposes a functionalist definition of teaching, that teaching is "behavior that evolved to facilitate learning in others," and outlines a taxonomy of teaching behaviors that includes:
  • Teaching by social tolerance, where you let someone watch you do something;
  • Teaching by opportunity provisioning, where you create opportunities for someone to try something;
  • Teaching by stimulus or local enhancement, where you highlight the relevant aspects of a problem for a learner;
  • Teaching by evaluative feedback, which is what it sounds like; and
  • Direct active teaching, which is teaching someone verbally or by demonstration.
We were interested in this taxonomy because it intersects with work we've done on understanding teaching in the context of the inferences that learners make. In addition, Kline didn't seem to have considered things from the perspective of the learner, which is what we thought made the most sense in our work – since adaptive benefits of teaching typically accrue to the learner, not the teacher. We wrote a proposal to comment on the Kline piece (BBS solicits such proposals), but our commentary was rejected. So we're posting it here.

In brief, we argue that evolutionary benefits of teaching are driven by the benefits to learners. Thus, an evolutionary taxonomy should derive from the inferential affordances that teaching allows for learners: what aspects of the input they can learn from, what they can learn, and hence what the consequences are for their overall fitness. In our work, we have outlined a taxonomy of social learning that distinguishes three levels of learning based on the inferences that can be made in different teaching situations:
  • Learning from physical evidence, where the learner cannot make any inference stronger than that a particular action causes a particular result;
  • Learning from rational action, where the learner can make the inference that a particular action is the best one to take in order to obtain a particular result, modulo constraints on the actor's action; and
  • Learning from communicative action, where the learner can infer that a teacher chose this example because it is the best or maximally useful example for them to learn from.
Critically, our work provides a formal account of these inferences such that their assumptions and predictions are explicit (Shafto, Goodman, & Frank, 2012).  Our taxonomy distinguishes between cases where an individual chooses information intended to facilitate another’s learning (social goal-directed action in our terminology, or “direct active teaching” in Kline’s taxonomy) and cases where an individual engages in merely goal-directed behavior and is observed by a learner (“teaching by social tolerance” in Kline’s taxonomy). We've found that this framework lines up nicely with a number of results on teaching and learning in developmental psychology, including "rational imitation" findings and Pat's work on the "double-edged sword of pedagogy" as well as other empirical data (Goodman et al., 2009; Shafto, et al., 2014)

The remaining distinctions proposed by Kline – teaching by opportunity provisioning, teaching by stimulus enhancement, and teaching by evaluative feedback – also fit neatly into our framework for social learning. Opportunity provisioning is a case of non-social learning, where the possibilities have been reduced to facilitate learning. Stimulus enhancement is a form of social-goal directed learning where the informant chooses information to facilitate learning (much as in direct active teaching). Teaching by evaluative feedback is a classic form of non-social learning known as supervised learning.

On our account, these distinctions correspond to qualitatively different inferential opportunities for the learner. As such, the different cases have different evolutionary implications – through qualitative differences in knowledge propagation within groups over time. If the goal is to understand teaching as an adaption, we argue that it is critical to analyze social learning situations in terms of the differential learning opportunities that they provide, and any taxonomy of teaching or social learning must distinguish among these possibilities by focusing on the inferential implications for learners, not just through characterization of the circumstances of teaching.

References

Bonawitz, E. B., Shafto, P., Gweon, H., Goodman, N. D., Spelke, E. & Schulz, L. (2011). The double-edged sword of pedagogy: Instruction affects spontaneous exploration and discovery. Cognition, 120, 322-330.

Goodman, N. D., Baker, C. L., and Tenenbaum, J. B. (2009). Cause and intent: Social reasoning in causal learning. Proceedings of the Thirty-First Annual Conference of the Cognitive Science Society.

Shafto, P., Goodman, N. D., and Griffiths, T. L. (2014). Rational reasoning in pedagogical contexts. Cognitive Psychology.

Shafto, P., Goodman, N. D. & Frank, M. C. (2012). Learning from others: The consequences of psychological reasoning for human learning. Perspectives on Psychological Science, 7, 341-351.


ResearchBlogging.org Kline, M. (2014). How to learn about teaching: An evolutionary framework for the study of teaching behavior in humans and other animals. Behavioral and Brain Sciences, 1-70 DOI: 10.1017/S0140525X14000090

First word?

(The somewhat cubist Brown Bear that may or may not have been the referent of M's first word.) 

M appears to have had a first word. As someone who studies word learning, I suppose I should have been prepared. But as excited as I was, I still found the whole thing very surprising. Let me explain.

M has been babbling ba, da, and ma for quite a while now. But at about 10.5 months, she started producing a very characteristic sequence: "BAba." This sequence showed up in the presence of "Brown Bear, Brown Bear, What Do You See," a board book by Eric Carle that she loves and that she reads often both at daycare and at home.

I was initially skeptical that this form was really a word, but three things convinced me:

  1. Consistency of form. The intonation was descending, the stress was on the first syllable, and there was a hint of rounding ("B(r)Ab(r)a"). It felt very word-y.
  2. Consistency of context. We heard this again and again when we would bring the book to her. 
  3. Low frequency in other contexts. We pretty much only heard it when "Brown Bear" was present, with the exception of one or two potential false alarms when another book was present.
Even sillier, M stopped using it around 3 weeks later. Now we think she's got "mama" and "dada" roughly associated with us, but we haven't heard "BAba" in a while, even with repeated prompting. 

This whole trajectory highlights a feature of  development that I find fascinating: its non-linearity. M's growth – physically, motorically, and cognitively – proceeds in fits and starts, rather than continuously. We see new developments, and then a period of consolidation. We may even see what appears to be regression. 

It's easy to read about non-linearities in development. But observing one myself made me think again about the importance of microgenetic studies, where you sample a single child's development in depth across a particular transition point for the behavior you're interested in. As readers of the developmental literature, we forget this kind of thing sometimes; as parents, we are the original microgeneticists. 



Monday, June 2, 2014

Shifting our cultural understanding of replication

tl;dr - I agree that replication is important – very important! But the way to encourage it as a practice is to change values, not to shame or bully.

Introduction. 

The publication of the new special issue on replications in Social Psychology has prompted a huge volume of internet discussion. For one example, see the blogpost by Simone Schnall, one of the original authors of a paper in the replication issue – much of the drama has played out in the comment thread (and now she has posted a second response). My favorite recent additions to this conversation have focused on how to move the field forward. For example, Betsy Levy Paluck has written a very nice piece on the issue that also happens to recapitulate several points about data sharing that I strongly believe in. Jim Coan also has a good post on the dangers of negativity.

All of this discussion has made me consider two questions: First, what is the appropriate attitude towards replication? And second, can we make systematic cultural changes in psychology to encourage replication without the kind of negative feelings that have accompanied the recent discussion? Here are my thoughts:
  1. The methodological points made by proponents of replication are correct. Something is broken.
  2. Replication is a major part of the answer; calls for direct replication may even understate its importance if we focus on cumulative, model-driven science.
  3. Replication must not depend on personal communications with authors. 
  4. Scolding, shaming, and "bullying" will not create the cultural shifts we want. Instead, I favor technical and social solutions. 
I'll expand on each of these below.

1. Something is broken in psychology right now.

The Social Psychology special issue and the Reproducibility Project (which I'm somewhat involved in) both suggest that there may be systematic issues in our methodological, statistical, and reporting standards. My own personal experiences confirm this generalization. I teach a course based on students conducting replications. A paper describing this approach is here, and my syllabus – along with lots of other replication education materials – is here.

In class last year, we conducted a series of replications of an ad-hoc set of findings that the students and I were interested in. Our reproducibility rate was shockingly low. I coded our findings on a scale from 0 - 1, with 1 denoting full replication (a reliable significance test on the main hypothesis of interest) and .5 denoting partial replication (a trend towards significance, or a replication of the general pattern but without a predicted interaction or with an unexpected moderator). We reproduced 8.5 / 19 results (50%), with a somewhat higher probability of replication for more "cognitive" effects (~75%, N=6) and a somewhat lower probability for more "social" effects (~30%, N=11). Alongside the obvious possibility for our failures – that some of the findings we tried to reproduce were spurious to begin with – there are many other very plausible explanations. Among others: We conducted our replications on the web, there may have been unknown moderators, we may have made methodological errors, and we could have been underpowered (though we tried for 80% power relative to the reported effects).*

To state the obvious: Our numbers don't provide an estimate of the probability that these findings were incorrect. But they do estimate the probability that a tenacious graduate student could reproduce the finding effectively for a project, given the published record and an email – sometimes unanswered – to the original authors. Although our use of Mechanical Turk as a platform was for convenience, preliminary results from the RP suggest that my course's estimate of reproducibility isn't that far off base.

When "prep" – the probability of replication – is so low (the true prep, not the other one), we need to fix something. If we don't, we run a large risk that students who want to build on previous work will end up wasting tremendous amounts of time and resources trying to reproduce findings that – even if they are real – are nevertheless so heavily moderated or so finicky that they will not form a solid basis for new work.

2. Replication is more important than even the replicators emphasize.  

Much of the so-called "replication debate" has been about the whether, how, who, and when of doing direct replications of binary hypothesis tests. These hypothesis tests are used in paradigms from papers that support particular claims (e.g. cleanliness is related to moral judgment, or flags prime conservative attitudes). This NHST approach – even combined with a meta-analytic effect-size estimation approach, as in the Many Labs project – understates the importance of replication. That's because these effects typically aren't used as measurements supporting a quantitative theory.

Our goal as scientists (psychological, cognitive, or otherwise) should be to construct theories that make concrete, quantitative predictions. While verbal theories are useful up to a point, formal theories are a more reliable method for creating clear predictions; these formal theories are often – but don't have to be – instantiated in computational models. Some more discussion of this viewpoint, which I call "models as formal theories," here and here. If our theories are going to make quantitative predictions about the relationship between measurements, we need to be able to validate and calibrate our measurements. This validation and calibration is where replication is critical.

Validation. In the discussion to date (largely surrounding controversial findings in social psychology), it has been assumed that we should replicate simply to test the reliability of previous findings. But that's not why every student methods class performs the Stroop task. They are not checking to see that it still works. They are testing their own reliability – validating their measurements.

Similarly, when I first set up my eye-tracker, I set out to replicate the developmental speedup in word processing shown by Anne Fernald and her colleagues (reviewed here). I didn't register this replication, and I didn't run it by her in advance. I wasn't trying to prove her wrong; as with students doing Stroop as a class exercise, I was trying to validate my equipment and methods. I believed so strongly in Fernald's finding that I figured that if I failed to replicate it, then I was doing something wrong in my own methods. Replication isn't always adversarial. This kind of bread and butter replication is – or should be – much more common.

Calibration. If we want to make quantitative predictions about the performance of new group of participants in tasks derived from previous work, we need to calibrate our measurements to those of other scientists. Consistent and reliable effects may nevertheless be scaled differently due to differences in participant populations. For example, studies of reading time among college undergraduates at selective institutions may end up finding overall faster reading than studies conducted among a sample with a broader educational background.

As one line of my work, I've studied artificial language learning in adults as a case study of language learning mechanisms that could have implications for the study of language learning in children. I've tried to provide computational models of these sorts of learning phenomena (e.g. here, here, and here). Fitting these models to data has been a big challenge because typical experiments only have a few datapoints - and given the overall scaling differences in learning described above, a model needs to have 1 - 2 extra parameters (minimally an intercept but possibly also a slope) to integrate across experiment sets from different labs and populations.

As a result, I ended up doing a lot of replication studies of artificial language experiments so that I could vary parameters of interest and get quantitatively-comparable measures. I believed all of the original findings would replicate – and indeed they did, often precisely as specified. If you are at all curious about this literature, I replicated (all with adults): Saffran et al. (1996a; 1996b); Aslin et al. (1998); Marcus et al. (1999); Gomez (2002); Endress, Scholl, & Mehler (2005); and Yu & Smith (2007). All of these were highly robust. In addition, in all cases where there were previous adult data, I found differences in the absolute level of learning from the prior report (as you might expect, considering I was comparing participants on Mechanical Turk or at MIT with whatever population the original researchers had used). I wasn't surprised or worried by these differences. Instead, I just wanted to get calibrated – find out what the baseline measurements were for my particular participant population.

In other words, even – or maybe even especially – when you assume that the binary claims of a paper are correct, replication plays a role by helping to validate empirical measurements and calibrate those measurements against prior data.

3. Replications can't depend on contacting the original authors for details. 

As Andrew Wilson argues in his nice post on the topic, we need to have the kind of standards that allow reproducibility – as best as we can – without direct contact of the initial authors. Of course, no researcher will always know perfectly what factors matter to their finding, especially in complex social scenarios. But how are we ever supposed to get anything done if we can't just read the scientific literature and come up with new hypotheses and test them? Should we have to contact the authors for every finding we're interested in, to find out whether the authors knew about important environmental moderators that they didn't report? In a world where replication is commonplace and unexceptional – where it is the typical starting point for new work rather than an act of unprovoked aggression – the extra load caused by these constant requests would be overwhelming, especially for authors with popular paradigms.

There's a different solution. Authors could make all of their materials (at least the digital ones) directly and publicly accessible as part of publication. Psycholinguists have been attaching their experimental stimulus items as an appendix to their papers for years – no reason not to do this more ubiquitously. For most studies, posting code and materials will be enough. In fact, for most of my studies – which are now all run online – we can just link to the actual HTML/javascript paradigm so that interested parties can try it out. If researchers believe that their effects are due to very specific environmental factors, then they can go the extra mile to take photos or videos of the experimental circumstances. The sharing of materials and data (whether using the Open Science Framework, github, or other tools) is free and costs almost nothing in terms of time. Used properly, these tools can even improve the reliability of own work along with its reproducibility by others.

I don't mean to suggest that people shouldn't contact original authors, merely that they shouldn't be obliged to. Original authors are – by definition – experts in a phenomenon, and can be very helpful in revealing the importance of particular factors, providing links to followup literature both published and unpublished, and generally putting work in context. But a requirement to contact authors prior to performing a replication emphasizes two negative values: the possibility for perceived aggressiveness in the act of replication, and the incompleteness of methods reporting. I strongly advocate for the reverse default. We should be humbled and flattered when others build on our work by assuming that it is a strong foundation, and they should assume our report is complete and correct. Neither of these assumptions will always be true, but good defaults breed good cultures.

4. The way to shift to a culture of replication is not by shaming the authors of papers that don't replicate. 

No one likes it when people are mean to one another. There's been some considerable discussion of tone on the SPSP blog and on twitter, and I think this is largely to the good. It's important to be professional in our discussion or else we alienate many within the field and hurt our reputation outside it. But there's a larger reason why shaming and bullying shouldn't be our MO: they won't bring about the cultural changes we need. For that we need two ingredients. First, technical tools that decrease the barriers to replication; and second, role models who do cool research that moves the field forward by focusing on solid measurement and quantitative detail, not flashy demonstrations. 

Technical tools. One of the things I have liked about the – otherwise somewhat acrimonious – discussion of Schnall et al.'s work is the use of the web to post data, discuss alternative theories, and iterate in real time on an important issue (three links herehere, and here, with meta-analysis here). If nothing else comes of this debate, I hope it convinces its participants that posting data for reanalysis is a good thing. 

More generally, my feeling is that there is a technical (and occasionally generational) gap at work in some of this discussion. On the data side, there is a sense that if all we do are t-tests on two sets of measurements from 12 people, then no big deal, no one needs to see your analysis. But datasets and analyses are getting larger and more sophisticated. People who code a lot accept that everyone makes errors. In order to fight error, we need to have open review of code and analysis. We also need to have reuse of code across projects. If we publish models or novel analyses, we need to give people the tools to reproduce them. We need sharing and collaborating, open-source style – enabled by tools like github and OSF. Accepting these ideas about data and analyses means that replication on the data side should be trivial: a matter of downloading and rerunning a script. 

On the experimental side, reproducibility should be facilitated by a combination of web-based experiments and code-sharing. There will always be costly and slow methods – think fMRI or habituating infants – but standard issue social and cognitive psychology is relatively cheap and fast. With the advent of Mechanical Turk and other online study venues, often an experiment is just a web page, perhaps served by a tool like PsiTurk. And I think it should go without saying: if your experiment is a webpage, then I would like to see the webpage if I am reading your paper. That way if I want to reproduce your findings I should be able to make a good start by simply directing people – online or in the lab – to look at your webpage, measuring your responses, and rerunning your analysis code. Under this model, if I have $50 and a bit of curiosity about your findings, I can run a replication. No big deal.** 

We aren't there yet. And perhaps we will never be for more involved social psychological interventions (though see PERTS, the Project for Education Research that Scales, for a great example of such interventions in a web context). But we are getting closer and closer. The more open we are with experimental materials, code, and data, the easier replication and reanalysis will be and the less we will have to imagine replication as a last-resort, adversarial move, and the more collecting new data will be part of a general ecosystem of scientific sharing and reuse.

Role models. These tools will only catch on if people think they are cool. For example, Betsy Levy Paluck's research on high-schoolers suggests something that we probably all know intuitively. We all want to be like the cool people, so the best way to change a culture is by having the cool kids endorse your value of choice. In other words, students and early-career psychologists will flock to new approaches if they see awesome science that's enabled by these methods. I think of this as a new kind of bling: Instead of being wowed by the counterintuitiveness or unlikeliness of a study's conclusions, can we instead praise how thoroughly it nailed the question? Or the breadth and scope of its approach

Conclusions. 

For what it's worth, some of the rush to publish high-profile tests of surprising hypotheses has to be due to pressures related to hiring and promotion in psychology. Here I'll again follow Betsy Levy Paluck and Brian Nosek in reiterating that, in the search committees I've sat on, the discussion over and over turns to how thorough, careful, and deep a candidate's work is – not how many publications they have. Our students have occasionally been shocked to see that a candidate with a huge, stellar CV doesn't get a job offer, and have asked "what more does someone need to do in order to get hired." My answer (and of course this is only my personal opinion, not the opinion of anyone else in the department): Engage deeply with an interesting question and do work on that question that furthers the field by being precise, thorough, and clearly-thought out. People who do this may pay a penalty in terms of CV length - but they are often the ones who get the job in the end

I've argued here that something really is broken in psychology. It's not just that some of our findings don't (easily) replicate, it's also that we don't think of replication as core to the enterprise of making reliable and valid measurements to support quantitative theorizing. In order to move away from this problematic situation, we are going to need technical tools to support easier replication, reproduction of analyses, and sharing more generally. We will also need the role models to make it cool to follow these new scientific standards.


---
Thanks very much to Noah Goodman and Lera Boroditsky for quick and helpful feedback on a previous draft. (Minor typos fixed 6/2 afternoon).

* I recently posted about one of the projects from that course, an unsuccessful attempt to replicate Schnall, Benton, & Harvey (2008)'s cleanliness priming effect. As I wrote in that posting, there are many reasons why we might not have reproduced the original finding – including differences between online and lab administration. Simone Schnall wrote in her response that "If somebody had asked me whether it makes sense to induce cleanliness in an online study, I would have said 'no,' and they could have saved themselves some time and money." It's entirely possible that cleanliness priming specifically is hard to achieve online. That would surprise me given the large number of successes in online experimentation more broadly (including the IAT and many subtle cognitive phenomena, among other things – I also just posted data from a different social priming finding that does replicate nicely online). In addition, while the effectiveness of a prime should depend on the baseline availability of the concept being primed, I don't see why this extra noise would completely eliminate Schnall et al.'s effect in the two large online replications that have been conducted so far.

** There's been a lot of talk about why web-based experiments are "just as good" as in-lab experiments. In this sense, they're actually quite a bit better! The fact that a webpage is so easily shown to others around the world as an exemplar of your paradigm means that you can communicate and replicate much more easily.

Monday, May 26, 2014

Another replication of Schnall, Benton, & Harvey (2008)


Simone Schnall, in her recent blogpost, notes that she has received many requests for materials and data to investigate her work on cleanliness priming. One of those requests came from Fiona Lee, a student in my replication-based graduate research methods course (info and syllabus here). Fiona did a project conducting a replication of Study 1 from Schnall, Benton, & Harvey (2008) using Amazon Mechanical Turk. We'd like to say at the outset that we really appreciate Simone Schnall's willingness to share her stimuli and her responsiveness to our messages.

After we realized the relevance of this replication attempt to the recent discussion, Fiona decided to make her data, report, and results public. Our results are described below, followed by some thoughts on the take-home messages in terms of both the science and the tone of the discussion.

Here is a link to Fiona's replication report (in the style of the Reproducibility Project Reports). Here are her (anonymized and cleaned-up) data and my analysis code. Here's her key figure:


In a sample of 96 adults (90 after planned exclusions), Fiona found no effect of priming condition (cleanliness vs. neutral) on participants' moral judgment severity in her planned analysis. In exploratory analyses, she found an unpredicted age effect on moral judgments, perhaps due to the fact that the age spread of the Mechanical Turk population was greater than that of the original study. When age was broken into quartiles, she saw some support for the hypothesis that the youngest participants showed the predicted priming effect, suggesting that age might have been a moderator of the effect. Here's the key figure for the age analysis:



I have confirmed Fiona's general analyses and I agree with her conclusions, though I would qualify that I feel the statistical support for the age-moderation hypothesis is quite limited. The main effect of age on moral judgment is very reliable, but the interaction with priming condition is not.

Some further thoughts on the relevance of these data to the Johnson et al. study. First, the age issue doesn't apply in that case, since there was no age differentiation in the Johnson et al. sample. Second, in Fiona's replication, the moral judgments were relatively coherent with one another (alpha=.71), so we didn't see any problem with averaging them. Finally, a ceiling effect was one of the concerns regarding the failed replication of Johnson et al. In Fiona's dataset, after planned exclusions (failure to pass the attention check or correctly guessing the hypothesis of the study), the percentage of extreme responses was about 24% for the entire dataset, and about 27% for the neutral condition, which was very close to the percentage score of extreme responses in the neutral condition of the original study (28%). Some exploratory histograms are embedded in my analysis code, if others are interested in seeing another dataset that uses Schnall's stimuli.

The fact that Fiona ran her study on Mechanical Turk is an important difference between her experiment and previous work. For someone interested in pursuing a related line of research, experiments like this one are trivially easy to run on Mechanical Turk, but there are real questions about whether priming research in particular can be replicated online. Although Turk replications of cognitive psychology tasks work exceedingly well, Turk could be a poor platform for social priming research specifically. Fiona suggests also that unscrambling tasks in particular may be different online, as participants type rather than writing longhand. A very useful goal for future research would be the replication of other priming experiments on Turk to determine the efficacy of the platform for research of this type.

So what's the upshot? These data provide further evidence that, to the extent cleanliness primes have an effect on moral judgment, there may be a number of moderators of this effect – age and/or online administration being possible candidates. Hence, our experience confirms that the finding in Experiment 1 of Schnall, Benton, & Harvey (2008) is not trivial to reproduce (though see two successes on PsychFiledrawer). Further research with larger samples, a number of administration methods, and – critically in my view – a wider range of well-normed judgment problems may be required.

Nevertheless, both Fiona and I feel that the tone of the discussion surrounding this issue has been far too negative and offer apologies for any issues of tone in our contact or discussion of this issue. (I will be writing a separate post on the issue of tone and replication shortly). Regardless of the eventual determination regarding cleanliness priming, we appreciate Schnall's willingness to engage with the community to understand the issue further.

Saturday, April 26, 2014

Data analysis, one step deeper

tl;dr: Using git+R is good for reproducible research. If you already knew that, then you won't learn a lot here.

I just read Dorothy Bishop's new post, Data analysis: Ten tips I wish I'd known sooner. I really like her clear, commonsense recommendations and agree with just about all of them. But in the last couple of years, I've become convinced that even for the less technically-minded student (let alone for the sophisticated researcher), the spirit of many of her tips can be implemented using open tools like git and R. As a side benefit, many of the recommendations are a natural part of the workflow in these tools, rather than requiring extra effort.

My goal in this post is to explain how this ecosystem works, and why it (more or less) does the right thing. I also want to show you why I think it's a pretty good tradeoff between learning time and value added. Of course, you can get even more reproducible, managing your project on the Open Science Framework (and using their git support), and use sweave and LaTeX to typeset exactly the same code you've written. These are all great things. But for many people, starting out with such a complex, interlocking set of tools can be quite daunting. I'd like to think the git+R setup that we use strikes a good balance.

Bishop's recommendations implicitly address several major failure modes for data analysis:
  1. Not being able to figure out what set of steps you did, 
  2. Not being able to figure out what those steps actually accomplished, and
  3. Not being able to reconstruct the dataset you did them to.
These are problems in the reproducibility of your analysis, and as such, pose major risks to basic science you're trying to do. The recommendations that address these issues are very sensible: keep track of what you did (recs 8 and 9), label and catalogue your variables in a semantically transparent way (recs 2 and 4), archive and back up your data (recs 5 and 6). Here's how I accomplish this in git+R.

Writing analyses in R as a keystone of reproducible analysis

Bishop's recommendations focus on the importance of keeping a log of analyses. This is the classic best-practices approach in bench science: keep a lab notebook! Although I don't think you can go wrong with this approach, it has a couple of negatives. First, it requires a lot of discipline. If you get excited and start doing analyses, you have to stop yourself and remember to document them fully. Second, keeping a paper lab notebook means going back and forth between computer and notebook all the time (and having to work on analyses only when you have a notebook with you). On the other hand, using an electronic notebook can mean you run into major formatting difficulties in including code, data, images, etc.

These problems have been solved very nicely by iPython, an interactive notebook that allows the inclusion of data, code, images, and text in a single flexible format.  I suspect that once this approach is truly mature and can be used across languages, interactive notebooks are what we all should be using. But I am not yet convinced that we should be writing python code to analyze our data yet – and I definitely don't think we should start students out this way. Python is a general-purpose language (and a much better one than R) but the idioms of data analysis are not yet as codified or as accessible in it, even though they are improving rapidly.

In the mean time, I think the easiest way for students to learn to do reproducible data analysis is to write well-commented R scripts. These scripts can simply be executed to produce the desired analyses. (There is of course scripting functionality in SPSS as well, but the combination of clicking and scripting can be devastating to reproducibility: the script gives the impression of reproducibility while potentially depending on some extra ad-hoc clicks that are not documented).

The reasons why I think R is a better data analysis language for students to learn than python are largely due to Hadley Wickham, who has done more than anyone else to rationalize R analysis. In particular, a good, easy-to-read analysis will typically only have a few steps: read in the data, aggregate the data across some units (often taking means across conditions and subjects), plot this aggregated data, and apply a statistical model to characterize patterns seen in the plots. In the R ecosystem, each of these can be executed in only one or at most a few lines of code.

Here's an example from a repository I've been working on with Ali Horowitz, a graduate student in my lab. This is an experiment on children's use of discourse information to learn the meanings of words. Children across ages choose which toy (side) they think a word refers to, in conditions with and without discourse continuity information. The key analysis script does most of its work in four chunks:

#### 1. read in data
d <- read.csv("data/all_data.csv") 

#### 2. aggregate for each subject and then across subjects
mss <- aggregate(side ~ subid + agegroup + corr.side + condition, 
                 data = d, mean)
ms <- aggregate(side ~ agegroup + corr.side + condition, 
                data = mss, mean)

#### 3. plot
qplot(agegroup, side, colour = corr.side, 
      facets = .~condition,  
      group = corr.side, 
      geom = "line", 
      data = ms)

##
## 4. linear mixed-effects model
lm.all <- glmer(side ~ condition * corr.side * age + 
                (corr.side | subid), 
                data = kids, family = "binomial")

This is simplified somewhat – I've left out the confidence intervals and a few pieces of data cleaning – but the overall schema is one that reappears over and over again in my analyses. Because this idiom for expressing data analysis is so terse (but still so flexible), I find it extremely easy to debug. In addition, if the columns of your original datasheet are semantically transparent (e.g. agegroup, condition, etc.), your expressions are very easy to read and interpret. (R's factor data structure helps with this too, by keeping track of different categorical variables in your data). Overall, there is very little going on that is not stated in the key formula expressions in the calls to aggregate, qplot, and glmer; this in turn means that good naming practices make it easy to interpret the code in terms of the underlying design of the experiment you ran. It's much easier to debug this kind of code than your typical matlab data analysis script, where rows and columns are often referred to numerically (e.g. plot(d(:,2), d(:,3)) rather than qplot(condition, correct, data=d)). 

Often the data you collect are not in the proper form to facilitate this kind of analysis workflow. In that case, my choice is to create another script, called something like "preprocessing.R" that uses tools like reshape2 to move from e.g. a mechanical turk output file to a tidy data format (a long-form tabular dataset). That way I have a two-step workflow, but I am saving both the original data and the intermediate datafile, and can easily check each by eye in a text editor or Excel for conversion/reformatting errors. 

Overall, the key thing about using R for the full analysis is that – especially when the analysis is version controlled, as described below – you have a full record of the steps that you took to get to a particular conclusion. In addition, with the general workflow described above, the steps in the analysis are described in a semantically transparent way (modulo understanding the particular conventions of, say, ggplot2, which can take some time). Both of these dramatically improve reproducibility by making debugging, rerunning, and updating this codebase easier. 

Archiving analyses on git

When I am ready to analyze the data from an experiment (or sometimes even before), I have started setting up a git repository on github.com. It took me a little while to get the hang of this practice, but now I am convinced that it is overall a huge time-saver. (A good tutorial is available here). The repository for an experimental project is initialized with the original datafile(s) that I collect, e.g. the eye-tracking records, behavioral testing software outputs, or logfiles, suitably anonymized. These source datafiles should remain unchanged throughout the lifetime of the analysis – confirmed by their git history.

I work on a local copy of that repository and push updates back to it so that I always have the analysis backed up. (I've begun doing all my analysis in the clear on github, but for academic users you can get free private repositories if that makes you uncomfortable). This strategy helps me keep track of the original data files, intermediate processed and aggregated data, and the analysis code, all in one place. So at its most basic it's a very good backup.

But managing data analysis through git has a couple of other virtues, too:
  • The primary benefits of version control. This is the obvious stuff for anyone who has worked with git or subversion before, but for me as a new user, this was amazing! Good committing practices – checking in versions of your code regularly – mean that you never have to have more than one version of a file. For example, if you're working on a file called "analysis.R," you don't have to have "analysis 4-21-14 doesn't quite work.R" and "analysis 4-22-14 final.R." Instead, "analysis.R" can reflect in its git history many different iterations that you can browse through whenever you want. You can even use branches or tags to keep track of multiple different conflicting approaches in the same file. 
  • Transparency within collaborations. Your collaborators can look at what you are doing while the analysis is ongoing, and they can even make changes and poke around without upsetting the applecart or creating totally incommensurable analysis drafts. This transparency can dramatically reduce sharing overhead and crosstalk between collaborators in a large project. It also means that it is way easier for authors to audit the analysis on a paper prior to submission – something that I think should probably be mandatory for complex analyses. 
  • Ease of sharing analyses during the publication and review process. When you're done – or even while analysis is still ongoing – you can share the repository with outsiders or link to it in your publications. Then, you can post updates to it if you have corrections or extensions, and new viewers will automatically see these rather than having to track you down. This means sharing your data and analysis is always as simple as sharing a link – no need to hunt down a lot of extra dependencies and clean things up after the fact (something that I suspect is a real reason why many data sharing requests go unanswered).
The open git analysis approach is not perfect for all datasets – the examples that come to mind are confidential data that cannot easily be anonymized (e.g. transcripts with lots of identifying information) and neuroimaging, where the data are too large to push back and forth to external repositories all the time. But for a wide range of projects, this can be a major win.

Conclusion

It takes an initial, upfront investment to master both git and R. Neither is as easy as using pivot tables in Excel. But the payoff is dramatic, both in terms of productivity and in terms of reproducibility. There are steps further that you can take if you are really committed to documenting every step of your work, but I think this is a reasonable starting point, even for honors students or beginning graduate students. For any project longer than a quick one-off, I am convinced that the investment is well worth while.

Of course, I don't mean to imply that you can't do bad or irreproducible research using this ecosystem – it's very easy to do both. But I do believe that it nudges you towards substantially better practices than tools like Excel and SPSS. And sometimes a nudge in the right direction can go a long way towards promoting the desired behavior.

Saturday, April 19, 2014

Slides for Stanford Autism Update

I'm giving a talk at the Stanford Autism Update today. My slides can be found here: http://bit.ly/1nzhM7V. This is my first time using figshare – certainly seems easy.