I just read Dorothy Bishop's new post, Data analysis: Ten tips I wish I'd known sooner. I really like her clear, commonsense recommendations and agree with just about all of them. But in the last couple of years, I've become convinced that even for the less technically-minded student (let alone for the sophisticated researcher), the spirit of many of her tips can be implemented using open tools like git and R. As a side benefit, many of the recommendations are a natural part of the workflow in these tools, rather than requiring extra effort.
My goal in this post is to explain how this ecosystem works, and why it (more or less) does the right thing. I also want to show you why I think it's a pretty good tradeoff between learning time and value added. Of course, you can get even more reproducible, managing your project on the Open Science Framework (and using their git support), and use sweave and LaTeX to typeset exactly the same code you've written. These are all great things. But for many people, starting out with such a complex, interlocking set of tools can be quite daunting. I'd like to think the git+R setup that we use strikes a good balance.
Bishop's recommendations implicitly address several major failure modes for data analysis:
- Not being able to figure out what set of steps you did,
- Not being able to figure out what those steps actually accomplished, and
- Not being able to reconstruct the dataset you did them to.
Writing analyses in R as a keystone of reproducible analysis
Bishop's recommendations focus on the importance of keeping a log of analyses. This is the classic best-practices approach in bench science: keep a lab notebook! Although I don't think you can go wrong with this approach, it has a couple of negatives. First, it requires a lot of discipline. If you get excited and start doing analyses, you have to stop yourself and remember to document them fully. Second, keeping a paper lab notebook means going back and forth between computer and notebook all the time (and having to work on analyses only when you have a notebook with you). On the other hand, using an electronic notebook can mean you run into major formatting difficulties in including code, data, images, etc.These problems have been solved very nicely by iPython, an interactive notebook that allows the inclusion of data, code, images, and text in a single flexible format. I suspect that once this approach is truly mature and can be used across languages, interactive notebooks are what we all should be using. But I am not yet convinced that we should be writing python code to analyze our data yet – and I definitely don't think we should start students out this way. Python is a general-purpose language (and a much better one than R) but the idioms of data analysis are not yet as codified or as accessible in it, even though they are improving rapidly.
In the mean time, I think the easiest way for students to learn to do reproducible data analysis is to write well-commented R scripts. These scripts can simply be executed to produce the desired analyses. (There is of course scripting functionality in SPSS as well, but the combination of clicking and scripting can be devastating to reproducibility: the script gives the impression of reproducibility while potentially depending on some extra ad-hoc clicks that are not documented).
The reasons why I think R is a better data analysis language for students to learn than python are largely due to Hadley Wickham, who has done more than anyone else to rationalize R analysis. In particular, a good, easy-to-read analysis will typically only have a few steps: read in the data, aggregate the data across some units (often taking means across conditions and subjects), plot this aggregated data, and apply a statistical model to characterize patterns seen in the plots. In the R ecosystem, each of these can be executed in only one or at most a few lines of code.
Here's an example from a repository I've been working on with Ali Horowitz, a graduate student in my lab. This is an experiment on children's use of discourse information to learn the meanings of words. Children across ages choose which toy (side) they think a word refers to, in conditions with and without discourse continuity information. The key analysis script does most of its work in four chunks:
#### 1. read in data
d <- read.csv("data/all_data.csv")
#### 2. aggregate for each subject and then across subjects
mss <- aggregate(side ~ subid + agegroup + corr.side + condition,
data = d, mean)
ms <- aggregate(side ~ agegroup + corr.side + condition,
data = mss, mean)
#### 3. plot
qplot(agegroup, side, colour = corr.side,
facets = .~condition,
group = corr.side,
geom = "line",
data = ms)
#### 4. linear mixed-effects model
lm.all <- glmer(side ~ condition * corr.side * age +
(corr.side | subid),
data = kids, family = "binomial")
This is simplified somewhat – I've left out the confidence intervals and a few pieces of data cleaning – but the overall schema is one that reappears over and over again in my analyses. Because this idiom for expressing data analysis is so terse (but still so flexible), I find it extremely easy to debug. In addition, if the columns of your original datasheet are semantically transparent (e.g. agegroup, condition, etc.), your expressions are very easy to read and interpret. (R's factor data structure helps with this too, by keeping track of different categorical variables in your data). Overall, there is very little going on that is not stated in the key formula expressions in the calls to aggregate, qplot, and glmer; this in turn means that good naming practices make it easy to interpret the code in terms of the underlying design of the experiment you ran. It's much easier to debug this kind of code than your typical matlab data analysis script, where rows and columns are often referred to numerically (e.g. plot(d(:,2), d(:,3)) rather than qplot(condition, correct, data=d)).
This is simplified somewhat – I've left out the confidence intervals and a few pieces of data cleaning – but the overall schema is one that reappears over and over again in my analyses. Because this idiom for expressing data analysis is so terse (but still so flexible), I find it extremely easy to debug. In addition, if the columns of your original datasheet are semantically transparent (e.g. agegroup, condition, etc.), your expressions are very easy to read and interpret. (R's factor data structure helps with this too, by keeping track of different categorical variables in your data). Overall, there is very little going on that is not stated in the key formula expressions in the calls to aggregate, qplot, and glmer; this in turn means that good naming practices make it easy to interpret the code in terms of the underlying design of the experiment you ran. It's much easier to debug this kind of code than your typical matlab data analysis script, where rows and columns are often referred to numerically (e.g. plot(d(:,2), d(:,3)) rather than qplot(condition, correct, data=d)).
Often the data you collect are not in the proper form to facilitate this kind of analysis workflow. In that case, my choice is to create another script, called something like "preprocessing.R" that uses tools like reshape2 to move from e.g. a mechanical turk output file to a tidy data format (a long-form tabular dataset). That way I have a two-step workflow, but I am saving both the original data and the intermediate datafile, and can easily check each by eye in a text editor or Excel for conversion/reformatting errors.
Overall, the key thing about using R for the full analysis is that – especially when the analysis is version controlled, as described below – you have a full record of the steps that you took to get to a particular conclusion. In addition, with the general workflow described above, the steps in the analysis are described in a semantically transparent way (modulo understanding the particular conventions of, say, ggplot2, which can take some time). Both of these dramatically improve reproducibility by making debugging, rerunning, and updating this codebase easier.
Archiving analyses on git
When I am ready to analyze the data from an experiment (or sometimes even before), I have started setting up a git repository on github.com. It took me a little while to get the hang of this practice, but now I am convinced that it is overall a huge time-saver. (A good tutorial is available here). The repository for an experimental project is initialized with the original datafile(s) that I collect, e.g. the eye-tracking records, behavioral testing software outputs, or logfiles, suitably anonymized. These source datafiles should remain unchanged throughout the lifetime of the analysis – confirmed by their git history.I work on a local copy of that repository and push updates back to it so that I always have the analysis backed up. (I've begun doing all my analysis in the clear on github, but for academic users you can get free private repositories if that makes you uncomfortable). This strategy helps me keep track of the original data files, intermediate processed and aggregated data, and the analysis code, all in one place. So at its most basic it's a very good backup.
But managing data analysis through git has a couple of other virtues, too:
- The primary benefits of version control. This is the obvious stuff for anyone who has worked with git or subversion before, but for me as a new user, this was amazing! Good committing practices – checking in versions of your code regularly – mean that you never have to have more than one version of a file. For example, if you're working on a file called "analysis.R," you don't have to have "analysis 4-21-14 doesn't quite work.R" and "analysis 4-22-14 final.R." Instead, "analysis.R" can reflect in its git history many different iterations that you can browse through whenever you want. You can even use branches or tags to keep track of multiple different conflicting approaches in the same file.
- Transparency within collaborations. Your collaborators can look at what you are doing while the analysis is ongoing, and they can even make changes and poke around without upsetting the applecart or creating totally incommensurable analysis drafts. This transparency can dramatically reduce sharing overhead and crosstalk between collaborators in a large project. It also means that it is way easier for authors to audit the analysis on a paper prior to submission – something that I think should probably be mandatory for complex analyses.
- Ease of sharing analyses during the publication and review process. When you're done – or even while analysis is still ongoing – you can share the repository with outsiders or link to it in your publications. Then, you can post updates to it if you have corrections or extensions, and new viewers will automatically see these rather than having to track you down. This means sharing your data and analysis is always as simple as sharing a link – no need to hunt down a lot of extra dependencies and clean things up after the fact (something that I suspect is a real reason why many data sharing requests go unanswered).
Conclusion
It takes an initial, upfront investment to master both git and R. Neither is as easy as using pivot tables in Excel. But the payoff is dramatic, both in terms of productivity and in terms of reproducibility. There are steps further that you can take if you are really committed to documenting every step of your work, but I think this is a reasonable starting point, even for honors students or beginning graduate students. For any project longer than a quick one-off, I am convinced that the investment is well worth while.Of course, I don't mean to imply that you can't do bad or irreproducible research using this ecosystem – it's very easy to do both. But I do believe that it nudges you towards substantially better practices than tools like Excel and SPSS. And sometimes a nudge in the right direction can go a long way towards promoting the desired behavior.