How often are there statistical reporting errors in published research? Using a new automated method for scraping APA-formatted stats out of PDFs, Nuijten et al. (2015) found that over 10% of p-values were inconsistent with the reported details of the statistical test, and 1.6% were what they called "grossly" inconsistent, e.g. difference between the p-value and the test statistic meant that one implied statistical significance and the other did not (another summary here). Here are two key figures, first for proportion inconsistent by article and then for proportion of articles with an inconsistency:
These graphs are upsetting news. Around half of articles had at least one error by this analysis, which is not what you want from your scientific literature.* Daniel Lakens has a nice post suggesting that three errors account for many of the problems: incorrect use of < instead of =, use of one-sided tests without clear reporting as such, and errors in rounding and reporting.
Speaking for myself, I'm sure that some of my articles have errors of this type, almost certainly from copying and pasting results from an analysis window into a manuscript (say Matlab in the old days or R now).** The copy-paste thing is incredibly annoying. I hate this kind of slow, error-prone, non-automatable process.
So what are we supposed to do? Of course, we can and should just check our numbers, and maybe run statcheck (the R package Nuijten et al. created) on our own work as well. But there is a much better technical solution out there: write statistics into the manuscript in one executable package that automatically generates the figures, tables, and statistical results. In my opinion, doing this used to be almost as much of a pain as doing the cutting and pasting (and this is spoken as someone who writes academic papers in LaTeX!). But now the tools for writing text and code together have gotten so good that I think there's no excuse not to.
In particular, the integration of the knitr package with RStudio and RPubs means that it is essentially trivial to create a well-formatted document that includes text, code, and data inside it. I've posted a minimal working example to RPubs; you can see the source code here. Critically, this functionality allows you to do things like this:
which eliminates the cut and paste step.*** And even more importantly, you can get out fully-formatted results tables:
You can even use bibtex for references (shown in the full example). Kyle MacDonald, Dan Yurovsky, and I recently wrote a paper together on the role of social cues in cross-situational word learning (the manuscript is under review at the moment). Kyle did the entire thing in RMarkdown using this workflow (repository here), and then did journal formatting using a knitr style that he bundled into his own R package.
The RStudio knitr integration is such that it's really trivial to get started using this workflow (here's a good initial guide), and it's pretty interactive to re-knit and see the output. Occasionally debugging is still a bit tricky, but you can easily switch back to the REPL to debug more complex code blocks. Perhaps the strongest evidence about how easy it is to work this way. More and more I've found myself turning to this workflow as the starting point of data analysis, rather as a separate packaging step at the end of a project.
Often we tend to think of there being a tension between the principles of open science and the researcher's own incentive to work quickly. In contrast, this is a case where I think that there is no tension at all: a better, easier, and faster workflow leads to both a lower risk of errors and more transparency.
* There are some potential issues in the automated extraction procedure that Nuijten et al. used. In particular, they have a very inflexible schema for reporting: if authors included an effect size, formatted their statistical results in a single parenthetical, or any other common formatting alternative, the package would not extract the appropriate stat (in practice, they get around 68% of tests). This kind of thing would be easy to improve on using modern machine reading packages (e.g., I'm thinking of DeepDive's extractors). But they also report a validation study in the Appendix that looks pretty good, so I'm not hugely worried about this aspect of the work.
** Actually, I doubt the statcheck package that Nuijten et al. used would find many of my stats at all, though: at this point, I do relatively few t-tests, chi-squareds, or ANOVAs. Instead I prefer to use regression or other models to try and describe the set of quantitative trends across an entire dataset – more like the approach that Andrew Gelman has advocated for.
*** You can of course still make coding errors here. But that was true before. You just don't have to copy and paste the output of your error into a separate window.
Nuijten MB, Hartgerink CH, van Assen MA, Epskamp S, & Wicherts JM (2015). The prevalence of statistical reporting errors in psychology (1985-2013). Behavior Research Methods. PMID: 26497820