Every time the issue gets discussed on twitter, I get a little bit rant-y; this post is my attempt to explain why. It's not because I fundamentally disagree with the argument. Barplots do mask important distributional facts about datasets. But there's more we have to take into account.
Here's my basic argument:
Hey #barbarplots folks: I agree with you that plotting variability is important, but the world of data is big! /1— Michael C. Frank (@mcxfrank) August 10, 2016
Sometimes you need a summary stat because you have lots of observations, sometimes because there are many conditions to compare. /2— Michael C. Frank (@mcxfrank) August 10, 2016
Bars are overused when neither of those apply; they shouldn't always be the default. Lines and points are usually cleaner. /3— Michael C. Frank (@mcxfrank) August 10, 2016
When I originally posted that rant, I was in transit and didn't get a chance to illustrate my point, so there was a lot of back-and-forth about what good use cases for bars would be.* The basic one that comes to mind for me is in analyzing datasets where there are many discrete independent variables (e.g., conditions, experiments) and not many observations per participant. This structure describes many experiments I've worked on.ANOVA also overused and used inappropriately - would be silly to ban it. Same for bars. Move the defaults and #barwithcaution. /end— Michael C. Frank (@mcxfrank) August 10, 2016
I put together an example visualization, based on Experiment 4 of this paper. All code and data here, in the experiment 4 analysis script. Here's the plot we put in the paper:
I chose a barplot because there were a lot of planned age groups and conditions and it seemed like an easy way to represent that discrete structure in the data, along with summary means and 95% CIs. I like to visualize by-subject distributions (I was actually a bit fetishistic about it in my early papers), but the data I was plotting here had only four observations per child. As a result, simple jitter plots look crazy:
And box plots are useless.
Violin plots are useless too.
The best alternative I saw was this one, but it still looks too sparse to me:
Having posted these to twitter along with the data, TJ Mahr rose to the challenge to do better:
@annemscheel @mcxfrank hmmm didn't realize only 4 trials per cell. not much y variability to be revealed by points. pic.twitter.com/bLaRixZV4q— tj mahr (@tjmahr) November 4, 2016
I like this representation, and with some tweaking it might be a nice alternative to put in a paper like the one I wrote.
But here's my point: these visualizations are good for different things. The barplot is simple and easy to read – and it compresses well. (This point is made by Heer & Bostock, 2010, as well). Consider what happens when we shrink the plotting space for these (using my version of TJ's so I can hold image size constant):
Or even tinier:
My sense is that the barplot holds up to compression much better, at least modulo the font size. In addition, I would never show the jitterdodge masterwork to a popular audience (or even really to a class). It's just got too much going on.
My broader point: banning particular data analyses or visualizations just doesn't seem like the right answer. Particular visualizations can be right for certain contexts, for certain audiences, and for certain data types. The world of data is broad. We can change the defaults, but we shouldn't ban something that has important uses.
* Everyone in the discussion agrees that bars are fine for visualizing single discrete values, e.g. as in the counts in a histogram.