Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
When you draw a histogram, an important question is “how many bar should I draw?”. This should inspire an indignant response. You didn’t become a programmer to answer questions, did you? No. The whole point of programming is to let your computer do your thinking for you, giving you more time to watch videos of fluffy kittens.
Fortunately, R contains three functions to automate the answer, namely nclass.Sturges
, nclass.scott
and nclass.FD
. (FD is short for Freedman-Diaconis; watch out for the fact that scott
isn’t capitalised.)
The differences depend upon length and spread of data. For longer vectors, Scott and Freedman-Diaconis tend to give bigger answers.
short_normal <- rnorm(1e2) nclass.Sturges(short_normal) #8 nclass.scott(short_normal) #8 nclass.FD(short_normal) #12 long_normal <- rnorm(1e5) nclass.Sturges(long_normal) #18 nclass.scott(long_normal) #111 nclass.FD(long_normal) #144
For strongly skewed data, you are best to use some sort of transformation before you draw a histogram, but for the record, Freedman-Diaconis again gives bigger answers for highly skewed (and thus wider) vectors.
short_lognormal <- rlnorm(1e2) nclass.Sturges(short_lognormal) #8 nclass.scott(short_lognormal) #9 nclass.FD(short_lognormal) #20 long_lognormal <- rlnorm(1e5) nclass.Sturges(long_lognormal) #18 nclass.scott(long_lognormal) #443 nclass.FD(long_lognormal) #1134
My feeling is that since each of the three algorithms is rather dumb, it is safest to calculate all three, then pick the middle one.
nclass.all <- function(x, fun = median) { fun(c( nclass.Sturges(x), nclass.scott(x), nclass.FD(x) )) } log_islands hist(log_islands, breaks = nclass.all(log_islands))
I also wrote a MATLAB implementation of this a couple of years ago.
It is worth noting that ggplot2 doesn’t accept a number-of-bins argument to geom_histogram
, because
In practice, you will need to use multiple bin widths to
discover all the signal in the data, and having bins with
meaningful widths (rather than some arbitrary fraction of the
range of the data) is more interpretable.
That’s fine if you are interactively exploring the data, but if you want a purely automated solution, then you need to make up a number of bins.
calc_bin_width <- function(x, ...) { rangex <- range(x, na.rm = TRUE) (rangex[2] - rangex[1]) / nclass.all(x, ...) } p <- ggplot(movies, aes(x = votes)) + geom_histogram(binwidth = calc_bin_width(log10(movies$votes))) + scale_x_log10() p
Tagged: histogram, matlab, r
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.