Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Is it you or have I become old and cranky?
I’ve been using R and mostly enjoying it since 2006. Lately I’ve been having some misgivings about the direction R as a community is headed. Some of these misgivings no doubt stem from reluctance to learn new ways of doing things after investing so much time mastering the old ways, but underneath my old-man crankiness I believe there are real and important challenges facing the R community. R has grown considerably since I started using it a decade ago, and this has mostly been a good thing as new packages implement new and better functionality. Recently, popular contributed packages have been replacing core R functionality with new approaches, leading to a fragmentation of the user base and increasing cognitive load by requiring analysts to choose a package (or set of packages) before they even write the first line of code.
< !-- TEASER_END -->R is “a large, coherent, integrated collection of intermediate tools for data analysis”
The quote is taken from the official R manual An Introduction to R. The “coherent” and “integrated” claims have always been questionable in some areas, but this situation is getting far worse as contributed packages re-invent basic R features and replace classic R idioms with new ones. Let me show you what I’m talking about.
Is there a coherent integrated “group by” in R?
Suppose I want to aggregate a data set by calculating group sizes, means, and standard deviations. There are many ways to do it in base R, and none are quite satisfactory.
First up is the tapply
function. It works well for simple aggregation, but won’t help us here because it can only handle a single x
variable, and it produces confusing return values if FUN
returns a vector of length greater than one. That is to say, it works fine for calculating a single summary statistic by group
tapply(mtcars$wt, mtcars[c("am", "cyl")], FUN = mean) cyl am 4 6 8 0 2.93500 3.38875 4.104083 1 2.04225 2.75500 3.370000
but anything more complicated quickly becomes a mess
tapply(mtcars$wt, mtcars[c("am", "cyl")], FUN = function(x) c(n = length(x), mean = mean(x), sd = sd(x))) cyl am 4 6 8 0 Numeric,3 Numeric,3 Numeric,3 1 Numeric,3 Numeric,3 Numeric,3
Next we might consider the ave
function, but a quick look at the documentation suggests it won’t be any better than tapply
. OK, let’s try something else. Maybe aggregate
will help us.
aggregate(cbind(mpg, hp, wt) ~ am + cyl, data = mtcars, FUN = function(x) c(n = length(x), mean = mean(x), sd = sd(x))) am cyl mpg.n mpg.mean mpg.sd hp.n hp.mean hp.sd wt.n wt.mean wt.sd 1 0 4 3.0000000 22.9000000 1.4525839 3.00000 84.66667 19.65536 3.0000000 2.9350000 0.4075230 2 1 4 8.0000000 28.0750000 4.4838599 8.00000 81.87500 22.65542 8.0000000 2.0422500 0.4093485 3 0 6 4.0000000 19.1250000 1.6317169 4.00000 115.25000 9.17878 4.0000000 3.3887500 0.1162164 4 1 6 3.0000000 20.5666667 0.7505553 3.00000 131.66667 37.52777 3.0000000 2.7550000 0.1281601 5 0 8 12.0000000 15.0500000 2.7743959 12.00000 194.16667 33.35984 12.0000000 4.1040833 0.7683069 6 1 8 2.0000000 15.4000000 0.5656854 2.00000 299.50000 50.20458 2.0000000 3.3700000 0.2828427
aggregate
appears to be closer to what we want than tapply
, but it still either requires post-processing to remove the redundant group size column, or multiple calls to produce the desired result, e.g.,
merge(aggregate(cbind(n = mpg) ~ am +cyl, data = mtcars, FUN = length), aggregate(cbind(mpg, hp, wt) ~ am + cyl, data = mtcars, FUN = function(x) c(mean = mean(x), sd = sd(x)))) am cyl n mpg.mean mpg.sd hp.mean hp.sd wt.mean wt.sd 1 0 4 3 22.9000000 1.4525839 84.66667 19.65536 2.9350000 0.4075230 2 0 6 4 19.1250000 1.6317169 115.25000 9.17878 3.3887500 0.1162164 3 0 8 12 15.0500000 2.7743959 194.16667 33.35984 4.1040833 0.7683069 4 1 4 8 28.0750000 4.4838599 81.87500 22.65542 2.0422500 0.4093485 5 1 6 3 20.5666667 0.7505553 131.66667 37.52777 2.7550000 0.1281601 6 1 8 2 15.4000000 0.5656854 299.50000 50.20458 3.3700000 0.2828427
OK, tapply
and ave
were busts, aggregate is close but not quite what we want. How about by
?
by(mtcars, mtcars[c("am", "cyl")], FUN = function(x) { with(x, c(n = nrow(x), mpg_mean = mean(mpg), mpg_sd = sd(mpg), hp_mean = mean(hp), hp_sd = sd(hp), wt_mean = mean(wt), wt_sd = sd(wt))) }, simplify = FALSE) am: 0 cyl: 4 n mpg_mean mpg_sd hp_mean hp_sd wt_mean wt_sd 3.000000 22.900000 1.452584 84.666667 19.655364 2.935000 0.407523 ------------------------------------------------------------------------------------- am: 1 cyl: 4 n mpg_mean mpg_sd hp_mean hp_sd wt_mean wt_sd 8.0000000 28.0750000 4.4838599 81.8750000 22.6554156 2.0422500 0.4093485 ------------------------------------------------------------------------------------- am: 0 cyl: 6 n mpg_mean mpg_sd hp_mean hp_sd wt_mean wt_sd 4.0000000 19.1250000 1.6317169 115.2500000 9.1787799 3.3887500 0.1162164 ------------------------------------------------------------------------------------- am: 1 cyl: 6 n mpg_mean mpg_sd hp_mean hp_sd wt_mean wt_sd 3.0000000 20.5666667 0.7505553 131.6666667 37.5277675 2.7550000 0.1281601 ------------------------------------------------------------------------------------- am: 0 cyl: 8 n mpg_mean mpg_sd hp_mean hp_sd wt_mean wt_sd 12.0000000 15.0500000 2.7743959 194.1666667 33.3598379 4.1040833 0.7683069 ------------------------------------------------------------------------------------- am: 1 cyl: 8 n mpg_mean mpg_sd hp_mean hp_sd wt_mean wt_sd 2.0000000 15.4000000 0.5656854 299.5000000 50.2045815 3.3700000 0.2828427
Well, maybe that’s better. It’s not really any less verbose than the aggregate-and-merge strategy, and the result isn’t very friendly. Maybe we should just roll our own.
do.call(rbind, lapply(split(mtcars, mtcars[c("am", "cyl")]), function(x) { with(x, data.frame(am = unique(am), cyl = unique(cyl), n = nrow(x), mpg_mean = mean(mpg), mpg_sd = sd(mpg), hp_mean = mean(hp), hp_sd = sd(hp), wt_mean = mean(wt), wt_sd = sd(wt))) })) am cyl n mpg_mean mpg_sd hp_mean hp_sd wt_mean wt_sd 0.4 0 4 3 22.90000 1.4525839 84.66667 19.65536 2.935000 0.4075230 1.4 1 4 8 28.07500 4.4838599 81.87500 22.65542 2.042250 0.4093485 0.6 0 6 4 19.12500 1.6317169 115.25000 9.17878 3.388750 0.1162164 1.6 1 6 3 20.56667 0.7505553 131.66667 37.52777 2.755000 0.1281601 0.8 0 8 12 15.05000 2.7743959 194.16667 33.35984 4.104083 0.7683069 1.8 1 8 2 15.40000 0.5656854 299.50000 50.20458 3.370000 0.2828427
By now we’ve tried four different approaches, but nothing seems to make the calculation particularly natural or convenient. Is this really a “coherent and integrated” collection of functions? It feels more like a haphazard collection of overlapping functions that can be abused in different ways. So here are some questions.
- Given that
aggregate
appears to be more flexible thantapply
andave
, do we really need the later two? - Can
aggregate
be generalized so that we can apply functions to data.frames instead of to the columns of those data.frames?
Can we do better?
Of course we can do better. Many an R programmer has gazed out over the rubble of tapply
, ave
, by
and aggregate
and mused “surely I can bring order and harmony to this jumble. Follow me and we will create a ‘group by’ operation to end all SQL jealousy in the kingdom of R.” And what comes of this musing? Let us look with wonder upon the bubbling exuberant creativity of the R community.
doBy::describeBy
Very similar to aggregate
, same limitations.
doBy::summaryBy(mpg + hp + wt ~ am + cyl, data = mtcars, FUN = function(x) c(n = length(x), mean = mean(x), sd = sd(x))) am cyl mpg.n mpg.mean mpg.sd hp.n hp.mean hp.sd wt.n wt.mean wt.sd 1 0 4 3 22.90000 1.4525839 3 84.66667 19.65536 3 2.935000 0.4075230 2 0 6 4 19.12500 1.6317169 4 115.25000 9.17878 4 3.388750 0.1162164 3 0 8 12 15.05000 2.7743959 12 194.16667 33.35984 12 4.104083 0.7683069 4 1 4 8 28.07500 4.4838599 8 81.87500 22.65542 8 2.042250 0.4093485 5 1 6 3 20.56667 0.7505553 3 131.66667 37.52777 3 2.755000 0.1281601 6 1 8 2 15.40000 0.5656854 2 299.50000 50.20458 2 3.370000 0.2828427
Hmisc::summary.formula
Similar to aggregate, large number of confusing options. This one automatically computes N for each group, so it actually works for our example.
Hmisc::summary.formula(cbind(mpg, hp, wt) ~ am + cyl, data = mtcars, fun = function(x) { apply(x, 2, FUN = function(y) { c(mean = mean(y), sd = sd(y)) }) }) cbind(mpg, hp, wt) N=32 +-------+---+--+--------+--------+---------+--------+--------+---------+ | | |N |mpg mean|mpg sd |hp mean |hp sd |wt mean |wt sd | +-------+---+--+--------+--------+---------+--------+--------+---------+ |am |No |19|17.14737|3.833966|160.26316|53.90820|3.768895|0.7774001| | |Yes|13|24.39231|6.166504|126.84615|84.06232|2.411000|0.6169816| +-------+---+--+--------+--------+---------+--------+--------+---------+ |cyl |4 |11|26.66364|4.509828| 82.63636|20.93453|2.285727|0.5695637| | |6 | 7|19.74286|1.453567|122.28571|24.26049|3.117143|0.3563455| | |8 |14|15.10000|2.560048|209.21429|50.97689|3.999214|0.7594047| +-------+---+--+--------+--------+---------+--------+--------+---------+ |Overall| |32|20.09062|6.026948|146.68750|68.56287|3.217250|0.9784574| +-------+---+--+--------+--------+---------+--------+--------+---------+
dplyr::summarize
This one is very popular, and for good reason. It works well.
dplyr::summarize(dplyr::group_by(mtcars, am, cyl), n = length(mpg), mean_mpg = mean(mpg), sd_mpg = sd(mpg), mean_hp = mean(hp), sd_hp = sd(hp), mean_wt = mean(wt), sd_hp = sd(hp)) Source: local data frame [6 x 8] Groups: am [?] am cyl n mean_mpg sd_mpg mean_hp sd_hp mean_wt <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> 1 0 4 3 22.90000 1.4525839 84.66667 19.65536 2.935000 2 0 6 4 19.12500 1.6317169 115.25000 9.17878 3.388750 3 0 8 12 15.05000 2.7743959 194.16667 33.35984 4.104083 4 1 4 8 28.07500 4.4838599 81.87500 22.65542 2.042250 5 1 6 3 20.56667 0.7505553 131.66667 37.52777 2.755000 6 1 8 2 15.40000 0.5656854 299.50000 50.20458 3.370000
dplyr::do
If you have a large number of columns to summarize you might not want to type them all out. In that case you can use do
.
do(group_by(mtcars, am, cyl), as.data.frame(c(list(n = ncol(.)), as.list(sapply(.[c("mpg", "wt", "hp")], function(x) c(mean = mean(x)))), as.list(sapply(.[c("mpg", "wt", "hp")], function(x) c(sd = mean(x))))))) Source: local data frame [6 x 9] Groups: am, cyl [6] am cyl n mpg.mean wt.mean hp.mean mpg.sd wt.sd hp.sd <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 0 4 11 22.90000 2.935000 84.66667 22.90000 2.935000 84.66667 2 0 6 11 19.12500 3.388750 115.25000 19.12500 3.388750 115.25000 3 0 8 11 15.05000 4.104083 194.16667 15.05000 4.104083 194.16667 4 1 4 11 28.07500 2.042250 81.87500 28.07500 2.042250 81.87500 5 1 6 11 20.56667 2.755000 131.66667 20.56667 2.755000 131.66667 6 1 8 11 15.40000 3.370000 299.50000 15.40000 3.370000 299.50000
tables::tabular
This one focuses on creating LaTeX and HTML tables. It creates its own SAS-inspired mini-language that is IMO very confusing, though possibly worth it if you frequently need to create complex publication ready tables.
#library(tables) tables::tabular((Factor(am))*(Factor(cyl)) ~ (n = 1) + (mpg + wt + hp)*(mean + sd), data = mtcars) mpg wt hp am cyl n mean sd mean sd mean sd 0 4 3 22.90 1.4526 2.935 0.4075 84.67 19.655 6 4 19.12 1.6317 3.389 0.1162 115.25 9.179 8 12 15.05 2.7744 4.104 0.7683 194.17 33.360 1 4 8 28.07 4.4839 2.042 0.4093 81.88 22.655 6 3 20.57 0.7506 2.755 0.1282 131.67 37.528 8 2 15.40 0.5657 3.370 0.2828 299.50 50.205
data.table::`[`
The data.table package implements an alternative to the venerable data.frame
class in R and provides sophisticated manipulation via an indexing-like interface.
as.data.table(mtcars)[, list(n = .N, mpg_mean = mean(mpg), mpg_sd = sd(mpg), wt_mean = mean(wt), wt_sd = sd(wt), hp_mean = mean(hp), hp_sd = sd(hp)), by = c("am", "cyl")]
Are we done yet? Well, I’m going to stop, but we could go on. There are at least 9 ways to skin this particular cat in R.
How do I _ in R?
So there are lots of ways to calculate statistics by some grouping variable(s) in R. Why can’t you be happy that you have so many excellent choices?
I can’t be happy about it because it makes my life more difficult. First, I need to identify my options. Then I need to evaluate them, and learn the particulars of my chosen package. This all takes effort that I would rather spend on other things. Now, if this problem was limited to the domain of calculating statistics by group, I wouldn’t be writing this post. But this issue is almost everywhere in R.
How do I read text data?
I have a .csv file I want to read into R. Should I use
- read.csv
- readr::readcsv
- data.table::fread
- rio::import
- hypoparsr::parsefile
- cvsread::cvsread
or something else?
How do I fit a linear regression model?
I want to fit a simple linear regression model. Should I use
- lm
- rms::ols
- Zelig::zlm
- glm2::glm2
or something else?
How do I make a table from model coefficients?
I’ve fit a model and would like to put the results in a nice table. Should I use
- xtable::xtable
- rockchalk::outreg
- apsrtable::apsrtable
- htmlTable::htmlTable
- etable::tabular.ade
- knitr::ktable
- texreg::texreg
- stargazer::stargazer
- ascii::ascii
- estout::esttab
or some other thing?
TODO Summary [summarize the post thus far]
There is an overwhelming number of choices for doing just about anything in R.
What is R really?
OK, so there is some duplication among R functions and packages and people need to choose. There are both good and bad consequences of this, but the totality of the situation is that it is no longer clear what R is.
R is not data.frames
Most people who use R use it for statistical analysis and graphics. The basic data structure in most popular statistical package is a rectangular structure with variables in the columns and observations in the rows. In R this structure is called a data.frame
and learning how to operate on and with data.frames is a basic skill that any R user must have.
The previous sentence may have been true at one time, but it no longer is. There are now at least three popular alternatives; data.frame
, data.table
and tibble
. It is now possible to carry out sophisticated data manipulation and analysis in R without ever learning fundamental data.frame
methods such as `[.data.frame`. All of these methods work differently.
head(mtcars)[, 1] [1] 21.0 21.0 22.8 21.4 18.7 18.1 as.data.table(head(mtcars))[, 1] [1] 1 as_tibble(head(mtcars))[, 1] # A tibble: 6 × 1 mpg <dbl> 1 21.0 2 21.0 3 22.8 4 21.4 5 18.7 6 18.1
R is not a language that uses parenthetical argument lists, like c
One of the first things I used to teach people about R is that function calls have the form functionName(arg1, arg2, ..., argn)
, and that even when all arguments are optional and you want to accept the defaults you need the ()
after the function name. This is no longer true. Many people now write strange looking R expressions like
library(magrittr) mtcars %>% head
R has always had flexible syntax, but with developments like this you can write R that looks nothing like what us old-timers expect R code to be.
R is not coherent
There is no zen-like “one obvious way to do it” it R.
R is being built phoenix-like from its own ashes
There is good news and there is bad news. The good news is that new and more coherent and integrated zones are being carved out of the R landscape. For example, the tidyverse brings greater simplicity and constancy to many common operations in R. The bad news is that you can’t escape the cold hard XKCD reality that producing a “better” way of doing things means there is one more way of doing that thing.
And now, the thrilling conclusion
Now finally we’ve reach the part of the blog where I tell you how everyone is doing it wrong and if you would just listen to me we could solve all our problems. The truth is I don’t have great answers or solutions for these issues. The best I can do is offer some general thoughts.
It could be worse
This post probably sounds critical of R, but don’t get me wrong, I’m a huge fan of R. Every time I venture into Python, Javascript, Scala, or even Julia it makes me appreciate R even more. R is easy and useful, and having too much choice is certainly better than having too little.
We can do better
Collaborate more. Update documentation to recommend current best-of-breed packages.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.