R SQL-ish aggregation
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I came to R from SQL. I’ve heard that packages such as reshape and plyr are quite beautiful, but to me they are somewhat non-intuitive. I’m always looking for
SELECT mean(sodium) as mean, variance(sodium) as var FROM texas GROUP BY weight;
I struggled with this a bit. First, it’s tempting to use tapply(), except that’s more of a one-column-in, one-column-out sort of thing. There’s by(), which is nice, and with a short wrapper can be OK. Well here’s what I did.
First, the pretty stuff. Because R, while it is generally considered a practical language, is, to me, more of a pretty language. A great way to make things pretty in R is to abuse dynamic binding.
Say I have a table like this:
> Table Alice Bob Cathy Doug 1 1 1 0.08309254 a 2 2 2 -0.76015321 a 3 3 3 0.30887905 a 4 4 4 0.74900962 a 5 1 5 0.69447607 a 6 2 6 0.32485197 a
I’d like to call select() like this:
select(Table, Alice, a = mean(Cathy * Doug), b = sum(Cathy - Doug) / sum(Doug) )
Where ‘Alice’ gives the column to group by, and a=, b= give the columns to create.
So this could look like
select = function(Table, Key, Add.Key=T, ...) { Columns = as.list(substitute(list(...)))[-1L] S.Key = substitute(Key) Key.Str = deparse(S.Key) Group.By = eval(S.Key, envir=Table) ... }
First, the as.list(substitute(list(…)))[-1L] trick to get the unevaluated expressions for the columns. Then substitute(Key) gets the unevaluated key. deparse(S.Key) gives it a name (better would let the caller specify the name, but I haven’t wanted to do that yet so I didn’t implement it).
Group.By = eval(S.Key, envir=Table)
evaluates the grouping column, so now we have it as a vector, with (hopefully) the same number of rows as the table.
Now the next step turned out to suffer some unusual efficiency problems. The obvious approach is to use by() to split Table into many subtables, and evaluate the output columns on all of those. This works, but when the number of output rows is very large, this is extraordinarily slow.
An implementation using by() would look like this
select1 = function(Table, Key, Add.Key=T, ...) { Columns = as.list(substitute(list(...)))[-1L] S.Key = substitute(Key) Key.Str = deparse(S.Key) Group.By = eval(S.Key, envir=Table) Table[[Key.Str]] = Group.By Group.Gen = function(Group) { List.Out = sapply(Columns, function(Col) { eval(Col, envir=Group) }) if (Add.Key) { List.Key = list() List.Key[[Key.Str]] = unique(Group[[Key.Str]]) List.Out = c(List.Key, List.Out) } as.data.frame(List.Out) } Group.List = by(Table, Group.By, Group.Gen, simplify=F) names(Group.List) = c() Result = do.call(rbind, Group.List) Result }
- Split it into tables with by().
- Evaluate the columns in each subtable.
- rbind() the outputs together.
The following implementation, which, instead of using by(), deals with vectors directly (not making a data.frame until the end), is about 10 times faster.
select2 = function(Table, Key, Add.Key=T, ...) { Columns = as.list(substitute(list(...)))[-1L] S.Key = substitute(Key) Key.Str = deparse(S.Key) Group.By = eval(S.Key, envir=Table) Indices = tapply(1:nrow(Table), Group.By, c) Col.List = as.list(Table) Col.Out.List = lapply(Columns, function(Column) { sapply(Indices, function(Group) { Subset = lapply(Col.List, function(Col) Col[Group]) eval(Column, envir=Subset) }) }) if (Add.Key) { Key.List = list(); Key.List[[Key.Str]] = sapply(Indices, function(Group) { unique(Group.By[Group]) }) Col.Out.List = c(Key.List, Col.Out.List) } Table.Out = do.call(data.frame, Col.Out.List) Table.Out }
With a large table
Table = data.frame( Alice = rep(c(1, 2, 3, 4), 1000), Bob = rep(1:2000, 2), Cathy = rnorm(4000), Doug = rep('a', 4000) )
The difference becomes apparent
> system.time(select1(Table, Bob, m=mean(Cathy))) user system elapsed 3.000 0.010 3.026 > system.time(select2(Table, Bob, m=mean(Cathy))) user system elapsed 0.290 0.000 0.286
Unfortunately, when the number of output rows is very, very large, even select2() becomes too slow (and much slower than tapply()) (plyr is even slower). The best solutions I have found are
- If there’s only one input column and one output column, use tapply().
- Use sqldf to do the query it SQLite.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.