Aggregate – A Powerful Tool for Data Frame in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This post gives a short review of the aggregate function as used for data.frames and presents some interesting uses: from the trivial but handy to the most complicated problems I have solved with aggregate.
Aggregate (data.frame): Technical Overview
Aggregate
is a function in base R which can, as the name suggests, aggregate the inputted data.frame
d.f by applying a function specified by the FUN
parameter to each column of sub-data.frames defined by the by
input parameter.
The by
parameter has to be a list
. However, since data.frame
‘s are handled as (named) lists of columns, one or more columns of a data.frame
can also be passed as the by
parameter. Interestingly, if these columns are of the same data.frame
as the one inputted as x
, those columns are not passed on to the FUN
function.
The function to apply has to be able to accept a vector
(since it will be called with parts of a column of a data.frame
as input).
The sub-data.frames defined by the by
input parameter can be thought of as logical indexing:
d.f <- data.frame(rating = c("AAA", "A", "A", "AAA", "BB", "BB", "AAA", "A")) i <- 1 by <- d.f$rating sub.data.frame <- d.f[by == unique(by)[i], ]
and do this for every i
between 1 and length(unique(by))
. Note that the by
variable doesn’t have to agree with one (or more) column of the data.frame
but could be anything. Hence, one can reproduce the aggregate
functionality by a for
cycle running the cycle variable over the unique values of the variable passed as by
and an sapply
applying the function passed as FUN
to each column of the data.frame
sub.data.frame. Such a workaround however would be very difficult to document, as it would be unclear what (and why) this code is actually doing.
Aggregate
always returns a data.frame
as a result. This data.frame
will contain the (now unique) values from the input parameter by
as the first column and then columns containing the results of the call to the function in the FUN
parameter applied to the parts of the columns of the inputted data.frame
. It is interesting to note that if the function FUN
returns multiple values, the class of the columns of the result data.frame
will be list
or something a list
can be cast to (see the last example below).
It is important to note that the function call is applied to nameless vectors rather than named columns of a data.frame
and hence referring to the names of the data.frame
will not work, nor will column references such as s.d.f[,1]
.
Basic Examples
The most basic uses of aggregate involve base functions such as mean
and sd
. It is indeed one of the most common uses of aggregate
to compare the mean or other properties of sample groups.
Recently I reproduced calculations from an Excel sheet. Most formulae were subtotals and grand totals. The Excel sheet was not very comfortably organized for this purpose: sums over rows, columns and totals of those sums were used. In R, I have changed the data to a star schema representation (when all metadata are represented row-wise and every value gets its own row) using reshape2
package and melt
then used aggregate
along different variables to get the different totals. The less variables you use in by
the more aggregated the end-result: the grand total along a dimension is simply using that dimension as “by”, while subtotals can be achieved using multiple variables as by
. The FUN
in this case was of course sum
.
One handy use of aggregate and a base function is getting the number of appearances of the various values:
values <- data.frame(value = c("a", "a", "a", "a", "a", "b", "b", "b", "c", "c", "c", "c")) nr.of.appearances <- aggregate(x = values, by = list(unique.values = values$value), FUN = length)
My favourite use of aggregate with a base function is getting the last day of each month in a series of dates. To do so, one can use the following code (assuming your dates are stored in a “YYYY-MM-DD” format as strings or as Date
):
dates <- data.frame(date = as.Date("2001-01-01", format = "%Y-%m-%d") + 0:729) dates date 1 2001-01-01 2 2001-01-02 3 2001-01-03 4 2001-01-04 ..... last.day <- aggregate(x = dates["date"], by = list(month = substr(dates$date, 1, 7)), FUN = max) last.day month date 1 2001-01 2001-01-31 2 2001-02 2001-02-28 3 2001-03 2001-03-31 4 2001-04 2001-04-30 .....
This came in very handy when working with banking information where the last day of the month depended on banking holidays as well as weekends.
Advanced Uses
More advanced uses of aggregate
depend on writing your own function
, e.g. anonymous functions passed on as the FUN
parameter. To do so, one can use the syntax
# do not run the syntax aggregate(x = d.f, by = by.list, FUN = function(s.d.f){y <- s.d.f; return(y)}
The possible uses range from calling complex portfolio risk metrics for the homogeneous risk groups of a portfolio via fitting a distribution to categories of samples to anything you can image, really.
Here is an example with a “complex” portfolio risk metric (exposure to different counterparties in different asset classes):
assets <- data.frame(asset.class = c("equity", "equity","equity", "option","option","option", "bond", "bond"), rating = c("AAA", "A", "A", "AAA", "BB", "BB", "AAA", "A"), counterparty.a = c(runif(3), rnorm(5)), counterparty.b = c(runif(3), rnorm(5)), counterparty.c = c(runif(3), rnorm(5))) assets asset.class rating counterparty.a counterparty.b counterparty.c 1 equity AAA 0.9026004 0.6029417 0.8629453 2 equity A 0.8834034 0.5809589 0.4654721 3 equity A 0.1007586 0.9368537 0.3090811 4 option AAA -1.0508915 0.7171532 0.2224984 .....
Here is the use of aggregate()
function.
exposures <- aggregate(x = assets[c("counterparty.a", "counterparty.b", "counterparty.c")], by = assets[c("asset.class", "rating")], FUN = function(market.values){ sum(pmax(market.values, 0)) }) exposures asset.class rating counterparty.a counterparty.b counterparty.c 1 bond A 1.0038714 0.6382029 2.2822936 2 equity A 0.9841620 1.5178126 0.7745532 3 bond AAA 0.0000000 0.0000000 0.0000000 4 equity AAA 0.9026004 0.6029417 0.8629453 .....
Next up: fitting a Gaussian distribution to observations by categories:
library(MASS) categories <- data.frame(category = c("a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "c", "c", "c", "c")) observations <- data.frame(observation = c(rnorm(5, mean = 3, sd = 0.2), rnorm(5, mean = -2, sd = 0.4), rnorm(4, mean = 0, sd = 1)))
Below we use the aggregate()
function to find the mean and standard deviation by categories.
distr.estimate <- aggregate(x = observations, by = categories, FUN = function(observations){ fitdistr(observations, densfun = "normal")$estimate }) distr.estimate category observation.mean observation.sd 1 a 3.0606926 0.1779962 2 b -2.1446040 0.1658481 3 c -0.1881841 0.5613013
This last example showcases several interesting properties. First, the data.frame
to aggregate
and the list of by
variables don’t have to be the same. While this is implied in other places of the post, this is an explicit example of such a setup. Secondly, the function passed as FUN
is not only an anonymous function, it is curried from a function with more than one input parameter. A function of a single input variable observations
has been created from the two-input variable function fitdistr
: fixing one of the input variables by setting densfun = "normal"
. Thirdly, rather than returning the full return value of the fitdistr
function, the return value is restricted to the element estimate
from the return value. And last but not least, the return value of the anonymous function passed to FUN
consists of two variables and not only one. Interestingly, aggregate casts the return value from list
to a matrix
and names the elements for us. However, these names can’t be used to reference the columns of the matrix. You can however reference them as follows:
distr.estimate$observation[1,][["mean"]] [1] 3.016988
Closing Words
I hope that you have found the above useful. Now that you are more familiar with aggregate
, it is time for the truth: everything above and much more can be done with data.table
, and with a much faster performance. However, data.table
has a complex syntax and one really has to understand how things work under the hood, while aggregate
is simple and insightful. Until you are comfortable with both the logic of aggregation and the syntax of data.table
, it is a worthy investment to first write the code using aggregate
and then optimize it by rewriting it using data.table
.
For those of you who are interested, a dedicated post is coming where the above is redone with data.table
, along with some additional use cases specific to data.table
.
If you have any question leave a comment below.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.