Site icon R-bloggers

Don’t use stats::aggregate()

[This article was first published on Win-Vector Blog » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When working with an analysis system (such as R) there are usually good reasons to prefer using functions from the “base” system over using functions from extension packages. However, base functions are sometimes locked into unfortunate design compromises that can now be avoided. In R’s case I would say: do not use stats::aggregate().

Read on for our example.

For our example we create a data frame. The issue is: I am working in the Pacific time zone on Saturday October 31st 2015, and I have some time data that I want to work with that is in an Asian time zone.

print(date())
## [1] "Sat Oct 31 08:14:38 2015"
d <- data.frame(group='x',
 time=as.POSIXct(strptime('2006/10/01 09:00:00',
   format='%Y/%m/%d %H:%M:%S',
   tz="Etc/GMT+8"),tz="Etc/GMT+8"))  # I'd like to say UTC+8 or CST
print(d)
##   group                time
## 1     x 2006-10-01 09:00:00
print(d$time)
## [1] "2006-10-01 09:00:00 GMT+8"
str(d$time)
##  POSIXct[1:1], format: "2006-10-01 09:00:00"
print(unclass(d$time))
## [1] 1159722000
## attr(,"tzone")
## [1] "Etc/GMT+8"

Suppose I try to aggregate the data to find the earliest time for each group. I have a problem, aggregate loses the timezone and gives a bad answer.

d2 <- aggregate(time~group,data=d,FUN=min)
print(d2)
##   group                time
## 1     x 2006-10-01 10:00:00
print(d2$time)
## [1] "2006-10-01 10:00:00 PDT"

This is bad. Our time has lost its time zone and changed from 09:00:00 to 10:00:00. This violates John M. Chambers’ “Prime Directive” that:

computations can be understood and trusted.

Software for Data Analysis, John M. Chambers, Springer 2008, page 3.

The issue is the POSIXct time time is essentially a numeric array carrying around its timezone as an attribute. Most base R code has problems if there are extra attributes on a numeric array. So R-stat code tends to have a habit of dropping attributes when it can. it is odd that the class() is kept (which itself an attribute style structure) and the timezone is lost, but R is full of hand-specified corner cases.

dplyr gets the right answer.

library('dplyr')
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
by_group = group_by(d,group)
d3 <- summarize(by_group,min(time))
print(d3)
## Source: local data frame [1 x 2]
## 
##   group           min(time)
## 1     x 2006-10-01 09:00:00
print(d3[[2]])
## [1] "2006-10-01 09:00:00 GMT+8"

And plyr also works.

library('plyr')
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## 
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
d4 <- ddply(d,.(group),summarize,time=min(time))
print(d4)
##   group                time
## 1     x 2006-10-01 09:00:00
print(d4$time)
## [1] "2006-10-01 09:00:00 GMT+8"

To leave a comment for the author, please follow the link and comment on their blog: Win-Vector Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.