Aggregation by Group in R
[This article was first published on Yet Another Blog in Statistical Computing » S+/R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
> df <- read.csv('credit_count.csv') > > # METHOD 1: USING AGGREGAGE() > summ1 <- aggregate(df[c('INCOME', 'BAD')], df[c('SELFEMPL', 'OWNRENT')], mean) > print(summ1) SELFEMPL OWNRENT INCOME BAD 1 0 0 2133.314 0.08470957 2 1 0 2742.247 0.06896552 3 0 1 2881.201 0.06293210 4 1 1 3487.910 0.05316973 > > # METHOD 2: USING BY() > temp2 <- by(df[c('INCOME', 'BAD')], df[c('SELFEMPL', 'OWNRENT')], colMeans) > summ2 <- cbind(expand.grid(dimnames(temp2)), do.call(rbind, temp2)) > print(summ2) SELFEMPL OWNRENT INCOME BAD 1 0 0 2133.314 0.08470957 2 1 0 2742.247 0.06896552 3 0 1 2881.201 0.06293210 4 1 1 3487.910 0.05316973 > > # METHOD 3: USING SQLDF() > library(sqldf) Loading required package: DBI Loading required package: gsubfn Loading required package: proto Loading required namespace: tcltk Loading Tcl/Tk interface ... done Loading required package: chron Loading required package: RSQLite Loading required package: RSQLite.extfuns > summ3 <- sqldf("select SELFEMPL, OWNRENT, avg(INCOME) as INCOME, avg(BAD) from df + group by SELFEMPL, OWNRENT") Loading required package: tcltk > print(summ3) SELFEMPL OWNRENT INCOME avg(BAD) 1 0 0 2133.314 0.08470957 2 0 1 2881.201 0.06293210 3 1 0 2742.247 0.06896552 4 1 1 3487.910 0.05316973 > > # METHOD 4: USING SQL.SELECT() > source("http://sqlselect.googlecode.com/svn/trunk/sql.select.R") Creating a generic function for ‘as.data.frame’ from package ‘base’ in the global environment > summ4 <- sql.select("select SELFEMPL, OWNRENT, `mean(INCOME)` as INCOME, `mean(BAD)` as BAD + from df group by SELFEMPL, OWNRENT") > print(summ4) SELFEMPL OWNRENT INCOME BAD 1 0 0 2133.314 0.08470957 2 0 1 2881.201 0.06293210 3 1 1 3487.910 0.05316973 4 1 0 2742.247 0.06896552
Efficiency Comparison among 4 Methods above
> test1 <- function(n){ + for (i in 1:n){ + summ1 <- aggregate(df[c('INCOME', 'BAD')], df[c('SELFEMPL', 'OWNRENT')], mean) + } + } > system.time(test1(10)) user system elapsed 0.404 0.036 0.513 > > test2 <- function(n){ + for (i in 1:n){ + temp2 <- by(df[c('INCOME', 'BAD')], df[c('SELFEMPL', 'OWNRENT')], colMeans) + summ2 <- cbind(expand.grid(dimnames(temp2)), do.call(rbind, temp2)) + } + } > system.time(test2(10)) user system elapsed 0.244 0.020 0.309 > > test3 <- function(n){ + for (i in 1:n){ + summ3 <- sqldf("select SELFEMPL, OWNRENT, avg(INCOME) as INCOME, avg(BAD) from df + group by SELFEMPL, OWNRENT") + } + } > system.time(test3(10)) user system elapsed 0.956 0.112 1.178 > > test4 <- function(n){ + for (i in 1:n){ + summ4 <- sql.select("select SELFEMPL, OWNRENT, `mean(INCOME)` as INCOME, `mean(BAD)` as BAD + from df group by SELFEMPL, OWNRENT") + } + } > system.time(test4(10)) user system elapsed 0.432 0.112 0.601
To leave a comment for the author, please follow the link and comment on their blog: Yet Another Blog in Statistical Computing » S+/R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.