Aggregation by Group in R
[This article was first published on Yet Another Blog in Statistical Computing » S+/R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
> df <- read.csv('credit_count.csv')
>
> # METHOD 1: USING AGGREGAGE()
> summ1 <- aggregate(df[c('INCOME', 'BAD')], df[c('SELFEMPL', 'OWNRENT')], mean)
> print(summ1)
SELFEMPL OWNRENT INCOME BAD
1 0 0 2133.314 0.08470957
2 1 0 2742.247 0.06896552
3 0 1 2881.201 0.06293210
4 1 1 3487.910 0.05316973
>
> # METHOD 2: USING BY()
> temp2 <- by(df[c('INCOME', 'BAD')], df[c('SELFEMPL', 'OWNRENT')], colMeans)
> summ2 <- cbind(expand.grid(dimnames(temp2)), do.call(rbind, temp2))
> print(summ2)
SELFEMPL OWNRENT INCOME BAD
1 0 0 2133.314 0.08470957
2 1 0 2742.247 0.06896552
3 0 1 2881.201 0.06293210
4 1 1 3487.910 0.05316973
>
> # METHOD 3: USING SQLDF()
> library(sqldf)
Loading required package: DBI
Loading required package: gsubfn
Loading required package: proto
Loading required namespace: tcltk
Loading Tcl/Tk interface ... done
Loading required package: chron
Loading required package: RSQLite
Loading required package: RSQLite.extfuns
> summ3 <- sqldf("select SELFEMPL, OWNRENT, avg(INCOME) as INCOME, avg(BAD) from df
+ group by SELFEMPL, OWNRENT")
Loading required package: tcltk
> print(summ3)
SELFEMPL OWNRENT INCOME avg(BAD)
1 0 0 2133.314 0.08470957
2 0 1 2881.201 0.06293210
3 1 0 2742.247 0.06896552
4 1 1 3487.910 0.05316973
>
> # METHOD 4: USING SQL.SELECT()
> source("http://sqlselect.googlecode.com/svn/trunk/sql.select.R")
Creating a generic function for ‘as.data.frame’ from package ‘base’ in the global environment
> summ4 <- sql.select("select SELFEMPL, OWNRENT, `mean(INCOME)` as INCOME, `mean(BAD)` as BAD
+ from df group by SELFEMPL, OWNRENT")
> print(summ4)
SELFEMPL OWNRENT INCOME BAD
1 0 0 2133.314 0.08470957
2 0 1 2881.201 0.06293210
3 1 1 3487.910 0.05316973
4 1 0 2742.247 0.06896552
Efficiency Comparison among 4 Methods above
> test1 <- function(n){
+ for (i in 1:n){
+ summ1 <- aggregate(df[c('INCOME', 'BAD')], df[c('SELFEMPL', 'OWNRENT')], mean)
+ }
+ }
> system.time(test1(10))
user system elapsed
0.404 0.036 0.513
>
> test2 <- function(n){
+ for (i in 1:n){
+ temp2 <- by(df[c('INCOME', 'BAD')], df[c('SELFEMPL', 'OWNRENT')], colMeans)
+ summ2 <- cbind(expand.grid(dimnames(temp2)), do.call(rbind, temp2))
+ }
+ }
> system.time(test2(10))
user system elapsed
0.244 0.020 0.309
>
> test3 <- function(n){
+ for (i in 1:n){
+ summ3 <- sqldf("select SELFEMPL, OWNRENT, avg(INCOME) as INCOME, avg(BAD) from df
+ group by SELFEMPL, OWNRENT")
+ }
+ }
> system.time(test3(10))
user system elapsed
0.956 0.112 1.178
>
> test4 <- function(n){
+ for (i in 1:n){
+ summ4 <- sql.select("select SELFEMPL, OWNRENT, `mean(INCOME)` as INCOME, `mean(BAD)` as BAD
+ from df group by SELFEMPL, OWNRENT")
+ }
+ }
> system.time(test4(10))
user system elapsed
0.432 0.112 0.601
To leave a comment for the author, please follow the link and comment on their blog: Yet Another Blog in Statistical Computing » S+/R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.