data.table or data.frame?
[This article was first published on - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I spent a portion of today trying to convince a colleague that there are times when the data.table
package is faster than traditional methods in R. It took a few of the tests below to prove the point.
Generate a data.frame of characters
and numbers for easy plotting.
df <- data.frame(letters = as.character(sample(letters[1:10], 1e+08, replace = TRUE)), numbers = sample(1:100, 1e+08, replace = TRUE)) head(df) ## letters numbers ## 1 f 69 ## 2 j 65 ## 3 h 29 ## 4 c 69 ## 5 j 12 ## 6 e 65
Aggregate using the base R function aggregate.
start <- proc.time() aggregate(numbers ~ letters, data = df, FUN = sum) ## letters numbers ## 1 a 504884636 ## 2 b 504587923 ## 3 c 505357057 ## 4 d 505106809 ## 5 e 504788174 ## 6 f 505219078 ## 7 g 504796095 ## 8 h 504693166 ## 9 i 505079861 ## 10 j 505044118 aggregate_time <- proc.time() - start aggregate_time ## user system elapsed ## 120.13 30.51 261.79
Aggregate using ddply
from the package plyr
.
require("plyr") ## Loading required package: plyr start <- proc.time() ddply(df, .(letters), summarize, sums = sum(numbers)) ## letters sums ## 1 a 504884636 ## 2 b 504587923 ## 3 c 505357057 ## 4 d 505106809 ## 5 e 504788174 ## 6 f 505219078 ## 7 g 504796095 ## 8 h 504693166 ## 9 i 505079861 ## 10 j 505044118 ddply_time <- proc.time() - start ddply_time ## user system elapsed ## 22.04 27.38 192.99
Aggregate using the data.table
pacakge.
require("data.table") ## Loading required package: data.table start <- proc.time() dt <- data.table(df, key = "letters") dt[, list(sums = sum(numbers)), by = c("letters")] ## letters sums ## 1: a 504884636 ## 2: b 504587923 ## 3: c 505357057 ## 4: d 505106809 ## 5: e 504788174 ## 6: f 505219078 ## 7: g 504796095 ## 8: h 504693166 ## 9: i 505079861 ## 10: j 505044118 dt_time <- proc.time() - start dt_time ## user system elapsed ## 7.102 7.017 55.957
Comparison of the system times.
# how many times slower is aggregate aggregate_time[2]/ddply_time[2] ## sys.self ## 1.114 aggregate_time[2]/dt_time[2] ## sys.self ## 4.347 # how many times slower is ddply ddply_time[2]/aggregate_time[2] ## sys.self ## 0.8975 ddply_time[2]/dt_time[2] ## sys.self ## 3.902 # how many times slower is data.table dt_time[2]/aggregate_time[2] ## sys.self ## 0.23 dt_time[2]/ddply_time[2] ## sys.self ## 0.2563
Based on 1 billion observations with the time to conver to a data.table included in the time elapsed.
- ddply requires ~0.8975 more system time than aggregate
- aggregate requires ~4.3474x more system time data.table
- ddply requires ~3.902x more system time than data.table
Conclusion - data.table for the win.
To leave a comment for the author, please follow the link and comment on their blog: - R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.