Site icon R-bloggers

data.table or data.frame?

[This article was first published on - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I spent a portion of today trying to convince a colleague that there are times when the data.table package is faster than traditional methods in R. It took a few of the tests below to prove the point.

Generate a data.frame of characters and numbers for easy plotting.

  df <- data.frame(letters = as.character(sample(letters[1:10], 1e+08, replace = TRUE)), 
      numbers = sample(1:100, 1e+08, replace = TRUE))
  head(df)

  ##   letters numbers
  ## 1       f      69
  ## 2       j      65
  ## 3       h      29
  ## 4       c      69
  ## 5       j      12
  ## 6       e      65

Aggregate using the base R function aggregate.

  start <- proc.time()
  aggregate(numbers ~ letters, data = df, FUN = sum)

  ##    letters   numbers
  ## 1        a 504884636
  ## 2        b 504587923
  ## 3        c 505357057
  ## 4        d 505106809
  ## 5        e 504788174
  ## 6        f 505219078
  ## 7        g 504796095
  ## 8        h 504693166
  ## 9        i 505079861
  ## 10       j 505044118

  aggregate_time <- proc.time() - start
  aggregate_time

  ##    user  system elapsed 
  ##  120.13   30.51  261.79

Aggregate using ddply from the package plyr.

  require("plyr")

  ## Loading required package: plyr

  start <- proc.time()
  ddply(df, .(letters), summarize, sums = sum(numbers))

  ##    letters      sums
  ## 1        a 504884636
  ## 2        b 504587923
  ## 3        c 505357057
  ## 4        d 505106809
  ## 5        e 504788174
  ## 6        f 505219078
  ## 7        g 504796095
  ## 8        h 504693166
  ## 9        i 505079861
  ## 10       j 505044118

  ddply_time <- proc.time() - start
  ddply_time

  ##    user  system elapsed 
  ##   22.04   27.38  192.99

Aggregate using the data.table pacakge.

  require("data.table")

  ## Loading required package: data.table

  start <- proc.time()
  dt <- data.table(df, key = "letters")
  dt[, list(sums = sum(numbers)), by = c("letters")]

  ##     letters      sums
  ##  1:       a 504884636
  ##  2:       b 504587923
  ##  3:       c 505357057
  ##  4:       d 505106809
  ##  5:       e 504788174
  ##  6:       f 505219078
  ##  7:       g 504796095
  ##  8:       h 504693166
  ##  9:       i 505079861
  ## 10:       j 505044118

  dt_time <- proc.time() - start
  dt_time

  ##    user  system elapsed 
  ##   7.102   7.017  55.957

Comparison of the system times.

  # how many times slower is aggregate
  aggregate_time[2]/ddply_time[2]

  ## sys.self 
  ##    1.114

  aggregate_time[2]/dt_time[2]

  ## sys.self 
  ##    4.347

  
  # how many times slower is ddply
  ddply_time[2]/aggregate_time[2]

  ## sys.self 
  ##   0.8975

  ddply_time[2]/dt_time[2]

  ## sys.self 
  ##    3.902

  
  # how many times slower is data.table
  dt_time[2]/aggregate_time[2]

  ## sys.self 
  ##     0.23

  dt_time[2]/ddply_time[2]

  ## sys.self 
  ##   0.2563

Based on 1 billion observations with the time to conver to a data.table included in the time elapsed.

  1. ddply requires ~0.8975 more system time than aggregate
  2. aggregate requires ~4.3474x more system time data.table
  3. ddply requires ~3.902x more system time than data.table

Conclusion – data.table for the win.

To leave a comment for the author, please follow the link and comment on their blog: - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.