Benchmarking Data Tables

[This article was first published on R on The Data Sandbox, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When I started learning R, I heard vague tales of the use of Data Tables. Really just whisperers, of something to consider in the future after I’ve become more proficient. Well now is the time to learn what if anything I’ve been missing out on.

Introduction

Data Tables are a potential replacement for the common dataframe. It seeks to perform that same role but with improved performance. I would like to see the speed comparison between Data Frames, Data Tables and Tibbles. I will use the microbenchmark package to perform the actual benchmarking.

library(tidyverse)
library(data.table)
library(microbenchmark)
library(farff)

For the benchmark, I will use the ‘credit-g’ dataset, which can be found on the open ml website. I’m pretty sure the last open ml dataset I used was a csv file, but they seem to have moved to a ARFF format. I will need to use the farff package to load the data.

df <- farff::readARFF('dataset_31_credit-g.arff')
dt <- setDT(df)
ti <- tibble(df)

Syntax

The syntax for Data Tables is a little different:

DT[i,j,by]

In this manner, a data table can be subset by i, to calculate j when grouped with a by. Along with the special syntax, there are some common functions that add some additional simplification.

.()

The ‘.()’ function can be used as a placeholder for ‘list()’. The list function is useful for subsetting.

Grouped Aggregate

Aggregating data in Data Tables is simple by using the j and by parameters in the syntax. Again, multiple functions or even multiple groupings can be passed with the ‘.()’ function. For this comparison, we will look at the performance of finding the average age of the credit holders grouped by the class or credit rating.

group <- microbenchmark(Data_Frame = df %>%
group_by(class) %>%
summarise(avg = mean(age)),
Data_Table = dt[,.(avg = mean(age)), by = class],
Tibble = ti %>%
group_by(class) %>%
summarise(avg = mean(age)))
print(group)
## Unit: microseconds
## expr min lq mean median uq max neval
## Data_Frame 8950.201 9905.301 11250.274 10430.800 11002.151 29392.6 100
## Data_Table 942.301 1277.851 1599.809 1454.551 1589.851 12496.6 100
## Tibble 9288.001 10204.701 11759.341 10727.901 11805.851 45636.2 100

Taking counts

Another function of interest is the ‘.N’ function. This function will return the count of rows. The test looks are the number of people with over 5000 in credit and younger than 35.

counts <- microbenchmark(Data_Frame = df %>%
filter(credit_amount > 5000, age <35) %>%
nrow(),
Data_Table = dt[credit_amount > 5000 & age < 35, .N ,],
Tibble = ti %>%
filter(credit_amount > 5000, age <35) %>%
nrow())
print(counts)
## Unit: microseconds
## expr min lq mean median uq max neval
## Data_Frame 13320.601 14479.852 16289.165 15315.751 15982.151 28997.401 100
## Data_Table 393.901 554.151 698.718 736.151 780.401 1698.802 100
## Tibble 14732.101 15575.950 17111.604 16231.801 16768.300 43220.901 100

Creating new columns

Data Tables also contain a very simple syntax for creating a new column with ‘:=’. I compare this to the tidyverse mutate function. Using the base R to create a column is still the fastest method, taking about half the time of the Data Table method.

new <- microbenchmark(Data_Frame = df %>% mutate(property = paste(property_magnitude, housing)),
Data_Table = dt[,property := paste(property_magnitude, housing)],
Tibble = ti %>% mutate(property = paste(property_magnitude, housing)))
print(new)
## Unit: microseconds
## expr min lq mean median uq max neval
## Data_Frame 3135.701 3492.601 4049.416 3883.201 4487.702 7376.201 100
## Data_Table 702.702 885.051 1080.750 1023.301 1157.851 3705.001 100
## Tibble 3921.101 4548.501 5573.877 4943.201 5622.801 17441.400 100

Chaining Data Tables

Another point of exploration is that Data Tables can be chained together to create more complicated structures

dt[credit_amount > 1000, .(age = mean(age)),by = .(purpose, class)][class == "good" & age < mean(age)]
## purpose class age
## 1: radio/tv good 35.44865
## 2: furniture/equipment good 33.21930
## 3: used car good 36.91860
## 4: business good 34.50000
## 5: domestic appliance good 35.50000
## 6: retraining good 34.00000

I don’t think this is the most useful feature, as you can already create some very complicated transformation with a single call. Chaining also makes it more difficult to understand.

Conclusions

It is clear that there are significant performance improvements when using Data Tables versus Data Frames (an average decrease of time by -86%). There are also insignificant differences between Data Frames and Tibbles. Also, the syntax for Data Tables is fairly simple and straight forward and yet extremely powerful.

So, to answer the most important question, should you change to Data Tables from Data Frames? Probably, they present a significant performance gain and their structure is very flexible.

Photo by Tyler Clemmensen on Unsplash

To leave a comment for the author, please follow the link and comment on their blog: R on The Data Sandbox.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)