collapse might rock your world
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The NHS-R Community site prompted this post, specifically a question to the help-with-r channel on Slack.
Someone asked about obtaining unique values in a 60M row / 6 column dataset.
Naturally, data.table came up, and a few of us made suggestions for improvement, but we didn’t seem to be hitting on a magic solution.
Without being able to get my hands on the actual data, I decided to have play around at home.
I don’t know what the real life data strucure was, but this is the simple 4 column data set I produced initially, named testdata
:
str(testdata) # Classes ‘data.table’ and 'data.frame': 60000000 obs. of 4 variables: # $ RANDOM_STRING: chr "aaaaa" "aaaaa" "aaaaa" "aaaaa" ... # $ START_DATE : Date, format: "2015-01-10" "2015-01-11" "2015-01-13" "2015-01-24" ... # $ randint : int 1 4 4 5 5 3 8 9 4 5 ... # $ EXPIRY_DATE : Date, format: "2015-11-02" "2015-10-29" "2015-10-27" "2015-09-11" ... # - attr(*, ".internal.selfref")=<externalptr> # - attr(*, "sorted")= chr [1:4] "RANDOM_STRING" "START_DATE" "EXPIRY_DATE" "randint" #
This was created and saved to an RDS file, and converted to a keyed data.table as part of the process.
haskey(testdata) #[1] TRUE
First up, calling data.table::unique
system.time(uniqdat <- unique(testdata))
Timings to return 5993540 unique observations
############################## # user system elapsed # 20.01 10.52 37.42 ##############################
Does it make any difference if I don’t assign to a variable ?
system.time(unique(testdata)) ############################## # user system elapsed # 17.88 8.50 25.47 ##############################
Looks like it does.
Now, in the real life example, the analyst is seeing severe issues with timing, in terms of hours rather than seconds. I was researching alternative solutions and chanced upon the collapse package:
It’s pretty amazing. After a couple of false starts because I didn’t read the documentation, I successfully called the function - you have to specify the columns you want to find unique values for.
system.time(uniqdat2 <- funique(testdata,cols = 1:4)) ############################## # user system elapsed # 4.33 1.74 8.23 ##############################
Testing without assignment:
system.time(funique(testdata,cols = 1:4)) ############################## # user system elapsed # 4.66 2.60 8.00 ##############################
Pretty cool - a huge reduction in time.
From the introductory vignette:
“collapse is a C/C++ based package for data transformation and statistical computing in R. It’s aims are:
To facilitate complex data transformation, exploration and computing tasks in R. To help make R code fast, flexible, parsimonious and programmer friendly.”
I look forward to getting more familiar with this package.
Finally, out of interest, I tried dplyr timings. I had already seen that dplyr often outperforms data.table for finding unique / distinct values:
# dplyr for thoroughness system.time(uniqdat3 <- distinct(testdata)) ############################## # user system elapsed # 14.87 2.86 20.82 ##############################
Not as fast as collapse, but faster than data.table.
I also tried without assigining to a variable, but each time I got ‘out of memory’ errors.
My personal laptop is a bit underpowered for this dataset (only 8GB RAM) and it couldn’t handle it.
So now I am in the process of upgrading my laptop ( it’s easier to swap out the SSD and RAM than get a new one), which will give me double the power, at which point I might try this again.
I’m impressed with collapse, it’s a really nice package (just look at the detail in the logo, you can tell this is the real deal) and I can’t wait to become more conversant with its functionality.
Post publishing update
I upgraded my RAM (now 16GB) and repeated the exercise:
Timings are as follows:
# user system elapsed # 19.31 1.46 14.82 data.table (with assignment) # 18.38 1.64 13.72 data.table (no assignment) # 4.70 0.59 5.42 collapse (with assignment) # 4.36 0.50 4.92 collapse (no assignment) # 13.52 0.33 14.18 dplyr (with assignment) # 12.65 0.53 13.45 dplyr (no assignment)
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.