[This article was first published on Daniel Marcelino's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I have several tables that I would like to load as a sole data frame. Derived functions from read. table () have a lot of convenient features, but it seems like there is a lot of steps in the implementation that would slow things down. The gain in performance of reading 29 CSV files (about 2.2 GB) shows quite different picture. While the parallelization process does bring some improvement considering the ‘user time’, i.e. the CPU time charged for the process execution at the machine level, the ‘elapsed time’, i.e. the ‘real’ elapsed time since the process was started doesn’t show much difference. Let’s go through it.
Data
Function
Results
Without Paralelize
With Paralelize
It’s important to realise that every bit of optimisation matters. But it would not help us if the outcome data frames were different, don’t you agree? Luckily all 174 million records do match.
Related
To leave a comment for the author, please follow the link and comment on their blog: Daniel Marcelino's Blog.