Site icon R-bloggers

Handling Large Datasets in R

[This article was first published on Quantitative Finance Collector, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Handling large dataset in R, especially CSV data, was briefly discussed before at Excellent free CSV splitter and Handling Large CSV Files in R. My file at that time was around 2GB with 30 million number of rows and 8 columns. Recently I started to collect and analyze US corporate bonds tick data from year 2002 to 2010, and the CSV file I got is 6.18GB with 40 million number of rows, even after removing biases data as in Biases in TRACE Corporate Bond Data.

How to proceed efficiently? Below is an excellent presentation on handling large datasets in R by Ryan Rosario at http://www.bytemining.com/2010/08/taking-r-to-the-limit-part-ii-large-datasets-in-r/, a short summary of the presentation:
1, R has a few packages for big data support. The presentation covers the following: bigmemory and ff; and also some uses of parallelism to accomplish the same goal using Hadoop and MapReduce;
2, the data used in the presentation is 11GB comma-separated values with 120 million rows, 29 columns;
3, For datasets with size in the range 10GB, bigmemory and ff handle themselves well;
4, For larger datasets, use Hadoop;

Taking R to the Limit (High Performance Computing in R), Part 2 — Large Datasets, LA R Users' Group 8/17/10< embed name="__sse5016270" src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=rhpc-100819231518-phpapp01&stripped_title=r-hpc&userName=bytemining" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="355">
View more presentations from Ryan Rosario.


BTW, determining the number of rows of a very big file is tricky, you don’t have to load the data first and use dim(), which easily leads to short of memory. One way of doing it is readLines(), for example:
data <- gzfile("yourdata.zip",open="r")
MaxRows <- 50000
TotalRows <- 0
while((LeftRow <- length(readLines(data,MaxRows))) > 0 )
TotalRows <- TotalRows+LeftRow
close(data)

Tags – data , csv
Read the full post at Handling Large Datasets in R.

To leave a comment for the author, please follow the link and comment on their blog: Quantitative Finance Collector.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.