Opening Large CSV Files in R

stathack

9 years ago

[This article was first published on NERD PROJECT » R project posts, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Before heading home for the holidays, I had a large data set (1.6 GB with over 1.25 million rows) with columns of text and integers ripped out of the company (Kwelia) Database and put into a .csv file since I was going to be offline a lot over the break. I tried opening the csv file in the usual way:

all <- read.csv("file.csv")

However it never finished even after letting it go all night. I also tried reading it into a SQLlite database first and reading it out of that, but the file was so messy it kept coming back with errors. I finally got it read in to R by using the ff package and the following code:

library("ff")
x<- read.csv.ffdf(file="file.csv", header=TRUE, VERBOSE=TRUE, first.rows=10000, next.rows=50000, colClasses=NA)

Because the file was so messy, I had to turn off column classes (colClasses=NA) to have the read ignore giving each column a class on the first 10,000. After reading the first 10,000 rows, the script then reads in chunks of 50,000 so as to not completely overload the ram in my laptop. I also turned Verbose because it would drive me nuts to not be able to follow the progress.

To leave a comment for the author, please follow the link and comment on their blog: NERD PROJECT » R project posts.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.