Site icon R-bloggers

New packages for reading data into R — fast

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Hadley Wickham and the RStudio team have created some new packages for R, which will be very useful for anyone who needs to read data into R (that is, everyone). The readr package provides functions for reading text data into R, and the readxl package provides functions for reading Excel spreadsheet data into R. Both are much faster than the functions you're probably using now.

The readr package provides several functions for reading tabular text data into R. This is a task normally accomplished with the read.table family of functions in R, and readr provides a number of replacement functions that provide additional functionality and are much faster.

First, there's read_table which provides a near-replacement for read.table. Here's a comparison of using both functions on a file with 4 million rows of data (which I created by stacking copies of this file):

dat <- read_table("biggerfile.txt",
  col_names=c("DAY","MONTH","YEAR","TEMP"))

dat2 <- read.table("biggerfile.txt",
  col.names=c("DAY","MONTH","YEAR","TEMP"))

The commands look quite similar, but while read.table took just over 30 seconds to complete, readr's read_table accomplished the same task in less than a second. The trick is that read_table treats the data as a fixed-format file, and uses C++ to process the data quickly. (One small caveat is that read.table supports arbitrary amounts of whitespace between columns, while read_table requires the columns be lined up exactly. In practice, this isn't much of a restriction.)

Base R has a function for reading fixed-width data too, and here readr really shines:

dat <- read_fwf("biggerfile.txt",
  fwf_widths(c(3,15,16,12),
    col_names=c("DAY","MONTH","YEAR","TEMP")))

dat2 <- read.fwf("biggerfile.txt", c(3,15,16,12),
  col.names=c("DAY","MONTH","YEAR","TEMP"))

While readr's read_fwf again accomplished the task in about a second, the standard read.fwf took over 3 minutes — almost 200 times as long.

Other functions in the package include read_csv (and a European-friendly variant read_csv2) for comma-separated data, read_tsv for tab-separated data, and read_lines for line-by-line file extraction (great for complicated post-processing). The package also makes it much easier to read columns of dates in various formats, and sensibly always handles text data as strings (no more strings.as.factors=FALSE). 

For data in Excel format, there's also the new readxl package. This package provides function to read Excel worksheets in both .xls and .xlsx formats. I haven't benchmarked the read_excel function myself, but like the readr functions it's based on a C++ library so should be quite snappy. And best of all, it has no external dependencies, so you can use it to read Excel data on just about any platform — there's no requirement that Excel itself be installed.

The readr package is on CRAN now, and readxl can be installed from GiHub. If you try them yourself, let us know how it goes in the comments.

RStudio blog: readr 0.1.0

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.