Effortlessly Read Rectangular Data: R Package `readit` 1.0.0 Released on CRAN

Ryan Price

4 years ago

[This article was first published on Another Blog About R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Another R package designed out of frustration, `readit` is now available. What follows is the README that you can find on Github, and verison 1.0.0 of readit is now available on CRAN. Please feel free to submit requests, bug reports, etc.!

readit() may be the only data-read function you ever need; by wrapping other popular reader packages, like readr, readxl, haven, jsonlite, readit provides one self-titled function to read almost anything that isn’t formatted like hot garbage. If you have faith that the underlying data is of modest quality, and don’t care how it’s delimited, or what its file extension suggests, then readit is for you.

This package was inspired by a handover at work; I took over as Maintainer for a package that dealt with a lot of disparate file extensions, and quickly became frustrated with trying to keep track of which filename was delimited in what way. “Why can’t I just… ***f@!#ing read it?!***” And lo, readit was born!

< svg aria-hidden="true" class="octicon octicon-link" height="16" version="1.1" viewbox="0 0 16 16" width="16">< path d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z" fill-rule="evenodd">Features

readit is a pretty straightforward R package. It only exports one function, readit(), which wraps most of the reader functions in readr, readxl haven, and jsonlite. You can pass any arguments that you would normally pass to those functions, to readit(), as well.

readit() uses some basic heuristics based on the file extension to call the appropriate read function, and if it’s too ambigious (like .txt files), readit() will perform some commonly-implemented checks to guess the correct delimiter. readit() will always print out what file type it guessed (in nice, bold, green console text, via crayon, as a sanity check, and throw an error if the file path you give it is parsed and determined to be too messy to deal with automatically. For example, say you have some .txt file that you receive from a client each month, and it’s delimited differently every time (because that’s how it goes). Instead of inspecting it with four or five different functions first, you can just call readit() on it to pass it to readr‘s… readers:

> readit("path/to/frustrating/file.txt")
File guessed to be pipe-delimited ("path/to/frustrating/file.txt")
Parsed with column specification:
cols(
  testheader1 = col_character(),
  testheader2 = col_character(),
  testheader3 = col_character(),
  testheader4 = col_character(),
  testheader5 = col_character(),
  testheader6 = col_character()
)
# A tibble: 5 x 5
  testheader1 testheader2 testheader3 testheader4 testheader5
  <chr>       <chr>       <chr>       <chr>       <chr>
1 testdata11  testdata12  testdata13  testdata14  testdata15
2 testdata21  testdata22  testdata23  testdata24  testdata25
3 testdata31  testdata32  testdata33  testdata34  testdata35
4 testdata41  testdata42  testdata43  testdata44  testdata45
5 testdata51  testdata52  testdata53  testdata54  testdata55

Huzzah! It turns out that someone replaced all the delimiters with pipes (|), but with readit, that’s no problem! Just throw it into the great maw, and watch as the correct data comes back out.

What about if the same file becomes a sneaky tab-delimited file next month?

> readit("path/to/frustrating/file.txt")
File guessed to be tab-delimited ("path/to/frustrating/file.txt")
Parsed with column specification:
cols(
  testheader1 = col_character(),
  testheader2 = col_character(),
  testheader3 = col_character(),
  testheader4 = col_character(),
  testheader5 = col_character()
)
# A tibble: 6 x 5
  testheader1 testheader2 testheader3 testheader4 testheader5
  <chr>       <chr>       <chr>       <chr>       <chr>
1 testdata11  testdata12  testdata13  testdata14  testdata15
2 testdata21  testdata22  testdata23  testdata24  testdata25
3 testdata31  testdata32  testdata33  testdata34  testdata35
4 testdata41  testdata42  testdata43  testdata44  testdata45
5 testdata51  testdata52  testdata53  testdata54  testdata55
6 testdata61  testdata62  testdata63  testdata64  testdata65

Nope, no problem: readit() picked it up just fine, including the newest data.

What if your client starts storing the same data in Excel files, instead?

> readit("path/to/frustrating/file.xlsx")
File guessed to be xls/xlsx (Excel) ("path/to/frustrating/file.xlsx")
Parsed with column specification:
cols(
  testheader1 = col_character(),
  testheader2 = col_character(),
  testheader3 = col_character(),
  testheader4 = col_character(),
  testheader5 = col_character(),
  testheader6 = col_character()
)
# A tibble: 6 x 5
  testheader1 testheader2 testheader3 testheader4 testheader5
  <chr>       <chr>       <chr>       <chr>       <chr>
1 testdata11  testdata12  testdata13  testdata14  testdata15
2 testdata21  testdata22  testdata23  testdata24  testdata25
3 testdata31  testdata32  testdata33  testdata34  testdata35
4 testdata41  testdata42  testdata43  testdata44  testdata45
5 testdata51  testdata52  testdata53  testdata54  testdata55
6 testdata61  testdata62  testdata63  testdata64  testdata65

readit() has you covered. What if that data is on the second Excel sheet, though? Just pass sheet = 2 to readit(), just like you would to read_excel():

> readit("path/to/frustrating/file.xlsx", sheet = 2)
File guessed to be xls/xlsx (Excel) ("path/to/frustrating/file.xlsx")
Parsed with column specification:
cols(
  testheader1 = col_character(),
  testheader2 = col_character(),
  testheader3 = col_character(),
  testheader4 = col_character(),
  testheader5 = col_character(),
  testheader6 = col_character()
)
# A tibble: 6 x 5
  testheader1 testheader2 testheader3 testheader4 testheader5
  <chr>       <chr>       <chr>       <chr>       <chr>
1 testdata11  testdata12  testdata13  testdata14  testdata15
2 testdata21  testdata22  testdata23  testdata24  testdata25
3 testdata31  testdata32  testdata33  testdata34  testdata35
4 testdata41  testdata42  testdata43  testdata44  testdata45
5 testdata51  testdata52  testdata53  testdata54  testdata55
6 testdata61  testdata62  testdata63  testdata64  testdata65

What if your client is a bunch of academics, and they send you the same data, but in SAS format?

> readit("path/to/frustrating/file.sas7bdat")
File guessed to be .sas7b*at (SAS) ("path/to/frustrating/file.sas7bdat")
Parsed with column specification:
cols(
  testheader1 = col_character(),
  testheader2 = col_character(),
  testheader3 = col_character(),
  testheader4 = col_character(),
  testheader5 = col_character(),
  testheader6 = col_character()
)
# A tibble: 6 x 5
  testheader1 testheader2 testheader3 testheader4 testheader5
  <chr>       <chr>       <chr>       <chr>       <chr>
1 testdata11  testdata12  testdata13  testdata14  testdata15
2 testdata21  testdata22  testdata23  testdata24  testdata25
3 testdata31  testdata32  testdata33  testdata34  testdata35
4 testdata41  testdata42  testdata43  testdata44  testdata45
5 testdata51  testdata52  testdata53  testdata54  testdata55
6 testdata61  testdata62  testdata63  testdata64  testdata65

Still no worries (readit will pick up both .sas7bdat and .sas7bcat extensions)

In fact, readit is able to read all of the following data, so long as they have a file extension, and will take any arguments you would want to pass to the underlying functions:

.txt (but not fixed-width, for obvious reasons)
.csv
.xls/.xlsx
.sas7bdat/.sas7bcat (SAS files)
.dta (Stata files)
.sav/.por (SPSS files)
.json (JSON arrays, which are parsed into data frames, like in loggit

< svg aria-hidden="true" class="octicon octicon-link" height="16" version="1.1" viewbox="0 0 16 16" width="16">< path d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z" fill-rule="evenodd">Future work

Add support for reader functions from the foreign package.

< svg aria-hidden="true" class="octicon octicon-link" height="16" version="1.1" viewbox="0 0 16 16" width="16">< path d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z" fill-rule="evenodd">Installation

You can install the latest CRAN release of readit via install.packages("readit").

Or, to get the latest development version from GitHub —

Via devtools:

devtools::install_github("ryapric/readit")

Or, clone & build from source:

cd /path/to/your/repos
git clone https://github.com/ryapric/readit.git readit
R CMD INSTALL readit

To use the most recent development version of readit in your own package, you can include it in your Remotes: field in your DESCRIPTION file:

Remotes: github::ryapric/readit

Note that packages being submitted to CRAN cannot have a Remotes field. Refer here for more info.

< svg aria-hidden="true" class="octicon octicon-link" height="16" version="1.1" viewbox="0 0 16 16" width="16">< path d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z" fill-rule="evenodd">License

MIT @ Ryan J. Price, 2018.

To leave a comment for the author, please follow the link and comment on their blog: Another Blog About R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.