Site icon R-bloggers

Create Parquet Files From R Data Frames With sergeant & Apache Drill (a.k.a. Make Parquet Files Great Again in R)

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Apache Drill is a nice tool to have in the toolbox as it provides a SQL front-end to a wide array of database and file back-ends and runs in standalone/embedded mode on every modern operating system (i.e. you can get started with or play locally with Drill w/o needing a Hadoop cluster but scale up almost effortlessly). It’s also a bit more lightweight than Spark and a nice alternative to Spark if you only need data wrangling and not the functionality in Spark’s MLlib.

When you’re in this larger-data world, parquet files are one of the core data storage formats. They’re designed to be compact and are optimized for columnar operations. Unlike CSV, JSON files or even R Data files, it’s not necessary to read or scan an entire parquet file to filter, select, aggregate, etc across columns. Unfortunately, parquet files aren’t first-class citizens in R. Well, they aren’t now, but thanks to this project it might not be too difficult to make an R interface to them. But, for now, you have to use some other means to convert or read parquet files.

Spark and sparklyr can help you write parquet files but I don’t need to run Spark all the time.

If you’re already a Drill user, you already know how easy it is to make parquet files with Drill:

CREATE TABLE dfs.tmp.sampleparquet AS 
  (SELECT trans_id, 
   cast(`date` AS date) transdate, 
   cast(`time` AS time) transtime, 
   cast(amount AS double) amountm,
   user_info, marketing_info, trans_info 
   FROM dfs.`/Users/drilluser/sample.json`);

If you’re not used to SQL, that may seem very ugly/foreign/verbose to you and you can thank Hadley for designing a better grammar of tidyness that seamlessly builds SQL queries like that behind the scenes for you. That SQL statement uses a JSON file as a data source (which you can do with Drill) make sure the field data types are correct by explicitly casting them to SQL data types (which is a good habit to get into even if it is verbose) and then tells Drill to make a parquet file (it’s actually a directory of parquet files) from it.

I’ve been working on an R package — sergeant — that provides RJDBC, direct REST and dplyr interfaces to Apache Drill for a while now. There are a number of complexities associated with creating a function to help users make parquet files from R data frames in Drill (which is why said function still does not exist in sergeant):

OK, perhaps not that last one (but I think it still remains a mystery despite claims by Jan and Dean).

It’s difficult to wrap something like that up in a simple package that will make 80% of the possible user-base happy (having Drill and Spark operate behind the scenes like “magic” seems like a bad idea to me despite how well sparklyr masks the complexity).

As I continue to work that out (you are encouraged to file an issue with your opines on it at the gh repo) here’s a small R script that you can use it to turn R data frames into parquet files:

library(sergeant)
library(tidyverse)

# make a place to hold our temp files
# this is kinda super destructive. make sure you have the path right
unlink("/tmp/pqtrans", recursive=TRUE, force=TRUE)
dir.create("/tmp/pqtrans", showWarnings=FALSE)

# save off a large-ish tibble
write_csv(nycflights13::flights, "/tmp/pqtrans/flights.csvh")

# connect to drill
db <- src_drill("localhost")

# make the parquet file
dbGetQuery(db$con, "
CREATE TABLE dfs.tmp.`/pqtrans/flights.parquet` AS SELECT * FROM dfs.tmp.`/pqtrans/flights.csvh`
")
## # A tibble: 1 × 2
##   `Number of records written` Fragment
## *                       <int>    <chr>
## 1                      336776      0_0

# prove we did it
list.files("/tmp/pqtrans", recursive=TRUE, include.dirs=TRUE)
## [1] "flights.csvh"                  "flights.parquet"              
## [3] "flights.parquet/0_0_0.parquet"

# prove it again
flights <- tbl(db, "dfs.tmp.`/pqtrans/flights.parquet`")

flights
## Source:   query [?? x 19]
## Database: Drill 1.9.0 [localhost:8047] [8GB direct memory]
## 
##    flight arr_delay distance  year tailnum dep_time sched_dep_time origin
##     <int>     <dbl>    <dbl> <int>   <chr>    <int>          <int>  <chr>
## 1    1545        11     1400  2013  N14228      517            515    EWR
## 2    1714        20     1416  2013  N24211      533            529    LGA
## 3    1141        33     1089  2013  N619AA      542            540    JFK
## 4     725       -18     1576  2013  N804JB      544            545    JFK
## 5     461       -25      762  2013  N668DN      554            600    LGA
## 6    1696        12      719  2013  N39463      554            558    EWR
## 7     507        19     1065  2013  N516JB      555            600    EWR
## 8    5708       -14      229  2013  N829AS      557            600    LGA
## 9      79        -8      944  2013  N593JB      557            600    JFK
## 10    301         8      733  2013  N3ALAA      558            600    LGA
## # ... with more rows, and 11 more variables: sched_arr_time <int>,
## #   dep_delay <dbl>, dest <chr>, minute <dbl>, carrier <chr>, month <int>,
## #   hour <dbl>, arr_time <int>, air_time <dbl>, time_hour <dttm>,
## #   day <int>

# work with the drill parquet file
count(flights, year, origin) %>%
  collect()
## Source: local data frame [3 x 3]
## Groups: year [1]
## 
##    year origin      n
## * <int>  <chr>  <int>
## 1  2013    EWR 120835
## 2  2013    LGA 104662
## 3  2013    JFK 111279

That snippet:

It’s a naive and destructive snippet, but does provide a means to get your data frames into parquet and into Drill.

Most of my Drill parquet needs are converting ~20-100K JSON files a day into parquet, which is why I haven’t focused on making a nice interface for this particular use case (data frame to parquet) in R. Ultimately, I’ll likely go the “wrap parquet-cpp route” (unless you’re working on that, which — if you are — you should @-ref me in that gh-repo of yours so I can help out). But, if having a sergeant function to do this conversion would help you, drop an issue in the repo.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.