Create Parquet Files From R Data Frames With sergeant & Apache Drill (a.k.a. Make Parquet Files Great Again in R)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Apache Drill is a nice tool to have in the toolbox as it provides a SQL front-end to a wide array of database and file back-ends and runs in standalone/embedded mode on every modern operating system (i.e. you can get started with or play locally with Drill w/o needing a Hadoop cluster but scale up almost effortlessly). It’s also a bit more lightweight than Spark and a nice alternative to Spark if you only need data wrangling and not the functionality in Spark’s MLlib.
When you’re in this larger-data world, parquet files are one of the core data storage formats. They’re designed to be compact and are optimized for columnar operations. Unlike CSV, JSON files or even R Data files, it’s not necessary to read or scan an entire parquet file to filter, select, aggregate, etc across columns. Unfortunately, parquet files aren’t first-class citizens in R. Well, they aren’t now, but thanks to this project it might not be too difficult to make an R interface to them. But, for now, you have to use some other means to convert or read parquet files.
Spark and sparklyr
can help you write parquet files but I don’t need to run Spark all the time.
If you’re already a Drill user, you already know how easy it is to make parquet files with Drill:
CREATE TABLE dfs.tmp.sampleparquet AS (SELECT trans_id, cast(`date` AS date) transdate, cast(`time` AS time) transtime, cast(amount AS double) amountm, user_info, marketing_info, trans_info FROM dfs.`/Users/drilluser/sample.json`);
If you’re not used to SQL, that may seem very ugly/foreign/verbose to you and you can thank Hadley for designing a better grammar of tidyness that seamlessly builds SQL queries like that behind the scenes for you. That SQL statement uses a JSON file as a data source (which you can do with Drill) make sure the field data types are correct by explicitly casting them to SQL data types (which is a good habit to get into even if it is verbose) and then tells Drill to make a parquet file (it’s actually a directory of parquet files) from it.
I’ve been working on an R package — sergeant
— that provides RJDBC, direct REST and dplyr
interfaces to Apache Drill for a while now. There are a number of complexities associated with creating a function to help users make parquet files from R data frames in Drill (which is why said function still does not exist in sergeant
):
- Is Drill installed or does there need to be a helper set of functions for installing and running Drill in embedded mode?
- Even if there’s a Drill cluster running, does the user — perhaps — want to do the conversion locally in embedded mode? Embedded is way easier since all the files are local. The only real way to convert a data frame to Drill is to save a data frame to a temporary, interim file and them have Drill read it in. In a cluster mode where your local filesystem is not part of the cluster, that would mean finding the right way to get the file to the cluster. Which leads to the next item…
- Where does the user want the necessary temporary files stored? Local
dfs.
file system? HDFS? - Do we need two different methods? One for quick conversion and one that forces explicit column data type casting?
- Do we need to support giving the user explicit casting control and column selection capability?
- Who put the bomp in the bomp, bomp, bomp?
OK, perhaps not that last one (but I think it still remains a mystery despite claims by Jan and Dean).
It’s difficult to wrap something like that up in a simple package that will make 80% of the possible user-base happy (having Drill and Spark operate behind the scenes like “magic” seems like a bad idea to me despite how well sparklyr
masks the complexity).
As I continue to work that out (you are encouraged to file an issue with your opines on it at the gh repo) here’s a small R script that you can use it to turn R data frames into parquet files:
library(sergeant) library(tidyverse) # make a place to hold our temp files # this is kinda super destructive. make sure you have the path right unlink("/tmp/pqtrans", recursive=TRUE, force=TRUE) dir.create("/tmp/pqtrans", showWarnings=FALSE) # save off a large-ish tibble write_csv(nycflights13::flights, "/tmp/pqtrans/flights.csvh") # connect to drill db <- src_drill("localhost") # make the parquet file dbGetQuery(db$con, " CREATE TABLE dfs.tmp.`/pqtrans/flights.parquet` AS SELECT * FROM dfs.tmp.`/pqtrans/flights.csvh` ") ## # A tibble: 1 × 2 ## `Number of records written` Fragment ## * <int> <chr> ## 1 336776 0_0 # prove we did it list.files("/tmp/pqtrans", recursive=TRUE, include.dirs=TRUE) ## [1] "flights.csvh" "flights.parquet" ## [3] "flights.parquet/0_0_0.parquet" # prove it again flights <- tbl(db, "dfs.tmp.`/pqtrans/flights.parquet`") flights ## Source: query [?? x 19] ## Database: Drill 1.9.0 [localhost:8047] [8GB direct memory] ## ## flight arr_delay distance year tailnum dep_time sched_dep_time origin ## <int> <dbl> <dbl> <int> <chr> <int> <int> <chr> ## 1 1545 11 1400 2013 N14228 517 515 EWR ## 2 1714 20 1416 2013 N24211 533 529 LGA ## 3 1141 33 1089 2013 N619AA 542 540 JFK ## 4 725 -18 1576 2013 N804JB 544 545 JFK ## 5 461 -25 762 2013 N668DN 554 600 LGA ## 6 1696 12 719 2013 N39463 554 558 EWR ## 7 507 19 1065 2013 N516JB 555 600 EWR ## 8 5708 -14 229 2013 N829AS 557 600 LGA ## 9 79 -8 944 2013 N593JB 557 600 JFK ## 10 301 8 733 2013 N3ALAA 558 600 LGA ## # ... with more rows, and 11 more variables: sched_arr_time <int>, ## # dep_delay <dbl>, dest <chr>, minute <dbl>, carrier <chr>, month <int>, ## # hour <dbl>, arr_time <int>, air_time <dbl>, time_hour <dttm>, ## # day <int> # work with the drill parquet file count(flights, year, origin) %>% collect() ## Source: local data frame [3 x 3] ## Groups: year [1] ## ## year origin n ## * <int> <chr> <int> ## 1 2013 EWR 120835 ## 2 2013 LGA 104662 ## 3 2013 JFK 111279
That snippet:
- assumes Drill is running, which is really as easy as entering
drill-embedded
at a shell prompt, but try out Drill in 10 Minutes if you don’t believe me dfs.tmp
points to/tmp
(i.e. you need to modify that if yours doesn’t…see, I told you this wasn’t simple)- assumes we’re OK with letting Drill figure out column types
- assumes we want ALL THE COLUMNS
- uses the
.csvh
extension which tells Drill to read the column names from the first line so we don’t have to create the schema from scratch - is slow because of ↑ due to the need to create the
csvh
file first - exploits the fact that we can give
dplyr
the cold shoulder and talk directly to Drill anytime we feel like it with DBI calls by using the$con
list field (thedbGetQuery(db$con, …)
line).
It’s a naive and destructive snippet, but does provide a means to get your data frames into parquet and into Drill.
Most of my Drill parquet needs are converting ~20-100K JSON files a day into parquet, which is why I haven’t focused on making a nice interface for this particular use case (data frame to parquet) in R. Ultimately, I’ll likely go the “wrap parquet-cpp
route” (unless you’re working on that, which — if you are — you should @-ref me in that gh-repo of yours so I can help out). But, if having a sergeant
function to do this conversion would help you, drop an issue in the repo.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.