Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The reticulate
package provides a very clean & concise interface bridge between R and Python which makes it handy to work with modules that have yet to be ported to R (going native is always better when you can do it). This post shows how to use reticulate
to create parquet files directly from R using reticulate
as a bridge to the pyarrow
module, which has the ability to natively create parquet files.
Now, you can create parquet files through R with Apache Drill — and, I’ll provide another example for that here — but, you may have need to generate such files and not have the ability to run Drill.
The Python parquet process is pretty simple since you can convert a pandas
DataFrame
directly to a pyarrow
Table
which can be written out in parquet format with pyarrow.parquet
. We just need to follow this process through reticulate
in R:
library(reticulate) pd <- import("pandas", "pd") pa <- import("pyarrow", "pa") pq <- import("pyarrow.parquet", "pq") mtcars_py <- r_to_py(mtcars) mtcars_df <- pd$DataFrame$from_dict(mtcars_py) mtcars_tab <- pa$Table$from_pandas(mtcars_df) pq$write_table(mtcars_tab, path.expand("~/Data/mtcars_python.parquet"))
I wouldn’t want to do that for ginormous data frames, but it should work pretty well for modest use cases (you’re likely using Spark, Drill, Presto or other “big data” platforms for creation of larger parquet structures). Here’s how we’d do that with Drill via the sergeant
package:
readr::write_csv(mtcars, "~/Data/mtcars_r.csvh") dc <- drill_connection("localhost") drill_query(dc, "CREATE TABLE dfs.tmp.`/mtcars_r.parquet` AS SELECT * FROM dfs.root.`/Users/bob/Data/mtcars_r.csvh`")
Without additional configuration parameters, the reticulated-Python version (above) generates larger parquet files and also has an index column since they’re needed in Python DataFrame
s (ugh), but small-ish data frames will end up in a single file whereas the Drill created ones will be in a directory with an additional CRC file (and, much smaller by default). NOTE: You can use preserve_index=False
on the call to Table.from_pandas
to get rid of that icky index.
It’s fairly efficient even for something like nycflights13::flights
which has ~330K rows and 19 columns:
system.time( r_to_py(nycflights13::flights) %>% pd$DataFrame$from_dict() %>% pa$Table$from_pandas() %>% pq$write_table(where = "/tmp/flights.parquet") ) ## user system elapsed ## 1.285 0.108 1.398
If you need to generate parquet files in a pinch, reticulate
seems to be a good way to go.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.