Converting arbitrarily large CSVs to Parquet with R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this recent post, we have used Polars and DuckDB to convert a large CSV file to Parquet in steaming mode – and Python.
Different people have contacted me and asked: “and in R?”
Simple answer: We have DuckDB, and we have different Polars bindings. Here, we are using {polars} which is currently being overhauled into {neopandas}.
So let’s not wait any longer!
Run times are on a Windows system with an Intel i7-13700H CPU.
Generate 2.2 GB csv file
We use {data.table} to dump a randomly generated dataset with 100 Mio rows into a csv file.
library(data.table) set.seed(1) n <- 1e8 df <- data.frame( X = sample(letters[1:3], n, TRUE), Y = runif(n), Z = sample(1:5, n, TRUE) ) fwrite(df, "data.csv")
DuckDB
Then, we use DuckDB to fire a query to the file and stream the result into Parquet.
Threads and RAM can be set on the fly, which is very convenient. Setting a low memory limit (e.g., 500 MB) will work – try it out!
library(duckdb) con <- dbConnect(duckdb(config = list(threads = "8", memory_limit = "4GB"))) system.time( # 3.5s dbSendQuery( con, " COPY ( SELECT Y, Z FROM 'data.csv' WHERE X == 'a' ORDER BY Y ) TO 'data.parquet' (FORMAT parquet, COMPRESSION zstd) " ) ) # Check dbGetQuery(con, "SELECT COUNT(*) N FROM 'data.parquet'") # 33329488 dbGetQuery(con, "SELECT * FROM 'data.parquet' LIMIT 5") # Y Z # 1 5.355105e-09 4 # 2 9.080395e-09 5 # 3 2.258457e-08 2 # 4 3.445894e-08 2 # 5 6.891787e-08 1
3.5 seconds – wow! The resulting file looks good. It is 125 MB large.
Polars
Let’s do the same with Polars.
# Sys.setenv(NOT_CRAN = "true") # install.packages("polars", repos = "https://community.r-multiverse.org") library(polars) polars_info() system.time( # 9s ( pl$scan_csv("data.csv") $filter(pl$col("X") == "a") $drop("X") $sort("Y") $sink_parquet("data.parquet", row_group_size = 1e5) ) ) # Check pl$scan_parquet("data.parquet")$head()$collect() # shape: (5, 2) # ┌───────────┬─────┐ # │ Y ┆ Z │ # │ --- ┆ --- │ # │ f64 ┆ i64 │ # ╞═══════════╪═════╡ # │ 5.3551e-9 ┆ 4 │ # │ 9.0804e-9 ┆ 5 │ # │ 2.2585e-8 ┆ 2 │ # │ 3.4459e-8 ┆ 2 │ # │ 6.8918e-8 ┆ 1 │ # └───────────┴─────┘
With nine seconds, it is slower than DuckDB. But the output looks as expected and has the same size as with DuckDB.
Final words
- With DuckDB or Polars, conversion of CSVs to Parquet is easy and fast, even in larger-than-RAM situations.
- We can apply filters, selects, sorts etc. on the fly.
- Let’s keep an eye on Polars in R. It looks really interesting.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.