Fast remote parquet
[This article was first published on r.iresmi.net, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Querying a remote parquet file via HTTP with DuckDB.
The french statistical service (INSEE) has made available its first parquet file on data.gouv.fr in June.
It’s a 470 MB file (from a 1.8 GB CSV) with 16·106 rows, showing for each address in France which polling station it belongs to.
It would take at least 23 s to only download the file (at 20 MB·s-1) and some more time to open it and aggregate the data. But with DuckDB coupled with parquet we don’t need to download the file; it’s just magical!
Config
dataset <- "https://static.data.gouv.fr/resources/bureaux-de-vote-et-adresses-de-leurs-electeurs/20230626-135723/table-adresses-reu.parquet" library(duckdb) library(tidyverse) library(glue) cnx <- dbConnect(duckdb()) # To do once: # dbExecute(cnx, "INSTALL httpfs") dbExecute(cnx, "LOAD httpfs") dbSendQuery(cnx, glue(" CREATE VIEW bureaux AS SELECT * FROM '{dataset}'"))
With {DBI} and SQL
It takes less than 2 s to query the top ten communes!
# available columns dbGetQuery(cnx, " DESCRIBE bureaux")
column_name column_type null key default extra 1 code_commune_ref VARCHAR YES <NA> <NA> NA 2 reconstitution_code_commune VARCHAR YES <NA> <NA> NA 3 id_brut_bv_reu VARCHAR YES <NA> <NA> NA 4 id VARCHAR YES <NA> <NA> NA 5 geo_adresse VARCHAR YES <NA> <NA> NA 6 geo_type VARCHAR YES <NA> <NA> NA 7 geo_score DOUBLE YES <NA> <NA> NA 8 longitude DOUBLE YES <NA> <NA> NA 9 latitude DOUBLE YES <NA> <NA> NA 10 api_line VARCHAR YES <NA> <NA> NA 11 nb_bv_commune UINTEGER YES <NA> <NA> NA 12 nb_adresses UINTEGER YES <NA> <NA> NA
# number of rows dbGetQuery(cnx," SELECT COUNT(*) FROM bureaux")
count_star() 1 15970992
# top communes by address number dbGetQuery(cnx," SELECT code_commune_ref, sum(nb_adresses) AS total_nb_adresses FROM bureaux GROUP BY code_commune_ref ORDER BY total_nb_adresses DESC LIMIT 10")
code_commune_ref total_nb_adresses 1 31555 183488 2 34172 124866 3 06088 123217 4 33063 106018 5 59350 83468 6 97411 70738 7 13001 64095 8 44109 60255 9 72181 56159 10 83137 54995
With {dbplyr}
Idem with {dbplyr}!
bureaux <- tbl(cnx, "bureaux") # available columns colnames(bureaux)
[1] "code_commune_ref" "reconstitution_code_commune" [3] "id_brut_bv_reu" "id" [5] "geo_adresse" "geo_type" [7] "geo_score" "longitude" [9] "latitude" "api_line" [11] "nb_bv_commune" "nb_adresses"
# number of rows bureaux %>% summarize(rows = n())
# Source: SQL [1 x 1] # Database: DuckDB 0.8.1 [unknown@Linux 5.15.0-83-generic:R 4.3.1/:memory:] rows <dbl> 1 15970992
# top communes by address number bureaux %>% group_by(code_commune_ref) %>% summarise(tot_nb_adresses = sum(nb_adresses)) %>% arrange(desc(tot_nb_adresses))
# Source: SQL [?? x 2] # Database: DuckDB 0.8.1 [unknown@Linux 5.15.0-83-generic:R 4.3.1/:memory:] # Ordered by: desc(tot_nb_adresses) code_commune_ref tot_nb_adresses <chr> <dbl> 1 31555 183488 2 34172 124866 3 06088 123217 4 33063 106018 5 59350 83468 6 97411 70738 7 13001 64095 8 44109 60255 9 72181 56159 10 83137 54995 # ℹ more rows
Cleaning
dbDisconnect(cnx, shutdown = TRUE)
It’s so fast, I’m not sure I can believe it…
To leave a comment for the author, please follow the link and comment on their blog: r.iresmi.net.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.