[This article was first published on Guillaume Pressiat, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
You may know that MonetDBLite was removed from CRAN. DuckDB comming up.
Breaking change
People who based their works on MonetDBLite may ask what happened, what to do. Not to play a risky game with database and tools choices for future works… (“It’s really fast but we may waste some time if we have to replace it by another solution”).
It’s the game with open source. Remember big changes in dplyr 0.7.
Sometimes we want better tools, and most of the time they become better. It’s really great.
And sometimes we don’t have time and energy to adapt our work to tools that became better in a too iterative way. Or in a too subjective way.
We want it to work, not break.
Keeping code as simple as possible (and avoid nebulous dependencies, so, tidy?) is one of the key point.
Stocking data in a database is another one.
All that we can say is that “we’re walking on works in progress”. Like number of eggshells, more works in progress here probably means more breaking changes.
Works in progress for packages, also for (embedded) databases!
From Monet to Duck
MonetDBLite philosophy is to be like a “very very fast SQLite”. But it’s time for change (or it seems to be).
Then we can thanks MonetDBLite developers as it was a nice adventure to play/work with MonetDB speed!
As a question, is there another person, some volunteers, possibilities to maintain MonetDBLite (somewhere a nice tool)?
There are not so many informations for the moment about what happened and that’s why I write this post.
Here, I read that they are now working on a new solution, under MIT License, named DuckDB, see here for more details.
As I’m just a R user and haven’t collaborate to the project, I would just say for short: DuckDB takes good parts from SQLite and PostGreSQL (Parser), see here for complete list, it looks promising. As in MonetDB, philosophy is focused on columns and speed. And dates for instance are handled correctly, not having to convert them in “ISO-8601 – like” character strings.
I also put a link at the bottom of this page which give some explanations about the name of this new tool and DuckDB developers point’s of view about data manipulation and storage1.
Sampling it to get more rows, then duplicating columns, two time.
Write in db
It take some times compared to MonetDBLite (no benchmark here, I just run this several times and it was consistent).
The three are pretty fast.
Most importantly if queries are fast, and they are, most of the time we’re allwright.
I want to say here that’s for now it’s a work in progress, we have to wait more communication from DuckDB developers. I just write this to share the news.
Glimpse
Count
Dates
Compared to SQLite it handles dates/times correctly. No need to convert in character.
Some querying
Running some queries
dplyr
It already works nicely with dplyr.
sql
Run query as a string
Like for all data sources with DBI, if the query is more complex, we can write it comfortably in an external file and launch it like this for example:
“Little” benchmarks
Collecting this big data frame
This has no sense but give some idea of read speed. We collect df_test in memory, from duckdb, monetdb and sqlite.
Really good !
Simple count
Count then collect aggregate rows.
Faster !
It remains to test joins, filters, sorts, etc.
Informations
I find that there are not so many communications for the moment about this work and binding for R, so I made this post to highlight it.
MonetDBLite speed is amazing, do you will give DuckDB a try ?
In any case thanks to DuckDB developers and welcome to the new duck.