Site icon R-bloggers

nodbi: the NoSQL Database Connector

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

DBI

What is DBI? DBI is an R package. It defines an interface to relational database management systems (R/DBMS) that other R packages build upon to interact with a specific relational database, such as SQLite or PostgreSQL.

NoSQL

NoSQL databases are a very broad class of database that can include document databases such as CouchDB and MongoDB, key-value stores such as Redis, and more. They are generally not row-column relational stores though, though can include that. NoSQL is often thought of now as “not only SQL”.

You can imagine how it is relatively straight-forward to create a common interace to row-column oriented databases, and DBI is great for that.

However, a common interface to NoSQL datbases is a bit harder to wrap your head around for various reasons. One of the most obvious of which is that they don’t share a fundamental data structure like R/DMBS do.

A sort of DBI for NoSQL

nodbi (the no for “N” and “O” in “NoSQL”) was started nearly 3 years ago during the 2015 rOpenSci Unconference on 26 March together with Rich FitzJohn and Jeroen Ooms. We’ve worked on it here and there, but it hasn’t seen any commits in nearly a year.

Given the success of dplyr it goes to reason that something that can abstract away the details for NoSQL databases – for a subset of the most common operations – could prove to be very useful.

nodbi takes a different approach than DBI (thus the “sort of” in this section title) in that instead of package X for database Y importing DBI and extending its methods, nodbi provides the connection objects to various databases, and a set of common methods for interacting with those databases. Whereas DBI doesn’t import or depend in any way on packages for individual databases, nodbi has packages for individual databases in Suggests. That is to say, nodbi can be installed and loaded without any specific database package, but to use any one database (e.g., Mongo) you need to install the package for that database (i.e., mongolite). Because database packages in nodbi are in Suggests, you can pick and choose the packages you want to have installed – and forget about the others.

Check out the package on GitHub.

data.frame’s

Thus far, we’ve built nodbi around data.frame’s. That is, we’re focusing on the data.frame use case as it’s very common that R users are dealing solely with data.frame’s in their analysis pipelines.

Let us know if you have a strong need for lists or vectors, etc. in nodbi.

supported databases

Connection functions to each database follow the pattern src_ following the pattern used by dplyr

The details of connecting to different databases vary as you’d imagine, but what’s returned from src_ functions can be passed to any of the below methods (for the most part).

supported methods

Operations in nodbi follow a key-value pattern. That is, an object (read: data.frame) is stored in your choice of database with a key (read: character string). Not every operation above needs both the key and value – for example, deleting an object only requires the key.

Internally, a key maps differently to Redis vs. Mongo vs. CouchDB, etc. But the point is that we’re abstracting away the details so that the user can focus on getting things done faster.

Setup

You’ll need to start up the database you want to work with if it’s not started already. Let’s work with Mongo for this blog post.

Check out the mongolite book for details on installing MongoDB

Start up mongo on your cli (i.e., shell or terminal)

mongod # or sudo mongod

Install database packages

install.packages(c("sofa", "etseed", "elastic", "mongolite", "redux"))

Install nodbi

devtools::install_github("ropensci/nodbi")

library(nodbi)

Examples

Connect to Mongo with src_mongo

(src <- src_mongo(verbose = FALSE))
#> MongoDB 3.6.2 (uptime: 3017s)
#> URL: leothelion.local/test

With the connection object you can access the name of the Mongo database

src$db
#> [1] "test"

And the connection mongolite connection object

src$con
#> <Mongo collection> 'test' 
#>  $aggregate(pipeline = "{}", options = "{\"allowDiskUse\":true}", handler = NULL, pagesize = 1000) 
#>  $count(query = "{}") 
#>  $distinct(key, query = "{}") 
#>  $drop() 
#>  $export(con = stdout(), bson = FALSE) 
#>  $find(query = "{}", fields = "{\"_id\":0}", sort = "{}", skip = 0, limit = 0, handler = NULL, pagesize = 1000) 
#>  $import(con, bson = FALSE) 
#>  $index(add = NULL, remove = NULL) 
#>  $info() 
#>  $insert(data, pagesize = 1000, stop_on_error = TRUE, ...) 
#>  $iterate(query = "{}", fields = "{\"_id\":0}", sort = "{}", skip = 0, limit = 0) 
#>  $mapreduce(map, reduce, query = "{}", sort = "{}", limit = 0, out = NULL, scope = NULL) 
#>  $remove(query, just_one = FALSE) 
#>  $rename(name, db = NULL) 
#>  $replace(query, update = "{}", upsert = FALSE) 
#>  $update(query, update = "{\"$set\":{}}", filters = NULL, upsert = FALSE, multiple = FALSE)

Insert a data.frame (the returned object is from mongolite) using docdb_create

library("ggplot2")
x <- docdb_create(src = src, key = "diamonds", value = diamonds)
x
#> List of 5
#>  $ nInserted  : num 53940
#>  $ nMatched   : num 0
#>  $ nRemoved   : num 0
#>  $ nUpserted  : num 0
#>  $ writeErrors: list()

Get the data.frame back from Mongo using docdb_get

head(docdb_get(src, "diamonds"))
#>   carat       cut color clarity depth table price    x    y    z
#> 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
#> 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
#> 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
#> 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
#> 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
#> 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

Delete the object using docdb_delete

docdb_delete(src, "diamonds")

The same pattern as above can be used for any of the other database options, except the connecting with src_* differs among databases.



similar art

The one that comes closest is storr by Rich FitzJohn – which shares Redis in common with nodbi as a backend. storr seems to be more focused on the caching use case – whereas, nodbi focuses on user X has a specific database they keep their data in and they’d like to not worry about the details of working with that database. Perhaps the distinction isn’t really there, I’m not sure.

Todo

Feedback!

We’d love to get some eyes on this; to sort out problems that will no doubt arise from real world scenarios; to flesh out new use cases we hadn’t thought of, etc.

Open an issue: https://github.com/ropensci/nodbi/issues/new.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.