Mongolite 2.0: GridFS, connection pooling, and more

rOpenSci - open tools for open science

4 years ago

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This week version 2.0 of the mongolite package has been released to CRAN. Major new features in this release include support for MongoDB 4.0, GridFS, running database commands, and connection pooling.

Mongolite is primarily an easy-to-use client to get data in and out of MongoDB. However it supports increasingly many advanced features like aggregation, indexing, map-reduce, streaming, encryption, and enterprise authentication. The mongolite user manual provides a great introduction with details and worked examples.

GridFS Support!

New in version 2.0 is support for the MongoDB GridFS system. A GridFS is a special type of Mongo collection for storing binary data, such as files. To the user, a GridFS looks like a key-value server with potentially very large values.

We support two interfaces for sending/receiving data from/to GridFS. The fs$read() and fs$write() methods are the most flexible and can stream data from/to an R connection, such as a file, socket or url. These methods support a progress counter and can be interrupted if needed. These methods are recommended for reading or writing single files.

# Assume 'mongod' is running on localhost
fs <- gridfs()
buf <- serialize(nycflights13::flights, NULL)
fs$write(buf, 'flights')
#> [flights]: written 45.11 MB (done)
#>  List of 6
#> $ id      : chr "5b70a37a47a302506117c352"
#> $ name    : chr "flights"
#> $ size    : num 45112028
#> $ date    : POSIXct[1:1], format: "2018-08-12 23:15:38"
#> $ type    : chr NA
#> $ metadata: chr NA

# Read serialized data:
out <- fs$read('flights')
flights <- unserialize(out$data)
# [flights]: read 45.11 MB (done)
#>  A tibble: 6 x 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int>
#> 1  2013     1     1      517            515         2      830            819        11 UA        1545
#> 2  2013     1     1      533            529         4      850            830        20 UA        1714
#> 3  2013     1     1      542            540         2      923            850        33 AA        1141
#> 4  2013     1     1      544            545        -1     1004           1022       -18 B6         725
#> 5  2013     1     1      554            600        -6      812            837       -25 DL         461
#> 6  2013     1     1      554            558        -4      740            728        12 UA        1696
#> # ... with 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> #   hour <dbl>, minute <dbl>, time_hour <dttm>

The fs$upload() and fs$download() methods on the other hand copy directly between GridFS and your local disk. This API is vectorized so it can transfer many files at once. Hover over to the gridfs chapter in the manual for more examples.

Running Commands

MongoDB exposes an enormous number of database commands, and mongolite cannot provide wrappers for each command. As a compromise, the new version of mongolite exposes an api to run raw commands, so you can manually run the commands for which we do not expose wrappers.

The result data from the commannd automatically gets simplified into nice R data structures using the jsonlite simplification system (but you can set simplify = FALSE if you prefer json structures).

m <- mongo()
col$run('{"ping": 1}')
#> $ok
#> [1] 1

For example we can run the listDatabases command to list all db’s on the server:

admin <- mongo(db = "admin")
admin$run('{"listDatabases":1}')
#> $databases
#>     name sizeOnDisk empty
#> 1  admin      32768 FALSE
#> 2 config      73728 FALSE
#> 3  local      73728 FALSE
#> 4   test   72740864 FALSE
#> 
#> $totalSize
#> [1] 72921088
#> 
#> $ok
#> [1] 1

The server tools chapter in the manual has some more examples.

Connection Pooling

Finally another much requested feature was connection pooling. Previously, mongolite would open a new database connection for each collection object in R. However as of version 2.0, mongolite will use existing connections to the same database when possible.

# This will use a single database connection
db.flights <- mongolite::mongo("flights")
db.iris <- mongolite::mongo("iris")
db.files <- mongolite::gridfs()

A database connection is automatically closed when all objects that were using it have either been removed, or explicitly disconnected with the disconnect() method. For example using the example above:

# Connection still alive
rm(db.flights)
db.files$disconnect()

# Now it will disconnect
db.iris$disconnect()

Mongolite collection and GridFS objects automatically reconnect if when needed if they are disconnected (either explicitly or automatically e.g. when restarting your R session). For example:

> db.files$find()
Connection lost. Trying to reconnect with mongo...
#>                         id    name     size                date type metadata
#> 1 5b70a37a47a302506117c352 flights 45112028 2018-08-12 23:15:38 <NA>     <NA>

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.