Site icon R-bloggers

Connecting to a MongoDB database from R using Java

[This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It would be nice if there were an R package, along the lines of RMySQL, for MongoDB. For now there is not – so, how best to get data from a MongoDB database into R?

One option is to retrieve JSON via the MongoDB REST interface and parse it using the rjson package. Assuming, for example, that you have retrieved your CiteULike collection in JSON format from this URL:


http://www.citeulike.org/json/user/neils

– and saved it to a database named citeulike in a collection named articles, you can fetch the first 5 articles into R like so:

library(RCurl)
library(rjson)

db <- "http://localhost:28017/citeulike/articles/?limit=5"
articles <- fromJSON(getURL(db))
articles$rows[[1]]$title
# [1] "A computational genomics pipeline for prokaryotic sequencing projects"

That works, but you may not want to use the MongoDB REST interface: for example, it may be slow for large queries or there might be security concerns.

MongoDB has both C and Java drivers. R has packages that interface with these languages: .C/.Call and rJava, respectively. My only problem is that I can write what I know about C and Java on the back of a postage stamp.

Not to be deterred, I took the approach that has served me well my whole professional life: wing it, using what I could glean from Google searches and the Web. In the end, using Java in R to connect with MongoDB was surprisingly easy. Here’s a basic how-to.

I’ll assume that MongoDB is installed and running on your machine. Packages for Ubuntu/Debian can be obtained here.

1. Install R packages
You’ll need rJava and rjson. The latter was a simple install.packages(“rjson”) from the R console. The former gave me some problems so as I use Ubuntu, I went with sudo apt-get install r-cran-rjava. That should also install the necessary dependencies, including a JDK if you don’t already have one.

2. Install the MongoDB Java driver
Create a directory, e.g. ~/mongodb/java, change into it and grab the latest driver from GitHub. I renamed the file to mongo.jar. Having no idea what to do with it, I searched and discovered this guide. I ran:

jar xf mongo.jar
# generates these directories
com  git-hash  META-INF  mongo.jar  org

The Java class files are located in com/mongodb.

3. Experiment with rJava
Still in ~/mongodb/java, I started an R console and loaded the libraries:

library(rJava)
library(rjson)

Next, I added the MongoDB classes to the classpath:

.jinit()
.jaddClassPath("~/mongodb/java/mongo.jar")

The next step was to consult the MongoDB Java tutorial and try to figure out how to convert “normal” Java syntax to rJava. First, rJava has no import, so you create a new Mongo object like this:

m <- .jnew("com/mongodb/Mongo", "localhost")
print(m)
# [1] "Java-Object{com.mongodb.Mongo@c2ea3f}"

OK – that seems to have worked; we have a Java object of class Mongo, connected to the server on localhost.
You can see the available methods like this:

.jmethods(m)
# result
 [1] "public com.mongodb.DB com.mongodb.Mongo.getDB(java.lang.String)"
 [2] "public java.util.List com.mongodb.Mongo.getDatabaseNames() throws com.mongodb.MongoException"
 [3] "public void com.mongodb.Mongo.dropDatabase(java.lang.String) throws com.mongodb.MongoException"
 [4] "public java.lang.String com.mongodb.Mongo.debugString()"
 [5] "public java.lang.String com.mongodb.Mongo.getConnectPoint()"
 [6] "public java.util.List com.mongodb.Mongo.getAllAddress()"
 [7] "public void com.mongodb.Mongo.setWriteConcern(com.mongodb.WriteConcern)"
 [8] "public com.mongodb.WriteConcern com.mongodb.Mongo.getWriteConcern()"
 [9] "public com.mongodb.ServerAddress com.mongodb.Mongo.getAddress()"
[10] "public void com.mongodb.Mongo.close()"
[11] "public static com.mongodb.DB com.mongodb.Mongo.connect(com.mongodb.DBAddress)"
[12] "public java.lang.String com.mongodb.Mongo.getVersion()"
[13] "public final native void java.lang.Object.wait(long) throws java.lang.InterruptedException"
[14] "public final void java.lang.Object.wait(long,int) throws java.lang.InterruptedException"
[15] "public final void java.lang.Object.wait() throws java.lang.InterruptedException"
[16] "public boolean java.lang.Object.equals(java.lang.Object)"
[17] "public java.lang.String java.lang.Object.toString()"
[18] "public native int java.lang.Object.hashCode()"
[19] "public final native java.lang.Class java.lang.Object.getClass()"
[20] "public final native void java.lang.Object.notify()"
[21] "public final native void java.lang.Object.notifyAll()"

As a non-Java programmer, that means very little to me. Instead, I typed m$, hit the tab key a couple of times and saw this:

m$MAJOR_VERSION       m$dropDatabase(       m$setWriteConcern(    m$connect(            m$equals(             m$notify()
m$MINOR_VERSION       m$debugString()       m$getWriteConcern()   m$getVersion()        m$toString()          m$notifyAll()
m$getDB(              m$getConnectPoint()   m$getAddress()        m$wait(               m$hashCode()
m$getDatabaseNames()  m$getAllAddress()     m$close()             m$wait()              m$getClass()

That’s much more useful – I recognise those methods. Let’s try connecting with the citeulike database:

db <- m$getDB("citeulike")
print(db)
# [1] "Java-Object{citeulike}"

Progress, no errors, it’s all good. Using the same approach – type db$ and hit tab, I saw this:

db$requestStart()             db$getCollectionFromString(   db$getLastError(              db$toString()                 db$hashCode()
db$requestDone()              db$doEval(                    db$isAuthenticated()          db$getName()                  db$getClass()
db$requestEnsureConnection()  db$eval(                      db$addUser(                   db$setReadOnly(               db$notify()
db$dropDatabase()             db$getStats()                 db$getPreviousError()         db$command(                   db$notifyAll()
db$setWriteConcern(           db$getCollectionNames()       db$resetError()               db$authenticate(
db$getWriteConcern()          db$collectionExists(          db$forceError()               db$wait(
db$getCollection(             db$resetIndexCache()          db$getMongo()                 db$wait()
db$createCollection(          db$getLastError()             db$getSisterDB(               db$equals(

Which led me to believe that I could access the articles collection like this:

col <- db$getCollection("articles")
print(col)
# [1] "Java-Object{articles}"

You get the idea. The Java methods follow the names of the MongoDB shell commands. Let’s fetch the first article:

article <- col$findOne()
article <- article$toString

Success! The toString() method converts the article to a JSON string. Now all that’s left is to get that into an R data structure:

article <- fromJSON(article)
article$title
# [1] "A computational genomics pipeline for prokaryotic sequencing projects"
article$authors
#  [1] "Andrey O. Kislyuk"    "Lee S. Katz"          "Sonia Agrawal"
 [4] "Matthew S. Hagen"     "Andrew B. Conley"     "Pushkala Jayaraman"
 [7] "Viswateja Nelakuditi" "Jay C. Humphrey"      "Scott A. Sammons"
[10] "Dhwani Govil"         "Raydel D. Mair"       "Kathleen M. Tatti"
[13] "Maria L. Tondella"    "Brian H. Harcourt"    "Leonard W. Mayer"
[16] "I. King Jordan"

Let the statistical analysis of your CiteULike library (or any other data from MongoDB) begin.


Filed under: computing, programming, R, research diary, statistics Tagged: java, mongodb, rjava, rjson

To leave a comment for the author, please follow the link and comment on their blog: What You're Doing Is Rather Desperate » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.