Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Suppose you would like to publish some data, for example to accompany a journal article. One way would be to put a CSV
file on your website, and share the URL with your colleagues. However CSV has many limitations: it only works for tabular structures, has limited type safety (pretty much everything gets coersed into strings) and leads to loss of numeric precision.
There are many alternative data interchange formats, each with their own benefits and limitations. For example JSON is widely supported and can be parsed in almost any language, however it can be verbose and slow. A binary format such as Protocol Buffers is more efficient, but many users might not know how to parse it. You could even use save
or saveRDS
in R to share the native R structures, however this limits your audience to R users.
Retrieving dynamic data
What we really need is a method to publish the data itself rather than some representation of the data in a particular format. With OpenCPU you can publish R < emph>objects (including datasets) in a way that lets the clients select the format and formatting options for retrieving the dataset. This is implemented using native R functionality to include arbitrary data/objects in packages, and standard R functions for exporting these data. For example, the CRAN package MASS
includes a dataset called bacteria
:
library(MASS) data(bacteria) print(bacteria)
Via OpenCPU, the dataset can downloaded by anyone, using one of many formats:
Format | Export Function | URL (short) |
---|---|---|
text | print |
cran.ocpu.io/MASS/data/bacteria/print |
CSV | write.csv |
cran.ocpu.io/MASS/data/bacteria/csv |
TSV | write.table |
cran.ocpu.io/MASS/data/bacteria/tab |
JSON | jsonlite::asJSON |
cran.ocpu.io/MASS/data/bacteria/json |
Protocol Buffers | RProtoBuf::serialize_pb |
cran.ocpu.io/MASS/data/bacteria/pb |
RData | save |
cran.ocpu.io/MASS/data/bacteria/rda |
RDS | saveRDS |
cran.ocpu.io/MASS/data/bacteria/rds |
ascii R | dput |
cran.ocpu.io/MASS/data/bacteria/ascii |
The client can also control formatting options by passing HTTP parameters. These parameters map directly to function arguments for the respective export function in the table above. Some random examples:
Output Format | Equivalent URL on Public OpenCPU Server |
---|---|
write.csv(bacteria, row.names=TRUE) |
cran.ocpu.io/MASS/data/bacteria/csv?row.names=true |
jsonlite::asJSON(Boston, digits=4) |
cran.ocpu.io/MASS/data/Boston/json?digits=4 |
jsonlite::asJSON(Boston, dataframe="columns") |
cran.ocpu.io/MASS/data/Boston/json?dataframe=columns |
jsonlite::asJSON(Boston, pretty=FALSE) |
cran.ocpu.io/MASS/data/Boston/json?pretty=false |
Creating a data package
To start publishing your own dynamic data you need to put your data objects in an R package following the standard guidelines as documented in section 1.1.6 of Writing R Extensions. This might sound cumbersome, but once you get a hold of it, it only takes a few seconds. You’ll realize that packages are actually a beautiful, standardized and well-tested container format for R objects and much more. Have a look at the data folder in the opencpu/appdemo package for some examples.
After creating and installing your package on your local R, test it using the OpenCPU single user server:
library(opencpu) opencpu$browse("/library/mypackage/data") opencpu$browse("/library/mypackage/data/myobject")
Publishing dynamic data on ocpu.io
To make your data available through the public OpenCPU server and ocpu.io
, all you need to do is put your package up on Github. OpenCPU requires the name of the Github repository to match the name of the R package it contains. Use devtools to test if your package is working:
library(devtools) install_github("pkgname", "username")
If this succeeds you’re good to go. Navigate to username.ocpu.io/pkgname/data
where username is your Github login. By default the OpenCPU public server updates packages installed from Github every 24 hours. However, the Github webhook can be used to update the package immediately every time a commit is pushed to github.
Publishing dynamic data on your own server
OpenCPU does not lock you into some commercial hosting service. Your data is stored on Github in a standard format under your control. The ocpu.io
public server is there for your convenience. You can also install your own OpenCPU cloud server to publish data at e.g. http://opencpu.yourserver.com/ocpu/library/pkgname/data/myobject
. No need to put anything on Github, just install the package in R on the server.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.