Deploying a scoring engine for predictive analytics with OpenCPU
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
TLDR/abstract: See the tvscore demo app or this jsfiddle for all of this in action.
This post explains how to use the OpenCPU system to setup a scoring engine for calculating real time predictions. In our example we use the predict.gam function from the mgcv
package to make predictions based on a generalized additive model. The entire process consists of four steps:
- Building a model
- Create an R package containing the model and a scoring function
- Install the package on your OpenCPU server
- Remotely call the scoring function through the OpenCPU API
Let’s get started!
Step 1: creating a model
For this example, we use data from the General Social Survey, which is a very rich dataset on demographic characteristics and attitudes of United States residents. To load the data in R:
#Data info: http://www3.norc.org/GSS+Website/Download/SPSS+Format/ download.file("http://publicdata.norc.org/GSS/DOCUMENTS/OTHR/2012_spss.zip", destfile="2012_spss.zip") unzip("2012_spss.zip") GSS <- foreign::read.spss("GSS2012.sav", to.data.frame=TRUE)
The GSS data has 1974 rows for 816 variables. To keep our example simple, we create a model with only 2 predictor variables. The code below fits a GAM which predicts the average number of hours per day that a person watches TV, based on their age and marital status. In these data tvhours
and age
are numeric variables, whereas marital
is categorical (factor) variable with levels MARRIED
, SEPARATED
,DIVORCED
, WIDOWED
and NEVER MARRIED
.
#Variable info: http://www3.norc.org/GSS+Website/Browse+GSS+Variables/Mnemonic+Index/ library(mgcv) mydata <- na.omit(GSS[c("age", "tvhours", "marital")]) tv_model <- gam(tvhours ~ s(age, by=marital), data = mydata)
The predict
function is used to score data against the model. We test with some random cases:
newdata <- data.frame( age = c(24, 54, 32, 75), marital = c("MARRIED", "DIVORCED", "WIDOWED", "NEVER MARRIED") ) predict(tv_model, newdata = newdata) 1 2 3 4 3.022650 3.693640 1.556342 3.665077
All seems good, this completes step 1. But just to get a sense of what our example model actually looks like before we start scoring, a simple viz:
library(ggplot2) qplot(age, predict(tv_model), color=marital, geom="line", data=mydata) + ggtitle("gam(tvhours ~ s(age, by=marital))") + ylab("Average hours of TV per day")
Seems like people that get married start watching less TV, who would have thought 🙂 In a real study we should probably tune the smoothing a bit and add parenting as predictor (also in the data), but for simplicity we’ll stick with this model for now.
Step 2: creating a package
In order to score cases via the OpenCPU API, we need to turn the model into an R package. Making R packages is very easy these days, especially when using RStudio. Our package needs to contain at least two things: the tv_model
object that we created above, and a wrapper function that calls out to predict(tv_model, ...)
. You can make the wrapper as simple or sophisticated as you like, based on the type of input and output data that you want to send/receive from your scoring engine.
The tvscore
package that is available from the opencpu github repository is an example of such a package. The important thing to note is that the tv_model
object is included in the data
directory of the package. Saving objects to a file is done using the save
function in R:
#Store the model as a data object save(tv_model, file="data/tv_model.rda")
To load the model with the package, we can either set LazyData=true
in the package DESCRIPTION, or manually load it using the data()
function in R. For details on including data in R packages, see section 1.1.6 of writing R extensions.
Finally the package contains a scoring function called tv
, which calls out to predict.gam
. The scoring function is what clients will call remotely through the OpenCPU API. We use a smart function that supports both data frames as well as CSV files for input:
tv <- function(input){ #input can either be csv file or data newdata <- if(is.character(input) && file.exists(input)){ read.csv(input) } else { as.data.frame(input) } stopifnot("age" %in% names(newdata)) stopifnot("marital" %in% names(newdata)) newdata$age <- as.numeric(newdata$age) #tv_model is included with the package newdata$tv <- predict.gam(tv_model, newdata = newdata) return(newdata) }
Note how the function does a bit of input validation by checking that the age
and marital
columns are present. As usual, the tv
function is saved in the R
directory of the source package. Install the package locally to verify that it works as expected in a clean R session. To install our example package from github, restart R and do:
#install the tv score package library(devtools) install_github("opencpu/tvscore")
First we test the tv
function with data frame input:
#test scoring with data frame input library(tvscore) newdata <- data.frame( age = c(24, 54, 32, 75), marital = c("MARRIED", "DIVORCED", "WIDOWED", "NEVER MARRIED") ) tv(input = newdata)
And then we test if it works for CSV data:
#test scoring with CSV file input setwd(tempdir()) write.csv(newdata, "testdata.csv") library(tvscore) tv(input = "testdata.csv")
If all of this works as expected, the package is ready to be deployed on your OpenCPU server!
Step 3: Install the package on the server
To deploy your scoring engine, simply install the package on your OpenCPU server. If you are running the OpenCPU cloud server, make sure to install your package as root. For example if you built the package into a tar.gz
archive:
sudo -i R CMD INSTALL tvscore_0.1.tar.gz
To install our example package straight from R, either on an OpenCPU cloud server or OpenCPU single-user server:
#install the tv score package library(devtools) install_github("opencpu/tvscore")
If you are running the cloud server, you are done with this step. If you are running the single-user server, start OpenCPU using:
library(opencpu) opencpu$browse()
To verify that the installation succeeded, open your browser and navigate to the /ocpu/library/tvscore
path on the OpenCPU server. Also have a look at /ocpu/library/tvscore/R/tv
and /ocpu/library/tvscore/man/tv
.
Step 4: Scoring through the API
Once the package is installed on the server, we can remotely call the tv
function via the OpenCPU API. In the examples below we use the public demo server: https://public.opencpu.org/
. For example, to call the tv
function with curl
using basic JSON RPC:
curl https://public.opencpu.org/ocpu/library/tvscore/R/tv/json \ -H "Content-Type: application/json" \ -d '{"input" : [ {"age":26, "marital" : "MARRIED"}, {"age":41, "marital":"DIVORCED"}, {"age":53, "marital":"NEVER MARRIED"} ]}'
Note how the OpenCPU server automatically converts input and output data from/to JSON using jsonlite
. See the API docs for more details on this process. Alternatively we can batch score by posting a CSV file (example data)
curl https://public.opencpu.org/ocpu/library/tvscore/R/tv -F "[email protected]"
The response to a successful HTTP POST request contains the location of the output data in the Location
header. For example if the call returned a HTTP 201 with Location
header /ocpu/tmp/x036bf30e82
, the client can retrieve the output data in various formats using a subsequent HTTP GET request:
curl https://public.opencpu.org/ocpu/tmp/x036bf30e82/R/.val/csv curl https://public.opencpu.org/ocpu/tmp/x036bf30e82/R/.val/json curl https://public.opencpu.org/ocpu/tmp/x036bf30e82/R/.val/tab
This completes our scoring engine. Using these steps, clients from any language can remotely score cases by calling the tv
function using standard HTTP
and JSON
libraries.
Extra credit: performance optimization
When using a scoring engine based on OpenCPU in production, it is worthwile configuring your server to optimize performance. In particular, we can add our package to the preload
field in the /etc/opencpu/server.conf
file on the OpenCPU cloud server. This will automatically load (but not attach) the package when the OpenCPU server starts, which eliminates package loading time from the individual scoring requests. In our example this is important because tvscore
depends on the mgcv
package, which takes about 2 seconds to load.
Note that R does not load LazyData objects when the package loads. Hence, preload
in combination with lazy loading of data might not have the desired effect. When using preload
, make sure to design your package such that all data gets loaded when the package loads (example).
Finally in production you might want to tweak the timelimit.post
(timeout), rlimit.as
(mem limit), rlimit.fsize
(disk limit) and rlimit.nproc
(parallel process limit) options in /etc/opencpu/server.conf
to fit your needs. Also see the server manual on this topic.
Bonus: creating an OpenCPU app
By including web pages in the /inst/www/
directory of the source package, we can turn our scoring engine into a standalone web application. The tvscore
example package contains a simple web interface that makes use of the opencpu.js JavaScript client to interact with R via OpenCPU in the browser. Navigate to /ocpu/library/tvscore/www/ on the public demo server to see it in action!
To install and run the same app in your local R session, use:
#Install the app library(devtools) install_github("opencpu/tvscore") #Load the app library(opencpu) opencpu$browse("/library/tvscore/www")
We can also call the OpenCPU server from an external website using cross domain ajax requests (CORS). See this jsfiddle for a simple example that calls the public server using the ocpu.rpc
function from opencpu.js
.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.