Deploying R models in SQL server
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
As an R user who is building models and analysing data, one of the key challenges is how to make those results available to those who need it. After all, data science is about making better decisions, and your results need to get into the hands of the people who make those decisions.
For reporting, there are many options from writing Excel files to rmarkdown documents and shiny apps. Many businesses already have great reporting with a business intelligence (BI) tool. For them, it is preferable that you present your results alongside a number of other critical business metrics. Moreover, your results need to be refreshed daily. In this situation, you might be working with SQL developers to integrate your work. The question is, what is the best way to deliver R code to the BI team?
In this blog post, we will be looking at the specific case of deploying a predictive model, written in R, to a Microsoft SQL Server database for consumption by a BI tool. We’ll look at some of the different options to integrate R, from in-database R services, to pushing with ODBC or picking up flat files with SSIS.
The Problem
Flight delay planning
To demonstrate we’ll use the familiar flights
dataset from the nycflights13 package to imagine that we are airport planners and we want to test various scenarios related to flight delays. Our data contains the departure delay of all flights leaving the New York airports: JFK, LGA, and EWR in 2013. I’m running this code on my Windows 10 laptop, where I have a local SQL Server 17 instance running, with a database called ml
. If you want to reproduce the code you’ll need to have your own SQL Server setup (you can install it locally) and push the flights table there. Here’s a selection of columns:
SELECT TOP(5) flight, origin, dest, sched_dep_time, carrier, time_hour, dep_delay
FROM flights
flight | origin | dest | sched_dep_time | carrier | time_hour | dep_delay |
---|---|---|---|---|---|---|
1545 | EWR | IAH | 515 | UA | 2013-01-01 05:00:00 | 2 |
1714 | LGA | IAH | 529 | UA | 2013-01-01 05:00:00 | 4 |
1141 | JFK | MIA | 540 | AA | 2013-01-01 05:00:00 | 2 |
725 | JFK | BQN | 545 | B6 | 2013-01-01 05:00:00 | -1 |
461 | LGA | ATL | 600 | DL | 2013-01-01 06:00:00 | -6 |
We’ll fit a statistical model for the departure delay, and run simulations for the delay of future flights. We want to capture the natural variation from day to day so a useful approach here is a mixed-effects model where each day is a random effect.
model <- lme4::lmer(
dep_delay ~ 1 +
(1 | date:origin) +
carrier +
origin +
sched_dep_time +
distance +
week_day,
data = data_train
)
This is a simple model for demonstration purposes. For example, it doesn’t capture big delays (extreme values) well, but it will serve our purpose. The full model code and data prep is available at mangothecat/dbloadss so we won’t go through every line here.
To simulate delays on future flights we can call the simulate
function. Here we’ll run 10 different scenarios:
sim_delays <- simulate(model, nsim = 10, newdata = data_test)
The reason we’re using simulate
rather than predict
is that we don’t just want the most likely value for each delay, we want to sample from likely scenarios. That way we can report any aspect of the result that seems relevant. A data scientist will ask: “how can I predict dep_delay
as accurately as possible?”. An airport manager will want to know “how often will the last flight of the day leave after midnight?”, or another question that you haven’t thought of. This type of output lets the BI team address these sorts of question.
Package it up
At Mango, we believe that the basic unit of work is a package. A well-written package will be self-documenting, have a familiar structure, and unit tests. All behind-the-scenes code can be written into unexported functions, and user facing code lives in a small number (often one) of exported functions. This single entry point should be designed for someone who is not an experienced R user to run the code, and if anything goes wrong, be as informative as possible. R is particularly friendly for building packages, with the excellent devtools automating most of it, and the wonderfully short R packages book by Hadley guiding you through it all.
The code for this blog post lives in the dbloadss package available on GitHub. For the flights model, a single function is exported simulate_departure_delays
, which is documented to explain exactly what it expects as input, and what it will output. The entire model runs with the single line:
output_data <- simulate_departure_delays(input_data, nsim = 20)
where the input_data
is prepared from the database and output_data
will be pushed/pulled back to the database.
Connecting to the Database
Push, Pull, or Pickup?
Once the model has been packaged and the interface decided, it remains to decide how to actually run the code. With SQL Server there are three options:
- Run the model from R and push the results to SQL Server using an ODBC connection.
- Call the model from SQL Server using a stored procedure to run an R script using R Services and pull the results back.
- Invoke an Rscript from SSIS and pickup flat files (csv).
Which you choose will depend on a number of factors. We’ll take some time to look at each one.
The Push (SQL from R)
The best way to talk to a database from R is to use the DBI database interface package. The DBI project has been around for a while but received a boost with R Consortium funding. It provides a common interface to many databases integrating specific backend packages to each separate database type. For SQL Server we’re going to use the odbc backend. It has great documentation and since Microsoft released ODBC drivers for Linux it’s a cinch to setup from most operating systems.
Let’s get the flights data from SQL Server:
library(DBI)
con <- dbConnect(odbc::odbc(),
driver = "SQL Server",
server = "localhost\\SQL17ML",
database = "ml")
flights <- dbReadTable(con, Id(schema="dbo", name="flights"))
I’ve included the the explicit schema
argument because it’s a recent addition to DBI and it can be a sticking point for complicated database structures.
Now we run the model as above
library(dbloadss)
output_data <- simulate_departure_delays(flights, nsim = 20, split_date = "2013-07-01")
dim(output_data)
## [1] 3412360 3
So for 20 simulations, we have about 3.5 million rows of output! It’s just a flight ID (for joining back to the source), a simulation ID, and a delay.
head(output_data)
## id sim_id dep_delay ## 1 27005 1 -49.25918 ## 2 27006 1 23.09865 ## 3 27007 1 28.84683 ## 4 27008 1 -43.68340 ## 5 27009 1 -44.98220 ## 6 27010 1 37.62463
We’ll do all further processing in the database so let’s push it back.
# Workaround for known issue https://github.com/r-dbi/odbc/issues/175
dbRemoveTable(con, name = Id(schema = "dbo", name = "flightdelays"))
odbctime <- system.time({
dbWriteTable(con,
name = Id(schema = "dbo", name = "flightdelays"),
value = output_data,
overwrite = TRUE)
})
odbctime
## user system elapsed ## 10.08 0.47 114.60
That took under 2 minutes. This post started life as a benchmark of write times from odbc vs RODBC, an alternative way to talk to SQL Server. The results are on the dbloadss README and suggest this would take several hours! RODBC is usually fine for reads but we recommend switching to odbc where possible.
It is relatively straightforward to push from R and this could run as a scheduled job from a server running R.
The Pull (R from SQL)
An alternative approach is to use the new features in SQL Server 17 (and 16) for calling out to R scripts from SQL. This is done via the sp_execute_external_script
command, which we will wrap in a stored procedure. This method is great for SQL developers because they don’t need to go outside their normal tool and they can have greater control about exactly what is returned and where it goes.
A word of warning before we continue. There’s a line in the Microsoft docs:
Do not install R Services on a failover cluster. The security mechanism used for isolating R processes is not compatible with a Windows Server failover cluster environment.
that somewhat limits the ultimate use of this technique in production settings as a failover cluster is a very common configuration. Assuming this might get fixed, or perhaps it’s not an issue for you, let’s see how it works.
I’m running SQL Server 2017 with Machine Learning Services. This installs its own version of R that you can access. To do so you have to enable “the execution of scripts with certain remote language extensions”. Then you need to install the dbloadss package somewhere that this version of R can see. This can require admin privileges, or alternatively, you can set the .libPaths()
somewhere in the stored proc.
The following stored procedure is what we’ll add for our flight delays model:
use [ml];
DROP PROC IF EXISTS r_simulate_departure_delays;
GO
CREATE PROC r_simulate_departure_delays(
@nsim int = 20,
@split_date date = "2013-07-01")
AS
BEGIN
EXEC sp_execute_external_script
@language = N'R'
, @script = N'
library(dbloadss)
output_data <- simulate_departure_delays(input_data, nsim = nsim_r,
split_date = split_date_r)
'
, @input_data_1 = N' SELECT * FROM [dbo].[flights];'
, @input_data_1_name = N'input_data'
, @output_data_1_name = N'output_data'
, @params = N'@nsim_r int, @split_date_r date'
, @nsim_r = @nsim
, @split_date_r = @split_date
WITH RESULT SETS ((
"id" int not null,
"sim_id" int not null,
"dep_delay" float not null));
END;
GO
The query that goes into @input_data_1
becomes a data frame in your R session. The main things to note are that you can pass in as many parameters as you like, but only one data frame. Your R script assigns the results to a nominated output data frame and this is picked up and returned to SQL server.
I believe it’s very important that the R script that is inside the stored procedure does not get too complicated. Much better to use your single entry point and put the complex code in a package where it can be unit tested and documented.
We then call the stored procedure with another query:
INSERT INTO [dbo].[flightdelays]
EXEC [dbo].[r_simulate_departure_delays] @nsim = 20
The performance of this method seems to be good. For write speeds in our tests it was faster even than pushing with odbc, although it’s harder to benchmark in the flights example because it includes running the simulation.
Overall, were it not for the issue with failover clusters I would be recommending this as the best way to integrate R with SQL Server. As it stands you’ll have to evaluate on your setup.
The Pickup (R from SSIS)
The final method is to use SSIS to treat running the R model as an ETL process. To keep things simple we use SSIS to output the input data as a flat file (csv), kick-off an R process to run the job, and pickup the results from another csv. This means that we’ll be making our R code run as a command line tool and using a csv “air gap”.
Running R from the command line is relatively straightforward. To handle parameters we’ve found the best way is to use argparser, also honourable mention to optparse. Checkout Mark’s blog post series on building R command line applications. After you’ve parsed the arguments everything is essentially the same as pushing straight to the database, except that you write to csv at the end. SSIS then picks up the csv file and loads it into the database. Performance is generally not as good as the other methods but in our experience it was close enough — especially for a batch job.
An example of what this script might look like is on GitHub. We can run this by doing (from the commandline):
> Rscript blog/flight_delay.R -n 10 -d '2017-07-01' -i flights.csv
Loading required package: methods
Running delay simulations:
= FALSE
help = FALSE
verbose = FALSE
opts = NA
nsim = 10
split_date = 2013-07-01
input_path = flights.csv
output_path = simulated_delays.csv
Reading... Parsed with column specification:
cols(
.default = col_integer(),
carrier = col_character(),
tailnum = col_character(),
origin = col_character(),
dest = col_character(),
time_hour = col_datetime(format = "")
)
See spec(...) for full column specifications.
Read 336776 rows
Running simulations...
Writing 1706180 rows
Done.
When doing this from SSIS it can directly call Rscript and the arguments can be variables.
The SSIS solution has some great advantages in that it is controlled by the SQL developers, it has the greatest separation of technologies, and it’s easy to test the R process in isolation. Downsides are it’s unlikely that going via csv will be the fastest, and you need to be a little more careful about data types when reading the csv into R.
The Results
If everything goes well the results of your delays simulation will land in the database every night, and every morning reports can be built and dashboards updated. The results look something like this:
SELECT TOP(5) * FROM flightdelays
id | sim_id | dep_delay |
---|---|---|
27005 | 1 | -27.48316 |
27006 | 1 | 46.50636 |
27007 | 1 | 65.94304 |
27008 | 1 | -78.93091 |
27009 | 1 | -17.86126 |
Your work is not totally done. There is one cost of getting the dashboards setup. The simulation results are not always the easiest to get your head around so it helps if you can setup the BI team with a few queries just to get started. For example: To generate a daily average delay for the airline UA
, they would need something like the following:
WITH gp AS (
SELECT sim_id
,origin
,day_date
,avg(fd.dep_delay) as mean_delay
FROM dbo.flightdelays fd
LEFT JOIN (SELECT *, convert(date, time_hour) as day_date FROM dbo.flights) fs on fd.id = fs.id
WHERE fs.carrier='UA'
GROUP BY sim_id, origin, day_date
)
SELECT day_date, origin
, avg(mean_delay) as mean_delay
, min(mean_delay) as min_delay
, max(mean_delay) as max_delay
FROM gp
GROUP BY origin, day_date
ORDER BY day_date, origin
So first you aggregate over each simulation, then you aggregate across simulations. Your BI team’s SQL is better than yours (and definitely mine) so this can be a useful way to get feedback on your code and it’s also a good way to explain what your model does to them (code speaks louder than words). Loading up a table like this into a BI tool (for example Power BI) you can get all the plots you’re used to.
It turns out the daily variation means you don’t learn much from this plot. Maybe it’s time to go and look at that manager’s question about late flights after all.
Conclusion
After spending a bit of time delivering R models for business reporting here are my main takeaways:
- Data Science does not exist in a vacuum. Think about how to get your results into the hands of decision-makers from the beginning. Think about their tools and workflow and work back to how you can deliver.
- Always serve your code from a package with as few exported functions as possible. Make it as simple as possible for a non-R (or Python) user to run the code. Document everything that goes in and out and defend your code like crazy from bad input. The rest is your domain.
- Push as much processing as possible back to the database. It gives maximum flexibility for the BI team to produce all the reports that they might be asked for.
- When it works I really like the in-database R method. It gives the power to the database team and performs really well. Unfortunately right now the issue with failover clusters has been a deal-breaker for me and so falling back to SSIS has worked just fine.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.