Converting Existing R Scripts to ORE – Getting Started
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Oracle R Enterprise provides a comprehensive, database-centric environment for end-to-end analytical processes in R, with immediate deployment to production environments. This message really resonates with our customers who are interested in executing R functions on database-resident data while seamlessly leveraging Oracle Database as a high-performance computing (HPC) environment. The ability to develop and operationalize R scripts for analytical applications in one step is quite appealing.
One frequently asked question is how to convert existing R code that access data in flat files or the database to use Oracle R Enterprise. In this blog post, we talk about a few scenarios and how to begin a conversion from existing R code to using Oracle R Enterprise.
Consider the following scenarios:
Scenario 1: A stand-alone R script that generates its own data and simply returns a result. Data is not obtained from the file system or database. This may result from performing simulations where dadta is dynamically generated, or perhaps access from a URL on the internet.
Scenario 2: An R script that loads data from a flat file such as a CSV file, performs some computations in R, and then writes the result back to a file.
Scenario 3: An R script that loads data from a database table, via one of the database connector packages like RODBC, RJDBC, or ROracle, and writes a result back to the database –using SQL statements or package functions.
Scenario 1
A stand-alone R script might normally be run on a user’s desktop, invoked as a cron job, or even via Java to spawn an R engine and retrieve the result, but we’d like to operationalize its execution as part of a database application, invoked from SQL. Here’s a simple script to illustrate the concept of converting such a script to be executed at the database server using ORE’s embedded R execution. The script generates a data.frame with some random columns, performs summary on that data and returns the summary statistics, which are represented as an R table.
# generate data
set.seed(1)
n <- 1000
df <- 3
x <- data.frame(a=1:n, b=rnorm(n), c=rchisq(n,df=df))
# perform some analysis
res <- summary(x)
#return the result
res
To convert this to use ORE, create a function with appropriate arguments and body, for example:
myFunction1 <- function (n = 1000, df = 3,seed=1) {
set.seed(seed)
x <- data.frame(a=1:n, b=rnorm(n), c=rchisq(n,df=df))
res <- summary(x)
res
}
Next, load the ORE packages and connect to Oracle Database using the ore.connect function. Using the all argument set to TRUE loads metadata for all the tables and views in that schema. We then store the function in the R script repository, invoking it via ore.doEval.
# load ORE packages and connect to Oracle Database
library(ORE)
ore.connect(“schema”,”sid”,”hostname”,”password”,port=1521, all=TRUE)
# load function into R script repository
ore.scriptDrop(“myFunction-1”)
ore.scriptCreate(“myFunction-1”, myFunction1)
# invoke using embedded R execution at the database server
ore.doEval(FUN.NAME=”myFunction-1″)
> ore.doEval(FUN.NAME="myFunction-1") a b c Min. : 1.0 Min. :-3.00805 Min. : 0.03449 1st Qu.: 250.8 1st Qu.:-0.69737 1st Qu.: 1.27386 Median : 500.5 Median :-0.03532 Median : 2.36454 Mean : 500.5 Mean :-0.01165 Mean : 3.07924 3rd Qu.: 750.2 3rd Qu.: 0.68843 3rd Qu.: 4.25994 Max. :1000.0 Max. : 3.81028 Max. :17.56720
Of course, we’re using default values here. To provide different arguments, change the invocation with arguments as follows:
ore.doEval(FUN.NAME=”myFunction-1″, n=500, df=5,
seed=2)
> ore.doEval(FUN.NAME="myFunction-1", n=500, df=5, seed=2) a b c Min. : 1.0 Min. :-2.72182 Min. : 0.1621 1st Qu.:125.8 1st Qu.:-0.65346 1st Qu.: 2.6144 Median :250.5 Median : 0.04392 Median : 4.4592 Mean :250.5 Mean : 0.06169 Mean : 5.0386 3rd Qu.:375.2 3rd Qu.: 0.79096 3rd Qu.: 6.8467 Max. :500.0 Max. : 2.88842 Max. :17.0367
Having successfully invoked this from the R client (my laptop), we can now invoke it from SQL. Here, we retrieve the summary result, which is an R table, as an XML string.
select *
from table(rqEval( NULL,’XML’,’myFunction-1′));
The result can be viewed from SQL Developer.
The following shows the XML output in a more structured manner.
What if we wanted to get the result to appear as a SQL table? Since the current result is an R table (an R object), we need to convert this to a data.frame to return it. We’ll make a few modifications to “myFunction-1” above. Most notably is the need to convert the table object in res to a data.frame. There are a variety of ways to do this.
myFunction2 <- function (n = 1000, df = 3,seed=1) {
# generate data
set.seed(seed)
x <- data.frame(a=1:n, b=rnorm(n), c=rchisq(n,df=df))
# perform some analysis
res <- summary(x)
# convert the table result to a data.frame
res.df <- as.matrix(res)
res.sum <- as.data.frame(matrix(as.numeric(substr(res.df,9,20)),6,3))
names(res.sum) <- c('a','b','c')
res.sum$statname <- c("min","1stQ","median","mean","3rdQ","max")
res.sum <- res.sum[,c(4,1:3)]
res.sum
}
# load function into R script repository
ore.scriptDrop(“myFunction-2”)
ore.scriptCreate(“myFunction-2”, myFunction2)
We’ll now modify the SQL statement to specify the format of the result.
select *
from table(rqEval( NULL,’select cast(”a” as VARCHAR2(12)) as “statname”,
1 “a”, 1 “b”, 1 “c” from dual ‘,’myFunction-2’));
Here’s the result as viewed from SQL Developer.
This type of result could be incorporated into any SQL application accepting table or view input from a SQL query. That is particular useful in combination with OBIEE dashboards via an RPD.
Scenario 2
If you’ve been loading data from a flat file, perhaps a CSV file, your R code may look like the following, where it specifies to builds a model and write hat model to a file for future use, perhaps in scoring. It also generates a graph of the clusters highlighting the individual points, colored by their cluster id, with the centroids indicated with a star.
# read data
setwd(“D:/datasets”)
dat <- read.csv("myDataFile.csv")
# build a clustering model
cl <- kmeans(x, 2)
# write model to file
save(cl, file=”myClusterModel.dat”)
# create a graph and write it to a file
pdf(“myGraphFile.pdf”)
plot(x, col = cl$cluster)
points(cl$centers, col = 1:2, pch = 8, cex=2)
dev.off()
The resulting PDF file contains the following image.
To convert this script for use in ORE, there are several options. We’ll explore two: the first involving minimal change to use embedded R execution, and the second leveraging in-database techniques. First, we’ll want the data we used above in variable dat to be loaded into the database.
# create a row id to enable ordered results (if a key doesn’t already exist)
dat$ID <- 1:nrow(dat)
# remove the table if it exists
ore.drop(“MY_DATA”)
# create the table using the R data.frame, resulting in an ore.frame named MY_DATA
ore.create(dat,”MY_DATA”)
# assign the ID column as the row.names of the ore.frame
row.names(MY_DATA) <- MY_DATA$ID
In the first example, we’ll use embedded R execution and pass the data to the function via ore.tableApply. We’ll generate the graph, but simply display it within the function to allow embedded R execution to return the graph as a result. (Note we could also write the graph to a file in any directory accessible to the database server.) Instead of writing the model to a file, which requires keeping track of its location, as well as worring about backup and recovery, we store the model in the database R datastore using ore.save. All this requires minimal change. As above, we could store the function in the R script repository and invoke it by name – both from R and SQL. In this example, we simply provide the function itself as argument.
myClusterFunction1 <- function(x) {
cl <- kmeans(x, 2)
ore.save(cl, name=”myClusterModel”,overwrite=TRUE)
plot(x, col = cl$cluster)
points(cl$centers, col = 1:2, pch = 8, cex=2)
TRUE
}
ore.tableApply(MY_DATA[,c(‘x’,’y’)], myClusterFunction1,
ore.connect=TRUE,ore.png.height=700,ore.png.width=700)
The ore.tableApply function projects the x and y columns of MY_DATA as input and also specifies ore.connect as TRUE since we are using the R datastore, which requires a database connection. Optionally, we can specify control arguments to the PNG output. In this example, these are the height and width of the image.
For the second example, we convert this to leverage the ORE Transparency Layer. We’ll use the in-database K-Means algorithm and save the model in a datastore named “myClusteringModel”, as we did above. Since ore.odmKMeans doesn’t automatically assign cluster ids (since the data may be very large or are not required), the scoring is done separately. Note, however, that the prediction results also exist in the database as an ore.frame. To ensure ordering, we also assign row.names to the ore.frame pred. Lastly, we create the plot. Coloring the nodes requires pulling the cluster assignments; however, the points themselves can be accessed from the ore.frame. The centroids points are obtained from cl$centers2 of the cluster model.
# build a clustering model in-database
cl <- ore.odmKMeans(~., MY_DATA, 2, auto.data.prep=FALSE)
# save model in database R datastore
ore.save(cl,name=”myClusterModel”,overwrite=TRUE)
# generate predictions to assign each row a cluster id, supplement with original data
pred <- predict(cl,MY_DATA,supp=c('x','y','ID'),type="class")
# assign row names to ensure ordering of results
row.names(pred) <- pred$ID
# create the graph
plot(pred[,c(‘x’,’y’)], col = ore.pull(pred$CLUSTER_ID))
points(cl$centers2[,c(‘x’,’y’)], col = c(2,3), pch = 8, cex=2)
We can also combine using the transparency layer within an embedded R function. But we’ll leave that as an exercise to the reader.
Scenario 3
In this last scenario, the data already exists in the database and one of the database interface packages, such as RODBC, RJDBC, and ROracle is be used to retrieve data from and write data to the database. We’ll illustrate this with ROracle, but the same holds for the other two packages.
# connect to the database
drv <- dbDriver("Oracle")
con <- dbConnect(drv, "mySchema", "myPassword")
# retrieve the data specifying a SQL query
dat <- dbGetQuery(con, 'select * from MY_RANDOM_DATA where "a" > 100′)
# perform some analysis
res <- summary(dat)
# convert the table result to a data.frame for output as table
res.df <- as.matrix(res)
res.sum <- as.data.frame(matrix(as.numeric(substr(res.df,9,20)),6,3))
names(res.sum) <- c('a','b','c')
res.sum$statname <- c("min","1stQ","median","mean","3rdQ","max")
res.sum <- res.sum[,c(4,1:3)]
res.sum
dbWriteTable(con, “SUMMARY_STATS”, res.sum)
Converting this to ORE is straightforward. We’re already connected to the database using ore.connect from previous scenarios, so the existing table MY_RANDOM_DATA was already loaded in the environment as an ore.frame. Executing ore.ls lists this table is the result, so we can just start using it.
> ore.ls(pattern=”MY_RAND”)
[1] “MY_RANDOM_DATA”
# no need to retrieve the data, use the transparency layer to compute summary
res <- with(MY_RANDOM_DATA , summary(MY_RANDOM_DATA[a > 100,]))
# convert the table result to a data.frame for output as table
res.df <- as.matrix(res)
res.sum <- as.data.frame(matrix(as.numeric(substr(res.df,9,20)),6,3))
names(res.sum) <- c('a','b','c')
res.sum$statname <- c("min","1stQ","median","mean","3rdQ","max")
res.sum <- res.sum[,c(4,1:3)]
# create the database table
ore.create(res.sum, “SUMMARY_STATS”)
SUMMARY_STATS
As we did in previous scenarios, this script can also be wrapped in a function and used in embedded R execution. This too is left as an exercise to the reader.
Summary
As you can see from the three scenarios discussed here, converting a script that accesses no external data, accesses and manipulates file data, or accesses and manipulates database data can be accomplished with a few strategic modifications. More involved scripts, of course, may require additional manipulation. For example, if the SQL query performs complex joins and filtering, along with derived column creation, the user may want to convert this SQL to the corresponding ORE Transparency Layer code, thereby eliminating reliance on SQL. But that’s a topic for another post.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.