Site icon R-bloggers

Parallelizing Data Analytics on Azure with the R Interface Tool

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Le Zhang (Data Scientist, Microsoft) and Graham Williams (Director of Data Science, Microsoft)

In data science, to develop a model with optimal performance, exploratory experiments on different sets of hyper-parameters are often performed. Preliminary analyses on smaller data can be performed on a single machine, while the experimental one on large-scale data by sweeping multi-sets of parameters can be run on a cluster to boost the computational efforts. Scalable computation resources which can be easily managed are desired for such an application scenario. This blog post shares a walk-through using the Azure Interface tool that

  • operates and manages Azure cloud instances directly from R using the AzureSMR package, and
  • executes scalable analytical jobs on deployed instances using customized Microsoft R Server computation contexts, which are easily specified in R using an "interface object".

The overall architecture of the Interface Tool is described in the graphic below. The Master script manages Azure instance deployment and operation, as well as settings of analytics execution specifications. The Worker script contains the actual analytics which is submitted onto the Azure instance and run according to the configured context.

 

predictive maintenance example is presented to illustrate the efficacy of the interface tool. Predictive maintenance is widely employed within the airline and manufacturing industries to reduce operational costs. One of the problems in a predictive maintenance scenario is to diagnose whether equipment is working under a healthy status or not by analyzing historical sensor measurements. Based on the data set of sensor measurements, a well-trained model is able to classify a machine as “low risk” or “high risk” of failure. To obtain an optimal model for accurate prediction, hyper-parameters should be swept to perform a search on the domain of definition. There are two critical hyper-parameters in the health status recognition that need experimental analysis for optimal prediction: 

  1. Length of historical data. The length of historical data is the period over which the sensor value measures can be used to characterize the health status of the machine. For the high risk group samples, the time series are taken near the end of the life time of a machine (as shown in the illustration below), while for the low risk group samples, the time series are taken near the beginning of the lifetime. The classification model can be subject to over fitting if the length of historical data is too small. However, enlarging the window too much reduces the accuracy of the model as a large window may overlap the data series from two different risk groups.
  2. Lagged feature window. To consider the time-dependency of the time series data, correlation of sensor value data in the lagged time points with the current time point is also taken into consideration. The strategy is to obtain the statistical characteristics (e.g., rolling mean) of the historical data and aggregate them into the data at the current time point. There are also trade-offs in the lagged feature window. If the window is too small, the progressing degradation of the features over time may not be captured. However, if the window is too large, the dimensionality of the feature set for training grows which may easily lead to poor performance.

The graph below depicts the health status metric of a machine during its lifetime, partitioned into the low risk group and high risk group.


In our example, we create a Master script to manage Azure resources (virtual machines) and specify the computing environment. We launch five Data Science Virtual Machines (DSVMs) on the Azure platform to form a cluster, using the AzureSMR functions below to eliminate the repetitiveness of manual DSVM deployment.

# Authenticate Azure account and start DSVMs in resource group under the subscription.
 
library(AzureSMR)
library(magrittr)
library(dplyr)
 
TID <- "88bf...011d"    # Tenant ID from app creation in Active Directory.
CID <- "10e3....d3d1"   # Client ID from app creation in Active Directory.
KEY <- "u/cc....53hg"   # User key from app creation in Active Directory.
 
sc <- createAzureContext(tenantID=TID, clientID=CID, authKey=KEY)
 
rg_list <- azureListRG(sc)
 
location <- 
  as.character(rg_list %>% filter(name == RG) %>% select(location)) %T>%
  print() 
 
vm_list <- 
  azureListVM(azureActiveContext=sc, resourceGroup=RG) %T>%
  print()
 
vm_names <- as.character(vm_list$name)
 
# check the status of VMs.
 
for (vm in vm_names) 
{
  vm_status <- azureVMStatus(azureActiveContext=sc, resourceGroup=RG, vmName=vm)
  print(paste0(vm, vm_status, sep=", "))
}
 
# switch on the VMs in the resource group.
 
for (vm in vm_names) 
{
  azureStartVM(azureActiveContext=sc, resourceGroup=RG, vmName=vm)
}
vm_dns_list <- paste0(vm_names, ".", location, ".cloudapp.azure.com")

Next, an Interface object is created in the Master script to specify and configure the computing context. Here the “clusterParallel” computing context is used to parallelize the analytical work on the a cluster formed by the 5 running DSVM nodes.

index <- sample(x=seq_along(vm_dns_list), 1)

MACHINES     <- vm_names
MACHINES_URL <- vm_dns_list
MASTER_URL   <- vm_dns_list[index]
SLAVES_URL   <- vm_dns_list[-index]
 
source("rInterfaceObject.R")
 
# Create a new interface.
 
interface <- new("rInterface")
 
# Set the interface.
 
interface <- riSet(object=interface,
                   remote=MASTER_URL,
                   user="<user-name>"
)
 
# Configure the interface.
 
interface <- riConfig(object=interface, 
                      machine_list=MACHINES, 
                      data="<reference-to-data-source",
                      dns_list=MACHINES_URL, 
                      machine_user="<user-name>",
                      master=MASTER_URL, 
                      slaves=SLAVES_URL, 
                      context="clusterParallel")

A Worker script which contains the analytic logic is now created. The computation performs a health status prediction given sensor measures from a machine. Note that once the interface object is configured, codes that specify the compute context in the Worker script will be automatically added. (The contents of the Worker script are not shown here but are included in the Github repository.)

# Create a new worker script.
 
script_path  <- "<path-to-worker-script>"
script_title <- "<name-of-worker-script>"
 
if (!file.exists(file.path(script_path, script_title))) {
  riNewScript(path=script_path, title=script_title)
}
 
# Set the script to the interface object, and update the new worker script with config information.
 
interface@script <- file.path(script_path, script_title)
riScript(interface)
file.edit(interface@script)

The engines in the data set (NASA Turbo Fan Engine Sensor Data Set) are labelled as “high risk” and “low risk”. The length of historical data is selected from two different values to investigate its impact. Similarly, the feature lagged window is also selected from two values. The problem can be regarded as a classification – the boosted tree algorithm is used to train a model which is then tested on a testing data set. The Worker script is run on the header node of the DSVM cluster, with the specification of computation context in the Master script. The computation context is set to be “clusterParallel” which leverages the power of 4 DSVM nodes for computation.

result <- riExecute(object=interface,
                    roptions="--verbose",
                    verbose=TRUE)

The model performance is evaluated on four metrics: accuracy, precision, recall, and F-score. The parameter “length of historical data” (LHD) is selected from 40 and 50 cycles while the lagged feature window (LFW) is 3 or 5. Two sets of experiments are performed to sweep only LHD and both LHD and LFW. The results follow.

(LHD, LFW)

Accuracy/Precision/Recall/F-1

Elapsed time

(40, 3)

0.992/0.984/0.999/0.992

2.6 min

(50, 3)

0.969/0.966/0.972/0.969

(40, 3)

0.990/0.981/0.999/0.990

2.8 min

(40, 5)

0.990/0.984/0.997/0.990

(50, 3)

0.982/0.976/0.988/0.982

(50, 5)

0.991/0.984/0.998/0.991

After execution, the optimal combination of parameters can be selected based on the model performance. Note the execution of experimental set 2 which sweeps on both of LHD and LFW only induces a small increase of elapsed time, owing to the high performance computational power of the DSVM cluster. After the analytics are completed the DSVMs can be powered down to reduce their running cost whilst they are not required.

# Switch off the VMs in the resource group.
for (vm in vm_names) 
{
  azureStopVM(azureActiveContext=sc, resourceGroup=RG, vmName=vm))
}

Using the interface tool as presented here can significantly reduce time spent manually configuring DSVMs in a traditional GUI portal and in specifying the compute environment for executing the tasks. As the tool makes it possible to harness everything within an R session it can be easily aggregated into a development pipeline for scalable analytics.

The walk-through presented in this post can be found in our GitHub repository at https://github.com/yueguoguo/Azure-R-Interface. This includes the DSVM deployment script, ARM template, Interface object definition, and Master and Worker scripts for the health status diagnosis use case. For any queries, please reach out the authors Le Zhang zhle@microsoft.com and Graham Williams graham.williams@microsoft.com.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.