Site icon R-bloggers

AzureDSVM: a new R package for elastic use of the Azure Data Science Virtual Machine

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Le Zhang (Data Scientist, Microsoft) and Graham Williams (Director of Data Science, Microsoft)

The Azure Data Science Virtual Machine (DSVM) is a curated VM which provides commonly-used tools and software for data science and machine learning, pre-installed. AzureDSVM is a new R package that enables seamless interaction with the DSVM from a local R session, by providing functions for the following tasks:

  1. Deployment, deallocation, deletion of one or multiple DSVMs;
  2. Remote execution of local R scripts: compute contexts available in Microsoft R Server can be enabled for enhanced computation efficiency for either a single DSVM or a cluster of DSVMs;
  3. Retrieval of cost consumption and total expense spent on using DSVM(s).

AzureDSVM is built upon the AzureSMR package and depends on the same set of R packages such as httr, jsonlite, etc. It requires the same initial set up on Azure Active Directory (for authentication).

To install AzureDSVM with devtools package:

library(devtools)
devtools::install_github("Azure/AzureDSVM")
library("AzureDSVM")

When deploying a Data Science Virtual Machine, the machine name, size, OS type, etc. must be specified. AzureDSVM supports DSVMs on Ubuntu, CentOS, Windows, and Windows with the Deep Learning Toolkit (on GPU-class instances). For example, the following code fires up a D4 v2 Ubuntu DSVM located in South East Asia:

deployDSVM(context, 
           resource.group="example",
           location="southeastasia",
           size="Standard_D4_v2",
           os="Ubuntu",
           hostname="mydsvm",
           username="myname",
           pubkey="pubkey")

where context is an azureActiveContext object created by AzureSMR::createAzureContext() function that encapsulates credentials (Tenant ID, Client ID, etc.) for Azure authentication.

In addition to launching a single DSVM, the AzureDSVM package makes it easy to launch a cluster with multiple virtual machines. Multi-deployment supports:

  1. creating a collection of independent DSVMs which can be distributed to a group of data scientists for collaborative projects, as well as
  2. clustering a set of connected DSVMs for high-performance computation.

To create a cluster of 5 Ubuntu DSVMs with default VM size, use:

cluster<-deployDSVMCluster(context, 
                           resource.group=RG, 
                           location="southeastasia", 
                           hostnames="mydsvm",
                           usernames="myname", 
                           pubkeys="pubkey",
                           count=5)

To execute a local script on remote cluster of DSVMs with a specified Microsoft R Server compute context, use the executeScript function. (NOTE: only Linux-based DSVM instances are supported at the moment as underneath the remote execution is achieved via SSH. Microsoft R Server 9.x allows remote interaction for both Linux and Windows, and more details can be found here.) Here, we use the RxForeachDoPar context (as indicated by the compute.context option):

executeScript(context,
              resource.group="southeastasia",
              machines="dsvm_names_in_the_cluster",
              remote="fqdn_of_dsvm_used_as_master",
              user="myname",
              script="path_to_the_script_for_remote_execution",
              master="fqdn_of_dsvm_used_as_master",
              slaves="fqdns_of_dsvms_used_as_slaves",
              compute.context="clusterParallel")

Information of cost consumption and expense spent on DSVMs can be retrieved with:

consum<-expenseCalculator(context,
                            instance="mydsvm",
                            time.start="time_stamp_of_starting_point",
                            time.end="time_stamp_of_ending_point",
                            granularity="Daily",
                            currency="USD",
                            locale="en-US",
                            offerId="offer_id_of_azure_subscription",
                            region="southeastasia")

print(consum)

Detailed introductions and tutorials can be found in the AzureDSVM Github repository, linked below.

Github (Azure): AzureDSVM

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.