Using R in an High Performance Computing environment
[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In a common workflow when programming with R one only deals with a Desktop machine or a Laptop, for instance. This PC environment is convenient for R users as they can focus mainly on coding but it could be the case that the program is taking a long time to run (more than 1 hr. for instance) and one needs many repetitions for the same simulation. In some cases, the program could eat up the available memory of the PC. For a PC environment, tools such as Task Manager (Windows), Activity Monitor (Mac), and top/htop (Linux) could help you to monitor the usage of resources. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
High Performance Computing (HPC) centers offer the possibility of increasing the resources (memory/CPU power) your program can utilize. If you opt for moving your workflow to an HPC environment, you would need to learn how to deal with it to take full advantage of the provided resources. In this post, I will write some recommendations that we offer to our users at the High Performance Computing Center North (HPC2N) but that could be applied to other centers as well.
One important aspect, that I observed tends to create issues when moving to HPC, is the terminology. Some of the common terms used in HPC such as cores, CPUs, nodes, shared memory, and distributed memory computing, among others are covered in an R for HPC course that we delivered previously in collaboration with the Parallelldatorcentrum (PDC) in Stockholm.
In an HPC environment, one allocates some resources (cores and memory) for running an R program. In a PC this step is hidden in most cases from the user but under the hood, the R program would assume that all resources in that machine are available and it would try to use them. As in HPC, this step should be done explicitly (through the use of batch text files or some web server such as Open OnDemand) you will need to consciously decide how much CPU and memory power your R program will use in an efficient manner. For instance, if you request 10 cores and 20 GB (RAM) but your application is not parallelized (serial code) and uses < 1GB, 9 cores will be idle during the simulation. Sometimes, it is fine to work with this type of setup if your application needs more memory than what is provided by a single core though. Also, take into account that most HPC centers work in a project-based manner with some possible cost (monetary or with job priority for instance).
Some R packages that make use of Linear Algebra libraries, such as BLAS and LAPACK, can automatically trigger the use of several threads. One way to explicitly control the number of threads to be used is with the package RhpcBLASctl as follows:
library(RhpcBLASctl)
blas_set_num_threads(8) #set the number of threads to 8
In some packages, a parallelization layer has been introduced by using a backend (such as the Parallel package), for instance in heavy routines like bootstrapping (boot package). Other packages opted for a threaded mechanism, for instance for clustering there is a clusternor package. Examples of the usage of these packages can be found here.
In the cases already mentioned, someone did the job of parallelizing the application for us and we only need to set the number of threads or workers. If we are the R code developers who want to port some serial into a parallel program, we would need most likely refactor the code and change our programming paradigm. It is important to mention that not all the parts of a program are suitable for parallelization and there could be parts that although parallelizable, one could not observe a significant speedup (ratio of simulation time with 1 core by time with N cores). Thus, one important aspect of code parallelization is to make a code analysis (profiling) by timing parts of the code and locating the bottlenecks that are suitable for parallelization.
In the following code in serial mode (unoptimized one), I am computing the 2D integral of the sinus function between 0 and π in both x and y ranges:
∫∫sin(x+y)dxdy = 0
integral <- function(N){
# Function for computing a 2D sinus integral
h <- pi/N # Size of grid
mySum <-0 # Camel convention for variables' names
for (i in 1:N) { # Discretization in the x direction
x <- h*(i-0.5) # x coordinate of the grid cell
for (j in 1:N) { # Discretization in the y direction
y <- h*(j-0.5) # y coordinate of the grid cell
mySum <- mySum + sin(x+y) # Computing the integral
}
}
return(mySum*h*h)
}
One way to parallelize this code is by dividing the workload (for loop in the x direction) in an even manner by using some number of workers. In the present case, I will make use of the foreach function that is available in the doParallel package and that allows running tasks in parallel mode. Once I decided what part of the code I will parallelize (x integration) and the tools (foreach), I can refactor my original code. One possible parallel version can be:
integral_parallel <- function(N,i){
# Parallel function for computing a 2D sinus integral
myPartialSum <- 0.0
x <- h*(i-0.5) # x coordinate of the grid cell
for (j in 1:N) { # Discretization in the y direction
y <- h*(j-0.5) # y coordinate of the grid cell
myPartialSum <- myPartialSum + sin(x+y) # Computing the integral
}
return(myPartialSum)
}
Notice that here I changed the original programming paradigm because now my function only computes a partial value for each worker. The total value will be known only after all the workers finish their tasks and the result is summarized at the end. The doParallel package requires the initialization of a cluster and the foreach function requires the dopar option to run tasks in parallel mode:
library(doParallel)
cl <- makeCluster(M) # Create the cluster with M workers
registerDoParallel(cl)
r <- foreach(i=1:N, .combine = 'c') %dopar% integral_parallel(N,i)
stopCluster(cl)
integral <- sum(r)*h*h # Summarize and print out final result
integral
The complete example can be found here.
A common mistake of HPC users is that they try to use batch scripts from other centers, assuming that SLURM or PBS job schedulers behave equally in different centers. Although that is true for the standard features, system administrators at one center could activate switches that are not available or behave slightly differently in other centers.
One recommendation is to use the HPC tools available in your center to monitor the resources’ usage by a simulation. If you have access to the computing nodes the most straightforward way to obtain this information is with top/htop commands. Otherwise, tools such as Grafana or Ganglia would be handy if they are available in your center.
Additional resources:
- R in HPC course offered by HPC2N/PDC
Using R in an High Performance Computing environment was first posted on March 4, 2023 at 8:27 am.
To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.