Parallel Computing Exercises: Snow and Rmpi (Part-3)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The foreach
statement, which was introduced in the previous set of exercises of this series, can work with various parallel backends. This set allows to train in working with backends provided by the snow
and Rmpi
packages (on a single machine with multiple CPUs). The name of the former package stands for “Simple Network of Workstations”. It can employ various parallelization techniques; socket clustering is used here. The latter one is an R’s wrapper for the MPI (Message-Passing Interface), which is another paralellization technique.
The set also demonstrates that inter-process communication overhead has to be taken into account when preparing to use parallelization. If short tasks are run in parallel the overhead can offset the gains in performance from using multiple CPUs, and in some cases execution can get even slower. For parallelization to be useful, tasks that are run in parallel have to be long enough.
The exercises are based on an example of using bootstrapping to estimate the sampling distribution of linear regression coefficients. The regression is run multiple times on different sets of data derived from an original sample. The size of each derived data set is equal to the size of the original sample, which is possible because the sets are produced by random sampling with replacement. The original sample is taken from the InstEval
data set, which comes with the lme4
package, and represents lecture/instructor evaluations by students at the ETH. The estimated distribution is not analyzed in the exercises.
The exercises require the packages foreach
, snow
, doSNOW
, Rmpi
, and doMPI
to be installed.
IMPORTANT NOTE: the Rmpi
package depends on an MPI software, which has to be installed on the machine separately. The software can be the following:
- Windows: either the
Microsoft MPI
, orOpen MPI
library (the former one can be installed as an ordinary application). - OS X/macOS: the
Open MPI
library (available throughHomebrew
). - Linux: the
Open MPI
library (look for packages namedlibopenmpi
(oropenmpi
,lib64openmpi
, or similar), as well aslibopenmpi-dev
(orlibopenmpi-devel
, or similar) in your distribution’s repository).
The zipped data set can be downloaded here. For other parts of the series follow the tag parallel computing.
Answers to the exercises are available here.
Exercise 1
Load the data set, and assign it to the data_large
variable.
Exercise 2
Create a smaller data set that will be used to compare how parallel computing performance depends on the size of the task. Use the sample
function to obtain a random subset from the loaded data. Its size has to be 10% of the size of the original dataset (in terms of rows). Assign the subset to the data_small
variable.
For reproducibility, set the seed to 1234
.
Print the number of rows in the data_large
and data_small
data sets.
Exercise 3
Write a function that will be used as a task in parallel computing. The function has to take a data set as an input, and do the following:
- Resample the data, i.e. obtain a new sample of data based on the input data set. The number of rows in the new sample has to be equal to the one in the input data set (use the
sample
function as in the previous exercise, but change parameters to allow for resampling with replacement). - Run a linear regression on the resampled data. Use
y
as the dependent variable, and the others as independent variables (this can be done by using the formulay ~ .
as an argument to thelm
function). - Return a vector of coefficients of the linear regression.
Exercise 4
Let’s test how much time it takes to run the task multiple times sequentially (not in parallel). Use the foreach
statement with the %do%
operator (as discussed in the previous set of exercises of this series) to run it:
- 10 times with the
data_large
data set, and - 100 times with the
data_small
data set.
Use the rbind
function as an argument to foreach
to combine the results.
In both cases, measure how much time is spent on execution of the task (with the system.time
function). Theoretically, the length of time spent should be roughly the same because the total number of rows processed is equal (it is 100,000 rows: 10,000 rows 10 times in the first case, and 1,000 rows 100 times in the second case), and the row length is the same. But is this the case in practice?
Exercise 5
Now we’ll prepare to run the task in parallel using 2 CPU cores. First, we’ll use a parallel computing backend for the foreach
statement from the snow
package. This requires to steps:
- Make a cluster for parallel execution using the
makeCluster
function from thesnow
package. Pass two arguments to this function: the size of the cluster (i.e. the number of CPU cores that will be used in computations), and the type of the cluster ("SOCK"
in this case). - Register the cluster with the
registerDoSNOW
function from thedoSNOW
package (which provides aforeach
parallel adapter for the'snow'
package).
Exercise 6
Run the task 10 times with the large data set in parallel using the foreach
statement with the %dopar%
operator (as discussed in the previous set of exercises of this series). Measure the time spent on execution with the system.time
function.
When done, use the stopCluster
function from the snow
package to stop the cluster.
Is the length of execution time smaller comparing to the one measured in Exercise 4?
Exercise 7
Repeat the steps listed in Exercise 5 and Exercise 6 to run the task 100 times using the small data set.
What is the change in the execution time?
Exercise 8
Next, we’ll use another parallel backend for the foreach
function: the one that is provided by the Rmpi
package (R’s wrapper to Message-Passing Interface), and accessible through an adapter from the doMPI
package. From the user perspective, it differs from the snow
-based backend in the following ways:
- as mentioned above, additional software has to be installed for this backend to work (either (a) the
openmpi
library, available for Windows, macOS, and Linux, or (b) theMicrosoft MPI
library, which is available for Windows, - when an
mpi
cluster is created, it immediately starts using CPUs as much as it can, - when the work is complete, the
mpi
execution environment has to be terminated; if terminated, it can’t be relaunched without restarting the R session (if you try to create anmpi
cluster after the environment was terminated, the session will be aborted, which may result in a loss of data; see Exercise 10 for more details).
In this exercise, we’ll create an mpi
execution environment to run the task using 2 CPU cores. This requires actions similar to the ones performed in Exercise 5:
- Make a cluster for parallel execution using the
startMPIcluster
function from thedoMPI
package. This function can take just one argument, which is the number of CPU cores to be used in computations. - Register the cluster with the
registerDoMPI
function from thedoMPI
package.
After creating a cluster, you may check whether the CPU usage on your machine increased using Resource Monitor (Windows), Activity Monitor (macOS), top
or htop
commands (Linux), or other tools.
Exercise 9
Stop the cluster created in the previous exercise with the closeCluster
command from the doMPI
package. The CPU usage should fall immediately.
Exercise 10
Create an mpi
cluster again, and use it as a backend for the foreach
statement to run the task defined above:
- 10 times with the
data_large
data set, and - 100 times with the
data_small
data set.
In both cases, start a cluster before running the task, and stop it afterwards. Measure how much time is spent on execution of the task. How the time compares to the execution time with the snow
cluster (found in Exercises 6 and 7)?
When done working with the clusters, terminate the mpi
execution environment with the mpi.finalize
function. Note that this function always returns 1.
Important! As mentioned above, if you intend to create an mpi
cluster again after the environment was terminated you have to restart the R session, otherwise the current session will be aborted, which may result in a loss of data. In RStudio, an R session can be relaunched from the Session
menu (relaunching the session this way does not affect the data, you’ll only need to reload libraries). In other cases, you may have to quit and restart R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.