Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
R has a lot of tools to speed up computations making use of multiple CPU cores either on one computer, or on multiple machines. This series of exercises aims to introduce the basic techniques for implementing parallel computations using multiple CPU cores on one machine.
The initial step in preparation for parallelizing computations is to decide whether the task can and should be run in parallel. Some tasks involve sequential computation, where operations in one round depend on the results of the previous round. Such computations cannot be parallelized. The next question is whether it is worth to use parallel computations. On the one hand, running tasks in parallel may reduce computer time spent on calculations. On the other hand, it requires additional time to write the code that can be run in parallel, and check whether it yields correct results.
The code that implements parallel computations basically makes three things:
- splits the task into pieces,
- runs them in parallel, and
- combines the results.
This set of exercises allows to train in using the snowfall
package to perform parallel computations. The set is based on the example of parallelizing the k-means algorithm, which splits data into clusters (i.e. splits data points into groups based on their similarity). The standard k-means algorithm is sensitive to the choice of initial points. So it is advisable to run the algorithm multiple times, with different initial points to get the best result. It is assumed that your computer has two or more CPU cores.
The data for the exercises can be downloaded here.
For other parts of the series follow the tag parallel computing.
Answers to the exercises are available here.
Exercise 1
Use the detectCores
function from the parallel
package to find the number of physical CPU cores on your computer. Then change the arguments of the function to find the number of logical CPU cores.
Exercise 2
Load the data set, and assign it to the df
variable.
Exercise 3
Use the system.time
function to measure the time spent on execution of the command fit_30 <- kmeans(df, centers = 3, nstart = 30)
, which finds three clusters in the data.
Note that this command runs the kmeans function 30 times sequentially with different (randomly chosen) initial points, and then selects the ‘best’ way of clustering (the one that minimizes the squared sum of distances between each data point and its cluster center).
- efficiently organize your workflow to get the best performance of your entire project
- get a full introduction to using R for a data science project
- And much more
Exercise 4
Now we’ll try to paralellize the runs of kmeans. The first step is to write the code that performs a single run of the kmeans
function. The code has to do the following:
- Randomly choose three rows in the data set (this can be done using the
sample
function). - Subset the data set keeping only the chosen rows (they will be used as initial points in the k-means algorithm).
- Transform the obtained subset into a matrix.
- Run the
kmeans
function using the original data set, the obtained matrix (as thecenters
argument), and without thenstart
argument.
Exercise 5
The second step is to wrap the code written in the previous exercise into a function. It should take one argument, which is not used (see explanation on the solutions page), and should return the output of the kmeans
function.
Such functions are often labelled as wrapper
, but they may have any possible name.
Exercise 6
Let’s prepare for parallel execution of the function:
- Initialize a cluster for parallel computations using the
sfInit
function from thesnowfall
package. Set theparallel
argument equal toTRUE
. If your machine has two logical CPU’s assign two to thecpus
argument; if the number of CPU’s exceeds two set this argument equal to the number of logical CPU’s on your machine minus one. - Make the data set available for parallel processes with the
sfExport
function. - Prepare the random number generation for parallel execution using the
sfClusterSetupRNG
. Set theseed
argument equal to 1234.
(Note that kmeans
is a function from the base R packages. If your want to run in parallel a function from a downloaded package, you have also to make it available for parallel execution with the sfLibrary
function).
Exercise 7
Use the sfLapply
function from the snowfall
package to run the wrapper function (written in Exercise 5) 30 times in parallel, and store the output of sfLapply
in the result
variable. Apply also the system.time
function to measure the time spent on execution of sfLapply
.
Note that sfLapply
is a parallel version of lapply
function. It takes two main arguments: (1) a vector or a list (in this case it should be a numeric vector of length 30), and (2) the function to be applied to each element of the vector or list.
Exercise 8
Stop the cluster for parallel execution with the sfStop
function from the snowfall
package.
Exercise 9
Explore the output of sfLapply
(the result
object):
- Find out to what class it belongs.
- Print its length.
- Print the structure of its first element.
- Find the value of the
tot.withinss
sub-element in the first element (it represents the total squared sum of distances between data points and their cluster centers in a given solution to the clustering problem). Print that value.
Exercise 10
Find an element of the result
object with the lowest tot.withinss
value (there may be multiple such elements), and assign it to the best_result
variable.
Compare the tot.withinss
value of that variable with the corresponding value of the fit_30
variable, which was obtained in Exercise 3.
Related exercise sets:
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.