Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
R has a lot of tools to speed up computations making use of multiple CPU cores either on one computer, or on multiple machines. This series of exercises aims to introduce the basic techniques for implementing parallel computations using multiple CPU cores on one machine.
The initial step in preparation for parallelizing computations is to decide whether the task can and should be run in parallel. Some tasks involve sequential computation, where operations in one round depend on the results of the previous round. Such computations cannot be parallelized. The next question is whether it is worth to use parallel computations. On the one hand, running tasks in parallel may reduce computer time spent on calculations. On the other hand, it requires additional time to write the code that can be run in parallel, and check whether it yields correct results.
The code that implements parallel computations basically makes three things:
- splits the task into pieces,
- runs them in parallel, and
- combines the results.
This set of exercises allows to train in using the snowfall package to perform parallel computations. The set is based on the example of parallelizing the k-means algorithm, which splits data into clusters (i.e. splits data points into groups based on their similarity). The standard k-means algorithm is sensitive to the choice of initial points. So it is advisable to run the algorithm multiple times, with different initial points to get the best result. It is assumed that your computer has two or more CPU cores.
The data for the exercises can be downloaded here.
For other parts of the series follow the tag parallel computing.
Answers to the exercises are available here.
Exercise 1
Use the detectCores function from the parallel package to find the number of physical CPU cores on your computer. Then change the arguments of the function to find the number of logical CPU cores.
Exercise 2
Load the data set, and assign it to the df variable.
Exercise 3
Use the system.time function to measure the time spent on execution of the command fit_30 <- kmeans(df, centers = 3, nstart = 30), which finds three clusters in the data.
Note that this command runs the kmeans function 30 times sequentially with different (randomly chosen) initial points, and then selects the ‘best’ way of clustering (the one that minimizes the squared sum of distances between each data point and its cluster center).
- efficiently organize your workflow to get the best performance of your entire project
- get a full introduction to using R for a data science project
- And much more
Exercise 4
Now we’ll try to paralellize the runs of kmeans. The first step is to write the code that performs a single run of the kmeans function. The code has to do the following:
- Randomly choose three rows in the data set (this can be done using the
samplefunction). - Subset the data set keeping only the chosen rows (they will be used as initial points in the k-means algorithm).
- Transform the obtained subset into a matrix.
- Run the
kmeansfunction using the original data set, the obtained matrix (as thecentersargument), and without thenstartargument.
Exercise 5
The second step is to wrap the code written in the previous exercise into a function. It should take one argument, which is not used (see explanation on the solutions page), and should return the output of the kmeans function.
Such functions are often labelled as wrapper, but they may have any possible name.
Exercise 6
Let’s prepare for parallel execution of the function:
- Initialize a cluster for parallel computations using the
sfInitfunction from thesnowfallpackage. Set theparallelargument equal toTRUE. If your machine has two logical CPU’s assign two to thecpusargument; if the number of CPU’s exceeds two set this argument equal to the number of logical CPU’s on your machine minus one. - Make the data set available for parallel processes with the
sfExportfunction. - Prepare the random number generation for parallel execution using the
sfClusterSetupRNG. Set theseedargument equal to 1234.
(Note that kmeans is a function from the base R packages. If your want to run in parallel a function from a downloaded package, you have also to make it available for parallel execution with the sfLibrary function).
Exercise 7
Use the sfLapply function from the snowfall package to run the wrapper function (written in Exercise 5) 30 times in parallel, and store the output of sfLapply in the result variable. Apply also the system.time function to measure the time spent on execution of sfLapply.
Note that sfLapply is a parallel version of lapply function. It takes two main arguments: (1) a vector or a list (in this case it should be a numeric vector of length 30), and (2) the function to be applied to each element of the vector or list.
Exercise 8
Stop the cluster for parallel execution with the sfStop function from the snowfall package.
Exercise 9
Explore the output of sfLapply (the result object):
- Find out to what class it belongs.
- Print its length.
- Print the structure of its first element.
- Find the value of the
tot.withinsssub-element in the first element (it represents the total squared sum of distances between data points and their cluster centers in a given solution to the clustering problem). Print that value.
Exercise 10
Find an element of the result object with the lowest tot.withinss value (there may be multiple such elements), and assign it to the best_result variable.
Compare the tot.withinss value of that variable with the corresponding value of the fit_30 variable, which was obtained in Exercise 3.
Related exercise sets:
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
