Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In general, foreach
is a statement for iterating over items in a collection without using any explicit counter. In R, it is also a way to run code in parallel, which may be more convenient and readable that the sfLapply
function (considered in the previous set of exercises of this series) or other apply
-alike functions.
Apart from being able to run code in parallel, the R’s foreach
has some other differences from the standard for
loop. Specifically, the foreach
statement:
- allows to iterate over several variables simultaneously,
- returns a value (a list, a vector, a matrix, or another object),
- is able to skip some iterations based on a condition (the last two properties make it similar to the list comprehension, which is present in Python and some other languages),
- has a special syntax that includes operators
%do%
(see an example in Exercise 1),%dopar%
, and%:%
.
The first six exercises in this set allow to train in performing basic operations with the foreach
statement, and the last four ones show how to run it in parallel using multiple CPU cores on one machine. The task will be to parallelize identical operations on a set of files (the zipped data files can be downloaded here). It is assumed that your computer has two or more CPU cores.
The exercises require the packages foreach
, doParallel
, and parallel
. The first two packages have to be installed, and the last one comes with the standard R distribution. The packages doParallel
and parallel
are necessary to run foreach
in parallel.
For other parts of the series follow the tag parallel computing.
Answers to the exercises are available here.
Exercise 1
The foreach
function (from the package of the same name) is typically used as a part of a special statement. In its simple form, the statement looks like this:
result <- foreach(i = 1:3) %do% sqrt(i)
The statement above consists of three parts:
foreach(i = 1:3)
– a call to theforeach
function, with an argument that includes an iteration variable (i
) and a sequence to be iterated over (1:3
),%do%
– a special operator,sqrt(i)
: an R expression, which represents an operation to be performed over the iteration variable (this part of the statement is equivalent to the body of the loop).
The code iterates over the sequence, applies an operation defined in the expression to each element of the sequence, and stores the output in the result
variable.
Note that if the expression extends over several lines it has to be enclosed in curly braces. The use of the iteration variable is not mandatory: if you just want to repeat the expression n
times not passing anything to that expression you can use only a sequence of the length n
as input to foreach
.
In this exercise:
- Run the code above, print the
result
object, and find to which class it belongs. - Use the
foreach
function to reverse the result. I.e. write a line of code that receives theresult
object as an input, and outputs the original sequence. Print the sequence.
Exercise 2
The foreach
function allows for the use of several iteration variables simultaneously. They are passed to the function as arguments, and are separated by commas.
Run the foreach
function with two iteration variables to get a sequence of their sums. The variables have to iterate over a vector of integers from 1 to 3, and a vector of 5 integers of value 10. Print the result.
(Tip: if you want to use an arithmetic operator to calculate the sum then the expression must be placed in parentheses or curly braces).
What is the length of the resulting object? How does the function deal with the vectors of different length?
Exercise 3
The package iterators
provides several functions that can be used to create sequences for the foreach
function. For example, the irnorm
function creates an object that iterates over vectors of normally distributed random numbers. It is useful when you need to use random variables drawn from one distribution in an expression that is run in parallel.
In this exercise, use the foreach
and irnorm
functions to iterate over 3 vectors, each containing 5 random variables. Find the largest value in each vector, and print those largest values.
Before running the foreach
function set the seed to 1234.
- efficiently organize your workflow to get the best performance of your entire project
- get a full introduction to using R for a data science project
- And much more
Exercise 4
By default the foreach
function returns a list. But it can also return sequences of other types. This requires changing the value of the .combine
parameter of the function. This exercise will train how to use this parameter.
As in the previous exercise, use the foreach
and irnorm
functions to iterate over 3 vectors, each containing 5 random variables. But now use an expression that returns all variables generated by irnorm
. Pass the .combine
parameter to the foreach
function with value 'c'
. Print the result, and find its class and length.
Then run the code again with the 'cbind'
value assigned to the .combine
parameter. Print the result, find its class and size.
Note that 'c'
and 'cbind'
are R functions from the base
package. Other functions (including user-written ones) can be used as well to combine the outputs of the expression.
Exercise 5
The results of the expression placed after the %do%
operator can be combined in different ways. Look at the documentation for the foreach
function to find what value has to be assigned to the .combine
parameter to sum the values produced by the expression in each iteration.
Run the code used in previous exercise with that value assigned to the .combine
parameter, and print the result.
Before running the code set the seed to 1234.
Exercise 6
The sequence passed to the foreach
function can be filtered so that the expression after %do%
is applied only to a part of the sequence. This is done using a syntax like this:
result ‹- foreach(i = some_sequence) %:% when(i › 0) %do% sqrt(i)
You can notice that the %:%
operator and the when
function, which contains a Boolean expression involving the iteration variable, are added to a standard foreach
statement.
Modify the example above to get a vector of logs of all even integers in the range from 1 to 10. Print the result.
Exercise 7
Now let’s parallelize the execution of the foreach
function. We’ll use it to read similarly named files, and perform identical calculations on data from each file.
As a first step, write a function to be run in parallel. The function takes an integer as input, and performs the following actions:
- Create a string (character vector) with a file name by concatenating constant parts of the name (
test_data_
,.csv
) with the integer (example of possible result when 1 is used as integer:test_data_1.csv
). - Read the file with the obtained name from the current working directory into a data frame.
- Calculate mean values for each column in the data frame.
- Return a vector of those values.
Exercise 8
The second step is to create a backend for parallel execution:
- Make a cluster for parallel execution using the
makeCluster
function from theparallel
package; pass the size of the cluster (i.e. the number of CPU cores that you want to be used in computations) as an argument to this function . - Register the cluster with the
registerDoParallel
function from thedoParallel
package.
Note that by default the makeCluster
function creates a PSOCK
cluster, which is an enhanced version of the SOCK
cluster implemented in the snow
package. Accordingly, the PSOCK
cluster is a pool of worker processes that exchange data with the master process via sockets. The makeCluster
function can also create other types of clusters.
Exercise 9
The last step is to run the foreach
function to read and analyze 10 test files (contained in this archive) using the function created in Exercise 7. Combine the outputs of that function using rbind
.
Perform this task twice:
- with
%do%
operator, which evaluates the expression sequentially, and - with
%dopar%
operator, which evaluates the expression in parallel.
In both cases, measure the execution time using the the system.time
function. Print the result of the last run.
IMPORTANT: after completing parallel computations stop the cluster (created in Exercise 8) using the stopCluster
function from the parallel
package.
Exercise 10
Modify the code written in the Exercise 7 and Exercise 9 to calculate the mean and the variance of values contained in the first column in each file. The resulting object must be a two-column matrix with the first column representing means, and the second column describing variances (the number of rows must be equal to the number of files).
Repeat the actions listed in Exercise 8 to prepare a cluster for parallel execution, then run the modified code in parallel.
Print the result.
Stop the cluster.
Related exercise sets:
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.