Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Intro
As is the case with the majority of posts normally born into existence, there was an interesting problem that arose on recently on StackOverflow. Steffen, a scientist at an unnamed weather service, faced an issue with the amount of computational time required by his triple loop in R. Specifically, Steffen needed to be able to sum over different continuous regions of elements within a 3D array structure with dimensions 2500 x 2500 x 50
to acquire the amount of neighbors. The issue is quite common within geography information sciences and/or spatial statistics. What follows next is a modified version of the response that I gave providing additional insight into various solutions.
Problem Statement
Consider an array of matrices that contain only 1 and 0 entries that are spread out over time with dimensions . Within each matrix, there are specific regions on the plane of a fixed time, , that must be summed over to obtain their neighboring elements. Each region is constrained to a four by four tile given by points inside , where and .
Sample Result:
Without a loss of generality, I’ve opted to downscale the problem to avoid a considerable amount of output while display a sample result. Thus, the sample result is of dimensions: x = 6
, y = 6
, t = 2
. Therefore, the results of the neighboring clusters are:
Note: Both time points, t = 1
and t = 2
, are the same!!! More on this interesting pattern later…
Possible Solutions
There are many ways to approach a problem like this. The first is to try to optimize the initial code base within R. Secondly, and one of the primary reasons for this article is, one can port over the computational portion of the code and perhaps add some parallelization component with the hope that such a setup would decrease the amount of time required to perform said computations. Thirdly, one can try to understand the underlying structure of the arrays by searching for patterns that can be exploited to create a cache of values which would be able to be reused instead of individually computed for each time.
Thus, there are really three different components within this post that will be addressed:
- Optimizing R code directly in hopes of obtaining a speedup in brute forcing the problem;
- Porting over the R code into C++ and parallelizing the computation using more brute force;
- Trying to determine different patterns in the data and exploit their weakness
Optimizing within R
The initial problem statement has something that all users of R
dread: a loop. However, it isn’t just one loop, it’s 3! As is known, one of the key downsides to R
is looping. This problem has a lot of coverage – my favorite being the straight, curly, or compiled – and is primarily one of the key reasons why Rcpp
is favored in loop heavy problems.
There are a few ways we can aim to optimize just the R code:
- Cache subsets within the loop.
- Parallelize the time computation stage.
Original
To show differences between functions, I’ll opt to declare the base computational loop given above as:
Cached R
The first order of business is to implement a cached value schema so that we avoid subsetting the different elements considerably.
Parallelized R
Next up, let’s implement a way to parallelize the computation over time using R’s built in parallel
package.
Benchmarking
As this is Rcpp
, benchmarks are king. Here I’ve opted to create a benchmark of the different functions after trying to optimize them within R.
Output:
Not surprisingly, the cached subset function performed better than the original function. What was particularly surpising in this case was that the parallelization option within R took a considerably longer amount of time as additional cores were added. This was partly due to the fact that the a
array had to be replicated out across the different R
processes and then there was also a time lag between spawning the processes and winding them down. This may prove to be a fatal flaw of the construction of cube_r_parallel
as a normal user may wish to make repeated calls within the same parallelized session. However, I digress as I feel like I’ve spent too much time describing ways to optimize the R
code.
Porting into C++
To keep in the spirit of pure ‘brute force’, we can quickly port over the R cached version of the computational loop to Armadillo
, C++ Linear Algebra Library, and use OpenMP
with RcppArmadillo
.
Why not use just Rcpp
though? The rationale for using RcppArmadillo
over Rcpp
is principally because of the native support for multidimensional arrays via arma::cube. In addition, it’s a bit painful in the current verison of Rcpp
to work with an array. Hence, I’d rather face the cost of copying an object from SEXP
to arma
and back again than mess around with this.
Before we get started, it is very important to provide protection for systems that still lack OpenMP
by default in R (**cough** OS X **cough**). Though, this option can be enabled using the steps described at http://thecoatlessprofessor.com/programming/openmp-in-r-on-os-x/. After saying that, we still need to protect users who do not use that option by guarding the header inclusion:
This provides protection from the compiler throwing an error due to OpenMP
not being available on the system. Note: In the event that the system does not have OpenMP
, the process will be executed serially just like always.
With this being said, let’s look at the C++ port of the R function:
A few things to note here:
a
,res
,xdim
,ydim
andncores
are shared across processes.t
is unique to each process.- We protect users that cannot use
OpenMP
!
To verify that this is an equivalent function, we opt to check the object equality:
Timings
Just as before, let’s check the benchmarks to see how well we did:
Output:
Wow! The C++ version of the parallelization really did wonders vs. the previous R implementation of parallelization. The speed ups are about 44x vs. original and 34x vs. the optimized R loop (using 3 cores). Note, with the addition of the 4th core, the parallelization option performs poorly as the system running the benchmarks only has four cores. Thus, one of the cores is trying to keep up with the parallelization while also having to work on operating system tasks and so on.
Now, this isn’t to say that there is no cap to parallelization benefits given infinite amounts of processing power. In fact, there are two laws that govern general speedups from parallelization: Amdahl’s Law (Fixed Problem Size) and Gustafson’s Law (Scaled Speedup).
Detecting Patterns
Firstly, notice how the array is constructed in this case with: array(0:1, dims)
. There seems to be some sort of pattern depending on the xdim
, ydim
, and tdim
. If we can recognize the pattern in advance, the amount of computation required decreases. However, this may also impact the ability to generalize the algorithm to other cases outside the problem statement. Thus, the reason for this part of the post being at the terminal part of this article.
After some trial and error using different dimensions, different patterns become recognizable and reproducible. Most notably, we have three different cases:
- Case 1: If
xdim
is even, then only the rows of a matrix alternate with rows containing all 1s or 0s. - Case 2: If
xdim
is odd andydim
is even, then only the rows alternate of a matrix alternate with rows containing a combination of both 1 or 0. - Case 3: If
xdim
is odd andydim
is odd, then rows alternate as well as the matrices alternate with a combination of both 1 or 0.
Examples of Pattern Cases
Let’s see the cases in action to observe the patterns. Note, the size of these arrays is small and would likely yield issues where the for
loop given above would not work due to an out-of-bounds error.
Case 1:
Output:
Case 2:
Output:
Case 3:
Output:
Pattern Hacking
Based on the above discussion, we opt to make a bit of code that exploits this unique pattern. The language that we can write this code in is either R or C++. Though, due to the nature of this website, we opt to proceed in the later. Nevertheless, as an exercise, feel free to backport this code into R.
Having acknowledged that, we opt to start off trying to create code that can fill a matrix with either an even
or an odd
column vector. e.g.
Odd Vector
The odd vector is defined as having the initial value being given by 1
instead of 0
. This is used in heavily in both Case 2 and Case 3.
Even
In comparison to the odd vector, the even vector starts at 0. This is used principally in Case 1 and then in Case 2 and Case 3 to alternate columns.
Creating Alternating Vectors
To obtain such an alternating vector that switches between two values, we opt to create a vector using the modulus of the vector position i % 2
, where 0 % 2
or 2 % 2
is 0 (divisible by 2 / even) and 1 % 2
is 1 (not divisible by 2 / odd). alternating vector in this case switches between two different values.
Creating the three cases of matrix
With our ability to now generate odd and even vectors by column, we now need to figureo out how to create the matrices described in each case. As mentioned above, there are three cases of matrix given as:
- The even,
- The bad odd,
- And the ugly odd.
Calculation Engine
Next, we need to create a computational loop to subset the appropriate continuous areas of the matrix to figure out the amount of neighbors. In comparison to the problem statement, note that this loop is without the t
as we no longer need to repeat calculations within this approach. Instead, we only need to compute the values once and then cache the result before duplicating it across the array.
Call Main Function
Whew, that was a lot of work. But, by approaching the problem this way, we have:
- Created reusable code snippets.
- Decreased the size of the main call function.
- Improved clarity of each operation.
Now, the we are ready to write the glue that combines all the different components. As a result, we will obtain the desired neighbor information.
Verification of Results
To verify, let’s quickly create similar cases and test them against the original R function:
Case 1:
Case 2:
Case 3:
Closing Time – You don’t have to go home but you can’t stay here.
With all of these different methods now thoroughly described, let’s do one last benchmark to figure out the best of the best.
Output:
As can be seen, the customized approach based on the data’s pattern provided the fastest speed up. Meanwhile, the code port into C++ yielded much better results than the optimized R version. Lastly, the parallelization within R was simply too time consuming for a one-off computation.
Misc
The benchmarks listed above were taken using microbenchmark
package.
Hardware:
- iMac (27-inch, Late 2013) with:
- 3.2 GHz Intel Core i5
- 32 GB 1600 MHz DDR3
- NVIDIA GeForce GT 755M 1024MB
Software:
- R version 3.3.0
- RcppArmadillo version 0.7.100.3.1
- Rcpp version 0.12.5
- Compiler: clang-omp under c++98
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.