Optimizing Code vs Recognizing Patterns with 3D Arrays

Rcpp Gallery

6 years ago

[This article was first published on Rcpp Gallery, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Intro

As is the case with the majority of posts normally born into existence, there was an interesting problem that arose on recently on StackOverflow. Steffen, a scientist at an unnamed weather service, faced an issue with the amount of computational time required by his triple loop in R. Specifically, Steffen needed to be able to sum over different continuous regions of elements within a 3D array structure with dimensions 2500 x 2500 x 50 to acquire the amount of neighbors. The issue is quite common within geography information sciences and/or spatial statistics. What follows next is a modified version of the response that I gave providing additional insight into various solutions.

Problem Statement

Consider an array of matrices that contain only 1 and 0 entries that are spread out over time with dimensions . Within each matrix, there are specific regions on the plane of a fixed time, , that must be summed over to obtain their neighboring elements. Each region is constrained to a four by four tile given by points inside , where and .

# Matrix Dimensions
xdim = 20
ydim = 20

# Number of Time Steps
tdim = 5

# Generation of 3D Arrays

# Initial Neighborhoods
a = array(0:1, dim = c(xdim, ydim, tdim))

# Result storing matrix
res = array(0:1, dim = c(xdim, ydim, tdim))

# Calculation loop over time
for (t in 1:tdim){
  
  # Subset by specific rows
  for (x in 3:(xdim-2)){
    
    # Subset by specific columns
    for (y in 3:(ydim-2)){
      
      # Sum over each region within a time point
      res[x,y,t] <- sum(a[(x-2):(x+2),(y-2):(y+2),t])
      
    }
    
  }
  
}

Sample Result:

Without a loss of generality, I’ve opted to downscale the problem to avoid a considerable amount of output while display a sample result. Thus, the sample result is of dimensions: x = 6, y = 6, t = 2. Therefore, the results of the neighboring clusters are:

, , 1

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    0    0    0    0    0    0
[2,]    1    1    1    1    1    1
[3,]    0    0   10   10    0    0
[4,]    1    1   15   15    1    1
[5,]    0    0    0    0    0    0
[6,]    1    1    1    1    1    1

, , 2

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    0    0    0    0    0    0
[2,]    1    1    1    1    1    1
[3,]    0    0   10   10    0    0
[4,]    1    1   15   15    1    1
[5,]    0    0    0    0    0    0
[6,]    1    1    1    1    1    1

Note: Both time points, t = 1 and t = 2, are the same!!! More on this interesting pattern later…

Possible Solutions

There are many ways to approach a problem like this. The first is to try to optimize the initial code base within R. Secondly, and one of the primary reasons for this article is, one can port over the computational portion of the code and perhaps add some parallelization component with the hope that such a setup would decrease the amount of time required to perform said computations. Thirdly, one can try to understand the underlying structure of the arrays by searching for patterns that can be exploited to create a cache of values which would be able to be reused instead of individually computed for each time.

Thus, there are really three different components within this post that will be addressed:

Optimizing R code directly in hopes of obtaining a speedup in brute forcing the problem;
Porting over the R code into C++ and parallelizing the computation using more brute force;
Trying to determine different patterns in the data and exploit their weakness

Optimizing within R

The initial problem statement has something that all users of R dread: a loop. However, it isn’t just one loop, it’s 3! As is known, one of the key downsides to R is looping. This problem has a lot of coverage – my favorite being the straight, curly, or compiled – and is primarily one of the key reasons why Rcpp is favored in loop heavy problems.

There are a few ways we can aim to optimize just the R code:

Cache subsets within the loop.
Parallelize the time computation stage.

Original

To show differences between functions, I’ll opt to declare the base computational loop given above as:

cube_r = function(a,res,xdim,ydim,tdim){
  for (t in 1:tdim){
    for (x in 3:(xdim-2)){
      for (y in 3:(ydim-2)){
        res[x,y,t] = sum(a[(x-2):(x+2),(y-2):(y+2),t])
      }
    }
  }
  res
}

Cached R

The first order of business is to implement a cached value schema so that we avoid subsetting the different elements considerably.

cube_r_cache = function(a,res,xdim,ydim,tdim){
  for (t in 1:tdim){
    temp_time = a[,,t]
    for (x in 3:(xdim-2)){
      temp_row = temp_time[(x-2):(x+2),]
      for (y in 3:(ydim-2)){
        res[x,y,t] = sum(temp_row[,(y-2):(y+2)])
      }
    }
  }
  res
}

Parallelized R

Next up, let’s implement a way to parallelize the computation over time using R’s built in parallel package.

library(parallel)

cube_r_parallel = function(a, xdim, ydim, tdim, ncores){
  
  # Create a cluster
  cl = makeCluster(ncores)
  
  # Create a time-based computation loop
  computation_loop = function(X, a, xdim, ydim, tdim){
    
    temp_time = a[,,X]
    res = temp_time
    
    for (x in 3:(xdim-2)){
      temp_row = temp_time[(x-2):(x+2),]
      for (y in 3:(ydim-2)){
        res[x,y] = sum(temp_row[,(y-2):(y+2)])
      }
    }
    
    res
  }
  
  # Obtain the results
  r_parallel = parSapply(cl = cl,
                         X = seq_len(tdim), 
                         FUN = computation_loop, 
                         a = a, xdim = xdim, ydim = ydim, tdim = tdim, 
                         simplify = "array")
  
  # Kill the cluster
  stopCluster(cl)
  
  # Return
  r_parallel
}

Benchmarking

As this is Rcpp, benchmarks are king. Here I’ve opted to create a benchmark of the different functions after trying to optimize them within R.

rtimings = microbenchmark(op_answer = cube_r(a,res,xdim,ydim,tdim),
                          r_cached_subset = cube_r_cache(a,res,xdim,ydim,tdim),
                          r_parallel1 = cube_r_parallel(a,xdim,ydim,tdim, ncores = 1),
                          r_parallel2 = cube_r_parallel(a,xdim,ydim,tdim, ncores = 2),
                          r_parallel3 = cube_r_parallel(a,xdim,ydim,tdim, ncores = 3),
                          r_parallel4 = cube_r_parallel(a,xdim,ydim,tdim, ncores = 4))

Output:

Unit: milliseconds
            expr        min         lq       mean     median         uq        max neval
 r_cached_subset   2.914021   3.380441   3.762666   3.498877   3.671357   6.584328   100
       op_answer   4.007368   4.350468   5.053147   4.630304   4.934394   9.320869   100
     r_parallel1 143.999263 149.833078 152.525882 152.473543 154.586970 165.265859   100
     r_parallel2 284.159601 291.980428 296.283743 295.066914 300.471994 312.267809   100
     r_parallel3 426.694259 437.278416 441.646800 440.875021 445.099250 475.146716   100
     r_parallel4 563.040544 582.282379 588.159337 585.561017 593.944895 644.125363   100

Not surprisingly, the cached subset function performed better than the original function. What was particularly surpising in this case was that the parallelization option within R took a considerably longer amount of time as additional cores were added. This was partly due to the fact that the a array had to be replicated out across the different R processes and then there was also a time lag between spawning the processes and winding them down. This may prove to be a fatal flaw of the construction of cube_r_parallel as a normal user may wish to make repeated calls within the same parallelized session. However, I digress as I feel like I’ve spent too much time describing ways to optimize the R code.

Porting into C++

To keep in the spirit of pure ‘brute force’, we can quickly port over the R cached version of the computational loop to Armadillo, C++ Linear Algebra Library, and use OpenMP with RcppArmadillo.

Why not use just Rcpp though? The rationale for using RcppArmadillo over Rcpp is principally because of the native support for multidimensional arrays via arma::cube. In addition, it’s a bit painful in the current verison of Rcpp to work with an array. Hence, I’d rather face the cost of copying an object from SEXP to arma and back again than mess around with this.

Before we get started, it is very important to provide protection for systems that still lack OpenMP by default in R (**cough** OS X **cough**). Though, this option can be enabled using the steps described at http://thecoatlessprofessor.com/programming/openmp-in-r-on-os-x/. After saying that, we still need to protect users who do not use that option by guarding the header inclusion:

// Protect against compilers without OpenMP
#ifdef _OPENMP
#include <omp.h>
#endif

This provides protection from the compiler throwing an error due to OpenMP not being available on the system. Note: In the event that the system does not have OpenMP, the process will be executed serially just like always.

With this being said, let’s look at the C++ port of the R function:

#include <RcppArmadillo.h>

// [[Rcpp::depends(RcppArmadillo)]]

// Add a flag to enable OpenMP at compile time
// [[Rcpp::plugins(openmp)]]

// Protect against compilers without OpenMP
#ifdef _OPENMP
#include <omp.h>
#endif

// Parallelized C++ equivalent
// [[Rcpp::export]]
arma::cube cube_cpp_parallel(arma::cube a, arma::cube res, int ncores = 1) {
  
  // Extract the different dimensions

  // Normal Matrix dimensions
  unsigned int xdim = res.n_rows;
  
  unsigned int ydim = res.n_cols;
  
  // Depth of Array
  unsigned int tdim = res.n_slices;
  
  // Added an omp pragma directive to parallelize the loop with ncores
  #pragma omp parallel for num_threads(ncores)
  for (unsigned int t = 0; t < tdim; t++){ // Note: Array's Index in C++ starts at 0 and goes to N - 1
    
    // Pop time `t` from `a` e.g. `a[,,t]`
    arma::mat temp_mat = a.slice(t);
    
    // Begin region selection
    for (unsigned int x = 2; x < xdim-2; x++){
      
      // Subset the rows
      arma::mat temp_row_sub = temp_mat.rows(x-2, x+2);
      
      // Iterate over the columns with unit accumulative sum
      for (unsigned int y = 2; y <  ydim-2; y++){
        res(x,y,t) = accu(temp_row_sub.cols(y-2,y+2));
      }
    }
  }
  
  return res;
}

A few things to note here:

a, res, xdim,ydim and ncores are shared across processes.
t is unique to each process.
We protect users that cannot use OpenMP!

To verify that this is an equivalent function, we opt to check the object equality:

cpp_parallel = cube_cpp_parallel(a, res, ncores)

all.equal(cpp_parallel, op_answer)

[1] TRUE

Timings

Just as before, let’s check the benchmarks to see how well we did:

port_timings = microbenchmark(op_answer = cube_r(a,res,xdim,ydim,tdim),
                              r_cached_subset = cube_r_cache(a,res,xdim,ydim,tdim),
                              cpp_core1 = cube_cpp_parallel(a,res,1),
                              cpp_core2 = cube_cpp_parallel(a,res,2),
                              cpp_core3 = cube_cpp_parallel(a,res,3),
                              cpp_core4 = cube_cpp_parallel(a,res,4))

Output:

Unit: microseconds
            expr      min        lq      mean    median        uq      max neval
       op_answer 3618.381 3865.0475 4145.8970 4008.6600 4190.3205 5965.215   100
 r_cached_subset 2790.882 2992.2785 3313.1804 3100.9775 3296.5020 5141.932   100
       cpp_core1  174.254  181.6835  194.7873  185.8370  204.2300  373.787   100
       cpp_core2  112.481  118.5510  134.7780  123.8365  145.9055  343.364   100
       cpp_core3   82.790   92.0120  124.8239  108.9060  144.7930  314.537   100
       cpp_core4   81.068  116.8390  134.6073  126.3345  149.9915  256.079   100

Wow! The C++ version of the parallelization really did wonders vs. the previous R implementation of parallelization. The speed ups are about 44x vs. original and 34x vs. the optimized R loop (using 3 cores). Note, with the addition of the 4th core, the parallelization option performs poorly as the system running the benchmarks only has four cores. Thus, one of the cores is trying to keep up with the parallelization while also having to work on operating system tasks and so on.

Now, this isn’t to say that there is no cap to parallelization benefits given infinite amounts of processing power. In fact, there are two laws that govern general speedups from parallelization: Amdahl’s Law (Fixed Problem Size) and Gustafson’s Law (Scaled Speedup).

Detecting Patterns

Firstly, notice how the array is constructed in this case with: array(0:1, dims). There seems to be some sort of pattern depending on the xdim, ydim, and tdim. If we can recognize the pattern in advance, the amount of computation required decreases. However, this may also impact the ability to generalize the algorithm to other cases outside the problem statement. Thus, the reason for this part of the post being at the terminal part of this article.

After some trial and error using different dimensions, different patterns become recognizable and reproducible. Most notably, we have three different cases:

Case 1: If xdim is even, then only the rows of a matrix alternate with rows containing all 1s or 0s.
Case 2: If xdim is odd and ydim is even, then only the rows alternate of a matrix alternate with rows containing a combination of both 1 or 0.
Case 3: If xdim is odd and ydim is odd, then rows alternate as well as the matrices alternate with a combination of both 1 or 0.

Examples of Pattern Cases

Let’s see the cases in action to observe the patterns. Note, the size of these arrays is small and would likely yield issues where the for loop given above would not work due to an out-of-bounds error.

Case 1:

xdim = 2
ydim = 3
tdim = 2
a = array(0:1, dim = c(xdim, ydim, tdim))

Output:

, , 1

     [,1] [,2] [,3]
[1,]    0    0    0
[2,]    1    1    1

, , 2

     [,1] [,2] [,3]
[1,]    0    0    0
[2,]    1    1    1

Case 2:

xdim = 3
ydim = 4
tdim = 2
a = array(0:1, dim = c(xdim, ydim, tdim))

Output:

, , 1
    
     [,1] [,2] [,3] [,4]
[1,]    0    1    0    1
[2,]    1    0    1    0
[3,]    0    1    0    1
    
, , 2
    
     [,1] [,2] [,3] [,4]
[1,]    0    1    0    1
[2,]    1    0    1    0
[3,]    0    1    0    1

Case 3:

xdim = 3
ydim = 3
tdim = 3
a = array(0:1, dim = c(xdim, ydim, tdim))

Output:

, , 1
    
     [,1] [,2] [,3]
[1,]    0    1    0
[2,]    1    0    1
[3,]    0    1    0
    
, , 2
    
     [,1] [,2] [,3]
[1,]    1    0    1
[2,]    0    1    0
[3,]    1    0    1
    
, , 3
    
     [,1] [,2] [,3]
[1,]    0    1    0
[2,]    1    0    1
[3,]    0    1    0

Pattern Hacking

Based on the above discussion, we opt to make a bit of code that exploits this unique pattern. The language that we can write this code in is either R or C++. Though, due to the nature of this website, we opt to proceed in the later. Nevertheless, as an exercise, feel free to backport this code into R.

Having acknowledged that, we opt to start off trying to create code that can fill a matrix with either an even or an odd column vector. e.g.

Odd Vector

The odd vector is defined as having the initial value being given by 1 instead of 0. This is used in heavily in both Case 2 and Case 3.

     [,1]
[1,]    1
[2,]    0
[3,]    1
[4,]    0
[5,]    1

Even

In comparison to the odd vector, the even vector starts at 0. This is used principally in Case 1 and then in Case 2 and Case 3 to alternate columns.

     [,1]
[1,]    0
[2,]    1
[3,]    0
[4,]    1
[5,]    0

Creating Alternating Vectors

To obtain such an alternating vector that switches between two values, we opt to create a vector using the modulus of the vector position i % 2, where 0 % 2 or 2 % 2 is 0 (divisible by 2 / even) and 1 % 2 is 1 (not divisible by 2 / odd). alternating vector in this case switches between two different values.

#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]

// ------- Make Alternating Column Vectors

// Creates a vector with initial value 1
arma::vec odd_vec(unsigned int xdim){
  
  // make a temporary vector to create alternating 0-1 effect by row.
  arma::vec temp_vec(xdim);
  
  // Alternating vector (anyone have a better solution? )
  for (unsigned int i = 0; i < xdim; i++) {
    temp_vec(i) = (i % 2 ? 0 : 1); // Ternary operator in C++, e.g. if(TRUE){1}else{0}
  }
  
  return temp_vec;
}

// Creates a vector with initial value 0
// [[Rcpp::export]]
arma::vec even_vec(unsigned int xdim){
  
  // make a temporary vector to create alternating 0-1 effect by row.
  arma::vec temp_vec(xdim);
  
  // Alternating vector (anyone have a better solution? )
  for (unsigned int i = 0; i < xdim; i++) {
    temp_vec(i) = (i % 2 ? 1 : 0); // changed
  }
  
  return temp_vec;
}

Creating the three cases of matrix

With our ability to now generate odd and even vectors by column, we now need to figureo out how to create the matrices described in each case. As mentioned above, there are three cases of matrix given as:

The even,
The bad odd,
And the ugly odd.

// --- Requires the previous vector code snippets to be within one file inorder to run.

// --- Handle the different matrix cases 

// Case 1: xdim is even
// [[Rcpp::export]]
arma::mat make_even_matrix_case1(unsigned int xdim, unsigned int ydim){
  
  arma::mat temp_mat(xdim,ydim);
  
  temp_mat.each_col() = even_vec(xdim);
  
  return temp_mat;
}

// Case 2: xdim is odd and ydim is odd    
// [[Rcpp::export]]
arma::mat make_odd_matrix_case2(unsigned int xdim, unsigned int ydim){
  
  arma::mat temp_mat(xdim,ydim);
  
  // Cache values
  arma::vec e_vec = even_vec(xdim);
  arma::vec o_vec = odd_vec(xdim);
  
  // Alternating column 
  for (unsigned int i = 0; i < ydim; i++) {
    temp_mat.col(i) = (i % 2 ? e_vec : o_vec); 
  }
  
  return temp_mat;
}

// Case 3: xdim is odd and ydim is even
// [[Rcpp::export]]
arma::mat make_odd_matrix_case3(unsigned int xdim, unsigned int ydim){
  
  arma::mat temp_mat(xdim,ydim);
  
  // Cache values
  arma::vec e_vec = even_vec(xdim);
  arma::vec o_vec = odd_vec(xdim);
  
  // Alternating column 
  for (unsigned int i = 0; i < ydim; i++) {
    temp_mat.col(i) = (i % 2 ? o_vec : e_vec); // slight change
  }
  
  return temp_mat;
}

Calculation Engine

Next, we need to create a computational loop to subset the appropriate continuous areas of the matrix to figure out the amount of neighbors. In comparison to the problem statement, note that this loop is without the t as we no longer need to repeat calculations within this approach. Instead, we only need to compute the values once and then cache the result before duplicating it across the array.

#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]

// --- Calculation engine

// [[Rcpp::export]]
arma::mat calc_matrix(arma::mat temp_mat){
  
  unsigned int xdim = temp_mat.n_rows;
  
  unsigned int ydim = temp_mat.n_cols;
  
  arma::mat res = temp_mat;
  
  // Subset the rows
  for (unsigned int x = 2; x < xdim-2; x++){
    
    arma::mat temp_row_sub = temp_mat.rows(x-2, x+2);
    
    // Iterate over the columns with unit accumulative sum
    for (unsigned int y = 2; y <  ydim-2; y++){
      res(x,y) = accu(temp_row_sub.cols(y-2,y+2));
    }
  }
  
  return res;
}

Call Main Function

Whew, that was a lot of work. But, by approaching the problem this way, we have:

Created reusable code snippets.
Decreased the size of the main call function.
Improved clarity of each operation.

Now, the we are ready to write the glue that combines all the different components. As a result, we will obtain the desired neighbor information.

// --- Requires the previous code snippets to be within one file inorder to run.

// --- Main Engine

// Create the desired cube information
// [[Rcpp::export]]
arma::cube dim_to_cube(unsigned int xdim = 4, unsigned int ydim = 4, unsigned int tdim = 3) {
  
  // Initialize values in A
  arma::cube res(xdim,ydim,tdim);
  
  if(xdim % 2 == 0){
    res.each_slice() = calc_matrix(make_even_matrix_case1(xdim, ydim));
  }else{
    
    if(ydim % 2 == 0){
      
      res.each_slice() = calc_matrix(make_odd_matrix_case3(xdim, ydim));
      
    }else{
      
      arma::mat first_odd_mat = calc_matrix(make_odd_matrix_case2(xdim, ydim));
      
      arma::mat sec_odd_mat = calc_matrix(make_odd_matrix_case3(xdim, ydim));
      
      for(unsigned int t = 0; t < tdim; t++){
        res.slice(t) = (t % 2 ? first_odd_mat  : sec_odd_mat);
      }
      
    }
    
  }
  
  return res;
}

Verification of Results

To verify, let’s quickly create similar cases and test them against the original R function:

Case 1:

xdim = 6; ydim = 5; tdim = 3
a = array(0:1,dim=c(xdim,ydim,tdim)); res = a

all.equal(dim_to_cube(xdim, ydim, tdim), cube_r(a, res, xdim, ydim, tdim))

[1] TRUE

Case 2:

xdim = 7; ydim = 6; tdim = 3
a = array(0:1,dim=c(xdim,ydim,tdim)); res = a

all.equal(dim_to_cube(xdim, ydim, tdim), cube_r(a, res, xdim, ydim, tdim))

[1] TRUE

Case 3:

xdim = 7; ydim = 7; tdim = 3
a = array(0:1,dim=c(xdim,ydim,tdim)); res = a

all.equal(dim_to_cube(xdim, ydim, tdim), cube_r(a, res, xdim, ydim, tdim))

[1] TRUE

Closing Time – You don’t have to go home but you can’t stay here.

With all of these different methods now thoroughly described, let’s do one last benchmark to figure out the best of the best.

xdim = 20
ydim = 20
tdim = 5
a = array(0:1,dim=c(xdim,ydim,tdim))
res = a


microbenchmark::microbenchmark(r_orig      = cube_r(a, res, xdim, ydim, tdim),
                               r_cached    = cube_r_cache(a, res, xdim, ydim, tdim),
                               cpp_core1   = cube_cpp_parallel(a, res, 1), 
                               cpp_core2   = cube_cpp_parallel(a, res, 2), 
                               cpp_core3   = cube_cpp_parallel(a, res, 3),
                               cpp_pattern = dim_to_cube(xdim, ydim, tdim))

Output:

Unit: microseconds
            expr      min        lq       mean    median        uq      max neval
          r_orig 3545.746 3912.1155 4220.26266 4062.5115 4270.3515 6132.152   100
        r_cached 2750.397 3003.5730 3338.90562 3132.2870 3299.1740 5115.018   100
       cpp_core1  170.006  181.0935  195.04401  187.6615  202.9760  302.570   100
       cpp_core2  112.524  118.8390  135.98392  126.6020  142.9040  322.466   100
       cpp_core3   81.936   87.9210  120.86310  105.0160  150.9475  261.475   100
     cpp_pattern   40.240   44.0575   51.85476   48.7645   59.2875   76.207   100

As can be seen, the customized approach based on the data’s pattern provided the fastest speed up. Meanwhile, the code port into C++ yielded much better results than the optimized R version. Lastly, the parallelization within R was simply too time consuming for a one-off computation.

Misc

The benchmarks listed above were taken using microbenchmark package.

Hardware:

iMac (27-inch, Late 2013) with:
- 3.2 GHz Intel Core i5
- 32 GB 1600 MHz DDR3
- NVIDIA GeForce GT 755M 1024MB

Software:

R version 3.3.0
RcppArmadillo version 0.7.100.3.1
Rcpp version 0.12.5
Compiler: clang-omp under c++98

To leave a comment for the author, please follow the link and comment on their blog: Rcpp Gallery.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.