Divide and parallelize large data problems with Rcpp
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
by Błażej Moska, computer science student and data science intern
Got stuck with too large a dataset? R speed drives you mad? Divide, parallelize and go with Rcpp!
One of the frustrating moments while working with data is when you need results urgently, but your dataset is large enough to make it impossible. This happens often when we need to use algorithm with high computational complexity. I will demonstrate it on the example I’ve been working with.
Suppose we have large dataset consisting of association rules. For some reasons we want to slim it down. Whenever two rules consequents are the same and one rule’s antecedent is a subset of second rule’s antecedent, we want to choose the smaller one (probability of obtaining smaller set is bigger than probability of obtaining bigger set). This is illustrated below:
{A,B,C}=>{D}
{E}=>{F}
{A,B}=>{D}
{A}=>D
How can we achieve that? For example, using below pseudo algorithm:
For i=1 to n: For j=i+1 to n: # check if antecedent[i] contains antecedent[j] (if consequents[i]=consequents[j]), then flag antecedent[i] with 1, otherwise with 0 else: # check if antecedent[j] contains antecedent[i] (if consequents[i]=consequents[j]), then flag antecedent[j] with 1, otherwise with 0
How many operations do we need to perform with this simple algorithm?
For the first i
we need to iterate \(n-1\) times, for the second i
\(n-2\) times, for the third i
\(n-3\) and so on, reaching finally \(n-(n-1)\). This leads to (proof can be found here):
\[ \sum_{i=1}^{n}{i}= \frac{n(n-1)}{2} \]
So the above has asymptotic complexity of \(O(n^2)\). It means, more or less, that the computational complexity grows with the square of the size of the data. Well, for the dataset containing around 1,300,000 records this becomes serious issue. With R I was unable to perform computation in reasonable time. Since a compiled language performs better with simple arithmetic operations, the second idea was to use Rcpp. Yes, it is faster, to some extent — but with such a large dataframe I was still unable to get results in satisfying time. So are there any other options?
Yes, there are. If we take a look at our dataset, we can see that it can be aggregated in such way that each individual “chunk” will consist of records with exactly same consequents:
{A,B}=>{D}
{A}=>{D}
{C,G}=>{F}
{Y}=>{F}
After such division I got 3300 chunks, so the average number of observations per chunk was around 400. Next step was to retry sequentially for each chunk. Since our algorithm has square complexity, it is faster to do it that way rather than on the whole dataset at once. While R failed again, Rcpp finally returned result (after 5 minutes). But still there is a room for improvement. Since our chunks can be calculated independently, there is a possibility to perform parallel computation using for example, foreach package (which I demonstrated in previous article). While passing R functions to foreach is a simple task, parallelizing Rcpp is a little bit more time consuming. We need to do below steps:
- Create
.cpp
file, which includes all of functions needed - Create a package using Rcpp. This can be achieved using for example:
Rcpp.package.skeleton("nameOfYourPackage",cpp_files = "directory_of_your_cpp_file")
- Install your Rcpp package from source:
install.packages("directory_of_your_rcpp_package", repos=NULL, type="source")
- Load your library:
library(name_of_your_rcpp_package)
Now you can use your Rcpp function in foreach:
results=foreach(k=1:length(len), .packages=c(name_of_your_package)) %dopar% {your_cpp_function(data)}
Even with foreach I waited forever for the R results, but Rcpp gave them in approximately 2.5 minutes. Not too bad!
- Here is the Rcpp code for the
issub
function - Here is the R code that partitions the data and calls the
issub
function in parallel
Here are some conclusions. Firstly, it’s worth knowing more languages/tools than just R. Secondly, there is often escape from the large dataset trap. There is little chance that somebody will do exactly the same task as mentioned in above example, but much higher probability that someone will face similar problem, with a possibility to solve it in the same way.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.