Why Vectorize?
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In the post (https://statcompute.wordpress.com/2018/09/15/how-to-avoid-for-loop-in-r), I briefly introduced the idea of vectorization and potential use cases. One might be wondering why we even need the Vectorize() function given the fact that it is just a wrapper and whether there is any material efficiency gain by vectorizing a function. It is true that the Vectorize() function is not able to improve the efficiency of any function itself that is wrapped around, e.g. vectorized. However, the vectorization can change the input format of a function that normally consumes scalar inputs before being vectorized and therefore would improve the processing efficiency. An example is given below to demonstrate the value of vectorization.
When we want to locate the index of a value within the long vector with millions of rows, the which() function should be the fastest, e.g. "which((0:100) == 10)"
. When we want to locate indices of several values within the vector, the match() function might be the most intuitive, e.g. "match(c(10, 12), 0:100)"
. If we would like to take advantage of the speed offered by the which() function, then we might consider one of the following:
A. Using the “%in%” operator within the which() function such as "which(0:100 %in% c(10, 12))"
, where “%in%” is the shorthand of the match() function.
B. Parsing out each lookup value and then connecting them by “|” operators such as "which(eval(parse(text = paste('0:100 == ', c(10, 12), collapse= '|'))))"
.
Besides the two above, we can also leverage the idea of MapReduce discussed in https://statcompute.wordpress.com/2018/09/08/playing-map-and-reduce-in-r-subsetting such as "Reduce(c, Map(function(x) which((0:100) == x), c(10, 12)))"
.
However, since the Vectorize() function is able to change the input format from a scalar to a vector, we can now consider vectorizing the which() function, which would consume the vector directly such as "(Vectorize(function(s, l) which(l == s), 's')) (c(10, 12), 0:100)"
. In this newly defined function, there are two parameters, e.g. “s” and “l”, of which “s” is the input changing from a scalar to a vector after the vectorization.
With all ideas on the table, a benchmark comparison is presented below to show how fast to look up 5 values from a vector with a million rows by using each above-mentioned approach. Additionally, since it is straightforward to extend the idea of Parallelism to MapReduce and vectorization, we will add two parallel solutions in the benchmark, including the parallel::pvec() function that executes the vectorization in parallel and the parallel::mcMap() function that is considered the parallelized Map() function.
tbl <- 0:1000000 lst <- 10 ** (1:5) str_which <- function(s, l) which(eval(parse(text = paste('l == ', s, collapse= '|')))) map_which <- function(s, l) Reduce(c, Map(function(x) which(l == x), s)) vec_which <- Vectorize(function(s, l) which(l == s), 's') mcmap_which <- function(s, l) Reduce(c, parallel::mcMap(function(x) which(l == x), s, mc.cores = length(s))) mcvec_which <- function(s, l) parallel::pvec(s, mc.cores = length(s), function(x) which(l == x)) rbenchmark::benchmark(replications = 1000, order = "user.self", relative = "user.self", columns = c("test", "relative", "elapsed", "user.self", "sys.self", "user.child", "sys.child"), match = {match(lst, tbl)}, which = {which(tbl %in% lst)}, str_which = {str_which(lst, tbl)}, vec_which = {vec_which(lst, tbl)}, map_which = {map_which(lst, tbl)}, mcvec_which = {mcvec_which(lst, tbl)}, mcmap_which = {mcmap_which(lst, tbl)} ) # test relative elapsed user.self sys.self user.child sys.child # mcvec_which 1.000 25.296 1.722 12.477 33.191 30.004 # mcmap_which 1.014 25.501 1.746 12.424 34.231 30.228 # map_which 9.642 18.240 16.604 1.635 0.000 0.000 # vec_which 9.777 18.413 16.836 1.576 0.000 0.000 # which 12.130 22.060 20.888 1.171 0.000 0.000 # str_which 13.467 25.355 23.191 2.164 0.000 0.000 # match 36.659 64.114 63.126 0.986 0.000 0.000
With no surprise, both parallel solutions are at least 10 times faster than any single-core solution in terms of the user CPU time. It is also intriguing to see that the vectorization is as efficient as the MapReduce no matter with a single core or multiple cores and is significantly faster than first three approaches shown early and that the match() function, albeit simple, is the slowest, which in turn justifies efforts on vectorizing the which() function.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.