Vectorisation is your best friend: replacing many elements in a character vector
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
As with any programming language, R allows you to tackle the same problem in many different ways or styles. These styles differ both in the amount of code, readability, and speed. In this post I want to illustrate this by tackling the following problem. We have a data.frame
that contains an ID
character column:
n = 9e6 df = data.frame(values = rnorm(n), ID = rep(LETTERS[1:3], each = n/3), stringsAsFactors = FALSE) > head(df) values ID 1 -0.7355823 A 2 -0.4729925 A 3 -0.7417259 A 4 1.7633367 A 5 -0.3006790 A 6 0.6785947 A
We want to replace all occurrences of A
by 'Text for A'
, and the same for B
and C
. One approach is to use a combination of a for
-loop and some if
statements, in a style that looks more like C:
translator_if_for = function(input_vector) { output_vector = input_vector for(index in seq_along(input_vector)) { if(input_vector[index] == 'A') { output_vector[index] = 'Text for A' } else if(input_vector[index] == 'B') { output_vector[index] = 'Text for B' } else if(input_vector[index] == 'C') { output_vector[index] = 'Text for C' } } return(output_vector) } dum_if_for = translator_if_for(df$ID)
This kind of imperative programming style is not typically R-like. The first response of an R-aficionado is to suggest using an apply
loop. First we construct a helper function:
translator_function = function(element) { switch(element, A = 'Text for A', B = 'Text for B', C = 'Text for C') }
which uses switch
in stead of the set of nested if
statements. Next we use sapply
to call the helper function on each of the elements in df$ID
:
dum_switch_sapply = sapply(df$ID, translator_function)
The advantage here is that we use roughly half the amount of code to express the same functionality, and I find the code more readable (seeing it’s purpose at a glance). Readability however is in the eye of the beholder, and some people used to non-functional programming languages might prefer the more explicit for
-loop and if
statement.
Ofcourse, R also supports vectorisation, which can be of particular interest if you are interested in performance. FOr a vectorised solution, we first create a lookup vector:
translator_vector = c(A = 'Text for A', B = 'Text for B', C = 'Text for C')
and subset this vector using df$ID
:
dum_vectorized = translator_vector[df$ID]
I encourage you to spend a little time figuring out what this subsetting trick does, as I think it is quite a nice trick. The code of this final solution is even shorter, although it does take some careful consideration on the part of the reader to understand what is happening. Careful naming of variables, or encapsulation in a function can solve this issue.
All three solutions yield the same result:
all.equal(dum_if_for, dum_switch_sapply, check.attributes = FALSE) # TRUE all.equal(dum_vectorized, dum_switch_sapply, check.attributes = FALSE) # TRUE
but how long do they take. For this, we benchmark the three solutions:
library(rbenchmark) res = benchmark(if_for_solution = translator_if_for(df$ID), function_solution = sapply(df$ID, translator_function), vector_solution = translator_vector[df$ID], replications = 10) res test replications elapsed relative user.self sys.self 2 function_solution 10 281.326 79.158 276.235 5.193 1 if_for_solution 10 254.751 71.680 253.358 1.484 3 vector_solution 10 3.554 1.000 3.052 0.504
The benchmark clearly shows that the performance of the vectorised solution is vastly superior to the other two, in the order of 70-80 times faster. In addition, the apply
base solution is only a factor 1.10 faster than the for
-loop based solution. The take home message: apply
-loops are not inherently faster, and vectorisation is your friend!
ps: In this case, making the character vector a factor
, and simply replacing the levels
is probably much much faster even than using a vectorised substitution. However, the point of the post was to compare different coding styles, and this problem was just a convenient example.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.