Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Sometimes, R is a bit too intuitive, and I wondered what was wrong with my code the other day was. The problem was vectorized functions within a mutate statement. I usually use the paste
function and the ifelse
function within mutate
so the vectorization is already automatic. However, for a specific task at work, I was working with a non vectorized function and it took me a little bit to figure out what was wrong with my code.
So I decided to write a little post as a reminder for myself, how vectorized functions in mutate
work.
Let’s start with some sample data.
sample_df <- dplyr::tibble( list_col = list(c("a", "b", "c"), c("a", "b"), "c", c("e", "f")), d = c(1, 2, 3, 4) ) sample_df ## # A tibble: 4 × 2 ## list_col d ## <list> <dbl> ## 1 <chr [3]> 1 ## 2 <chr [2]> 2 ## 3 <chr [1]> 3 ## 4 <chr [2]> 4
In the data frame above we have 2 columns. A list column with character vectors and one integer column. Now, we want to get the length of the vectors for each row and create a new column. Naively, I tried something like that…
sample_df %>% dplyr::mutate( length_vec = length(list_col) ) ## # A tibble: 4 × 3 ## list_col d length_vec ## <list> <dbl> <int> ## 1 <chr [3]> 1 4 ## 2 <chr [2]> 2 4 ## 3 <chr [1]> 3 4 ## 4 <chr [2]> 4 4
For my task at work, I was working with JSON data but the example above demonstrates the problem I had. Instead of getting the length of each individual vector in the list_col
rows, I was getting the length of the list_col
list or the number of rows of the data frame. Now if I do …
length(sample_df$list_col) ## [1] 4
… I get a scalar, or a vector of length 1, back. The way R works is that it recycles the output and fills up the column, length_vec
with all 4s.
To illustrate this behavior, we can create a data frame like this:
data.frame( a = 1, b = 1:2, c = 1:5, d = letters[1:10] ) ## a b c d ## 1 1 1 1 a ## 2 1 2 2 b ## 3 1 1 3 c ## 4 1 2 4 d ## 5 1 1 5 e ## 6 1 2 1 f ## 7 1 1 2 g ## 8 1 2 3 h ## 9 1 1 4 i ## 10 1 2 5 j dplyr::tibble( a = "letter:", d = letters[1:10] ) ## # A tibble: 10 × 2 ## a d ## <chr> <chr> ## 1 letter: a ## 2 letter: b ## 3 letter: c ## 4 letter: d ## 5 letter: e ## 6 letter: f ## 7 letter: g ## 8 letter: h ## 9 letter: i ## 10 letter: j
For tibbles, we get a warning with the first creation of a data frame because it says, only values of size one are recycled. Also, it will only be repeated a whole number of times if necessary for the data frame.
That’s what basically happened to me.
Fixing Vectorization with Purrr::map
To fix the issue, we can simply use purrr
in the mutate function and then get the length of each vector.
sample_df %>% dplyr::mutate( length_vec = purrr::map_int(list_col, ~ length(.)) ) ## # A tibble: 4 × 3 ## list_col d length_vec ## <list> <dbl> <int> ## 1 <chr [3]> 1 3 ## 2 <chr [2]> 2 2 ## 3 <chr [1]> 3 1 ## 4 <chr [2]> 4 2
To illustrate the problem more, consider the code below.
- For the first function, we are using a for loop o vectorize the
vec_fn_above_below
function. - The second function is vectorized by using the
Vectorize
function in R. - In the
mutate
function, forcat_3
, we useifelse
which is by default vectorized in R. - For
cat_4
, we vectorize the function by usingpurrr::map_int
.
vec_fn_above_below <- function(column_name) { res <- base::vector(mode = 'character', length = length(column_name)) for (i in seq_along(column_name)) { if(column_name[i] >= 0) { res[i] <- "above" } else { res[i] <- "below" } } return(res) } fn_above_below <- function(column_name) { if(column_name >= 0) { res <- "above" } else { res <- "below" } return(res) } fn_above_below <- base::Vectorize(fn_above_below) df <- dplyr::tibble( numbers = sample(-10:10, size = 10) ) df %>% dplyr::mutate( cat = vec_fn_above_below(numbers), cat_2 = fn_above_below(numbers), cat_3 = ifelse(numbers >= 0, "above", "below"), cat_4 = purrr::map_chr( numbers, function(x) { if(x >= 0) { res <- "above" } else { res <- "below" } return(res) } ), cat_5 = sum(c(identical(cat, cat_2), identical(cat_2, cat_3), identical(cat_3, cat_4))) == 3 ) ## # A tibble: 10 × 6 ## numbers cat cat_2 cat_3 cat_4 cat_5 ## <int> <chr> <chr> <chr> <chr> <lgl> ## 1 -8 below below below below TRUE ## 2 0 above above above above TRUE ## 3 9 above above above above TRUE ## 4 3 above above above above TRUE ## 5 -2 below below below below TRUE ## 6 7 above above above above TRUE ## 7 -9 below below below below TRUE ## 8 -10 below below below below TRUE ## 9 -7 below below below below TRUE ## 10 -3 below below below below TRUE
All categories give the same solution.
All functions give the same results.
Additional Links
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.