Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
In data analysis and data science, it’s common to work with large datasets that require some form of manipulation to be useful. In this small article, we’ll explore how to create and modify columns in a dataframe using modern R tools from the tidyverse package. We can do that on several ways, so we are going from basic to advanced level.
Let’s use the starwars
dataset for that purpose:
data("starwars") head(starwars, 4) # # A tibble: 4 × 8 # name height mass hair_color skin_color eye_color birth_year sex # <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> # 1 Luke Skywalker 172 77 blond fair blue 19 male # 2 C-3PO 167 75 NA gold yellow 112 none # 3 R2-D2 96 32 NA white, blue red 33 none # 4 Darth Vader 202 136 none white yellow 41.9 male
The most basic example is using mutate
to create and modify variables.
starwars %>% mutate( height = height * 2, new_numeric_column = row_number(), new_char_column = "This variable is new" ) %>% select(name, height, new_numeric_column, new_char_column) %>% head(4) # # A tibble: 4 × 4 # name height new_numeric_column new_char_column # <chr> <dbl> <dbl> <chr> # 1 Luke Skywalker 344 1 This variable is new # 2 C-3PO 334 2 This variable is new # 3 R2-D2 192 3 This variable is new # 4 Darth Vader 404 4 This variable is new
Note how the second variable we created is recycled to match the length of the dataset. But you already knew that, right?
Common tricks
A common trick is making use of if_else
to conditionally modify some variables. I use this structure on a daily basis.
starwars %>% mutate(height = if_else(hair_color == "blond", height * 10, height)) %>% select(name, height, mass, hair_color) %>% head(4) # # A tibble: 4 × 8 # name height mass hair_color # <chr> <int> <dbl> <chr> # 1 Luke Skywalker 344 77 blond # 2 C-3PO 167 75 NA # 3 R2-D2 96 32 NA # 4 Darth Vader 202 136 none
Another common use case is to rely on the case_when
function to modify the variable based on several conditions:
starwars %>% mutate(mod_height = case_when(eye_color == "blue" ~ height/2, eye_color == "red" ~ height*2, eye_color == "yellow" ~ height*10, TRUE ~ height)) %>% select(name, height, eye_color, mod_height) %>% head(4) # name height eye_color mod_height # <chr> <int> <chr> <dbl> # 1 Luke Skywalker 172 blue 86 # 2 C-3PO 167 yellow 1670 # 3 R2-D2 96 red 192 # 4 Darth Vader 202 yellow 2020
Note that we should end the case_when
with an option that always yields TRUE
because the conditions are evaluated in order. If our data doesn’t meet any condition we are leaving the column as is.
All these are fairly basic examples. Let’s go with the dplyr
advanced way of creating and modifying variables.
The Advanced Way: Using across()
In modern R, we can simultaneously modify several columns at once using the verb across
. We need to pass the transformation we will be performing on those variables as well. For that, we are using a lambda function which basically means that we are creating the function on-the-fly but we are not storing it.
starwars %>% mutate(across(c(height, mass, birth_year), ~ .x*10)) %>% select(name, height, mass, birth_year) %>% head(4) # # A tibble: 4 × 4 # name height mass birth_year # <chr> <dbl> <dbl> <dbl> # 1 Luke Skywalker 1720 770 190 # 2 C-3PO 1670 750 1120 # 3 R2-D2 960 320 330 # 4 Darth Vader 2020 1360 419
That’s quite nice but sometimes you don’t want to modify the existing columns but creating new ones.
This is an important use case: batch-creating several columns at once based on the existing ones. I already discussed how to do it in How to create multiple lags like a Pro. We can use the .names
argument to dynamically specify the new column names, like this:
starwars %>% mutate(across(c(height, mass, birth_year), ~ .x * 100, .names = "{.col}_x100")) %>% select(name, height, mass, birth_year, height_x100, mass_x100, birth_year_x100) %>% head(4) # # A tibble: 4 × 7 # name height mass birth_year height_x100 mass_x100 birth_year_x100 # <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> # 1 Luke Skywalker 172 77 19 17200 7700 1900 # 2 C-3PO 167 75 112 16700 7500 11200 # 3 R2-D2 96 32 33 9600 3200 3300 # 4 Darth Vader 202 136 41.9 20200 13600 4190
Awesome, right? However, I still had to type them all manually. There is a better way.
The Pro Way: Using across() + tidyselectors
What if we want to modify a lot of columns? There must be a better way to avoid having to type them all…
Sure there is! tidyselectors
to the rescue! Those are a family of functions that allow us to dynamically select several columns based on a condition. Let’s see that with an example.
Let’s say we want to modify only the numerical variables. We can do that easily with the help of where
function and the neat part is this family of functions works with several verbs of the Tidyverse. For instance, they work with select
:
starwars %>% select(name, where(is.numeric)) %>% head(4) # # A tibble: 4 × 4 # name height mass birth_year # <chr> <int> <dbl> <dbl> # 1 Luke Skywalker 172 77 19 # 2 C-3PO 167 75 112 # 3 R2-D2 96 32 33 # 4 Darth Vader 202 136 41.9
But also with mutate
! So combining across
with where
we can apply the function only over the desired columns (without having to type them!)
starwars %>% mutate(across(where(is.numeric), ~ .x * 100)) %>% select(name, where(is.numeric)) %>% head(4) # # A tibble: 4 × 4 # name height mass birth_year # <chr> <dbl> <dbl> <dbl> # 1 Luke Skywalker 17200 7700 1900 # 2 C-3PO 16700 7500 11200 # 3 R2-D2 9600 3200 3300 # 4 Darth Vader 20200 13600 4190
Note how the name
feature hasn’t been modified, as it is not a numeric variable. This is a really handy trick specially when you are working with big datasets and need to perform an operation on many columns at once.
Also, it is worth noting that we can pass any function to across
to modify the selected columns. We don’t necessarily have to define the operation with a lambda function, but any existing function can be used.
Here is another powerful example working with character columns. We can apply an existing function to make all of them uppercase:
starwars %>% mutate(across(where(is.character), toupper)) %>% select(where(is.character)) %>% head(4) # # A tibble: 4 × 8 # name hair_color skin_color eye_color sex gender homeworld species # <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> # 1 LUKE SKYWALKER BLOND FAIR BLUE MALE MASCULINE TATOOINE HUMAN # 2 C-3PO NA GOLD YELLOW NONE MASCULINE TATOOINE DROID # 3 R2-D2 NA WHITE, BLUE RED NONE MASCULINE NABOO DROID # 4 DARTH VADER NONE WHITE YELLOW MALE MASCULINE TATOOINE HUMAN
Also, you don’t have to rely only on the where
tidyselector, you can use many others like contains
, matches
, starts_with
or ends_with
.
Here’s another example example using ends_with
:
starwars %>% mutate(across(ends_with("_color"), ~ paste0("the color is ", .x))) %>% select(name, ends_with("_color")) %>% head(4) # # A tibble: 4 × 4 # name hair_color skin_color eye_color # <chr> <chr> <chr> <chr> # 1 Luke Skywalker the color is blond the color is fair the color is blue # 2 C-3PO the color is NA the color is gold the color is yellow # 3 R2-D2 the color is NA the color is white, blue the color is red # 4 Darth Vader the color is none the color is white the color is yellow
Handy stuff, right? There is so much more possibilities to discover. You can read more about it on the across reference.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.