Site icon R-bloggers

10 Must-Know Tidyverse Features!

[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R Tutorials Update

Interested in more R tutorials? Learn more R tips:

???? Register for our blog to get new articles as we release them.


Tidyverse Updates

There is no doubt that the tidyverse opinionated collection of R packages offers attractive, intuitive ways of wrangling data for data science. In earlier versions of tidyverse some elements of user control were sacrificed in favor of simplifying functions that could be picked up and easily used by rookies. In the 2020 updates to dplyr and tidyr there has been progress to restoring some finer control.

This means that there are new methods available in the tidyverse that some may not be aware of. The methods allow you to better transform your data directly to the way you want and to perform operations more flexibly. They also provide new ways to perform common tasks like nesting, modeling and graphing in ways where the code is more readable. Often users are only just scratching the surface of what can be done with the latest updates to this important set of packages.

It’s incumbent on any analyst to stay up to date with new methods. This post covers ten examples of approaches to common data tasks that are better served by the latest tidyverse updates. We will use the new Palmer Penguins dataset, a great all round dataset for illustrating data wrangling.

First let’s load our tidyverse packages and the Palmer Penguins dataset and take a quick look at it. Please be sure to install the latest versions of these packages before trying to replicate the work here.

library(tidyverse)
library(palmerpenguins)

penguins <- palmerpenguins::penguins  %>%
           filter(!is.na(bill_length_mm))

penguins
## # A tibble: 342 x 8
##    species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>            <int>       <int>
##  1 Adelie  Torge~           39.1          18.7              181        3750
##  2 Adelie  Torge~           39.5          17.4              186        3800
##  3 Adelie  Torge~           40.3          18                195        3250
##  4 Adelie  Torge~           36.7          19.3              193        3450
##  5 Adelie  Torge~           39.3          20.6              190        3650
##  6 Adelie  Torge~           38.9          17.8              181        3625
##  7 Adelie  Torge~           39.2          19.6              195        4675
##  8 Adelie  Torge~           34.1          18.1              193        3475
##  9 Adelie  Torge~           42            20.2              190        4250
## 10 Adelie  Torge~           37.8          17.1              186        3300
## # ... with 332 more rows, and 2 more variables: sex <fct>, year <int>

The dataset presents several observations of anatomical parts of penguins of different species, sexes and locations, and the year that the measurements were taken.

1. Selecting columns

tidyselect helper functions are now built in to allow you to save time by selecting columns using dplyr::select() based on common conditions. In this case, if we want to reduce the dataset to just bill measurements we can use this, noting that all measurement columns contain an underscore:

penguins %>% 
  dplyr::select(!contains("_"), starts_with("bill"))
## # A tibble: 342 x 6
##    species island    sex     year bill_length_mm bill_depth_mm
##    <fct>   <fct>     <fct>  <int>          <dbl>         <dbl>
##  1 Adelie  Torgersen male    2007           39.1          18.7
##  2 Adelie  Torgersen female  2007           39.5          17.4
##  3 Adelie  Torgersen female  2007           40.3          18  
##  4 Adelie  Torgersen female  2007           36.7          19.3
##  5 Adelie  Torgersen male    2007           39.3          20.6
##  6 Adelie  Torgersen female  2007           38.9          17.8
##  7 Adelie  Torgersen male    2007           39.2          19.6
##  8 Adelie  Torgersen <NA>    2007           34.1          18.1
##  9 Adelie  Torgersen <NA>    2007           42            20.2
## 10 Adelie  Torgersen <NA>    2007           37.8          17.1
## # ... with 332 more rows

A full set of tidyselect helper functions can be found in the documentation here.

2. Reordering columns

dplyr::relocate() allows a new way to reorder specific columns or sets of columns. For example, if we want to make sure that all of the measurement columns are at the end of the dataset, we can use this, noting that my last column is year:

penguins <- penguins %>% 
  dplyr::relocate(contains("_"), .after = year)

penguins
## # A tibble: 342 x 8
##    species island sex    year bill_length_mm bill_depth_mm flipper_length_~
##    <fct>   <fct>  <fct> <int>          <dbl>         <dbl>            <int>
##  1 Adelie  Torge~ male   2007           39.1          18.7              181
##  2 Adelie  Torge~ fema~  2007           39.5          17.4              186
##  3 Adelie  Torge~ fema~  2007           40.3          18                195
##  4 Adelie  Torge~ fema~  2007           36.7          19.3              193
##  5 Adelie  Torge~ male   2007           39.3          20.6              190
##  6 Adelie  Torge~ fema~  2007           38.9          17.8              181
##  7 Adelie  Torge~ male   2007           39.2          19.6              195
##  8 Adelie  Torge~ <NA>   2007           34.1          18.1              193
##  9 Adelie  Torge~ <NA>   2007           42            20.2              190
## 10 Adelie  Torge~ <NA>   2007           37.8          17.1              186
## # ... with 332 more rows, and 1 more variable: body_mass_g <int>

Similar to .after you can also use .before as an argument here.


3. Controlling mutated column locations

Note in the penguins dataset that there are no unique identifiers for each study group. This can be problematic when we have multiple penguins of the same species, island, sex and year in the dataset. To address this and prepare for later examples, let’s add a unique identifier using dplyr::mutate(), and here we can illustrate how mutate() now allows us to position our new column in a similar way to relocate():

penguins_id <- penguins %>% 
  dplyr::group_by(species, island, sex, year) %>% 
  dplyr::mutate(studygroupid = row_number(), .before = contains("_"))

penguins_id
## # A tibble: 342 x 9
## # Groups:   species, island, sex, year [35]
##    species island sex    year studygroupid bill_length_mm bill_depth_mm
##    <fct>   <fct>  <fct> <int>        <int>          <dbl>         <dbl>
##  1 Adelie  Torge~ male   2007            1           39.1          18.7
##  2 Adelie  Torge~ fema~  2007            1           39.5          17.4
##  3 Adelie  Torge~ fema~  2007            2           40.3          18  
##  4 Adelie  Torge~ fema~  2007            3           36.7          19.3
##  5 Adelie  Torge~ male   2007            2           39.3          20.6
##  6 Adelie  Torge~ fema~  2007            4           38.9          17.8
##  7 Adelie  Torge~ male   2007            3           39.2          19.6
##  8 Adelie  Torge~ <NA>   2007            1           34.1          18.1
##  9 Adelie  Torge~ <NA>   2007            2           42            20.2
## 10 Adelie  Torge~ <NA>   2007            3           37.8          17.1
## # ... with 332 more rows, and 2 more variables: flipper_length_mm <int>,
## #   body_mass_g <int>

4. Transforming from wide to long

The penguins dataset is clearly in a wide form, as it gives multiple observations across the columns. For many reasons we may want to transform data from wide to long. In long data, each observation has its own row. The older function gather() in tidyr was popular for this sort of task but its new version pivot_longer() is even more powerful. In this case we have different body parts, measures and units inside these column names, but we can break them out very simply like this:

penguins_long <- penguins_id %>% 
  tidyr::pivot_longer(contains("_"), # break out the measurement cols
                      names_to = c("part", "measure", "unit"), # break them into these three columns
                      names_sep = "_") #  use the underscore to separate

penguins_long
## # A tibble: 1,368 x 9
## # Groups:   species, island, sex, year [35]
##    species island    sex     year studygroupid part    measure unit   value
##    <fct>   <fct>     <fct>  <int>        <int> <chr>   <chr>   <chr>  <dbl>
##  1 Adelie  Torgersen male    2007            1 bill    length  mm      39.1
##  2 Adelie  Torgersen male    2007            1 bill    depth   mm      18.7
##  3 Adelie  Torgersen male    2007            1 flipper length  mm     181  
##  4 Adelie  Torgersen male    2007            1 body    mass    g     3750  
##  5 Adelie  Torgersen female  2007            1 bill    length  mm      39.5
##  6 Adelie  Torgersen female  2007            1 bill    depth   mm      17.4
##  7 Adelie  Torgersen female  2007            1 flipper length  mm     186  
##  8 Adelie  Torgersen female  2007            1 body    mass    g     3800  
##  9 Adelie  Torgersen female  2007            2 bill    length  mm      40.3
## 10 Adelie  Torgersen female  2007            2 bill    depth   mm      18  
## # ... with 1,358 more rows

5. Transforming from long to wide

It’s just as easy to move back from long to wide. pivot_wider() gives much more flexibility compared to the older spread():

penguins_wide <- penguins_long %>% 
  tidyr::pivot_wider(names_from = c("part", "measure", "unit"), # pivot these columns
                     values_from = "value", # take the values from here
                     names_sep = "_") # combine col names using an underscore

penguins_wide
## # A tibble: 342 x 9
## # Groups:   species, island, sex, year [35]
##    species island sex    year studygroupid bill_length_mm bill_depth_mm
##    <fct>   <fct>  <fct> <int>        <int>          <dbl>         <dbl>
##  1 Adelie  Torge~ male   2007            1           39.1          18.7
##  2 Adelie  Torge~ fema~  2007            1           39.5          17.4
##  3 Adelie  Torge~ fema~  2007            2           40.3          18  
##  4 Adelie  Torge~ fema~  2007            3           36.7          19.3
##  5 Adelie  Torge~ male   2007            2           39.3          20.6
##  6 Adelie  Torge~ fema~  2007            4           38.9          17.8
##  7 Adelie  Torge~ male   2007            3           39.2          19.6
##  8 Adelie  Torge~ <NA>   2007            1           34.1          18.1
##  9 Adelie  Torge~ <NA>   2007            2           42            20.2
## 10 Adelie  Torge~ <NA>   2007            3           37.8          17.1
## # ... with 332 more rows, and 2 more variables: flipper_length_mm <dbl>,
## #   body_mass_g <dbl>

6. Running group statistics across multiple columns

dplyr can how apply multiple summary functions to grouped data using the across adverb, helping you be more efficient. If we wanted to summarize all bill and flipper measurements in our penguins we would do this:

penguin_stats <- penguins %>% 
  dplyr::group_by(species) %>% 
  dplyr::summarize(across(ends_with("mm"), # do this for columns ending in mm
                          list(~mean(.x, na.rm = TRUE), 
                               ~sd(.x, na.rm = TRUE)))) # calculate a mean and sd

penguin_stats
## # A tibble: 3 x 7
##   species bill_length_mm_1 bill_length_mm_2 bill_depth_mm_1 bill_depth_mm_2
##   <fct>              <dbl>            <dbl>           <dbl>           <dbl>
## 1 Adelie              38.8             2.66            18.3           1.22 
## 2 Chinst~             48.8             3.34            18.4           1.14 
## 3 Gentoo              47.5             3.08            15.0           0.981
## # ... with 2 more variables: flipper_length_mm_1 <dbl>,
## #   flipper_length_mm_2 <dbl>


7. Control output columns names when summarising columns

The columns in penguin_stats have been given default names which are not that intuitive. If we name our summary functions, we can then use the .names argument to control precisely how we want these columns named. This uses glue notation. For example, here we want to construct the new column names by taking the existing column names, removing any underscores or ‘mm’ metrics, and pasting to the summary function name using an underscore:

penguin_stats <- penguins %>% 
  dplyr::group_by(species) %>% 
  dplyr::summarize(across(ends_with("mm"), 
                          list(mean = ~mean(.x, na.rm = TRUE), 
                               sd = ~sd(.x, na.rm = TRUE)), # name summary functions
                          .names = "{gsub('_|_mm', '', col)}_{fn}")) # column names structure

penguin_stats
## # A tibble: 3 x 7
##   species billlength_mean billlength_sd billdepth_mean billdepth_sd
##   <fct>             <dbl>         <dbl>          <dbl>        <dbl>
## 1 Adelie             38.8          2.66           18.3        1.22 
## 2 Chinst~            48.8          3.34           18.4        1.14 
## 3 Gentoo             47.5          3.08           15.0        0.981
## # ... with 2 more variables: flipperlength_mean <dbl>, flipperlength_sd <dbl>

8. Running models across subsets

The output of summarize() can now be literally anything, because dplyr now allows different column types. We can generate summary vectors, dataframes or other objects like models or graphs.

If we wanted to run a model for each species you could do it like this:

penguin_models <- penguins %>% 
  dplyr::group_by(species) %>% 
  dplyr::summarize(model = list(lm(body_mass_g ~ flipper_length_mm + bill_length_mm + bill_depth_mm)))  # store models in a list column

penguin_models
## # A tibble: 3 x 2
##   species   model 
##   <fct>     <list>
## 1 Adelie    <lm>  
## 2 Chinstrap <lm>  
## 3 Gentoo    <lm>

It’s not usually that useful to keep model objects in a dataframe, but we could use other tidy-oriented packages to summarize the statistics of the models and return them all as nicely integrated dataframes:

library(broom)

penguin_models <- penguins %>% 
  dplyr::group_by(species) %>% 
  dplyr::summarize(broom::glance(lm(body_mass_g ~ flipper_length_mm + bill_length_mm + bill_depth_mm))) # summarize model stats

penguin_models
## # A tibble: 3 x 13
##   species r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC
##   <fct>       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl>
## 1 Adelie      0.508         0.498  325.      50.6 1.55e-22     3 -1086. 2181.
## 2 Chinst~     0.504         0.481  277.      21.7 8.48e-10     3  -477.  964.
## 3 Gentoo      0.625         0.615  313.      66.0 3.39e-25     3  -879. 1768.
## # ... with 4 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
## #   nobs <int>

9. Nesting data

Often we have to work with subsets, and it can be useful to apply a common function across all subsets of the data. For example, maybe we want to take a look at our different species of penguins and make some different graphs of them. Grouping based on subsets would previously be achieved by the following somewhat awkward combination of tidyverse functions.

penguins %>% 
  dplyr::group_by(species) %>% 
  tidyr::nest() %>% 
  dplyr::rowwise()
## # A tibble: 3 x 2
## # Rowwise:  species
##   species   data              
##   <fct>     <list>            
## 1 Adelie    <tibble [151 x 7]>
## 2 Gentoo    <tibble [123 x 7]>
## 3 Chinstrap <tibble [68 x 7]>


The new function nest_by() provides a more intuitive way to do the same thing:

penguins %>% 
  nest_by(species)
## # A tibble: 3 x 2
## # Rowwise:  species
##   species                 data
##   <fct>     <list<tbl_df[,7]>>
## 1 Adelie             [151 x 7]
## 2 Chinstrap           [68 x 7]
## 3 Gentoo             [123 x 7]

The nested data will be stored in a column called data unless we specify otherwise using a .key argument.

10. Graphing across subsets

Armed with nest_by() and the fact that we can summarize or mutate virtually any type of object now, this allows us to generate graphs across subsets and store them in a dataframe for later use. Let’s scatter plot bill length and depth for our three penguin species:

# generic function for generating a simple scatter plot in ggplot2
scatter_fn <- function(df, col1, col2, title) {
  df %>% 
    ggplot2::ggplot(aes(x = , y = )) +
    ggplot2::geom_point() +
    ggplot2::geom_smooth(method = "loess", formula = "y ~ x") +
    ggplot2::labs(title = title)
}

# run function across species and store plots in a list column
penguin_scatters <- penguins %>% 
  dplyr::nest_by(species) %>% 
  dplyr::mutate(plot = list(scatter_fn(data, bill_length_mm, bill_depth_mm, species))) 

penguin_scatters
## # A tibble: 3 x 3
## # Rowwise:  species
##   species                 data plot  
##   <fct>     <list<tbl_df[,7]>> <list>
## 1 Adelie             [151 x 7] <gg>  
## 2 Chinstrap           [68 x 7] <gg>  
## 3 Gentoo             [123 x 7] <gg>


Now we can easily display the different scatter plots to show, for example, that our penguins exemplify Simpson’s Paradox:

library(patchwork)

# generate scatter for entire dataset
p_all <- scatter_fn(penguins, bill_length_mm, bill_depth_mm, "All Species") 

# get species scatters from penguin_scatters dataframe
for (i in 1:3) {
 assign(paste("p", i, sep = "_"),
        penguin_scatters$plot[i][[1]]) 
}

# display nicely using patchwork in R Markdown
p_all /
(p_1 | p_2 | p_3) +
  plot_annotation(caption = "{palmerpenguins} dataset")


Author: Jim Gruman, Data Analytics Leader

Serving enterprise needs with innovators in mobile power, decision intelligence, and product management, Jim can be found at https://jimgruman.netlify.app.


To leave a comment for the author, please follow the link and comment on their blog: business-science.io.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.