Site icon R-bloggers

A very short introduction to Tidyverse

[This article was first published on R on Dominic Royé, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

1 Tidyverse

The tidyverse universe of packages, a collection of packages specially focused on data science, marked a milestone in R programming. In this post I am going to summarize very briefly the most essential to start in this world. The tidyverse grammar follows a common structure in all functions. The most essential thing is that the first argument is the object and then come the rest of the arguments. In addition, a set of verbs is provided to facilitate the use of the functions. The tidyverse philosophy and grammar of functions are also reflected in other packages that make its use compatible with the collection. For example, the sf package (simple feature) is a standardized way to encode spatial vector data and allows the use of multiple functions that we can find in the dplyr package.

The core of the tidyverse collection is made up of the following packages:

Package Description
ggplot2 Grammar for creating graphics
purrr R functional programming
tibble Modern and effective table system
dplyr Grammar for data manipulation
tidyr Set of functions to create tidy data
stringr Function set to work with characters
readr An easy and fast way to import data
forcats Tools to easily work with factors

In addition to the mentioned packages, lubridate is also used very frequently to work with dates and times, and also readxl which allows us to import files in Excel format. To know all the available packages we can use the function tidyverse_packages().

##  [1] "broom"      "cli"        "crayon"     "dbplyr"     "dplyr"     
##  [6] "forcats"    "ggplot2"    "haven"      "hms"        "httr"      
## [11] "jsonlite"   "lubridate"  "magrittr"   "modelr"     "pillar"    
## [16] "purrr"      "readr"      "readxl"     "reprex"     "rlang"     
## [21] "rstudioapi" "rvest"      "stringr"    "tibble"     "tidyr"     
## [26] "xml2"       "tidyverse"

It is very easy to get conflicts between functions, that is, that the same function name exists in several packages. To avoid this, we can write the name of the package in front of the function we want to use, separated by the colon symbol written twice (package_name::function_name).

Before I get started with the packages, I hope it will be a really short introduction, some comments on the style when programming in R.

2 Style guide

In R there is no universal style guide, that is, in the R syntax it is not necessary to follow specific rules for our scripts. But it is recommended to work in a homogeneous, uniform, legible and clear way when writing scripts. The tidyverse collection has its own guide (https://style.tidyverse.org/).

The most important recommendations are:

  • Avoid using more than 80 characters per line to allow reading the complete code.
  • Always use a space after a comma, never before.
  • The operators (==, +, -, <-,%>%, etc.) must have a space before and after.
  • There is no space between the name of a function and the first parenthesis, nor between the last argument and the final parenthesis of a function.
  • Avoid reusing names of functions and common variables (c <- 5 vs. c())
  • Sort the script separating the parts with the comment form # Import data -----
  • Avoid accent marks or special symbols in names, files, routes, etc.
  • Object names must follow a constant structure: day_one, day_1.

It is advisable to use a correct indentation for multiple arguments of a function or functions chained by the pipe operator (%>%).

3 Pipe %>%

To facilitate working in data management, manipulation and visualization, the magrittr package introduces the famous pipe operator in the form %>% with the aim of combining various functions without the need to assign the result to a new object. The pipe operator passes the output of a function applied to the first argument of the next function. This way of combining functions allows you to chain several steps simultaneously, to perform sequential tasks. In the very simple example below, we pass the vector 1:5 to the mean() function to calculate the average. You should know that there are a couple of other pipe operators in the same package.

1:5 %>% mean()
## [1] 3

4 Tidyverse packages

4.1 Read and write data

The readr package makes it easy to read or write multiple file formats using functions that start with read_* or write_*. In comparison to R Base, readr functions are faster; they handle problematic column names, and dates are automatically converted. The imported tables are of class tibble (tbl_df), a modern version of data.frame from the tibble package. In the same sense, you can use the read_excel() function of the readxl package to import data from Excel sheets (more details also in this blog post). In the following example, we import the mobility data registered by Google (link) during the last months of the COVID-19 pandemic (download).

Function Description
read_csv() o read_csv2() coma or semicolon (CSV)
read_delim() general separator
read_table() whitespace-separated
# load package
library(tidyverse)

google_mobility <- read_csv("Global_Mobility_Report.csv")
## Parsed with column specification:
## cols(
##   country_region_code = col_character(),
##   country_region = col_character(),
##   sub_region_1 = col_character(),
##   sub_region_2 = col_logical(),
##   iso_3166_2_code = col_character(),
##   census_fips_code = col_logical(),
##   date = col_date(format = ""),
##   retail_and_recreation_percent_change_from_baseline = col_double(),
##   grocery_and_pharmacy_percent_change_from_baseline = col_double(),
##   parks_percent_change_from_baseline = col_double(),
##   transit_stations_percent_change_from_baseline = col_double(),
##   workplaces_percent_change_from_baseline = col_double(),
##   residential_percent_change_from_baseline = col_double()
## )
## Warning: 597554 parsing failures.
##    row              col           expected         actual                         file
## 200119 sub_region_2     1/0/T/F/TRUE/FALSE Autauga County 'Global_Mobility_Report.csv'
## 200119 census_fips_code 1/0/T/F/TRUE/FALSE 01001          'Global_Mobility_Report.csv'
## 200120 sub_region_2     1/0/T/F/TRUE/FALSE Autauga County 'Global_Mobility_Report.csv'
## 200120 census_fips_code 1/0/T/F/TRUE/FALSE 01001          'Global_Mobility_Report.csv'
## 200121 sub_region_2     1/0/T/F/TRUE/FALSE Autauga County 'Global_Mobility_Report.csv'
## ...... ................ .................. .............. ............................
## See problems(...) for more details.
google_mobility
## # A tibble: 516,697 x 13
##    country_region_~ country_region sub_region_1 sub_region_2 iso_3166_2_code
##    <chr>            <chr>          <chr>        <lgl>        <chr>          
##  1 AE               United Arab E~ <NA>         NA           <NA>           
##  2 AE               United Arab E~ <NA>         NA           <NA>           
##  3 AE               United Arab E~ <NA>         NA           <NA>           
##  4 AE               United Arab E~ <NA>         NA           <NA>           
##  5 AE               United Arab E~ <NA>         NA           <NA>           
##  6 AE               United Arab E~ <NA>         NA           <NA>           
##  7 AE               United Arab E~ <NA>         NA           <NA>           
##  8 AE               United Arab E~ <NA>         NA           <NA>           
##  9 AE               United Arab E~ <NA>         NA           <NA>           
## 10 AE               United Arab E~ <NA>         NA           <NA>           
## # ... with 516,687 more rows, and 8 more variables: census_fips_code <lgl>,
## #   date <date>, retail_and_recreation_percent_change_from_baseline <dbl>,
## #   grocery_and_pharmacy_percent_change_from_baseline <dbl>,
## #   parks_percent_change_from_baseline <dbl>,
## #   transit_stations_percent_change_from_baseline <dbl>,
## #   workplaces_percent_change_from_baseline <dbl>,
## #   residential_percent_change_from_baseline <dbl>

Important is to take a look at the argument names, since they change in the readr functions. For example, the well-known header = TRUE argument of read.csv() is in this case col_names = TRUE. More details can be found in the Cheat-Sheet of readr.

4.2 Character manipulations

For working with strings we use the stringr package, whose functions always start with str_* followed by a verb and the first argument.

Some of these functions are as follows:

Function Description
str_replace() replace patterns
str_c() combine characters
str_detect() detect patterns
str_extract() extract patterns
str_sub() extract by position
str_length() length of string

Regular expressions are often used for character patterns. For example, the regular expression [aeiou] matches any single character that is a vowel. The use of square brackets [] corresponds to character classes. For example, [abc] corresponds to each letter regardless of its position. [a-z], [A-Z] or [0-9] each between a and z or 0 and 9. And finally, [:punct:] punctuation, etc. With curly braces “{}” we can indicate the number of the previous element, {2} would be twice, {1,2} between one and two, etc. Also with $ or ^ we can indicate if the pattern starts at the beginning or ends at the end. More details and patterns can be found in the Cheat-Sheet of stringr.

# replace 'er' at the end with empty space

str_replace(month.name, "er$", "")
##  [1] "January"  "February" "March"    "April"    "May"      "June"    
##  [7] "July"     "August"   "Septemb"  "Octob"    "Novemb"   "Decemb"
str_replace(month.name, "^Ma", "")
##  [1] "January"   "February"  "rch"       "April"     "y"         "June"     
##  [7] "July"      "August"    "September" "October"   "November"  "December"
# combine characters

a <- str_c(month.name, 1:12, sep = "_")
a
##  [1] "January_1"   "February_2"  "March_3"     "April_4"     "May_5"      
##  [6] "June_6"      "July_7"      "August_8"    "September_9" "October_10" 
## [11] "November_11" "December_12"
# collapse combination

str_c(month.name, collapse = ", ")
## [1] "January, February, March, April, May, June, July, August, September, October, November, December"
# detect patterns

str_detect(a, "_[1-5]{1}")
##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
# extract patterns

str_extract(a, "_[1-9]{1,2}")
##  [1] "_1"  "_2"  "_3"  "_4"  "_5"  "_6"  "_7"  "_8"  "_9"  "_1"  "_11" "_12"
# extract the characters between position 1 and 2

str_sub(month.name, 1, 2)
##  [1] "Ja" "Fe" "Ma" "Ap" "Ma" "Ju" "Ju" "Au" "Se" "Oc" "No" "De"
# string length of each month

str_length(month.name)
##  [1] 7 8 5 5 3 4 4 6 9 7 8 8
# the '.' represents the object passed by the pipe operator %>%
str_length(month.name) %>% 
   str_c(month.name, ., sep = ".")
##  [1] "January.7"   "February.8"  "March.5"     "April.5"     "May.3"      
##  [6] "June.4"      "July.4"      "August.6"    "September.9" "October.7"  
## [11] "November.8"  "December.8"

A very useful function is str_glue() to interpolate characters.

name <- c("Juan", "Michael")
age <- c(50, 80) 
date_today <- Sys.Date()

str_glue(
  "My name is {name}, ",
  "I'am {age}, ",
  "and my birth year is {format(date_today-age*365, '%Y')}."
)
## My name is Juan, I'am 50, and my birth year is 1970.
## My name is Michael, I'am 80, and my birth year is 1940.

4.3 Management of dates and times

The lubridate package is very powerful in handling dates and times. It allows us to create R recognized objects with functions (like ymd() or ymd_hms()) and we can even make calculations.

We only must know the following abbreviations:

  • ymd: represents y:year, m: month, d:day
  • hms: represents h:hour, m:minutes, s:seconds
# load package
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
# date vector
dat <- c("1999/12/31", "2000/01/07", "2005/05/20","2010/03/25")

# date-time vector
dat_time <- c("1988-08-01 05:00", "2000-02-01 22:00")

# convert to date class
dat <- ymd(dat) 
dat
## [1] "1999-12-31" "2000-01-07" "2005-05-20" "2010-03-25"
# other date formats
dmy("05-02-2000")
## [1] "2000-02-05"
ymd("20000506")
## [1] "2000-05-06"
# convert to POSIXct
dat_time <- ymd_hm(dat_time)
dat_time
## [1] "1988-08-01 05:00:00 UTC" "2000-02-01 22:00:00 UTC"
# different date formats
dat_mix <- c("1999/12/05", "05-09-2008", "2000/08/09", "25-10-2019")

# mixted formats with known convention found in ?strptime
parse_date_time(dat_mix, order = c("%Y/%m/%d", "%d-%m-%Y"))
## [1] "1999-12-05 UTC" "2008-09-05 UTC" "2000-08-09 UTC" "2019-10-25 UTC"

More useful functions:

# extract the year
year(dat)
## [1] 1999 2000 2005 2010
# the month
month(dat)
## [1] 12  1  5  3
month(dat, label = TRUE) # as label
## [1] dic ene may mar
## 12 Levels: ene < feb < mar < abr < may < jun < jul < ago < sep < ... < dic
# the day of the week
wday(dat)
## [1] 6 6 6 5
wday(dat, label = TRUE) # as label
## [1] vi\\. vi\\. vi\\. ju\\.
## Levels: do\\. < lu\\. < ma\\. < mi\\. < ju\\. < vi\\. < sá\\.
# the hour
hour(dat_time)
## [1]  5 22
# add 10 days
dat + days(10)
## [1] "2000-01-10" "2000-01-17" "2005-05-30" "2010-04-04"
# add 1 month
dat + months(1)
## [1] "2000-01-31" "2000-02-07" "2005-06-20" "2010-04-25"

Finally, the make_date() function is very useful to create dates from different date parts, such as the year, month, etc.

# create date from its elements, here with year and month
make_date(2000, 5)
## [1] "2000-05-01"
# create date with time
make_datetime(2005, 5, 23, 5)
## [1] "2005-05-23 05:00:00 UTC"

More details can be found in the Cheat-Sheet of lubridate.

4.4 Table and vector manipulation

The dplyr and tidyr packages provide us with a data manipulation grammar, a set of useful verbs to solve common problems. The most important functions are:

Function Description
mutate() add new variables or modify existing ones
select() select variables
filter() filter
summarise() summarize/reduce
arrange() sort
group_by() group
rename() rename columns

In case you haven’t done it before, we import the mobility data.

google_mobility <- read_csv("Global_Mobility_Report.csv")
## Parsed with column specification:
## cols(
##   country_region_code = col_character(),
##   country_region = col_character(),
##   sub_region_1 = col_character(),
##   sub_region_2 = col_logical(),
##   iso_3166_2_code = col_character(),
##   census_fips_code = col_logical(),
##   date = col_date(format = ""),
##   retail_and_recreation_percent_change_from_baseline = col_double(),
##   grocery_and_pharmacy_percent_change_from_baseline = col_double(),
##   parks_percent_change_from_baseline = col_double(),
##   transit_stations_percent_change_from_baseline = col_double(),
##   workplaces_percent_change_from_baseline = col_double(),
##   residential_percent_change_from_baseline = col_double()
## )
## Warning: 597554 parsing failures.
##    row              col           expected         actual                         file
## 200119 sub_region_2     1/0/T/F/TRUE/FALSE Autauga County 'Global_Mobility_Report.csv'
## 200119 census_fips_code 1/0/T/F/TRUE/FALSE 01001          'Global_Mobility_Report.csv'
## 200120 sub_region_2     1/0/T/F/TRUE/FALSE Autauga County 'Global_Mobility_Report.csv'
## 200120 census_fips_code 1/0/T/F/TRUE/FALSE 01001          'Global_Mobility_Report.csv'
## 200121 sub_region_2     1/0/T/F/TRUE/FALSE Autauga County 'Global_Mobility_Report.csv'
## ...... ................ .................. .............. ............................
## See problems(...) for more details.

4.4.1 Select and rename

We can select or remove columns with the select() function, using the name or index of the column. To delete columns we make use of the negative sign. The rename function helps in renaming columns with either the same name or their index.

residential_mobility <- select(google_mobility, 
                               country_region_code:sub_region_1, 
                               date, 
                               residential_percent_change_from_baseline) %>% 
                        rename(resi = 5)

4.4.2 Filter and sort

To filter data, we use filter() with logical operators (|, ==, >, etc) or functions that return a logical value (str_detect(), is.na() , etc.). The arrange() function sorts from least to greatest for one or multiple variables (with the negative sign - the order is reversed from greatest to least).

filter(residential_mobility, 
       country_region_code == "US")
## # A tibble: 304,648 x 5
##    country_region_code country_region sub_region_1 date        resi
##    <chr>               <chr>          <chr>        <date>     <dbl>
##  1 US                  United States  <NA>         2020-02-15    -1
##  2 US                  United States  <NA>         2020-02-16    -1
##  3 US                  United States  <NA>         2020-02-17     5
##  4 US                  United States  <NA>         2020-02-18     1
##  5 US                  United States  <NA>         2020-02-19     0
##  6 US                  United States  <NA>         2020-02-20     1
##  7 US                  United States  <NA>         2020-02-21     0
##  8 US                  United States  <NA>         2020-02-22    -1
##  9 US                  United States  <NA>         2020-02-23    -1
## 10 US                  United States  <NA>         2020-02-24     0
## # ... with 304,638 more rows
filter(residential_mobility, 
       country_region_code == "US", 
       sub_region_1 == "New York")
## # A tibble: 7,068 x 5
##    country_region_code country_region sub_region_1 date        resi
##    <chr>               <chr>          <chr>        <date>     <dbl>
##  1 US                  United States  New York     2020-02-15     0
##  2 US                  United States  New York     2020-02-16    -1
##  3 US                  United States  New York     2020-02-17     9
##  4 US                  United States  New York     2020-02-18     3
##  5 US                  United States  New York     2020-02-19     2
##  6 US                  United States  New York     2020-02-20     2
##  7 US                  United States  New York     2020-02-21     3
##  8 US                  United States  New York     2020-02-22    -1
##  9 US                  United States  New York     2020-02-23    -1
## 10 US                  United States  New York     2020-02-24     0
## # ... with 7,058 more rows
filter(residential_mobility, 
       resi > 50) %>% 
          arrange(-resi)
## # A tibble: 32 x 5
##    country_region_co~ country_region sub_region_1               date        resi
##    <chr>              <chr>          <chr>                      <date>     <dbl>
##  1 KW                 Kuwait         Al Farwaniyah Governorate  2020-05-14    56
##  2 KW                 Kuwait         Al Farwaniyah Governorate  2020-05-21    55
##  3 SG                 Singapore      <NA>                       2020-05-01    55
##  4 KW                 Kuwait         Al Farwaniyah Governorate  2020-05-28    54
##  5 PE                 Peru           Metropolitan Municipality~ 2020-04-10    54
##  6 EC                 Ecuador        Pichincha                  2020-03-27    53
##  7 KW                 Kuwait         Al Farwaniyah Governorate  2020-05-11    53
##  8 KW                 Kuwait         Al Farwaniyah Governorate  2020-05-13    53
##  9 KW                 Kuwait         Al Farwaniyah Governorate  2020-05-20    53
## 10 SG                 Singapore      <NA>                       2020-04-10    53
## # ... with 22 more rows

4.4.3 Group and summarize

Where do we find greater variability between regions in each country on April 1, 2020?

To answer this question, we first filter the data and then we group by the country column. When we use the summarize() function after grouping, it allows us to summarize by these groups. Moreover, combining group_by() with the mutate() function modifies columns in each group separately. In summarize() we calculate the maximum, minimum value and the difference between both extremes creating new columns.

resi_variability <- residential_mobility %>% 
                        filter(date == ymd("2020-04-01"),
                               !is.na(sub_region_1)) %>% 
                          group_by(country_region) %>% 
                           summarise(mx = max(resi, na.rm = TRUE), 
                                    min = min(resi, na.rm = TRUE),
                                    range = abs(mx)-abs(min))
## `summarise()` ungrouping output (override with `.groups` argument)
arrange(resi_variability, -range)
## # A tibble: 94 x 4
##    country_region    mx   min range
##    <chr>          <dbl> <dbl> <dbl>
##  1 Nigeria           43     6    37
##  2 United States     35     6    29
##  3 India             36    15    21
##  4 Malaysia          45    26    19
##  5 Philippines       40    21    19
##  6 Vietnam           28     9    19
##  7 Colombia          41    24    17
##  8 Ecuador           44    27    17
##  9 Argentina         35    19    16
## 10 Chile             30    14    16
## # ... with 84 more rows

4.4.4 Join tables

How can we filter the data to get a subset of Europe?

To do this, we import a spatial dataset with the country code and a column of regions. Detailed explanations about the sf (simple feature) package, I’ll leave for another post.

library(rnaturalearth) # package of spatial vectorial data

# world limits
wld <- ne_countries(returnclass = "sf")

# filter the countries with iso code and select the two columns of interest
wld <- filter(wld, !is.na(iso_a2)) %>% select(iso_a2, subregion)

# plot
plot(wld)

Other dplyr functions allow us to join tables: *_join (). Depending on which table (left or right) you want to join, the functions change: left_join(), right_join() or even full_join(). The by argument is not necessary as long as both tables have a column in common. However, in this case the variable names are different, so we use the following way: c("country_region_code"="iso_a2"). The forcats package of tidyverse has many useful functions for handling categorical variables (factors), variables that have a fixed and known set of possible values. All forcats functions have the prefix fct_*. For example, in this case we use fct_reorder() to reorder the country labels in order of the maximum based on the residential mobility records. Finally, we create a new column "resi_real" to change the reference value, the average or baseline, from 0 to 100.

subset_europe <- filter(residential_mobility, 
                        is.na(sub_region_1),
                        !is.na(resi)) %>%
                 left_join(wld, by = c("country_region_code"="iso_a2")) %>% 
                 filter(subregion %in% c("Northern Europe",
                                         "Southern Europe",
                                          "Western Europe",
                                          "Eastern Europe")) %>%
                 mutate(resi_real = resi + 100,
                        region = fct_reorder(country_region, 
                                             resi, 
                                            .fun = "max", 
                                            .desc = FALSE)) %>% 
                select(-geometry, -sub_region_1)

str(subset_europe)
## tibble [3,988 x 7] (S3: tbl_df/tbl/data.frame)
##  $ country_region_code: chr [1:3988] "AT" "AT" "AT" "AT" ...
##  $ country_region     : chr [1:3988] "Austria" "Austria" "Austria" "Austria" ...
##  $ date               : Date[1:3988], format: "2020-02-15" "2020-02-16" ...
##  $ resi               : num [1:3988] -2 -2 0 0 1 0 1 -2 0 -1 ...
##  $ subregion          : chr [1:3988] "Western Europe" "Western Europe" "Western Europe" "Western Europe" ...
##  $ resi_real          : num [1:3988] 98 98 100 100 101 100 101 98 100 99 ...
##  $ region             : Factor w/ 35 levels "Belarus","Ukraine",..: 18 18 18 18 18 18 18 18 18 18 ...
##  - attr(*, "problems")= tibble [597,554 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ row     : int [1:597554] 200119 200119 200120 200120 200121 200121 200122 200122 200123 200123 ...
##   ..$ col     : chr [1:597554] "sub_region_2" "census_fips_code" "sub_region_2" "census_fips_code" ...
##   ..$ expected: chr [1:597554] "1/0/T/F/TRUE/FALSE" "1/0/T/F/TRUE/FALSE" "1/0/T/F/TRUE/FALSE" "1/0/T/F/TRUE/FALSE" ...
##   ..$ actual  : chr [1:597554] "Autauga County" "01001" "Autauga County" "01001" ...
##   ..$ file    : chr [1:597554] "'Global_Mobility_Report.csv'" "'Global_Mobility_Report.csv'" "'Global_Mobility_Report.csv'" "'Global_Mobility_Report.csv'" ...

4.4.5 Long and wide tables

Before we go to create graphics with ggplot2, it is very common to modify the table between two main formats, long and wide. A table is tidy when 1) each variable is a column 2) each observation/case is a row and 3) each type of observational unit forms a table.

# subset
mobility_selection <- select(subset_europe, country_region_code, date:resi)
mobility_selection
## # A tibble: 3,988 x 3
##    country_region_code date        resi
##    <chr>               <date>     <dbl>
##  1 AT                  2020-02-15    -2
##  2 AT                  2020-02-16    -2
##  3 AT                  2020-02-17     0
##  4 AT                  2020-02-18     0
##  5 AT                  2020-02-19     1
##  6 AT                  2020-02-20     0
##  7 AT                  2020-02-21     1
##  8 AT                  2020-02-22    -2
##  9 AT                  2020-02-23     0
## 10 AT                  2020-02-24    -1
## # ... with 3,978 more rows
# wide table
mobi_wide <- pivot_wider(mobility_selection, 
                         names_from = country_region_code,
                         values_from = resi)
mobi_wide
## # A tibble: 114 x 36
##    date          AT    BA    BE    BG    BY    CH    CZ    DE    DK    EE    ES
##    <date>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 2020-02-15    -2    -1    -1     0    -1    -1    -2    -1     0     0    -2
##  2 2020-02-16    -2    -1     1    -3     0    -1    -1     0     1     0    -2
##  3 2020-02-17     0    -1     0    -2     0     1     0     0     1     1    -1
##  4 2020-02-18     0    -1     0    -2     0     1     0     1     1     1     0
##  5 2020-02-19     1    -1     0    -1    -1     1     0     1     1     0    -1
##  6 2020-02-20     0    -1     0     0    -1     0     0     1     1     0    -1
##  7 2020-02-21     1    -2     0    -1    -1     1     0     2     1     1    -2
##  8 2020-02-22    -2    -1     0     0    -2    -2    -3     0     1     0    -2
##  9 2020-02-23     0    -1     0    -3    -1    -1     0     0     0    -2    -3
## 10 2020-02-24    -1    -1     4    -1     0     0     0     4     0    16     0
## # ... with 104 more rows, and 24 more variables: FI <dbl>, FR <dbl>, GB <dbl>,
## #   GR <dbl>, HR <dbl>, HU <dbl>, IE <dbl>, IT <dbl>, LT <dbl>, LU <dbl>,
## #   LV <dbl>, MD <dbl>, MK <dbl>, NL <dbl>, NO <dbl>, PL <dbl>, PT <dbl>,
## #   RO <dbl>, RS <dbl>, RU <dbl>, SE <dbl>, SI <dbl>, SK <dbl>, UA <dbl>
# back to long table
pivot_longer(mobi_wide,
             2:36,
             names_to = "country_code",
             values_to = "resi")
## # A tibble: 3,990 x 3
##    date       country_code  resi
##    <date>     <chr>        <dbl>
##  1 2020-02-15 AT              -2
##  2 2020-02-15 BA              -1
##  3 2020-02-15 BE              -1
##  4 2020-02-15 BG               0
##  5 2020-02-15 BY              -1
##  6 2020-02-15 CH              -1
##  7 2020-02-15 CZ              -2
##  8 2020-02-15 DE              -1
##  9 2020-02-15 DK               0
## 10 2020-02-15 EE               0
## # ... with 3,980 more rows

Another group of functions you should take a look at are: separate(), case_when(), complete(). More details can be found in the Cheat-Sheet of dplyr.

4.5 Visualize data

ggplot2 is a modern system for data visualization with a huge variety of options. Unlike the R Base graphic system, in ggplot2 a different grammar is used. The grammar of graphics (gg) consists of the sum of several independent layers or objects that are combined using + to construct the final graph. ggplot differentiates between data, what is displayed and how it is displayed.

  • data: our dataset (data.frame or tibble)

  • aesthetics: with the aes() function we indicate the variables that correspond to the x, y, z, … axes, or when it is intended to apply graphic parameters (color, size, shape) according to a variable. It is possible to include aes() in ggplot() or in the corresponding function to a geometry geom_ *.

  • geometries: are geom_ * objects that indicate the geometry to be used, (eg: geom_point(), geom_line(), geom_boxplot(), etc.).

  • scales: are objects of type scales_ * (eg, scale_x_continous(), scale_colour_manual()) to manipulate axes, define colors, etc.

  • statistics: are stat_ * objects (eg, stat_density()) that allow to apply statistical transformations.

More details can be found in the Cheat-Sheet of ggplot2. ggplot is constantly supplemented by extensions for geometries or other graphical options (see https://exts.ggplot2.tidyverse.org/ggiraph.html), for graphical ideas have a look a the R Graph Gallery (https://www.r-graph-gallery.com/).

4.5.1 Line and scatter plot

We create a subset of our mobility data for residences and parks, filtering the records for Italian regions. In addition, we divide the mobility values in percentage by 100 to obtain the fraction, since ggplot2 allows us to indicate the unit of percentage in the label argument (see last plot in this section).

# create subset
it <- filter(google_mobility, 
             country_region == "Italy", 
             is.na(sub_region_1)) %>% 
      mutate(resi = residential_percent_change_from_baseline/100,   
             parks = parks_percent_change_from_baseline/100)


# line plot
ggplot(it, 
       aes(date, resi)) + 
  geom_line()

# scatter plot
ggplot(it, 
       aes(parks, resi)) + 
  geom_point() +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

To modify the axes, we use the different scale_* functions that we must adapt to the scales of measurement (date, discrete, continuous, etc.). The labs() function helps us define the axis, legend and plot titles. Finally, we add the style of the graph with theme_light() (others are theme_bw(), theme_minimal(), etc.). We could also make changes to all graphic elements through theme().

# time serie plot
ggplot(it, 
       aes(date, resi)) + 
  geom_line(colour = "#560A86", size = 0.8) +
  scale_x_date(date_breaks = "10 days", 
               date_labels = "%d %b") +
  scale_y_continuous(breaks = seq(-0.1, 1, 0.1), 
                     labels = scales::percent) +
  labs(x = "", 
       y = "Residential mobility",
       title = "Mobility during COVID-19") +
  theme_light()

# scatter plot
ggplot(it, 
       aes(parks, resi)) + 
  geom_point(alpha = .4, size = 2) +
  geom_smooth(method = "lm") +
  scale_x_continuous(breaks = seq(-1, 1.4, 0.2), 
                     labels = scales::percent) +
  scale_y_continuous(breaks = seq(-1, 1, 0.1), 
                     labels = scales::percent) +
  labs(x = "Park mobility", 
       y = "Residential mobility",
       title = "Mobility during COVID-19") +
  theme_light()
## `geom_smooth()` using formula 'y ~ x'

4.5.2 Boxplot

We can visualize different aspects of the mobility with other geometries. Here we will create boxplots for each European country representing the variability of mobility between and within countries during the COVID-19 pandemic.

# subset
subset_europe_reg <- filter(residential_mobility, 
                           !is.na(sub_region_1),
                           !is.na(resi)) %>%
                     left_join(wld, by = c("country_region_code"="iso_a2")) %>% 
                     filter(subregion %in% c("Northern Europe",
                                         "Southern Europe",
                                          "Western Europe",
                                          "Eastern Europe")) %>% 
                     mutate(resi = resi/100, 
                            country_region = fct_reorder(country_region, resi))

# boxplot
ggplot(subset_europe_reg, 
       aes(country_region, resi, fill = subregion)) + 
  geom_boxplot() +
  scale_y_continuous(breaks = seq(-0.1, 1, 0.1), labels = scales::percent) +
  scale_fill_brewer(palette = "Set1") +
  coord_flip() +
   labs(x = "", 
       y = "Residential mobility",
       title = "Mobility during COVID-19", 
       fill = "") +
  theme_minimal()

4.5.3 Heatmap

To visualize the mobility trend of all European countries it is recommended to use a heatmap instead of a bundle of lines. Before building the graph, we will create a vector of Sundays for the x-axis labels in the observation period.

# sequence of dates
df <- data.frame(d = seq(ymd("2020-02-15"), ymd("2020-06-07"), "day"))

# filter on Sundays 
sundays <- df %>% 
            mutate(wd = wday(d, week_start = 1)) %>% 
             filter(wd == 7) %>% 
              pull(d)

To difference between European regions, we will use a color fill for the boxplots. We can set the color type with scale_fill_*, in this case, from the viridis scheme. In addition, the guides() function can modify the color bar of the legend. Finally, here we see the use of theme() with additional changes to theme_minimal().

# headmap
ggplot(subset_europe, 
       aes(date, region, fill = resi_real)) +
  geom_tile() +
  scale_x_date(breaks = sundays,
               date_labels = "%d %b") +
  scale_fill_viridis_c(option = "A", 
                       breaks = c(91, 146),
                       labels = c("Less", "More"), 
                       direction = -1) +
  theme_minimal() +
  theme(legend.position = "top", 
        title = element_text(size = 14),
        panel.grid.major.x = element_line(colour = "white", linetype = "dashed"),
        panel.grid.minor.x = element_blank(),
        panel.grid.major.y = element_blank(),
        panel.ontop = TRUE,
        plot.margin = margin(r = 1, unit = "cm")) +
  labs(y = "", 
       x = "", 
       fill = "", 
       title = "Mobility trends for places of residence",
       caption = "Data: google.com/covid19/mobility/") +
  guides(fill = guide_colorbar(barwidth = 10, 
                               barheight = .5,
                               label.position = "top", 
                               ticks = FALSE)) +
  coord_cartesian(expand = FALSE)

4.6 Apply functions on vectors or lists

The purrr package contains a set of advanced functional programming functions for working with functions and vectors. The known lapply() family of R Base corresponds to the map() functions in this package. One of the biggest advantages is being able to reduce the use of loops (for, etc.).

# list of two vectors
vec_list <- list(x = 1:10, y = 50:70)

# calculate the average for each one
map(vec_list, mean)
## $x
## [1] 5.5
## 
## $y
## [1] 60
# change the output type map_* (dbl, chr, lgl, etc.)
map_dbl(vec_list, mean)
##    x    y 
##  5.5 60.0

Finally, a more complex example. We calculate the correlation coefficient between residential and park mobility in all European countries. To get a tidy summary of a model or test we use the tidy() function of the broom package.

library(broom) # tidy outputs

# custom function
cor_test <- function(x, formula) { 
  
df <- cor.test(as.formula(formula), data = x) %>% tidy()

return(df)
  
}

# prepare the data
europe_reg <- filter(google_mobility, 
                           !is.na(sub_region_1),
                           !is.na(residential_percent_change_from_baseline)) %>%
                     left_join(wld, by = c("country_region_code"="iso_a2")) %>% 
                     filter(subregion %in% c("Northern Europe",
                                         "Southern Europe",
                                          "Western Europe",
                                          "Eastern Europe"))

# apply the function to each country creating a list
cor_mobility <- europe_reg %>%
                 split(.$country_region_code) %>% 
                   map(cor_test, formula = "~ residential_percent_change_from_baseline + parks_percent_change_from_baseline")  

cor_mobility[1:5]
## $AT
## # A tibble: 1 x 8
##   estimate statistic  p.value parameter conf.low conf.high method    alternative
##      <dbl>     <dbl>    <dbl>     <int>    <dbl>     <dbl> <chr>     <chr>      
## 1   -0.360     -12.3 2.68e-32      1009   -0.413    -0.305 Pearson'~ two.sided  
## 
## $BE
## # A tibble: 1 x 8
##   estimate statistic  p.value parameter conf.low conf.high method    alternative
##      <dbl>     <dbl>    <dbl>     <int>    <dbl>     <dbl> <chr>     <chr>      
## 1   -0.312     -6.06  3.67e-9       340   -0.405    -0.213 Pearson'~ two.sided  
## 
## $BG
## # A tibble: 1 x 8
##   estimate statistic   p.value parameter conf.low conf.high method   alternative
##      <dbl>     <dbl>     <dbl>     <int>    <dbl>     <dbl> <chr>    <chr>      
## 1   -0.677     -37.8 1.47e-227      1694   -0.702    -0.650 Pearson~ two.sided  
## 
## $CH
## # A tibble: 1 x 8
##   estimate statistic p.value parameter conf.low conf.high method     alternative
##      <dbl>     <dbl>   <dbl>     <int>    <dbl>     <dbl> <chr>      <chr>      
## 1  -0.0786     -2.91 0.00370      1360   -0.131   -0.0256 Pearson's~ two.sided  
## 
## $CZ
## # A tibble: 1 x 8
##   estimate statistic  p.value parameter conf.low conf.high method    alternative
##      <dbl>     <dbl>    <dbl>     <int>    <dbl>     <dbl> <chr>     <chr>      
## 1  -0.0837     -3.35 0.000824      1593   -0.132   -0.0347 Pearson'~ two.sided

As we’ve seen before, there are subfunctions of map_* to get an object of another class instead of a list, here for a bind data.frame.

cor_mobility <- europe_reg %>%
                  split(.$country_region_code) %>% 
                     map_df(cor_test, 
                            formula = "~ residential_percent_change_from_baseline + parks_percent_change_from_baseline", 
                            .id = "country_code")

arrange(cor_mobility, estimate)
## # A tibble: 27 x 9
##    country_code estimate statistic   p.value parameter conf.low conf.high method
##    <chr>           <dbl>     <dbl>     <dbl>     <int>    <dbl>     <dbl> <chr> 
##  1 IT             -0.831    -71.0  0.             2250   -0.844    -0.818 Pears~
##  2 ES             -0.825    -65.4  0.             2005   -0.839    -0.811 Pears~
##  3 PT             -0.729    -46.9  2.12e-321      1938   -0.749    -0.707 Pears~
##  4 FR             -0.698    -37.4  3.29e-216      1474   -0.723    -0.671 Pears~
##  5 GR             -0.692    -27.0  1.03e-114       796   -0.726    -0.654 Pears~
##  6 BG             -0.677    -37.8  1.47e-227      1694   -0.702    -0.650 Pears~
##  7 RO             -0.640    -56.0  0.             4517   -0.657    -0.623 Pears~
##  8 SI             -0.627    -11.4  1.98e- 23       200   -0.704    -0.535 Pears~
##  9 HR             -0.579    -21.9  9.32e- 87       954   -0.620    -0.536 Pears~
## 10 LV             -0.544     -6.87 3.84e- 10       112   -0.662    -0.401 Pears~
## # ... with 17 more rows, and 1 more variable: alternative <chr>

Other practical examples here in this post or this other. More details can be found in the Cheat-Sheet of purrr.

To leave a comment for the author, please follow the link and comment on their blog: R on Dominic Royé.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.