A very short introduction to Tidyverse
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
1 Tiyverse
The tidyverse
universe of packages, a collection of packages specially focused on data science, marked a milestone in R programming. In this post I am going to summarize very briefly the most essential to start in this world. The tidyverse grammar follows a common structure in all functions. The most essential thing is that the first argument is the object and then come the rest of the arguments. In addition, a set of verbs is provided to facilitate the use of the functions. The tidyverse
philosophy and grammar of functions are also reflected in other packages that make its use compatible with the collection. For example, the sf
package (simple feature) is a standardized way to encode spatial vector data and allows the use of multiple functions that we can find in the dplyr
package.
The core of the tidyverse
collection is made up of the following packages:
Package | Description |
---|---|
ggplot2 | Grammar for creating graphics |
purrr | R functional programming |
tibble | Modern and effective table system |
dplyr | Grammar for data manipulation |
tidyr | Set of functions to create tidy data |
stringr | Function set to work with characters |
readr | An easy and fast way to import data |
forcats | Tools to easily work with factors |
In addition to the mentioned packages, lubridate
is also used very frequently to work with dates and times, and also readxl
which allows us to import files in Excel format. To know all the available packages we can use the function tidyverse_packages()
.
## [1] "broom" "cli" "crayon" "dbplyr" "dplyr" ## [6] "forcats" "ggplot2" "haven" "hms" "httr" ## [11] "jsonlite" "lubridate" "magrittr" "modelr" "pillar" ## [16] "purrr" "readr" "readxl" "reprex" "rlang" ## [21] "rstudioapi" "rvest" "stringr" "tibble" "tidyr" ## [26] "xml2" "tidyverse"
It is very easy to get conflicts between functions, that is, that the same function name exists in several packages. To avoid this, we can write the name of the package in front of the function we want to use, separated by the colon symbol written twice (package_name::function_name
).
Before I get started with the packages, I hope it will be a really short introduction, some comments on the style when programming in R.
2 Style guide
In R there is no universal style guide, that is, in the R syntax it is not necessary to follow specific rules for our scripts. But it is recommended to work in a homogeneous, uniform, legible and clear way when writing scripts. The tidyverse
collection has its own guide (https://style.tidyverse.org/).
The most important recommendations are:
- Avoid using more than 80 characters per line to allow reading the complete code.
- Always use a space after a comma, never before.
- The operators (==, +, -, <-,%>%, etc.) must have a space before and after.
- There is no space between the name of a function and the first parenthesis, nor between the last argument and the final parenthesis of a function.
- Avoid reusing names of functions and common variables (
c <- 5
vs.c()
) - Sort the script separating the parts with the comment form
# Import data -----
- Avoid accent marks or special symbols in names, files, routes, etc.
- Object names must follow a constant structure:
day_one
,day_1
.
It is advisable to use a correct indentation for multiple arguments of a function or functions chained by the pipe
operator (%>%
).
3 Pipe %>%
To facilitate working in data management, manipulation and visualization, the magrittr
package introduces the famous operator pipe in the form %>%
with the aim of combining various functions without the need to assign the result to a new object. The pipe operator passes the output of a function applied to the first argument of the next function. This way of combining functions allows you to chain several steps simultaneously, to perform sequential tasks. In the very simple example below, we pass the vector 1:5
to the mean()
function to calculate the average. You should know that there are a couple of other pipe operators in the same package.
1:5 %>% mean() ## [1] 3
4 Tidyverse packages
4.1 Read and write data
The readr
package makes it easy to read or write multiple file formats using functions that start with read_*
or write_*
.
In comparison to R Base readr
functions are faster; they handle problematic column names, and dates are automatically converted. The imported tables are of class tibble
(tbl_df), a modern version of data.frame
from the tibble
package. In the same sense, you can use the read_excel()
function of the readxl
package to import data from Excel sheets (more details also in this blog post). In the following example, we import the mobility data registered by Google (link) during the last months of the COVID-19 pandemic (download).
Function | Description |
---|---|
read_csv() o read_csv2() | coma or semicolon (CSV) |
read_delim() | general separator |
read_table() | whitespace-separated |
# load package library(tidyverse) google_mobility <- read_csv("Global_Mobility_Report.csv") ## Parsed with column specification: ## cols( ## country_region_code = col_character(), ## country_region = col_character(), ## sub_region_1 = col_character(), ## sub_region_2 = col_logical(), ## iso_3166_2_code = col_character(), ## census_fips_code = col_logical(), ## date = col_date(format = ""), ## retail_and_recreation_percent_change_from_baseline = col_double(), ## grocery_and_pharmacy_percent_change_from_baseline = col_double(), ## parks_percent_change_from_baseline = col_double(), ## transit_stations_percent_change_from_baseline = col_double(), ## workplaces_percent_change_from_baseline = col_double(), ## residential_percent_change_from_baseline = col_double() ## ) ## Warning: 597554 parsing failures. ## row col expected actual file ## 200119 sub_region_2 1/0/T/F/TRUE/FALSE Autauga County 'Global_Mobility_Report.csv' ## 200119 census_fips_code 1/0/T/F/TRUE/FALSE 01001 'Global_Mobility_Report.csv' ## 200120 sub_region_2 1/0/T/F/TRUE/FALSE Autauga County 'Global_Mobility_Report.csv' ## 200120 census_fips_code 1/0/T/F/TRUE/FALSE 01001 'Global_Mobility_Report.csv' ## 200121 sub_region_2 1/0/T/F/TRUE/FALSE Autauga County 'Global_Mobility_Report.csv' ## ...... ................ .................. .............. ............................ ## See problems(...) for more details. google_mobility ## # A tibble: 516,697 x 13 ## country_region_~ country_region sub_region_1 sub_region_2 iso_3166_2_code ## <chr> <chr> <chr> <lgl> <chr> ## 1 AE United Arab E~ <NA> NA <NA> ## 2 AE United Arab E~ <NA> NA <NA> ## 3 AE United Arab E~ <NA> NA <NA> ## 4 AE United Arab E~ <NA> NA <NA> ## 5 AE United Arab E~ <NA> NA <NA> ## 6 AE United Arab E~ <NA> NA <NA> ## 7 AE United Arab E~ <NA> NA <NA> ## 8 AE United Arab E~ <NA> NA <NA> ## 9 AE United Arab E~ <NA> NA <NA> ## 10 AE United Arab E~ <NA> NA <NA> ## # ... with 516,687 more rows, and 8 more variables: census_fips_code <lgl>, ## # date <date>, retail_and_recreation_percent_change_from_baseline <dbl>, ## # grocery_and_pharmacy_percent_change_from_baseline <dbl>, ## # parks_percent_change_from_baseline <dbl>, ## # transit_stations_percent_change_from_baseline <dbl>, ## # workplaces_percent_change_from_baseline <dbl>, ## # residential_percent_change_from_baseline <dbl>
Important is to take a look at the argument names, since they change in the readr
functions. For example, the well-known header = TRUE
argument of read.csv()
is in this case col_names = TRUE
. More details can be found in the Cheat-Sheet of readr
.
4.2 Character manipulations
For working with strings we use the stringr
package, whose functions always start with str_*
followed by a verb and the first argument.
Some of these functions are as follows:
Function | Description |
---|---|
str_replace() | replace patterns |
str_c() | combine characters |
str_detect() | detect patterns |
str_extract() | extract patterns |
str_sub() | extract by position |
str_length() | length of string |
Regular expressions are often used for character patterns. For example, the regular expression [aeiou]
matches any single character that is a vowel. The use of square brackets []
corresponds to character classes. For example, [abc]
corresponds to each letter regardless of its position. [a-z]
, [A-Z]
or [0-9]
each between a and z or 0 and 9. And finally, [:punct:]
punctuation, etc. With curly braces “{}” we can indicate the number of the previous element {2}
would be twice, {1,2} between one and two, etc. Also with $
or ^
we can indicate if the pattern starts at the beginning or ends at the end. More details and patterns can be found in the Cheat-Sheet of stringr
.
# replace 'er' at the end with empty space str_replace(month.name, "er$", "") ## [1] "January" "February" "March" "April" "May" "June" ## [7] "July" "August" "Septemb" "Octob" "Novemb" "Decemb" str_replace(month.name, "^Ma", "") ## [1] "January" "February" "rch" "April" "y" "June" ## [7] "July" "August" "September" "October" "November" "December" # combine characters a <- str_c(month.name, 1:12, sep = "_") a ## [1] "January_1" "February_2" "March_3" "April_4" "May_5" ## [6] "June_6" "July_7" "August_8" "September_9" "October_10" ## [11] "November_11" "December_12" # collapse combination str_c(month.name, collapse = ", ") ## [1] "January, February, March, April, May, June, July, August, September, October, November, December" # detect patterns str_detect(a, "_[1-5]{1}") ## [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE # extract patterns str_extract(a, "_[1-9]{1,2}") ## [1] "_1" "_2" "_3" "_4" "_5" "_6" "_7" "_8" "_9" "_1" "_11" "_12" # extract the characters between position 1 and 2 str_sub(month.name, 1, 2) ## [1] "Ja" "Fe" "Ma" "Ap" "Ma" "Ju" "Ju" "Au" "Se" "Oc" "No" "De" # string length of each month str_length(month.name) ## [1] 7 8 5 5 3 4 4 6 9 7 8 8 # the '.' represents the object passed by the pipe operator %>% str_length(month.name) %>% str_c(month.name, ., sep = ".") ## [1] "January.7" "February.8" "March.5" "April.5" "May.3" ## [6] "June.4" "July.4" "August.6" "September.9" "October.7" ## [11] "November.8" "December.8"
A very useful function is str_glue()
to interpolate characters.
name <- c("Juan", "Michael") age <- c(50, 80) date_today <- Sys.Date() str_glue( "My name is {name}, ", "I'am {age}, ", "and my birth year is {format(date_today-age*365, '%Y')}." ) ## My name is Juan, I'am 50, and my birth year is 1970. ## My name is Michael, I'am 80, and my birth year is 1940.
4.3 Management of dates and times
The lubridate
package is very powerful in handling dates and times. It allows us to create R recognized objects with functions (like ymd()
or ymd_hms()
) and we can even make calculations.
We only must know the following abbreviations:
ymd
: representsy:year
,m: month
,d:day
hms
: representsh:hour
,m:minutes
,s:seconds
# load package library(lubridate) ## ## Attaching package: 'lubridate' ## The following objects are masked from 'package:base': ## ## date, intersect, setdiff, union # date vector dat <- c("1999/12/31", "2000/01/07", "2005/05/20","2010/03/25") # date-time vector dat_time <- c("1988-08-01 05:00", "2000-02-01 22:00") # convert to date class dat <- ymd(dat) dat ## [1] "1999-12-31" "2000-01-07" "2005-05-20" "2010-03-25" # other date formats dmy("05-02-2000") ## [1] "2000-02-05" ymd("20000506") ## [1] "2000-05-06" # convert to POSIXct dat_time <- ymd_hm(dat_time) dat_time ## [1] "1988-08-01 05:00:00 UTC" "2000-02-01 22:00:00 UTC" # different date formats dat_mix <- c("1999/12/05", "05-09-2008", "2000/08/09", "25-10-2019") # mixted formats with known convention found in ?strptime parse_date_time(dat_mix, order = c("%Y/%m/%d", "%d-%m-%Y")) ## [1] "1999-12-05 UTC" "2008-09-05 UTC" "2000-08-09 UTC" "2019-10-25 UTC"
More useful functions:
# extract the year year(dat) ## [1] 1999 2000 2005 2010 # the month month(dat) ## [1] 12 1 5 3 month(dat, label = TRUE) # as label ## [1] dic ene may mar ## 12 Levels: ene < feb < mar < abr < may < jun < jul < ago < sep < ... < dic # the day of the week wday(dat) ## [1] 6 6 6 5 wday(dat, label = TRUE) # as label ## [1] vi vi vi ju ## Levels: do < lu < ma < mi < ju < vi < sá # the hour hour(dat_time) ## [1] 5 22 # add 10 days dat + days(10) ## [1] "2000-01-10" "2000-01-17" "2005-05-30" "2010-04-04" # add 1 month dat + months(1) ## [1] "2000-01-31" "2000-02-07" "2005-06-20" "2010-04-25"
Finally, the make_date()
function is very useful to create dates from different date parts, such as the year, month, etc.
# create date from its elements, here with year and month make_date(2000, 5) ## [1] "2000-05-01" # create date with time make_datetime(2005, 5, 23, 5) ## [1] "2005-05-23 05:00:00 UTC"
More details can be found in the Cheat-Sheet of lubridate
.
4.4 Table and vector manipulation
The dplyr
package provides us with a data manipulation grammar, a set of useful verbs to solve common problems. The most important functions are:
Function | Description |
---|---|
mutate() | add new variables or modify existing ones |
select() | select variables |
filter() | filter |
summarise() | summarize/reduce |
arrange() | sort |
group_by() | group |
rename() | rename columns |
In case you haven’t done it before, we import the mobility data.
google_mobility <- read_csv("Global_Mobility_Report.csv") ## Parsed with column specification: ## cols( ## country_region_code = col_character(), ## country_region = col_character(), ## sub_region_1 = col_character(), ## sub_region_2 = col_logical(), ## iso_3166_2_code = col_character(), ## census_fips_code = col_logical(), ## date = col_date(format = ""), ## retail_and_recreation_percent_change_from_baseline = col_double(), ## grocery_and_pharmacy_percent_change_from_baseline = col_double(), ## parks_percent_change_from_baseline = col_double(), ## transit_stations_percent_change_from_baseline = col_double(), ## workplaces_percent_change_from_baseline = col_double(), ## residential_percent_change_from_baseline = col_double() ## ) ## Warning: 597554 parsing failures. ## row col expected actual file ## 200119 sub_region_2 1/0/T/F/TRUE/FALSE Autauga County 'Global_Mobility_Report.csv' ## 200119 census_fips_code 1/0/T/F/TRUE/FALSE 01001 'Global_Mobility_Report.csv' ## 200120 sub_region_2 1/0/T/F/TRUE/FALSE Autauga County 'Global_Mobility_Report.csv' ## 200120 census_fips_code 1/0/T/F/TRUE/FALSE 01001 'Global_Mobility_Report.csv' ## 200121 sub_region_2 1/0/T/F/TRUE/FALSE Autauga County 'Global_Mobility_Report.csv' ## ...... ................ .................. .............. ............................ ## See problems(...) for more details.
4.4.1 Select and rename
We can select or remove columns with the select()
function, using the name or index of the column. To delete columns we make use of the negative sign. The rename
function helps in renaming columns with either the same name or their index.
residential_mobility <- select(google_mobility, country_region_code:sub_region_1, date, residential_percent_change_from_baseline) %>% rename(resi = 5)
4.4.2 Filter and sort
To filter data, we use filter()
with logical operators (|
, ==
, >
, etc) or functions that return a logical value (str_detect()
, is.na()
, etc.). The arrange()
function sorts from least to greatest for one or multiple variables (with the negative sign -
the order is reversed from greatest to least).
filter(residential_mobility, country_region_code == "US") ## # A tibble: 304,648 x 5 ## country_region_code country_region sub_region_1 date resi ## <chr> <chr> <chr> <date> <dbl> ## 1 US United States <NA> 2020-02-15 -1 ## 2 US United States <NA> 2020-02-16 -1 ## 3 US United States <NA> 2020-02-17 5 ## 4 US United States <NA> 2020-02-18 1 ## 5 US United States <NA> 2020-02-19 0 ## 6 US United States <NA> 2020-02-20 1 ## 7 US United States <NA> 2020-02-21 0 ## 8 US United States <NA> 2020-02-22 -1 ## 9 US United States <NA> 2020-02-23 -1 ## 10 US United States <NA> 2020-02-24 0 ## # ... with 304,638 more rows filter(residential_mobility, country_region_code == "US", sub_region_1 == "New York") ## # A tibble: 7,068 x 5 ## country_region_code country_region sub_region_1 date resi ## <chr> <chr> <chr> <date> <dbl> ## 1 US United States New York 2020-02-15 0 ## 2 US United States New York 2020-02-16 -1 ## 3 US United States New York 2020-02-17 9 ## 4 US United States New York 2020-02-18 3 ## 5 US United States New York 2020-02-19 2 ## 6 US United States New York 2020-02-20 2 ## 7 US United States New York 2020-02-21 3 ## 8 US United States New York 2020-02-22 -1 ## 9 US United States New York 2020-02-23 -1 ## 10 US United States New York 2020-02-24 0 ## # ... with 7,058 more rows filter(residential_mobility, resi > 50) %>% arrange(-resi) ## # A tibble: 32 x 5 ## country_region_co~ country_region sub_region_1 date resi ## <chr> <chr> <chr> <date> <dbl> ## 1 KW Kuwait Al Farwaniyah Governorate 2020-05-14 56 ## 2 KW Kuwait Al Farwaniyah Governorate 2020-05-21 55 ## 3 SG Singapore <NA> 2020-05-01 55 ## 4 KW Kuwait Al Farwaniyah Governorate 2020-05-28 54 ## 5 PE Peru Metropolitan Municipality~ 2020-04-10 54 ## 6 EC Ecuador Pichincha 2020-03-27 53 ## 7 KW Kuwait Al Farwaniyah Governorate 2020-05-11 53 ## 8 KW Kuwait Al Farwaniyah Governorate 2020-05-13 53 ## 9 KW Kuwait Al Farwaniyah Governorate 2020-05-20 53 ## 10 SG Singapore <NA> 2020-04-10 53 ## # ... with 22 more rows
4.4.3 Group and summarize
Where do we find greater variability between regions in each country on April 1, 2020?
To answer this question, we first filter the data and then we group by the country column. When we use the summarize()
function after grouping, it allows us to summarize by these groups. Moreover, combining group_by()
with the mutate()
function modifies columns in each group separately. In summarize()
we calculate the maximum, minimum value and the difference between both extremes creating new columns.
resi_variability <- residential_mobility %>% filter(date == ymd("2020-04-01"), !is.na(sub_region_1)) %>% group_by(country_region) %>% summarise(mx = max(resi, na.rm = TRUE), min = min(resi, na.rm = TRUE), range = abs(mx)-abs(min)) ## `summarise()` ungrouping output (override with `.groups` argument) arrange(resi_variability, -range) ## # A tibble: 94 x 4 ## country_region mx min range ## <chr> <dbl> <dbl> <dbl> ## 1 Nigeria 43 6 37 ## 2 United States 35 6 29 ## 3 India 36 15 21 ## 4 Malaysia 45 26 19 ## 5 Philippines 40 21 19 ## 6 Vietnam 28 9 19 ## 7 Colombia 41 24 17 ## 8 Ecuador 44 27 17 ## 9 Argentina 35 19 16 ## 10 Chile 30 14 16 ## # ... with 84 more rows
4.4.4 Join tables
How can we filter the data to get a subset of Europe?
To do this, we import a spatial dataset with the country code and a column of regions. Detailed explanations about the sf
(simple feature) package, I’ll leave for another post.
library(rnaturalearth) # package of spatial vectorial data # world limits wld <- ne_countries(returnclass = "sf") # filter the countries with iso code and select the two columns of interest wld <- filter(wld, !is.na(iso_a2)) %>% select(iso_a2, subregion) # plot plot(wld)
Other dplyr
functions allow us to join tables: *_join ()
. Depending on which table (left or right) you want to join, the functions change: left_join()
, right_join()
or even full_join()
. The by
argument is not necessary as long as both tables have a column in common. However, in this case the variable names are different, so we use the following way: c("country_region_code"="iso_a2")
. The forcats
package of tidyverse
has many useful functions for handling categorical variables (factors
), variables that have a fixed and known set of possible values. All forcats
functions have the prefix fct_*
. For example, in this case we use fct_reorder()
to reorder the country labels in order of the maximum based on the residential mobility records. Finally, we create a new column "resi_real"
to change the reference value, the average or baseline, from 0 to 100.
subset_europe <- filter(residential_mobility, is.na(sub_region_1), !is.na(resi)) %>% left_join(wld, by = c("country_region_code"="iso_a2")) %>% filter(subregion %in% c("Northern Europe", "Southern Europe", "Western Europe", "Eastern Europe")) %>% mutate(resi_real = resi + 100, region = fct_reorder(country_region, resi, .fun = "max", .desc = FALSE)) %>% select(-geometry, -sub_region_1) str(subset_europe) ## tibble [3,988 x 7] (S3: tbl_df/tbl/data.frame) ## $ country_region_code: chr [1:3988] "AT" "AT" "AT" "AT" ... ## $ country_region : chr [1:3988] "Austria" "Austria" "Austria" "Austria" ... ## $ date : Date[1:3988], format: "2020-02-15" "2020-02-16" ... ## $ resi : num [1:3988] -2 -2 0 0 1 0 1 -2 0 -1 ... ## $ subregion : chr [1:3988] "Western Europe" "Western Europe" "Western Europe" "Western Europe" ... ## $ resi_real : num [1:3988] 98 98 100 100 101 100 101 98 100 99 ... ## $ region : Factor w/ 35 levels "Belarus","Ukraine",..: 18 18 18 18 18 18 18 18 18 18 ... ## - attr(*, "problems")= tibble [597,554 x 5] (S3: tbl_df/tbl/data.frame) ## ..$ row : int [1:597554] 200119 200119 200120 200120 200121 200121 200122 200122 200123 200123 ... ## ..$ col : chr [1:597554] "sub_region_2" "census_fips_code" "sub_region_2" "census_fips_code" ... ## ..$ expected: chr [1:597554] "1/0/T/F/TRUE/FALSE" "1/0/T/F/TRUE/FALSE" "1/0/T/F/TRUE/FALSE" "1/0/T/F/TRUE/FALSE" ... ## ..$ actual : chr [1:597554] "Autauga County" "01001" "Autauga County" "01001" ... ## ..$ file : chr [1:597554] "'Global_Mobility_Report.csv'" "'Global_Mobility_Report.csv'" "'Global_Mobility_Report.csv'" "'Global_Mobility_Report.csv'" ...
4.4.5 Long and wide tables
Before we go to create graphics with ggplot2
, it is very common to modify the table between two main formats, long and wide. A table is tidy when 1) each variable is a column 2) each observation/case is a row and 3) each type of observational unit forms a table.
# subset mobility_selection <- select(subset_europe, country_region_code, date:resi) mobility_selection ## # A tibble: 3,988 x 3 ## country_region_code date resi ## <chr> <date> <dbl> ## 1 AT 2020-02-15 -2 ## 2 AT 2020-02-16 -2 ## 3 AT 2020-02-17 0 ## 4 AT 2020-02-18 0 ## 5 AT 2020-02-19 1 ## 6 AT 2020-02-20 0 ## 7 AT 2020-02-21 1 ## 8 AT 2020-02-22 -2 ## 9 AT 2020-02-23 0 ## 10 AT 2020-02-24 -1 ## # ... with 3,978 more rows # wide table mobi_wide <- pivot_wider(mobility_selection, names_from = country_region_code, values_from = resi) mobi_wide ## # A tibble: 114 x 36 ## date AT BA BE BG BY CH CZ DE DK EE ES ## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2020-02-15 -2 -1 -1 0 -1 -1 -2 -1 0 0 -2 ## 2 2020-02-16 -2 -1 1 -3 0 -1 -1 0 1 0 -2 ## 3 2020-02-17 0 -1 0 -2 0 1 0 0 1 1 -1 ## 4 2020-02-18 0 -1 0 -2 0 1 0 1 1 1 0 ## 5 2020-02-19 1 -1 0 -1 -1 1 0 1 1 0 -1 ## 6 2020-02-20 0 -1 0 0 -1 0 0 1 1 0 -1 ## 7 2020-02-21 1 -2 0 -1 -1 1 0 2 1 1 -2 ## 8 2020-02-22 -2 -1 0 0 -2 -2 -3 0 1 0 -2 ## 9 2020-02-23 0 -1 0 -3 -1 -1 0 0 0 -2 -3 ## 10 2020-02-24 -1 -1 4 -1 0 0 0 4 0 16 0 ## # ... with 104 more rows, and 24 more variables: FI <dbl>, FR <dbl>, GB <dbl>, ## # GR <dbl>, HR <dbl>, HU <dbl>, IE <dbl>, IT <dbl>, LT <dbl>, LU <dbl>, ## # LV <dbl>, MD <dbl>, MK <dbl>, NL <dbl>, NO <dbl>, PL <dbl>, PT <dbl>, ## # RO <dbl>, RS <dbl>, RU <dbl>, SE <dbl>, SI <dbl>, SK <dbl>, UA <dbl> # back to long table pivot_longer(mobi_wide, 2:36, names_to = "country_code", values_to = "resi") ## # A tibble: 3,990 x 3 ## date country_code resi ## <date> <chr> <dbl> ## 1 2020-02-15 AT -2 ## 2 2020-02-15 BA -1 ## 3 2020-02-15 BE -1 ## 4 2020-02-15 BG 0 ## 5 2020-02-15 BY -1 ## 6 2020-02-15 CH -1 ## 7 2020-02-15 CZ -2 ## 8 2020-02-15 DE -1 ## 9 2020-02-15 DK 0 ## 10 2020-02-15 EE 0 ## # ... with 3,980 more rows
Another group of functions you should take a look at are: separate()
, case_when()
, complete()
. More details can be found in the Cheat-Sheet of dplyr
.
4.5 Visualize data
ggplot2
is a modern system for data visualization with a huge variety of options. Unlike the R Base graphic system, in ggplot2
a different grammar is used. The grammar of graphics (gg) consists of the sum of several independent layers or objects that are combined using +
to construct the final graph. ggplot
differentiates between data, what is displayed and how it is displayed.
data: our dataset (
data.frame
ortibble
)aesthetics: with the
aes()
function we indicate the variables that correspond to the x, y, z, … axes, or when it is intended to apply graphic parameters (color, size, shape) according to a variable. It is possible to includeaes()
inggplot()
or in the corresponding function to a geometrygeom_ *
.geometries: are
geom_ *
objects that indicate the geometry to be used, (eg:geom_point()
,geom_line()
,geom_boxplot()
, etc.).scales: are objects of type
scales_ *
(eg,scale_x_continous()
,scale_colour_manual()
) to manipulate axes, define colors, etc.statistics: are
stat_ *
objects (eg,stat_density()
) that allow to apply statistical transformations.
More details can be found in the Cheat-Sheet of ggplot2
.
4.5.1 Line and scatter plot
We create a subset of our mobility data for residences and parks, filtering the records for Italian regions. In addition, we divide the mobility values in percentage by 100 to obtain the fraction, since ggplot2
allows us to indicate the unit of percentage in the label argument (see last plot in this section).
# create subset it <- filter(google_mobility, country_region == "Italy", is.na(sub_region_1)) %>% mutate(resi = residential_percent_change_from_baseline/100, parks = parks_percent_change_from_baseline/100) # line plot ggplot(it, aes(date, resi)) + geom_line()
# scatter plot ggplot(it, aes(parks, resi)) + geom_point() + geom_smooth(method = "lm") ## `geom_smooth()` using formula 'y ~ x'
To modify the axes, we use the different scale_*
functions that we must adapt to the scales of measurement (date, discrete, continuous, etc.). The labs()
function helps us define the axis, legend and plot titles. Finally, we add the style of the graph with theme_light()
(others are theme_bw()
, theme_minimal()
, etc.). We could also make changes to all graphic elements through theme()
.
# time serie plot ggplot(it, aes(date, resi)) + geom_line(colour = "#560A86", size = 0.8) + scale_x_date(date_breaks = "10 days", date_labels = "%d %b") + scale_y_continuous(breaks = seq(-0.1, 1, 0.1), labels = scales::percent) + labs(x = "", y = "Residential mobility", title = "Mobility during COVID-19") + theme_light()
# scatter plot ggplot(it, aes(parks, resi)) + geom_point(alpha = .4, size = 2) + geom_smooth(method = "lm") + scale_x_continuous(breaks = seq(-1, 1.4, 0.2), labels = scales::percent) + scale_y_continuous(breaks = seq(-1, 1, 0.1), labels = scales::percent) + labs(x = "Park mobility", y = "Residential mobility", title = "Mobility during COVID-19") + theme_light() ## `geom_smooth()` using formula 'y ~ x'
4.5.2 Boxplot
We can visualize different aspects of the mobility with other geometries. Here we will create boxplots for each European country representing the variability of mobility between and within countries during the COVID-19 pandemic.
# subset subset_europe_reg <- filter(residential_mobility, !is.na(sub_region_1), !is.na(resi)) %>% left_join(wld, by = c("country_region_code"="iso_a2")) %>% filter(subregion %in% c("Northern Europe", "Southern Europe", "Western Europe", "Eastern Europe")) %>% mutate(resi = resi/100, country_region = fct_reorder(country_region, resi)) # boxplot ggplot(subset_europe_reg, aes(country_region, resi, fill = subregion)) + geom_boxplot() + scale_y_continuous(breaks = seq(-0.1, 1, 0.1), labels = scales::percent) + scale_fill_brewer(palette = "Set1") + coord_flip() + labs(x = "", y = "Residential mobility", title = "Mobility during COVID-19", fill = "") + theme_minimal()
4.5.3 Heatmap
To visualize the mobility trend of all European countries it is recommended to use a heatmap instead of a bundle of lines. Before building the graph, we will create a vector of Sundays for the x-axis labels in the observation period.
# sequence of dates df <- data.frame(d = seq(ymd("2020-02-15"), ymd("2020-06-07"), "day")) # filter on Sundays sundays <- df %>% mutate(wd = wday(d, week_start = 1)) %>% filter(wd == 7) %>% pull(d)
To difference between European regions, we will use a color fill for the boxplots. We can set the color type with scale_fill_*
, in this case, from the viridis scheme. In addition, the guides()
function can modify the color bar of the legend. Finally, here we see the use of theme()
with additional changes to theme_minimal()
.
# headmap ggplot(subset_europe, aes(date, region, fill = resi_real)) + geom_tile() + scale_x_date(breaks = sundays, date_labels = "%d %b") + scale_fill_viridis_c(option = "A", breaks = c(91, 146), labels = c("Less", "More"), direction = -1) + theme_minimal() + theme(legend.position = "top", title = element_text(size = 14), panel.grid.major.x = element_line(colour = "white", linetype = "dashed"), panel.grid.minor.x = element_blank(), panel.grid.major.y = element_blank(), panel.ontop = TRUE, plot.margin = margin(r = 1, unit = "cm")) + labs(y = "", x = "", fill = "", title = "Mobility trends for places of residence", caption = "Data: google.com/covid19/mobility/") + guides(fill = guide_colorbar(barwidth = 10, barheight = .5, label.position = "top", ticks = FALSE)) + coord_cartesian(expand = FALSE)
4.6 Apply functions on vectors or lists
The purrr
package contains a set of advanced functional programming functions for working with functions and vectors. The known lapply()
family of R Base corresponds to the map()
functions in this package. One of the biggest advantages is being able to reduce the use of loops (for
, etc.).
# list of two vectors vec_list <- list(x = 1:10, y = 50:70) # calculate the average for each one map(vec_list, mean) ## $x ## [1] 5.5 ## ## $y ## [1] 60 # change the output type map_* (dbl, chr, lgl, etc.) map_dbl(vec_list, mean) ## x y ## 5.5 60.0
Finally, a more complex example. We calculate the correlation coefficient between residential and park mobility in all European countries. To get a tidy summary of a model or test we use the tidy()
function of the broom
package.
library(broom) # tidy outputs # custom function cor_test <- function(x, formula) { df <- cor.test(as.formula(formula), data = x) %>% tidy() return(df) } # prepare the data europe_reg <- filter(google_mobility, !is.na(sub_region_1), !is.na(residential_percent_change_from_baseline)) %>% left_join(wld, by = c("country_region_code"="iso_a2")) %>% filter(subregion %in% c("Northern Europe", "Southern Europe", "Western Europe", "Eastern Europe")) # apply the function to each country creating a list cor_mobility <- europe_reg %>% split(.$country_region_code) %>% map(cor_test, formula = "~ residential_percent_change_from_baseline + parks_percent_change_from_baseline") cor_mobility[1:5] ## $AT ## # A tibble: 1 x 8 ## estimate statistic p.value parameter conf.low conf.high method alternative ## <dbl> <dbl> <dbl> <int> <dbl> <dbl> <chr> <chr> ## 1 -0.360 -12.3 2.68e-32 1009 -0.413 -0.305 Pearson'~ two.sided ## ## $BE ## # A tibble: 1 x 8 ## estimate statistic p.value parameter conf.low conf.high method alternative ## <dbl> <dbl> <dbl> <int> <dbl> <dbl> <chr> <chr> ## 1 -0.312 -6.06 3.67e-9 340 -0.405 -0.213 Pearson'~ two.sided ## ## $BG ## # A tibble: 1 x 8 ## estimate statistic p.value parameter conf.low conf.high method alternative ## <dbl> <dbl> <dbl> <int> <dbl> <dbl> <chr> <chr> ## 1 -0.677 -37.8 1.47e-227 1694 -0.702 -0.650 Pearson~ two.sided ## ## $CH ## # A tibble: 1 x 8 ## estimate statistic p.value parameter conf.low conf.high method alternative ## <dbl> <dbl> <dbl> <int> <dbl> <dbl> <chr> <chr> ## 1 -0.0786 -2.91 0.00370 1360 -0.131 -0.0256 Pearson's~ two.sided ## ## $CZ ## # A tibble: 1 x 8 ## estimate statistic p.value parameter conf.low conf.high method alternative ## <dbl> <dbl> <dbl> <int> <dbl> <dbl> <chr> <chr> ## 1 -0.0837 -3.35 0.000824 1593 -0.132 -0.0347 Pearson'~ two.sided
As we’ve seen before, there are subfunctions of map_*
to get an object of another class instead of a list, here for a bind data.frame
.
cor_mobility <- europe_reg %>% split(.$country_region_code) %>% map_df(cor_test, formula = "~ residential_percent_change_from_baseline + parks_percent_change_from_baseline", .id = "country_code") arrange(cor_mobility, estimate) ## # A tibble: 27 x 9 ## country_code estimate statistic p.value parameter conf.low conf.high method ## <chr> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <chr> ## 1 IT -0.831 -71.0 0. 2250 -0.844 -0.818 Pears~ ## 2 ES -0.825 -65.4 0. 2005 -0.839 -0.811 Pears~ ## 3 PT -0.729 -46.9 2.12e-321 1938 -0.749 -0.707 Pears~ ## 4 FR -0.698 -37.4 3.29e-216 1474 -0.723 -0.671 Pears~ ## 5 GR -0.692 -27.0 1.03e-114 796 -0.726 -0.654 Pears~ ## 6 BG -0.677 -37.8 1.47e-227 1694 -0.702 -0.650 Pears~ ## 7 RO -0.640 -56.0 0. 4517 -0.657 -0.623 Pears~ ## 8 SI -0.627 -11.4 1.98e- 23 200 -0.704 -0.535 Pears~ ## 9 HR -0.579 -21.9 9.32e- 87 954 -0.620 -0.536 Pears~ ## 10 LV -0.544 -6.87 3.84e- 10 112 -0.662 -0.401 Pears~ ## # ... with 17 more rows, and 1 more variable: alternative <chr>
Other practical examples here in this post or this other. More details can be found in the Cheat-Sheet of purrr
.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.