Site icon R-bloggers

Thoughts on Teaching R and Yet Another Tidyverse Intro

[This article was first published on R Bloggers on syknapptic, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Image credit to R Memes for Statistical Fiends


Considering this is a blog post, I’m going to get all bloggy here before jumping into the code.

Context

I recently had the opportunity to teach some R coding to colleagues and classmates in a series of workshops. Some had already dabbled in R or other programming languages, but it was the first time that the majority of participants had written a single line of code.

A few things happened in the week following the last session that I didn’t expect.

First, I saw a bit of R code written on a campus whiteboard that had nothing to do with me, but was straight out of the workshop. It may have come from some of my data-centric colleagues who use R, so I didn’t think too much of it.

Then, I overheard a conversation involving R from those in a program that doesn’t require any data-related coursework. Many folks are familiar with the name as the school’s primary data analysis course uses the {Rcmdr} GUI for statistical analysis, but these students would not have necessarily taken the course. I wondered if there was a connection.

Finally, a student who didn’t even attend came to my office hours asking for resources. Why exactly? Some of his work colleagues attended. It turns out that they are now trying to incorporate some R-powered analysis in their work and he doesn’t want to miss out.

The workshop consisted of 3 consecutive Fridays lasting 90 minutes each. That’s only a total of 4.5 hours.

That’s relatively tiny amount of time.

Wait. Scratch that.

That’s a negligible amount of time.

… but it was enough to convince some participants and non-participants that they should take advantage of the power that a bit of data-centric coding can offer.

Reflection

I taught a similar 90 minute workshop last spring using R, but focused on base R and a few data types. 10 minutes in, I’m trying to explain the difference between a data.frame and a matrix and the person asking the question says something along the lines of “I guess I’m kinda dumb. Don’t worry about it.”

For context, these were international policy graduate students. While some have completed a bit of quantitative coursework, most don’t have a hardcore math or science background and programming is seen as something akin to wizardry. However, they hold domain expertise in some rather important subjects. These include WMD nonproliferation, international development, economic diplomacy, conflict studies, and environmental policy. Nearly half of the participants were international students and everyone is proficient in at least a second natural language. Most have already tackled big, complicated problems in their careers and the others are on their way to doing so following graduation. In a nutshell, they’re not dumb. The way I was teaching was dumb. They knew that they’re supposed to want to learn new skills, but they didn’t know why. Focusing on the “basics” didn’t show them anything immediately useful. It didn’t show them the why.

After the workshop, I never heard anyone mention R outside my circle of fellow data folks.

Since that time, I started using R more. Like, a lot more. I have found a way to use R in nearly everything I’ve done since May 2017. As a policy student myself, that has not always been very straightforward and I was still avoiding the strange “tidy” code I’d encounter on Stack Overflow and elsewhere. I realized the error of my ways when I came across Julia Silge and David Robinson’s Text Mining with R. It was like discovering that you’re still in the stone age while most people are off partying on spaceships.

In preparation for this workshop series, I found a lot of inspiration in Michael Levy’s presentation on teaching R, which itself echoes principles preached by other tidyverse advocates.

A huge takeaway: live coding works.

Writing code in real time shows every single step we make from opening the IDE, to reshaping the data, to debugging inevitable errors, to rendering a final report.

Within a few short weeks of learning to code, it might be surprising how many tiny steps become automatic and taken for granted. Tack on a couple more months and newcomers will think you’re speaking in an entirely different language because you’re explaining something requiring context they simply haven’t yet encountered. Add a few years and… yeesh.

Something that frustrated me when I first started is that code explanations often seem to be written in such a way that dismisses how difficult establishing the basics can be. I’m half-convinced that, for some folks, the trauma was so great that they have simply blocked it from memory. Code is intimidating enough, but if an instructor doesn’t make a conscious effort to empathize, students will question their ability to learn. The goal is empowerment, not intimidation.

Live coding enforces a maximum speed in moving through exercises, which not only gives students more time to digest what you’re doing. It also provides more opportunities for them to ask questions on details you might find trivial, but only because you already suffered through them.

I also think that the benefits of live coding extend to the instructor as well. I found myself answering questions that framed things in ways that I had not even considered, but were exactly how multiple participants saw the task. Additionally, I have a better sense of which concepts need to be covered in more detail, as they weren’t necessarily as intuitive for others as they were for me. On the flip-side, concepts with which I remember struggling may not be difficult at all for others to understand.


… and now that we got the bloggyness of a blog post out of the way…

Here is the workflow I used for the first session. The goal was to introduce the primary {dplyr} verbs, functions that accomplish tasks necessary in nearly every project. Between each section is an exercise using {ggplot}.

tidyverse::tidyverse_logo()
## * __  _    __   .    o           *  . 
##  / /_(_)__/ /_ ___  _____ _______ ___ 
## / __/ / _  / // / |/ / -_) __(_-</ -_)
## \__/_/\_,_/\_, /|___/\__/_/ /___/\__/ 
##      *  . /___/      o      .       *


# install.packages("tidyverse")
library(tidyverse)

# install.packages("gapminder")
library(gapminder)
# loads the gapminder data set

## just to prettify printed tables when knitting
# install.packages("kableExtra")
library(knitr)
library(kableExtra)

Workflow


Resources Up Front

Data Carpentry




Plotting



Our Data

In the following exercises, gm.data.frame will be used to demonstrate actions that use {base} R methods for data.frame operations while gm_df will be used to to demonstrate {tidyverse} methods for tibble operations.

gm.data.frame <- as.data.frame(gapminder)

gm_df <- gapminder

tibble

class(gm.data.frame)
## [1] "data.frame"
class(gm_df)
## [1] "tbl_df"     "tbl"        "data.frame"

tibbles are opinionated data.frames that keep everything that is helpful about data.frames, changes some of their quirks, and adds methods that makes them even more useful.

Printing gm.data.frame dumps the whole data set to the console, typically requiring head() to limit the output.

Printing

head(gm.data.frame)
##       country continent year lifeExp      pop gdpPercap
## 1 Afghanistan      Asia 1952  28.801  8425333  779.4453
## 2 Afghanistan      Asia 1957  30.332  9240934  820.8530
## 3 Afghanistan      Asia 1962  31.997 10267083  853.1007
## 4 Afghanistan      Asia 1967  34.020 11537966  836.1971
## 5 Afghanistan      Asia 1972  36.088 13079460  739.9811
## 6 Afghanistan      Asia 1977  38.438 14880372  786.1134

Printing gm_df provides the dimensions, data type of each column, and only prints the first 10 rows.

gm_df
## # A tibble: 1,704 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ... with 1,694 more rows

%>%


The pipe (%>%) is used to chain operations together. Underneath the hood, it’s taking the value on the left-hand side of %>% and using it as the first argument of the function on the right-hand side of %>%.

For example, these 2 lines are doing the exact same thing.

head(gm_df)
## # A tibble: 6 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.
gm_df %>% head()
## # A tibble: 6 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.

For simple operations involving 1 function, %>% is only (arguably) beneficial in that it improves readability as the flow of operations go from left to right.

%>% become truly useful when you need to perform multiple operations in succession, which is the vast majority of data carpentry.

As an arbitrary example, let’s say that we want to select the head() (first 6 rows) of gm.data.frame and convert it to a tibble.

Without %>%, we can do this in a few ways.

  1. Use intermediate variables.
    • get gm.data.frame’s head() and assign it to no_pipe_1
    • convert no_pipe_1 to a tibble with as_tibble() and assign it to no_pipe_2
no_pipe_1 <- head(gm.data.frame)

no_pipe_2 <- as_tibble(no_pipe_1)

no_pipe_2
## # A tibble: 6 x 6
##   country     continent  year lifeExp      pop gdpPercap
## * <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.
  1. Nest gm.data.frame inside of head(), which is itself nested inside of as_tibble().
as_tibble(head(gm.data.frame))
## # A tibble: 6 x 6
##   country     continent  year lifeExp      pop gdpPercap
## * <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.

With %>%, we can chain these actions together in the order in which they occur, which is also the way we read English.

  • Here, we do the same thing by:
    • taking gm_df
    • piping it to head() (keeping the top 6 rows)
    • piping it to as_tibble() (converting it to a tibble data frame)
gm_df %>% head() %>% as_tibble()
## # A tibble: 6 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.

In practice, it’s usually best to place each of the functions on a separate line as it facilitates debugging and further improves readability.

gm_df %>%
  as_tibble() %>%
  head()
## # A tibble: 6 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.

From here on, you’ll notice prettify(). This is only being used to print tables in a clean format when the document is knit()ted.

I’m choosing to include it here as I often find myself reading similar pages where I come across a really effective way to format some output. I understand why the author chooses to set echo=FALSE, but it can be nice to see the underlying code without having to hunt through their GitHub.

data.frames will print a default maximum of 3 rows while tibbles will print a default maximum of 10 rows.

prettify <- function(df, n = NULL, cols_changed = NULL, rows_changed = NULL){
  if(is.null(n)) n <- ifelse(is.tibble(df), 10, 3)
  pretty_df <- df %>%
    head(n) %>%
    kable(format = "html") %>%
    kable_styling(bootstrap_options = c("striped", "bordered", "condensed",
                                        "hover", "responsive"),
                  full_width = FALSE)
  
  if(!is.null(cols_changed)){
    pretty_df <- pretty_df %>%
      column_spec(cols_changed, bold = T, color = "black", background = "#C8FAE3")
  }
  
  if(!is.null(rows_changed)){
    pretty_df <- pretty_df %>%
      row_spec(rows_changed, bold = T, color = "black", background = "#C8FAE3")
  }
  
  return(pretty_df)
}
gm.data.frame %>%
  prettify()
country continent year lifeExp pop gdpPercap
Afghanistan Asia 1952 28.801 8425333 779.4453
Afghanistan Asia 1957 30.332 9240934 820.8530
Afghanistan Asia 1962 31.997 10267083 853.1007
gm_df %>%
  prettify()
country continent year lifeExp pop gdpPercap
Afghanistan Asia 1952 28.801 8425333 779.4453
Afghanistan Asia 1957 30.332 9240934 820.8530
Afghanistan Asia 1962 31.997 10267083 853.1007
Afghanistan Asia 1967 34.020 11537966 836.1971
Afghanistan Asia 1972 36.088 13079460 739.9811
Afghanistan Asia 1977 38.438 14880372 786.1134
Afghanistan Asia 1982 39.854 12881816 978.0114
Afghanistan Asia 1987 40.822 13867957 852.3959
Afghanistan Asia 1992 41.674 16317921 649.3414
Afghanistan Asia 1997 41.763 22227415 635.3414

Sample Data

You’ll also see a toy data set for the introductory examples that start each section.

sample_countries <- c("Tunisia", "Nicaragua", "Singapore", "Hungary",
                      "New Zealand", "Nigeria", "Brazil", "Sri Lanka",
                      "Ireland", "Australia")
  
sample_df <- gm_df %>%
  filter(year == 2007,
         country %in% sample_countries)

sample_df %>%     
  prettify()
country continent year lifeExp pop gdpPercap
Australia Oceania 2007 81.235 20434176 34435.367
Brazil Americas 2007 72.390 190010647 9065.801
Hungary Europe 2007 73.338 9956108 18008.944
Ireland Europe 2007 78.885 4109086 40675.996
New Zealand Oceania 2007 80.204 4115771 25185.009
Nicaragua Americas 2007 72.899 5675356 2749.321
Nigeria Africa 2007 46.859 135031164 2013.977
Singapore Asia 2007 79.972 4553009 47143.180
Sri Lanka Asia 2007 72.396 20378239 3970.095
Tunisia Africa 2007 73.923 10276158 7092.923

“Tidy” Data

If you’re unsure of what “Tidy” data is actually describing and want to learn more, you can read Hadley Wickham’s article here. Otherwise, these graphics are likely the most concise explanation you’ll find.

With tibbles, %>%, and the concept of tidy data covered, let’s take a dive.

{dplyr}

{dplyr} provides a grammar of data manipulation and a set of verb functions that solve most common data carpentry challenges in a consistent fashion.

  • glimpse()
  • select()
  • filter()
  • arrange()
  • mutate()
  • summarize()
  • group_by()

Taking a glimpse()

In addition to the summary(), dim()ensions, and str()ucture functions that can be used to inspect data, you can now use {dplyr}’s glimpse().

summary(gm.data.frame)
##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1  
## 
dim(gm.data.frame)
## [1] 1704    6
str(gm.data.frame)
## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
glimpse(gm_df)
## Observations: 1,704
## Variables: 6
## $ country   <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, ...
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia...
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...

select() columns

Quick Example

Initial Data

sample_df %>%
  prettify()
country continent year lifeExp pop gdpPercap
Australia Oceania 2007 81.235 20434176 34435.367
Brazil Americas 2007 72.390 190010647 9065.801
Hungary Europe 2007 73.338 9956108 18008.944
Ireland Europe 2007 78.885 4109086 40675.996
New Zealand Oceania 2007 80.204 4115771 25185.009
Nicaragua Americas 2007 72.899 5675356 2749.321
Nigeria Africa 2007 46.859 135031164 2013.977
Singapore Asia 2007 79.972 4553009 47143.180
Sri Lanka Asia 2007 72.396 20378239 3970.095
Tunisia Africa 2007 73.923 10276158 7092.923

End Data

sample_df %>%
  select(country, pop) %>%
  prettify()
country pop
Australia 20434176
Brazil 190010647
Hungary 9956108
Ireland 4109086
New Zealand 4115771
Nicaragua 5675356
Nigeria 135031164
Singapore 4553009
Sri Lanka 20378239
Tunisia 10276158

The select() family is used to choose columns to keep. You can use bare (unquoted) names.

  • select() columns by specific names.
    • select only gm_df’s country and pop columns
gm_df %>%
  select(country, year, pop) %>%            # select columns by specific names
  prettify()
country year pop
Afghanistan 1952 8425333
Afghanistan 1957 9240934
Afghanistan 1962 10267083
Afghanistan 1967 11537966
Afghanistan 1972 13079460
Afghanistan 1977 14880372
Afghanistan 1982 12881816
Afghanistan 1987 13867957
Afghanistan 1992 16317921
Afghanistan 1997 22227415
  • select() a range of columns by name
    • select gm_df’s continent column and all columns from lifeExp to gdpPercap
gm_df %>%
  select(continent, lifeExp:gdpPercap) %>%  # select columns name range
  prettify()
continent lifeExp pop gdpPercap
Asia 28.801 8425333 779.4453
Asia 30.332 9240934 820.8530
Asia 31.997 10267083 853.1007
Asia 34.020 11537966 836.1971
Asia 36.088 13079460 739.9811
Asia 38.438 14880372 786.1134
Asia 39.854 12881816 978.0114
Asia 40.822 13867957 852.3959
Asia 41.674 16317921 649.3414
Asia 41.763 22227415 635.3414
  • deselect() a column with -
    • select() all of gm_df’s columns except lifeExp
gm_df %>%
  select(-lifeExp) %>%                      # deselect column by name
  prettify()
country continent year pop gdpPercap
Afghanistan Asia 1952 8425333 779.4453
Afghanistan Asia 1957 9240934 820.8530
Afghanistan Asia 1962 10267083 853.1007
Afghanistan Asia 1967 11537966 836.1971
Afghanistan Asia 1972 13079460 739.9811
Afghanistan Asia 1977 14880372 786.1134
Afghanistan Asia 1982 12881816 978.0114
Afghanistan Asia 1987 13867957 852.3959
Afghanistan Asia 1992 16317921 649.3414
Afghanistan Asia 1997 22227415 635.3414
  • deselect() a range of columns by name
    • select() all of gm_df’s columns except those between lifeExp and gdpPercap
gm_df %>%
  select(-c(lifeExp:gdpPercap)) %>%         # deselect column by name range
  prettify()
country continent year
Afghanistan Asia 1952
Afghanistan Asia 1957
Afghanistan Asia 1962
Afghanistan Asia 1967
Afghanistan Asia 1972
Afghanistan Asia 1977
Afghanistan Asia 1982
Afghanistan Asia 1987
Afghanistan Asia 1992
Afghanistan Asia 1997
  • select() column by index
    • select() gm_df’s 4th column
gm_df %>%
  select(4) %>%                             # select column by index
  prettify()
lifeExp
28.801
30.332
31.997
34.020
36.088
38.438
39.854
40.822
41.674
41.763
  • deselect() a column by index
    • select() all of gm_df’s columns except for the 4th column
gm_df %>%
  select(-4) %>%                         # deselect column by index
  prettify()
country continent year pop gdpPercap
Afghanistan Asia 1952 8425333 779.4453
Afghanistan Asia 1957 9240934 820.8530
Afghanistan Asia 1962 10267083 853.1007
Afghanistan Asia 1967 11537966 836.1971
Afghanistan Asia 1972 13079460 739.9811
Afghanistan Asia 1977 14880372 786.1134
Afghanistan Asia 1982 12881816 978.0114
Afghanistan Asia 1987 13867957 852.3959
Afghanistan Asia 1992 16317921 649.3414
Afghanistan Asia 1997 22227415 635.3414
  • deselect() a range of columns by index
    • select() all of gm_df’s columns except those between the 3rd and 5th columns
gm_df %>%
  select(-c(3:5)) %>%                    # deselect columns by index range
  prettify()
country continent gdpPercap
Afghanistan Asia 779.4453
Afghanistan Asia 820.8530
Afghanistan Asia 853.1007
Afghanistan Asia 836.1971
Afghanistan Asia 739.9811
Afghanistan Asia 786.1134
Afghanistan Asia 978.0114
Afghanistan Asia 852.3959
Afghanistan Asia 649.3414
Afghanistan Asia 635.3414

ggplot() Exercise 1

{ggplot2} is monster of a package used for data visualization that follows The Grammar of Graphics.

{ggplot2} takes R’s powerful graphics capabilities and makes them more accessible by taking care of many plotting tasks that are often tedious, while still allowing for lower-level customization.

  • Basic Setup

ggplot(your data, aes(x =x values, y =y values)) +
geom_boxplot() the type of plot geometry desired

Steps

  1. Using gm_df, select the lifeExp column
  2. Pipe (%>%) the result to ggplot()
  3. Select the plot’s aes()thetic values
    • lifeExp for the x values
      • a histogram’s y are counts of its x values, so we don’t provide them here
  4. Add geom_histogram() as the geometry of the plot
gm_df %>%                                     # data frame: Data
  select(lifeExp) %>%                         # columns to keep: Data
  ggplot(aes(x = lifeExp)) +                  # x values: Aesthetics
  geom_histogram()                            # histogram: Geometries

Figure 1: Figure 1

filter() Rows

Quick Example

Initial Data

sample_df %>%
  select(country, lifeExp) %>%
  prettify()
country lifeExp
Australia 81.235
Brazil 72.390
Hungary 73.338
Ireland 78.885
New Zealand 80.204
Nicaragua 72.899
Nigeria 46.859
Singapore 79.972
Sri Lanka 72.396
Tunisia 73.923

End Data

sample_df %>%
  select(country, lifeExp) %>%
  filter(lifeExp > 75) %>%
  prettify(cols_changed = 2)
country lifeExp
Australia 81.235
Ireland 78.885
New Zealand 80.204
Singapore 79.972

Use filter() to select rows using logic. Rows where a logical expression returns TRUE are kept and others are dropped.

  • filter() rows where numeric() values are greater or lesser than another value
    • filter() gm_df to only keep rows where gdpPercap < 500
gm_df %>%
  filter(gdpPercap < 500) %>%
  prettify(cols_changed = 6)
country continent year lifeExp pop gdpPercap
Burundi Africa 1952 39.031 2445618 339.2965
Burundi Africa 1957 40.533 2667518 379.5646
Burundi Africa 1962 42.045 2961915 355.2032
Burundi Africa 1967 43.548 3330989 412.9775
Burundi Africa 1972 44.057 3529983 464.0995
Burundi Africa 1997 45.326 6121610 463.1151
Burundi Africa 2002 47.360 7021078 446.4035
Burundi Africa 2007 49.580 8390505 430.0707
Cambodia Asia 1952 39.417 4693836 368.4693
Cambodia Asia 1957 41.366 5322536 434.0383
  • filter() rows using multiple logical expressions where all must be TRUE
    • filter() gm_df to only keep rows where year > 1990 and lifeExp < 40
    • , and & are evaluated identically in filter()
gm_df %>%
  filter(year > 1990, lifeExp < 40) %>%
  prettify(cols_changed = 3:4)
country continent year lifeExp pop gdpPercap
Rwanda Africa 1992 23.599 7290203 737.0686
Rwanda Africa 1997 36.087 7212583 589.9445
Sierra Leone Africa 1992 38.333 4260884 1068.6963
Sierra Leone Africa 1997 39.897 4578212 574.6482
Somalia Africa 1992 39.658 6099799 926.9603
Swaziland Africa 2007 39.613 1133066 4513.4806
Zambia Africa 2002 39.193 10595811 1071.6139
Zimbabwe Africa 2002 39.989 11926563 672.0386
  • filter() rows using multiple logical expressions where one must be TRUE
    • filter() gm_df to only keep rows where pop < 10000 or gdpPercap > 100000
    • | means or
gm_df %>%
  filter(pop < 10000 | gdpPercap > 100000) %>%
  prettify(cols_changed = 5:6)
country continent year lifeExp pop gdpPercap
Kuwait Asia 1952 55.565 160000 108382.4
Kuwait Asia 1957 58.033 212846 113523.1
Kuwait Asia 1972 67.712 841934 109347.9
  • filter() rows using a string
    • filter() gm_df to only keep rows where year is 1999 and continent is "Europe"
    • == means is equal to
gm_df %>%
  filter(year == 1997 & continent == "Europe") %>%
  prettify(cols_changed = 2:3)
country continent year lifeExp pop gdpPercap
Albania Europe 1997 72.950 3428038 3193.055
Austria Europe 1997 77.510 8069876 29095.921
Belgium Europe 1997 77.530 10199787 27561.197
Bosnia and Herzegovina Europe 1997 73.244 3607000 4766.356
Bulgaria Europe 1997 70.320 8066057 5970.389
Croatia Europe 1997 73.680 4444595 9875.605
Czech Republic Europe 1997 74.010 10300707 16048.514
Denmark Europe 1997 76.110 5283663 29804.346
Finland Europe 1997 77.130 5134406 23723.950
France Europe 1997 78.640 58623428 25889.785

ggplot() Exercise 2

Steps

  1. Using gm_df, select the continent, country, and gdpPercap columns
  2. filter() the rows to only keep those where continent == "Oceania"
  3. Pipe (%>%) the result to ggplot()
  4. Select the plot’s aes()thetic values
    • country for the x values
    • gdpPercap for the y values
  5. Add geom_boxplot() as the geometry of the plot
gm_df %>%                                         # data frame: Data
  select(continent, country, gdpPercap) %>%       # columns to keep: Data
  filter(continent == "Oceania") %>%              # rows to keep: Data
  ggplot(aes(x = country, y = gdpPercap)) +       # x and y values: Aesthetics
  geom_boxplot()                                  # box plot: Geometries

mutate() Columns

Quick Example

Initial Data

sample_df %>%
  select(country, pop) %>%
  prettify()
country pop
Australia 20434176
Brazil 190010647
Hungary 9956108
Ireland 4109086
New Zealand 4115771
Nicaragua 5675356
Nigeria 135031164
Singapore 4553009
Sri Lanka 20378239
Tunisia 10276158

End Data

sample_df %>%
  select(country, pop) %>%
  mutate(pop_in_thousands = pop / 1000) %>%
  prettify(cols_changed = 3)
country pop pop_in_thousands
Australia 20434176 20434.176
Brazil 190010647 190010.647
Hungary 9956108 9956.108
Ireland 4109086 4109.086
New Zealand 4115771 4115.771
Nicaragua 5675356 5675.356
Nigeria 135031164 135031.164
Singapore 4553009 4553.009
Sri Lanka 20378239 20378.239
Tunisia 10276158 10276.158

Use mutate() to manipulate column values and create new columns.

In order to mutate() a column, use the name of the column you are manipulating and set its value using =.

Here’s a silly example:

  • Add a new column to gm_df
    • mutate() gm_df to create a column named planet and set its value to "Earth"
gm_df %>%
  mutate(planet = "Earth") %>%
  prettify(cols_changed = 7)
country continent year lifeExp pop gdpPercap planet
Afghanistan Asia 1952 28.801 8425333 779.4453 Earth
Afghanistan Asia 1957 30.332 9240934 820.8530 Earth
Afghanistan Asia 1962 31.997 10267083 853.1007 Earth
Afghanistan Asia 1967 34.020 11537966 836.1971 Earth
Afghanistan Asia 1972 36.088 13079460 739.9811 Earth
Afghanistan Asia 1977 38.438 14880372 786.1134 Earth
Afghanistan Asia 1982 39.854 12881816 978.0114 Earth
Afghanistan Asia 1987 40.822 13867957 852.3959 Earth
Afghanistan Asia 1992 41.674 16317921 649.3414 Earth
Afghanistan Asia 1997 41.763 22227415 635.3414 Earth

Since we have gdpPercap and pop, we can calculate the values for a total_GDP column.

  • mutate() gm_df to set the results of a calculation on each row to a new column
    • multiply pop * gdpPercap and assign the result to total_GDP inside mutate()
gm_df %>%
  mutate(total_GDP = pop * gdpPercap) %>%
  prettify(cols_changed = 7)
country continent year lifeExp pop gdpPercap total_GDP
Afghanistan Asia 1952 28.801 8425333 779.4453 6567086330
Afghanistan Asia 1957 30.332 9240934 820.8530 7585448670
Afghanistan Asia 1962 31.997 10267083 853.1007 8758855797
Afghanistan Asia 1967 34.020 11537966 836.1971 9648014150
Afghanistan Asia 1972 36.088 13079460 739.9811 9678553274
Afghanistan Asia 1977 38.438 14880372 786.1134 11697659231
Afghanistan Asia 1982 39.854 12881816 978.0114 12598563401
Afghanistan Asia 1987 40.822 13867957 852.3959 11820990309
Afghanistan Asia 1992 41.674 16317921 649.3414 10595901589
Afghanistan Asia 1997 41.763 22227415 635.3414 14121995875

Typically, mutate() is used to perform operations across columns in each individual row. You can also use summary functions to perform operations on individual columns (acting as vectors) that result in a vector that can be assigned to a column.

Makes sense, right??

Let’s calculate the z-score of each gdpPercap value for a specific year.

\[ z = \frac {x_i -\mu_x} {\sigma_x}\]

  • \(x\) = gdpPercap
  • \(\mu_x\) = the mean of \(x\) = mean(gdpPercap)
  • \(\sigma_x\) = the standard deviation of x = sd(gdpPercap)

  • Use a summary function to perform a a calculation involving summary statistics of a column
    • subtract mean(gdpPercap) from gdpPercap
    • divide the result by sd(gdpPercap)
    • set the results as the values of a new column called gdp_per_cap_z_score
gm_df %>%
  filter(year == 1977) %>%
  mutate(gdp_per_cap_z_score = (gdpPercap - mean(gdpPercap)) / sd(gdpPercap)) %>%
  prettify(cols_changed = 7)
country continent year lifeExp pop gdpPercap gdp_per_cap_z_score
Afghanistan Asia 1977 38.438 14880372 786.1134 -0.7805156
Albania Europe 1977 68.930 2509048 3533.0039 -0.4520380
Algeria Africa 1977 58.014 17152804 4910.4168 -0.2873247
Angola Africa 1977 39.483 6162675 3008.6474 -0.5147414
Argentina Americas 1977 68.481 26983828 10079.0267 0.3307461
Australia Oceania 1977 73.490 14074100 18334.1975 1.3179128
Austria Europe 1977 72.170 7568430 19749.4223 1.4871476
Bahrain Asia 1977 65.593 297410 19340.1020 1.4382004
Bangladesh Asia 1977 46.923 80428306 659.8772 -0.7956111
Belgium Europe 1977 72.800 9821800 19117.9745 1.4116381

Here are other functions that can be used similarly:

Summary Functions
first() min()
last() max()
nth() mean()
n() median()
n_distinct() var()
IQR() sd()

ggplot() Exercise 3

Steps

  1. Using gm_df, select() country, year, and gdpPercap
  2. filter() the rows to keep only those where country is "Korea, Rep.", "Korea, Dem. Rep.", "Japan", or "China"
  3. Pipe the result to ggplot()
  4. Select the plot’s aes()thetic values
    • year for the x values
    • gdpPercap for the y values
    • country for the color values
  • Add geom_line() as the geometry of the plot
  • Add a title to the plot with labs()
gm_df %>%
  filter(country %in% c("Korea, Rep.", "Korea, Dem. Rep.", "Japan", "China")) %>%
  mutate(total_GDP = pop * gdpPercap) %>%
  ggplot(aes(x = year, y = gdpPercap, color = country)) +
  geom_line() +
  labs(title = "GDP Over Time")

arrange() Rows

Quick Example

Initial Data

sample_df %>%
  select(country, gdpPercap) %>%
  prettify()
country gdpPercap
Australia 34435.367
Brazil 9065.801
Hungary 18008.944
Ireland 40675.996
New Zealand 25185.009
Nicaragua 2749.321
Nigeria 2013.977
Singapore 47143.180
Sri Lanka 3970.095
Tunisia 7092.923

End Data

sample_df %>%
  select(country, gdpPercap)%>%
  arrange(gdpPercap) %>%
  prettify(cols_changed = 2)
country gdpPercap
Nigeria 2013.977
Nicaragua 2749.321
Sri Lanka 3970.095
Tunisia 7092.923
Brazil 9065.801
Hungary 18008.944
New Zealand 25185.009
Australia 34435.367
Ireland 40675.996
Singapore 47143.180

Use arrange() to sort rows.

  • arrange() by ascending number (smallest to largest)
    • arrange() gm_df’s pop column so that smallest populations are on top
gm_df %>%
  arrange(pop) %>%
  prettify(cols_changed = 5)
country continent year lifeExp pop gdpPercap
Sao Tome and Principe Africa 1952 46.471 60011 879.5836
Sao Tome and Principe Africa 1957 48.945 61325 860.7369
Djibouti Africa 1952 34.812 63149 2669.5295
Sao Tome and Principe Africa 1962 51.893 65345 1071.5511
Sao Tome and Principe Africa 1967 54.425 70787 1384.8406
Djibouti Africa 1957 37.328 71851 2864.9691
Sao Tome and Principe Africa 1972 56.480 76595 1532.9853
Sao Tome and Principe Africa 1977 58.550 86796 1737.5617
Djibouti Africa 1962 39.693 89898 3020.9893
Sao Tome and Principe Africa 1982 60.351 98593 1890.2181
  • arrange() by desc() number (largest to smallest)
    • arrange() the lifeExp column so that largest values are on top
gm_df %>%
  arrange(desc(lifeExp)) %>%
  prettify(cols_changed = 4)
country continent year lifeExp pop gdpPercap
Japan Asia 2007 82.603 127467972 31656.07
Hong Kong, China Asia 2007 82.208 6980412 39724.98
Japan Asia 2002 82.000 127065841 28604.59
Iceland Europe 2007 81.757 301931 36180.79
Switzerland Europe 2007 81.701 7554661 37506.42
Hong Kong, China Asia 2002 81.495 6762476 30209.02
Australia Oceania 2007 81.235 20434176 34435.37
Spain Europe 2007 80.941 40448191 28821.06
Sweden Europe 2007 80.884 9031088 33859.75
Israel Asia 2007 80.745 6426679 25523.28
  • arrange() alphabetically
    • filter() gm_df to keep only those rows where year == 2007 and continent == "Americas"
    • arrange() the country column alphabetically
gm_df %>%
  filter(year == 2007, continent == "Americas") %>%
  arrange(country) %>%
  prettify(cols_changed = 2:3)
country continent year lifeExp pop gdpPercap
Argentina Americas 2007 75.320 40301927 12779.380
Bolivia Americas 2007 65.554 9119152 3822.137
Brazil Americas 2007 72.390 190010647 9065.801
Canada Americas 2007 80.653 33390141 36319.235
Chile Americas 2007 78.553 16284741 13171.639
Colombia Americas 2007 72.889 44227550 7006.580
Costa Rica Americas 2007 78.782 4133884 9645.061
Cuba Americas 2007 78.273 11416987 8948.103
Dominican Republic Americas 2007 72.235 9319622 6025.375
Ecuador Americas 2007 74.994 13755680 6873.262

group_by() for Grouped Data

Quick Example

Initial Data

sample_df %>%
  select(country, continent, pop) %>%
  prettify()
country continent pop
Australia Oceania 20434176
Brazil Americas 190010647
Hungary Europe 9956108
Ireland Europe 4109086
New Zealand Oceania 4115771
Nicaragua Americas 5675356
Nigeria Africa 135031164
Singapore Asia 4553009
Sri Lanka Asia 20378239
Tunisia Africa 10276158

End Data

sample_df %>%
  select(country, continent, pop) %>%
  group_by(continent) %>%
  mutate(pop_by_continent = sum(pop)) %>%
  ungroup() %>%
  arrange(pop_by_continent) %>%
  prettify(cols_changed = 4)
country continent pop pop_by_continent
Hungary Europe 9956108 14065194
Ireland Europe 4109086 14065194
Australia Oceania 20434176 24549947
New Zealand Oceania 4115771 24549947
Singapore Asia 4553009 24931248
Sri Lanka Asia 20378239 24931248
Nigeria Africa 135031164 145307322
Tunisia Africa 10276158 145307322
Brazil Americas 190010647 195686003
Nicaragua Americas 5675356 195686003

group_by() allows us to group rows together based on column values.

Let’s say we wanted to compute summary values for each country for all years.

  • Calculate the mean_gdp_per_cap of each country with group_by()
    • take gm_df and group_by() country to group rows of the same country together
    • use mean() to calculate the mean_gdp_per_cap
    • ungroup() the rows
      • a habit you want
    • keep only those rows with distinct() combinations of country and mean_gdp_per_cap
      • distinct()’s default is to only keep columns used as arguments
gm_df %>%
  group_by(country) %>%
  mutate(mean_gdp_per_cap = median(gdpPercap)) %>% 
  ungroup() %>%
  distinct(country, mean_gdp_per_cap) %>% 
  prettify(cols_changed = 2)
country mean_gdp_per_cap
Afghanistan 803.4832
Albania 3253.2384
Algeria 4853.8559
Angola 3264.6288
Argentina 9068.7844
Australia 18905.6034
Austria 20673.2530
Bahrain 18779.8016
Bangladesh 703.7638
Belgium 20048.9102

ggplot() Exercise 4

Steps

  1. Using gm_df, group_by() the continent and year
  2. mutate() to add a column called mean_gdp for the average GDP of each continent
  3. ungroup() the data, because this is a habit that will save you headaches later
  4. Keep only distinct() combinations of continent, year, and mean_gdp
  5. Pipe the result to ggplot()
  6. Select the plot’s aes()thetic values
    • year for the x values
    • mean_gdp for the y values
    • continent for the color values
  7. Add geom_line() as the geometry of the plot
  8. Add a title and a caption (for the source of the data) to the plot with labs()
gm_df %>%
  group_by(year, continent) %>%
  mutate(mean_gdp = mean(gdpPercap)) %>%
  ungroup() %>%
  distinct(continent, year, mean_gdp) %>%
  ggplot(aes(x = year, y = mean_gdp, color = continent)) +
  geom_line() +
  labs(title = "Mean GDPs by Continent Over Time",
       caption = "Source: Free material from www.gapminder.org")

summarize()

Quick Example

Initial Data

sample_df %>%
  select(country, continent, lifeExp, pop) %>%
  prettify()
country continent lifeExp pop
Australia Oceania 81.235 20434176
Brazil Americas 72.390 190010647
Hungary Europe 73.338 9956108
Ireland Europe 78.885 4109086
New Zealand Oceania 80.204 4115771
Nicaragua Americas 72.899 5675356
Nigeria Africa 46.859 135031164
Singapore Asia 79.972 4553009
Sri Lanka Asia 72.396 20378239
Tunisia Africa 73.923 10276158
sample_df %>%
  select(country, continent, lifeExp, pop) %>%
  group_by(continent) %>%
  summarise(max_pop = max(pop),
            mean_life_exp = mean(lifeExp)) %>%
  prettify(cols_changed = 2:3)
continent max_pop mean_life_exp
Africa 135031164 60.3910
Americas 190010647 72.6445
Asia 20378239 76.1840
Europe 9956108 76.1115
Oceania 20434176 80.7195

Now that we know how to use group_by(), we can summarize() data by group. This can be done using all of the summary functions seen earlier.

Summary Functions
first() min()
last() max()
nth() mean()
n() median()
n_distinct() var()
IQR() sd()
  • Calculate some summary statistics for each continent.
    • take gm_df and group_by() continent
    • using summarize() or summarise(), calculate:
      • count with n()
      • mean_pop with mean()
      • max_gdp_per_cap with max()
gm_df %>%
  group_by(continent) %>%
  summarise(count = n(),
            mean_pop = mean(pop),
            max_gdp_per_cap = max(gdpPercap)) %>%
  prettify(cols_changed = 2:4)
continent count mean_pop max_gdp_per_cap
Africa 624 9916003 21951.21
Americas 300 24504795 42951.65
Asia 396 77038722 113523.13
Europe 360 17169765 49357.19
Oceania 24 8874672 34435.37

ggplot() Exercise 5

Steps

  1. Using gm_df, filter() the data to remove rows where continent is not "Oceania"
  2. group_by() continent and year
  3. summarize() the groups by calculating them mean() of pop
  4. ungroup() the data, because this is a habit that will save you headaches later
  5. Pipe the results to ggplot()
  6. Select the plot’s aes()thetics
    • year for the x values
    • mean_pop for the y values
    • continent for the color values
  7. Add geom_line() for the first geometry
  8. Add geom_point() for the second geometry
  9. Change the theme by adding theme_minimal()
  10. Using facet_wrap(), split the plot into panels for each continent
    • ~ is used as a formula to select the facet variable
  11. Add a title and a caption with labs()
gm_df %>%
  filter(continent != "Oceania") %>%
  group_by(continent, year) %>%
  summarise(mean_pop = mean(pop)) %>%
  ungroup() %>%
  ggplot(aes(x = year, y = mean_pop,
             color = continent)) +
  geom_line() +
  geom_point() +
  theme_minimal() +
  facet_wrap(~ continent) +
  labs(title = "Mean Continent Populations over Time",
       caption = "Source: Free material from www.gapminder.org")

النهاية

sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2.2       kableExtra_0.9.0     knitr_1.20.8        
##  [4] gapminder_0.3.0      forcats_0.3.0        stringr_1.3.1       
##  [7] dplyr_0.7.6          purrr_0.2.5          readr_1.1.1         
## [10] tidyr_0.8.1          tibble_1.4.2.9004    ggplot2_3.0.0.9000  
## [13] tidyverse_1.2.1.9000
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.4  xfun_0.3          reshape2_1.4.3   
##  [4] haven_1.1.2       lattice_0.20-35   colorspace_1.3-2 
##  [7] viridisLite_0.3.0 htmltools_0.3.6   yaml_2.1.19      
## [10] utf8_1.1.4        rlang_0.2.1       pillar_1.3.0.9000
## [13] withr_2.1.2       foreign_0.8-70    glue_1.2.0       
## [16] modelr_0.1.2      readxl_1.1.0      bindr_0.1.1      
## [19] plyr_1.8.4        munsell_0.5.0     blogdown_0.7.1   
## [22] gtable_0.2.0      cellranger_1.1.0  rvest_0.3.2      
## [25] codetools_0.2-15  psych_1.8.4       evaluate_0.10.1  
## [28] labeling_0.3      parallel_3.5.1    fansi_0.2.3      
## [31] highr_0.7         broom_0.4.5       Rcpp_0.12.17     
## [34] scales_0.5.0.9000 jsonlite_1.5      mnormt_1.5-5     
## [37] hms_0.4.2         digest_0.6.15     stringi_1.2.3    
## [40] bookdown_0.7      grid_3.5.1        cli_1.0.0        
## [43] tools_3.5.1       magrittr_1.5      lazyeval_0.2.1   
## [46] crayon_1.3.4      pkgconfig_2.0.1   xml2_1.2.0       
## [49] lubridate_1.7.4   assertthat_0.2.0  rmarkdown_1.10.7 
## [52] httr_1.3.1        rstudioapi_0.7    htmldeps_0.1.0   
## [55] R6_2.2.2          nlme_3.1-137      compiler_3.5.1

To leave a comment for the author, please follow the link and comment on their blog: R Bloggers on syknapptic.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.