Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Image credit to R Memes for Statistical Fiends
Considering this is a blog post, I’m going to get all bloggy here before jumping into the code.
Context
I recently had the opportunity to teach some R coding to colleagues and classmates in a series of workshops. Some had already dabbled in R or other programming languages, but it was the first time that the majority of participants had written a single line of code.
A few things happened in the week following the last session that I didn’t expect.
First, I saw a bit of R code written on a campus whiteboard that had nothing to do with me, but was straight out of the workshop. It may have come from some of my data-centric colleagues who use R, so I didn’t think too much of it.
Then, I overheard a conversation involving R from those in a program that doesn’t require any data-related coursework. Many folks are familiar with the name as the school’s primary data analysis course uses the {Rcmdr}
GUI for statistical analysis, but these students would not have necessarily taken the course. I wondered if there was a connection.
Finally, a student who didn’t even attend came to my office hours asking for resources. Why exactly? Some of his work colleagues attended. It turns out that they are now trying to incorporate some R-powered analysis in their work and he doesn’t want to miss out.
The workshop consisted of 3 consecutive Fridays lasting 90 minutes each. That’s only a total of 4.5 hours.
That’s relatively tiny amount of time.
Wait. Scratch that.
That’s a negligible amount of time.
… but it was enough to convince some participants and non-participants that they should take advantage of the power that a bit of data-centric coding can offer.
Reflection
I taught a similar 90 minute workshop last spring using R, but focused on base R and a few data types. 10 minutes in, I’m trying to explain the difference between a data.frame
and a matrix
and the person asking the question says something along the lines of “I guess I’m kinda dumb. Don’t worry about it.”
For context, these were international policy graduate students. While some have completed a bit of quantitative coursework, most don’t have a hardcore math or science background and programming is seen as something akin to wizardry. However, they hold domain expertise in some rather important subjects. These include WMD nonproliferation, international development, economic diplomacy, conflict studies, and environmental policy. Nearly half of the participants were international students and everyone is proficient in at least a second natural language. Most have already tackled big, complicated problems in their careers and the others are on their way to doing so following graduation. In a nutshell, they’re not dumb. The way I was teaching was dumb. They knew that they’re supposed to want to learn new skills, but they didn’t know why. Focusing on the “basics” didn’t show them anything immediately useful. It didn’t show them the why.
After the workshop, I never heard anyone mention R outside my circle of fellow data folks.
Since that time, I started using R more. Like, a lot more. I have found a way to use R in nearly everything I’ve done since May 2017. As a policy student myself, that has not always been very straightforward and I was still avoiding the strange “tidy” code I’d encounter on Stack Overflow and elsewhere. I realized the error of my ways when I came across Julia Silge and David Robinson’s Text Mining with R. It was like discovering that you’re still in the stone age while most people are off partying on spaceships.
In preparation for this workshop series, I found a lot of inspiration in Michael Levy’s presentation on teaching R, which itself echoes principles preached by other tidyverse
advocates.
A huge takeaway: live coding works.
Writing code in real time shows every single step we make from opening the IDE, to reshaping the data, to debugging inevitable errors, to rendering a final report.
Within a few short weeks of learning to code, it might be surprising how many tiny steps become automatic and taken for granted. Tack on a couple more months and newcomers will think you’re speaking in an entirely different language because you’re explaining something requiring context they simply haven’t yet encountered. Add a few years and… yeesh.
Something that frustrated me when I first started is that code explanations often seem to be written in such a way that dismisses how difficult establishing the basics can be. I’m half-convinced that, for some folks, the trauma was so great that they have simply blocked it from memory. Code is intimidating enough, but if an instructor doesn’t make a conscious effort to empathize, students will question their ability to learn. The goal is empowerment, not intimidation.
Live coding enforces a maximum speed in moving through exercises, which not only gives students more time to digest what you’re doing. It also provides more opportunities for them to ask questions on details you might find trivial, but only because you already suffered through them.
I also think that the benefits of live coding extend to the instructor as well. I found myself answering questions that framed things in ways that I had not even considered, but were exactly how multiple participants saw the task. Additionally, I have a better sense of which concepts need to be covered in more detail, as they weren’t necessarily as intuitive for others as they were for me. On the flip-side, concepts with which I remember struggling may not be difficult at all for others to understand.
… and now that we got the bloggyness of a blog post out of the way…
Here is the workflow I used for the first session. The goal was to introduce the primary {dplyr}
verbs, functions that accomplish tasks necessary in nearly every project. Between each section is an exercise using {ggplot}
.
tidyverse::tidyverse_logo() ## * __ _ __ . o * . ## / /_(_)__/ /_ ___ _____ _______ ___ ## / __/ / _ / // / |/ / -_) __(_-</ -_) ## \__/_/\_,_/\_, /|___/\__/_/ /___/\__/ ## * . /___/ o . *
# install.packages("tidyverse") library(tidyverse) # install.packages("gapminder") library(gapminder) # loads the gapminder data set ## just to prettify printed tables when knitting # install.packages("kableExtra") library(knitr) library(kableExtra)
Workflow
Resources Up Front
Our Data
In the following exercises, gm.data.frame
will be used to demonstrate actions that use {base}
R methods for data.frame
operations while gm_df
will be used to to demonstrate {tidyverse}
methods for tibble
operations.
gm.data.frame <- as.data.frame(gapminder) gm_df <- gapminder
tibble
class(gm.data.frame) ## [1] "data.frame" class(gm_df) ## [1] "tbl_df" "tbl" "data.frame"
tibble
s are opinionated data.frame
s that keep everything that is helpful about data.frame
s, changes some of their quirks, and adds methods that makes them even more useful.
Printing gm.data.frame
dumps the whole data set to the console, typically requiring head()
to limit the output.
Printing
head(gm.data.frame) ## country continent year lifeExp pop gdpPercap ## 1 Afghanistan Asia 1952 28.801 8425333 779.4453 ## 2 Afghanistan Asia 1957 30.332 9240934 820.8530 ## 3 Afghanistan Asia 1962 31.997 10267083 853.1007 ## 4 Afghanistan Asia 1967 34.020 11537966 836.1971 ## 5 Afghanistan Asia 1972 36.088 13079460 739.9811 ## 6 Afghanistan Asia 1977 38.438 14880372 786.1134
Printing gm_df
provides the dimensions, data type of each column, and only prints the first 10 rows.
gm_df ## # A tibble: 1,704 x 6 ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Afghanistan Asia 1952 28.8 8425333 779. ## 2 Afghanistan Asia 1957 30.3 9240934 821. ## 3 Afghanistan Asia 1962 32.0 10267083 853. ## 4 Afghanistan Asia 1967 34.0 11537966 836. ## 5 Afghanistan Asia 1972 36.1 13079460 740. ## 6 Afghanistan Asia 1977 38.4 14880372 786. ## 7 Afghanistan Asia 1982 39.9 12881816 978. ## 8 Afghanistan Asia 1987 40.8 13867957 852. ## 9 Afghanistan Asia 1992 41.7 16317921 649. ## 10 Afghanistan Asia 1997 41.8 22227415 635. ## # ... with 1,694 more rows
%>%
The pipe (%>%
) is used to chain operations together. Underneath the hood, it’s taking the value on the left-hand side of %>%
and using it as the first argument of the function on the right-hand side of %>%
.
For example, these 2 lines are doing the exact same thing.
head(gm_df) ## # A tibble: 6 x 6 ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Afghanistan Asia 1952 28.8 8425333 779. ## 2 Afghanistan Asia 1957 30.3 9240934 821. ## 3 Afghanistan Asia 1962 32.0 10267083 853. ## 4 Afghanistan Asia 1967 34.0 11537966 836. ## 5 Afghanistan Asia 1972 36.1 13079460 740. ## 6 Afghanistan Asia 1977 38.4 14880372 786. gm_df %>% head() ## # A tibble: 6 x 6 ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Afghanistan Asia 1952 28.8 8425333 779. ## 2 Afghanistan Asia 1957 30.3 9240934 821. ## 3 Afghanistan Asia 1962 32.0 10267083 853. ## 4 Afghanistan Asia 1967 34.0 11537966 836. ## 5 Afghanistan Asia 1972 36.1 13079460 740. ## 6 Afghanistan Asia 1977 38.4 14880372 786.
For simple operations involving 1 function, %>%
is only (arguably) beneficial in that it improves readability as the flow of operations go from left to right.
%>%
become truly useful when you need to perform multiple operations in succession, which is the vast majority of data carpentry.
As an arbitrary example, let’s say that we want to select the head()
(first 6 rows) of gm.data.frame
and convert it to a tibble
.
Without %>%
, we can do this in a few ways.
- Use intermediate variables.
- get
gm.data.frame
’shead()
and assign it tono_pipe_1
- convert
no_pipe_1
to atibble
withas_tibble()
and assign it tono_pipe_2
- get
no_pipe_1 <- head(gm.data.frame) no_pipe_2 <- as_tibble(no_pipe_1) no_pipe_2 ## # A tibble: 6 x 6 ## country continent year lifeExp pop gdpPercap ## * <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Afghanistan Asia 1952 28.8 8425333 779. ## 2 Afghanistan Asia 1957 30.3 9240934 821. ## 3 Afghanistan Asia 1962 32.0 10267083 853. ## 4 Afghanistan Asia 1967 34.0 11537966 836. ## 5 Afghanistan Asia 1972 36.1 13079460 740. ## 6 Afghanistan Asia 1977 38.4 14880372 786.
- Nest
gm.data.frame
inside ofhead()
, which is itself nested inside ofas_tibble()
.
as_tibble(head(gm.data.frame)) ## # A tibble: 6 x 6 ## country continent year lifeExp pop gdpPercap ## * <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Afghanistan Asia 1952 28.8 8425333 779. ## 2 Afghanistan Asia 1957 30.3 9240934 821. ## 3 Afghanistan Asia 1962 32.0 10267083 853. ## 4 Afghanistan Asia 1967 34.0 11537966 836. ## 5 Afghanistan Asia 1972 36.1 13079460 740. ## 6 Afghanistan Asia 1977 38.4 14880372 786.
With %>%
, we can chain these actions together in the order in which they occur, which is also the way we read English.
- Here, we do the same thing by:
- taking
gm_df
- piping it to
head()
(keeping the top 6 rows) - piping it to
as_tibble()
(converting it to atibble
data frame)
- taking
gm_df %>% head() %>% as_tibble() ## # A tibble: 6 x 6 ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Afghanistan Asia 1952 28.8 8425333 779. ## 2 Afghanistan Asia 1957 30.3 9240934 821. ## 3 Afghanistan Asia 1962 32.0 10267083 853. ## 4 Afghanistan Asia 1967 34.0 11537966 836. ## 5 Afghanistan Asia 1972 36.1 13079460 740. ## 6 Afghanistan Asia 1977 38.4 14880372 786.
In practice, it’s usually best to place each of the functions on a separate line as it facilitates debugging and further improves readability.
gm_df %>% as_tibble() %>% head() ## # A tibble: 6 x 6 ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Afghanistan Asia 1952 28.8 8425333 779. ## 2 Afghanistan Asia 1957 30.3 9240934 821. ## 3 Afghanistan Asia 1962 32.0 10267083 853. ## 4 Afghanistan Asia 1967 34.0 11537966 836. ## 5 Afghanistan Asia 1972 36.1 13079460 740. ## 6 Afghanistan Asia 1977 38.4 14880372 786.
From here on, you’ll notice prettify()
. This is only being used to print tables in a clean format when the document is knit()
ted.
I’m choosing to include it here as I often find myself reading similar pages where I come across a really effective way to format some output. I understand why the author chooses to set echo=FALSE
, but it can be nice to see the underlying code without having to hunt through their GitHub.
data.frame
s will print a default maximum of 3
rows while tibble
s will print a default maximum of 10
rows.
prettify <- function(df, n = NULL, cols_changed = NULL, rows_changed = NULL){ if(is.null(n)) n <- ifelse(is.tibble(df), 10, 3) pretty_df <- df %>% head(n) %>% kable(format = "html") %>% kable_styling(bootstrap_options = c("striped", "bordered", "condensed", "hover", "responsive"), full_width = FALSE) if(!is.null(cols_changed)){ pretty_df <- pretty_df %>% column_spec(cols_changed, bold = T, color = "black", background = "#C8FAE3") } if(!is.null(rows_changed)){ pretty_df <- pretty_df %>% row_spec(rows_changed, bold = T, color = "black", background = "#C8FAE3") } return(pretty_df) } gm.data.frame %>% prettify()
country | continent | year | lifeExp | pop | gdpPercap |
---|---|---|---|---|---|
Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 |
Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530 |
Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007 |
gm_df %>% prettify()
country | continent | year | lifeExp | pop | gdpPercap |
---|---|---|---|---|---|
Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 |
Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530 |
Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007 |
Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.1971 |
Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.9811 |
Afghanistan | Asia | 1977 | 38.438 | 14880372 | 786.1134 |
Afghanistan | Asia | 1982 | 39.854 | 12881816 | 978.0114 |
Afghanistan | Asia | 1987 | 40.822 | 13867957 | 852.3959 |
Afghanistan | Asia | 1992 | 41.674 | 16317921 | 649.3414 |
Afghanistan | Asia | 1997 | 41.763 | 22227415 | 635.3414 |
Sample Data
You’ll also see a toy data set for the introductory examples that start each section.
sample_countries <- c("Tunisia", "Nicaragua", "Singapore", "Hungary", "New Zealand", "Nigeria", "Brazil", "Sri Lanka", "Ireland", "Australia") sample_df <- gm_df %>% filter(year == 2007, country %in% sample_countries) sample_df %>% prettify()
country | continent | year | lifeExp | pop | gdpPercap |
---|---|---|---|---|---|
Australia | Oceania | 2007 | 81.235 | 20434176 | 34435.367 |
Brazil | Americas | 2007 | 72.390 | 190010647 | 9065.801 |
Hungary | Europe | 2007 | 73.338 | 9956108 | 18008.944 |
Ireland | Europe | 2007 | 78.885 | 4109086 | 40675.996 |
New Zealand | Oceania | 2007 | 80.204 | 4115771 | 25185.009 |
Nicaragua | Americas | 2007 | 72.899 | 5675356 | 2749.321 |
Nigeria | Africa | 2007 | 46.859 | 135031164 | 2013.977 |
Singapore | Asia | 2007 | 79.972 | 4553009 | 47143.180 |
Sri Lanka | Asia | 2007 | 72.396 | 20378239 | 3970.095 |
Tunisia | Africa | 2007 | 73.923 | 10276158 | 7092.923 |
“Tidy” Data
If you’re unsure of what “Tidy” data is actually describing and want to learn more, you can read Hadley Wickham’s article here. Otherwise, these graphics are likely the most concise explanation you’ll find.
With tibble
s, %>%
, and the concept of tidy data covered, let’s take a dive.
{dplyr}
{dplyr}
provides a grammar of data manipulation and a set of verb functions that solve most common data carpentry challenges in a consistent fashion.
glimpse()
select()
filter()
arrange()
mutate()
summarize()
group_by()
Taking a glimpse()
In addition to the summary()
, dim()
ensions, and str()
ucture functions that can be used to inspect data, you can now use {dplyr}
’s glimpse()
.
summary(gm.data.frame) ## country continent year lifeExp ## Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60 ## Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20 ## Algeria : 12 Asia :396 Median :1980 Median :60.71 ## Angola : 12 Europe :360 Mean :1980 Mean :59.47 ## Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85 ## Australia : 12 Max. :2007 Max. :82.60 ## (Other) :1632 ## pop gdpPercap ## Min. :6.001e+04 Min. : 241.2 ## 1st Qu.:2.794e+06 1st Qu.: 1202.1 ## Median :7.024e+06 Median : 3531.8 ## Mean :2.960e+07 Mean : 7215.3 ## 3rd Qu.:1.959e+07 3rd Qu.: 9325.5 ## Max. :1.319e+09 Max. :113523.1 ## dim(gm.data.frame) ## [1] 1704 6 str(gm.data.frame) ## 'data.frame': 1704 obs. of 6 variables: ## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ... ## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ... ## $ lifeExp : num 28.8 30.3 32 34 36.1 ... ## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ... ## $ gdpPercap: num 779 821 853 836 740 ... glimpse(gm_df) ## Observations: 1,704 ## Variables: 6 ## $ country <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, ... ## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia... ## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992... ## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8... ## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488... ## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...
select()
columns
Quick Example
Initial Data
sample_df %>% prettify()
country | continent | year | lifeExp | pop | gdpPercap |
---|---|---|---|---|---|
Australia | Oceania | 2007 | 81.235 | 20434176 | 34435.367 |
Brazil | Americas | 2007 | 72.390 | 190010647 | 9065.801 |
Hungary | Europe | 2007 | 73.338 | 9956108 | 18008.944 |
Ireland | Europe | 2007 | 78.885 | 4109086 | 40675.996 |
New Zealand | Oceania | 2007 | 80.204 | 4115771 | 25185.009 |
Nicaragua | Americas | 2007 | 72.899 | 5675356 | 2749.321 |
Nigeria | Africa | 2007 | 46.859 | 135031164 | 2013.977 |
Singapore | Asia | 2007 | 79.972 | 4553009 | 47143.180 |
Sri Lanka | Asia | 2007 | 72.396 | 20378239 | 3970.095 |
Tunisia | Africa | 2007 | 73.923 | 10276158 | 7092.923 |
End Data
sample_df %>% select(country, pop) %>% prettify()
country | pop |
---|---|
Australia | 20434176 |
Brazil | 190010647 |
Hungary | 9956108 |
Ireland | 4109086 |
New Zealand | 4115771 |
Nicaragua | 5675356 |
Nigeria | 135031164 |
Singapore | 4553009 |
Sri Lanka | 20378239 |
Tunisia | 10276158 |
The select()
family is used to choose columns to keep. You can use bare (unquoted) names.
select()
columns by specific names.- select only
gm_df
’scountry
andpop
columns
- select only
gm_df %>% select(country, year, pop) %>% # select columns by specific names prettify()
country | year | pop |
---|---|---|
Afghanistan | 1952 | 8425333 |
Afghanistan | 1957 | 9240934 |
Afghanistan | 1962 | 10267083 |
Afghanistan | 1967 | 11537966 |
Afghanistan | 1972 | 13079460 |
Afghanistan | 1977 | 14880372 |
Afghanistan | 1982 | 12881816 |
Afghanistan | 1987 | 13867957 |
Afghanistan | 1992 | 16317921 |
Afghanistan | 1997 | 22227415 |
select()
a range of columns by name- select
gm_df
’scontinent
column and all columns fromlifeExp
togdpPercap
- select
gm_df %>% select(continent, lifeExp:gdpPercap) %>% # select columns name range prettify()
continent | lifeExp | pop | gdpPercap |
---|---|---|---|
Asia | 28.801 | 8425333 | 779.4453 |
Asia | 30.332 | 9240934 | 820.8530 |
Asia | 31.997 | 10267083 | 853.1007 |
Asia | 34.020 | 11537966 | 836.1971 |
Asia | 36.088 | 13079460 | 739.9811 |
Asia | 38.438 | 14880372 | 786.1134 |
Asia | 39.854 | 12881816 | 978.0114 |
Asia | 40.822 | 13867957 | 852.3959 |
Asia | 41.674 | 16317921 | 649.3414 |
Asia | 41.763 | 22227415 | 635.3414 |
- de
select()
a column with-
select()
all ofgm_df
’s columns exceptlifeExp
gm_df %>% select(-lifeExp) %>% # deselect column by name prettify()
country | continent | year | pop | gdpPercap |
---|---|---|---|---|
Afghanistan | Asia | 1952 | 8425333 | 779.4453 |
Afghanistan | Asia | 1957 | 9240934 | 820.8530 |
Afghanistan | Asia | 1962 | 10267083 | 853.1007 |
Afghanistan | Asia | 1967 | 11537966 | 836.1971 |
Afghanistan | Asia | 1972 | 13079460 | 739.9811 |
Afghanistan | Asia | 1977 | 14880372 | 786.1134 |
Afghanistan | Asia | 1982 | 12881816 | 978.0114 |
Afghanistan | Asia | 1987 | 13867957 | 852.3959 |
Afghanistan | Asia | 1992 | 16317921 | 649.3414 |
Afghanistan | Asia | 1997 | 22227415 | 635.3414 |
- de
select()
a range of columns by nameselect()
all ofgm_df
’s columns except those betweenlifeExp
andgdpPercap
gm_df %>% select(-c(lifeExp:gdpPercap)) %>% # deselect column by name range prettify()
country | continent | year |
---|---|---|
Afghanistan | Asia | 1952 |
Afghanistan | Asia | 1957 |
Afghanistan | Asia | 1962 |
Afghanistan | Asia | 1967 |
Afghanistan | Asia | 1972 |
Afghanistan | Asia | 1977 |
Afghanistan | Asia | 1982 |
Afghanistan | Asia | 1987 |
Afghanistan | Asia | 1992 |
Afghanistan | Asia | 1997 |
select()
column by indexselect()
gm_df
’s4
th column
gm_df %>% select(4) %>% # select column by index prettify()
lifeExp |
---|
28.801 |
30.332 |
31.997 |
34.020 |
36.088 |
38.438 |
39.854 |
40.822 |
41.674 |
41.763 |
- de
select()
a column by indexselect()
all ofgm_df
’s columns except for the4
th column
gm_df %>% select(-4) %>% # deselect column by index prettify()
country | continent | year | pop | gdpPercap |
---|---|---|---|---|
Afghanistan | Asia | 1952 | 8425333 | 779.4453 |
Afghanistan | Asia | 1957 | 9240934 | 820.8530 |
Afghanistan | Asia | 1962 | 10267083 | 853.1007 |
Afghanistan | Asia | 1967 | 11537966 | 836.1971 |
Afghanistan | Asia | 1972 | 13079460 | 739.9811 |
Afghanistan | Asia | 1977 | 14880372 | 786.1134 |
Afghanistan | Asia | 1982 | 12881816 | 978.0114 |
Afghanistan | Asia | 1987 | 13867957 | 852.3959 |
Afghanistan | Asia | 1992 | 16317921 | 649.3414 |
Afghanistan | Asia | 1997 | 22227415 | 635.3414 |
- de
select()
a range of columns by indexselect()
all ofgm_df
’s columns except those between the3
rd and5
th columns
gm_df %>% select(-c(3:5)) %>% # deselect columns by index range prettify()
country | continent | gdpPercap |
---|---|---|
Afghanistan | Asia | 779.4453 |
Afghanistan | Asia | 820.8530 |
Afghanistan | Asia | 853.1007 |
Afghanistan | Asia | 836.1971 |
Afghanistan | Asia | 739.9811 |
Afghanistan | Asia | 786.1134 |
Afghanistan | Asia | 978.0114 |
Afghanistan | Asia | 852.3959 |
Afghanistan | Asia | 649.3414 |
Afghanistan | Asia | 635.3414 |
ggplot()
Exercise 1
{ggplot2}
is monster of a package used for data visualization that follows The Grammar of Graphics.
{ggplot2}
takes R’s powerful graphics capabilities and makes them more accessible by taking care of many plotting tasks that are often tedious, while still allowing for lower-level customization.
- Basic Setup
ggplot(
your data, aes(x =
x values, y =
y values)) +
geom_boxplot()
the type of plot geometry desired
Steps
- Using
gm_df
, select thelifeExp
column - Pipe (
%>%
) the result toggplot()
- Select the plot’s
aes()
thetic valueslifeExp
for thex
values- a histogram’s
y
are counts of itsx
values, so we don’t provide them here
- a histogram’s
- Add
geom_histogram()
as the geometry of the plot
gm_df %>% # data frame: Data select(lifeExp) %>% # columns to keep: Data ggplot(aes(x = lifeExp)) + # x values: Aesthetics geom_histogram() # histogram: Geometries
filter()
Rows
Quick Example
Initial Data
sample_df %>% select(country, lifeExp) %>% prettify()
country | lifeExp |
---|---|
Australia | 81.235 |
Brazil | 72.390 |
Hungary | 73.338 |
Ireland | 78.885 |
New Zealand | 80.204 |
Nicaragua | 72.899 |
Nigeria | 46.859 |
Singapore | 79.972 |
Sri Lanka | 72.396 |
Tunisia | 73.923 |
End Data
sample_df %>% select(country, lifeExp) %>% filter(lifeExp > 75) %>% prettify(cols_changed = 2)
country | lifeExp |
---|---|
Australia | 81.235 |
Ireland | 78.885 |
New Zealand | 80.204 |
Singapore | 79.972 |
Use filter()
to select rows using logic. Rows where a logical expression returns TRUE
are kept and others are dropped.
filter()
rows wherenumeric()
values are greater or lesser than another valuefilter()
gm_df
to only keep rows wheregdpPercap < 500
gm_df %>% filter(gdpPercap < 500) %>% prettify(cols_changed = 6)
country | continent | year | lifeExp | pop | gdpPercap |
---|---|---|---|---|---|
Burundi | Africa | 1952 | 39.031 | 2445618 | 339.2965 |
Burundi | Africa | 1957 | 40.533 | 2667518 | 379.5646 |
Burundi | Africa | 1962 | 42.045 | 2961915 | 355.2032 |
Burundi | Africa | 1967 | 43.548 | 3330989 | 412.9775 |
Burundi | Africa | 1972 | 44.057 | 3529983 | 464.0995 |
Burundi | Africa | 1997 | 45.326 | 6121610 | 463.1151 |
Burundi | Africa | 2002 | 47.360 | 7021078 | 446.4035 |
Burundi | Africa | 2007 | 49.580 | 8390505 | 430.0707 |
Cambodia | Asia | 1952 | 39.417 | 4693836 | 368.4693 |
Cambodia | Asia | 1957 | 41.366 | 5322536 | 434.0383 |
filter()
rows using multiple logical expressions where all must beTRUE
filter()
gm_df
to only keep rows whereyear > 1990
andlifeExp < 40
,
and&
are evaluated identically infilter()
gm_df %>% filter(year > 1990, lifeExp < 40) %>% prettify(cols_changed = 3:4)
country | continent | year | lifeExp | pop | gdpPercap |
---|---|---|---|---|---|
Rwanda | Africa | 1992 | 23.599 | 7290203 | 737.0686 |
Rwanda | Africa | 1997 | 36.087 | 7212583 | 589.9445 |
Sierra Leone | Africa | 1992 | 38.333 | 4260884 | 1068.6963 |
Sierra Leone | Africa | 1997 | 39.897 | 4578212 | 574.6482 |
Somalia | Africa | 1992 | 39.658 | 6099799 | 926.9603 |
Swaziland | Africa | 2007 | 39.613 | 1133066 | 4513.4806 |
Zambia | Africa | 2002 | 39.193 | 10595811 | 1071.6139 |
Zimbabwe | Africa | 2002 | 39.989 | 11926563 | 672.0386 |
filter()
rows using multiple logical expressions where one must beTRUE
filter()
gm_df
to only keep rows wherepop < 10000
orgdpPercap > 100000
|
means or
gm_df %>% filter(pop < 10000 | gdpPercap > 100000) %>% prettify(cols_changed = 5:6)
country | continent | year | lifeExp | pop | gdpPercap |
---|---|---|---|---|---|
Kuwait | Asia | 1952 | 55.565 | 160000 | 108382.4 |
Kuwait | Asia | 1957 | 58.033 | 212846 | 113523.1 |
Kuwait | Asia | 1972 | 67.712 | 841934 | 109347.9 |
filter()
rows using a stringfilter()
gm_df
to only keep rows whereyear
is1999
andcontinent
is"Europe"
==
means is equal to
gm_df %>% filter(year == 1997 & continent == "Europe") %>% prettify(cols_changed = 2:3)
country | continent | year | lifeExp | pop | gdpPercap |
---|---|---|---|---|---|
Albania | Europe | 1997 | 72.950 | 3428038 | 3193.055 |
Austria | Europe | 1997 | 77.510 | 8069876 | 29095.921 |
Belgium | Europe | 1997 | 77.530 | 10199787 | 27561.197 |
Bosnia and Herzegovina | Europe | 1997 | 73.244 | 3607000 | 4766.356 |
Bulgaria | Europe | 1997 | 70.320 | 8066057 | 5970.389 |
Croatia | Europe | 1997 | 73.680 | 4444595 | 9875.605 |
Czech Republic | Europe | 1997 | 74.010 | 10300707 | 16048.514 |
Denmark | Europe | 1997 | 76.110 | 5283663 | 29804.346 |
Finland | Europe | 1997 | 77.130 | 5134406 | 23723.950 |
France | Europe | 1997 | 78.640 | 58623428 | 25889.785 |
ggplot()
Exercise 2
Steps
- Using
gm_df
, select thecontinent
,country
, andgdpPercap
columns filter()
the rows to only keep those wherecontinent == "Oceania"
- Pipe (
%>%
) the result toggplot()
- Select the plot’s
aes()
thetic valuescountry
for thex
valuesgdpPercap
for they
values
- Add
geom_boxplot()
as the geometry of the plot
gm_df %>% # data frame: Data select(continent, country, gdpPercap) %>% # columns to keep: Data filter(continent == "Oceania") %>% # rows to keep: Data ggplot(aes(x = country, y = gdpPercap)) + # x and y values: Aesthetics geom_boxplot() # box plot: Geometries
mutate()
Columns
Quick Example
Initial Data
sample_df %>% select(country, pop) %>% prettify()
country | pop |
---|---|
Australia | 20434176 |
Brazil | 190010647 |
Hungary | 9956108 |
Ireland | 4109086 |
New Zealand | 4115771 |
Nicaragua | 5675356 |
Nigeria | 135031164 |
Singapore | 4553009 |
Sri Lanka | 20378239 |
Tunisia | 10276158 |
End Data
sample_df %>% select(country, pop) %>% mutate(pop_in_thousands = pop / 1000) %>% prettify(cols_changed = 3)
country | pop | pop_in_thousands |
---|---|---|
Australia | 20434176 | 20434.176 |
Brazil | 190010647 | 190010.647 |
Hungary | 9956108 | 9956.108 |
Ireland | 4109086 | 4109.086 |
New Zealand | 4115771 | 4115.771 |
Nicaragua | 5675356 | 5675.356 |
Nigeria | 135031164 | 135031.164 |
Singapore | 4553009 | 4553.009 |
Sri Lanka | 20378239 | 20378.239 |
Tunisia | 10276158 | 10276.158 |
Use mutate()
to manipulate column values and create new columns.
In order to mutate()
a column, use the name of the column you are manipulating and set its value using =
.
Here’s a silly example:
- Add a new column to
gm_df
mutate()
gm_df
to create a column namedplanet
and set its value to"Earth"
gm_df %>% mutate(planet = "Earth") %>% prettify(cols_changed = 7)
country | continent | year | lifeExp | pop | gdpPercap | planet |
---|---|---|---|---|---|---|
Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 | Earth |
Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530 | Earth |
Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007 | Earth |
Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.1971 | Earth |
Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.9811 | Earth |
Afghanistan | Asia | 1977 | 38.438 | 14880372 | 786.1134 | Earth |
Afghanistan | Asia | 1982 | 39.854 | 12881816 | 978.0114 | Earth |
Afghanistan | Asia | 1987 | 40.822 | 13867957 | 852.3959 | Earth |
Afghanistan | Asia | 1992 | 41.674 | 16317921 | 649.3414 | Earth |
Afghanistan | Asia | 1997 | 41.763 | 22227415 | 635.3414 | Earth |
Since we have gdpPercap
and pop
, we can calculate the values for a total_GDP
column.
mutate()
gm_df
to set the results of a calculation on each row to a new column- multiply
pop * gdpPercap
and assign the result tototal_GDP
insidemutate()
- multiply
gm_df %>% mutate(total_GDP = pop * gdpPercap) %>% prettify(cols_changed = 7)
country | continent | year | lifeExp | pop | gdpPercap | total_GDP |
---|---|---|---|---|---|---|
Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 | 6567086330 |
Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530 | 7585448670 |
Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007 | 8758855797 |
Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.1971 | 9648014150 |
Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.9811 | 9678553274 |
Afghanistan | Asia | 1977 | 38.438 | 14880372 | 786.1134 | 11697659231 |
Afghanistan | Asia | 1982 | 39.854 | 12881816 | 978.0114 | 12598563401 |
Afghanistan | Asia | 1987 | 40.822 | 13867957 | 852.3959 | 11820990309 |
Afghanistan | Asia | 1992 | 41.674 | 16317921 | 649.3414 | 10595901589 |
Afghanistan | Asia | 1997 | 41.763 | 22227415 | 635.3414 | 14121995875 |
Typically, mutate()
is used to perform operations across columns in each individual row. You can also use summary functions to perform operations on individual columns (acting as vectors) that result in a vector that can be assigned to a column.
Makes sense, right??
Let’s calculate the z-score of each gdpPercap
value for a specific year.
\[ z = \frac {x_i -\mu_x} {\sigma_x}\]
- \(x\) =
gdpPercap
- \(\mu_x\) = the mean of \(x\) =
mean(gdpPercap)
\(\sigma_x\) = the standard deviation of x =
sd(gdpPercap)
- Use a summary function to perform a a calculation involving summary statistics of a column
- subtract
mean(gdpPercap)
fromgdpPercap
- divide the result by
sd(gdpPercap)
- set the results as the values of a new column called
gdp_per_cap_z_score
- subtract
gm_df %>% filter(year == 1977) %>% mutate(gdp_per_cap_z_score = (gdpPercap - mean(gdpPercap)) / sd(gdpPercap)) %>% prettify(cols_changed = 7)
country | continent | year | lifeExp | pop | gdpPercap | gdp_per_cap_z_score |
---|---|---|---|---|---|---|
Afghanistan | Asia | 1977 | 38.438 | 14880372 | 786.1134 | -0.7805156 |
Albania | Europe | 1977 | 68.930 | 2509048 | 3533.0039 | -0.4520380 |
Algeria | Africa | 1977 | 58.014 | 17152804 | 4910.4168 | -0.2873247 |
Angola | Africa | 1977 | 39.483 | 6162675 | 3008.6474 | -0.5147414 |
Argentina | Americas | 1977 | 68.481 | 26983828 | 10079.0267 | 0.3307461 |
Australia | Oceania | 1977 | 73.490 | 14074100 | 18334.1975 | 1.3179128 |
Austria | Europe | 1977 | 72.170 | 7568430 | 19749.4223 | 1.4871476 |
Bahrain | Asia | 1977 | 65.593 | 297410 | 19340.1020 | 1.4382004 |
Bangladesh | Asia | 1977 | 46.923 | 80428306 | 659.8772 | -0.7956111 |
Belgium | Europe | 1977 | 72.800 | 9821800 | 19117.9745 | 1.4116381 |
Here are other functions that can be used similarly:
Summary Functions | |
---|---|
first() |
min() |
last() |
max() |
nth() |
mean() |
n() |
median() |
n_distinct() |
var() |
IQR() |
sd() |
ggplot()
Exercise 3
Steps
- Using
gm_df
,select()
country
,year
, andgdpPercap
filter()
the rows to keep only those wherecountry
is"Korea, Rep."
,"Korea, Dem. Rep."
,"Japan"
, or"China"
- Pipe the result to
ggplot()
- Select the plot’s
aes()
thetic valuesyear
for thex
valuesgdpPercap
for they
valuescountry
for thecolor
values
- Add
geom_line()
as the geometry of the plot - Add a
title
to the plot withlabs()
gm_df %>% filter(country %in% c("Korea, Rep.", "Korea, Dem. Rep.", "Japan", "China")) %>% mutate(total_GDP = pop * gdpPercap) %>% ggplot(aes(x = year, y = gdpPercap, color = country)) + geom_line() + labs(title = "GDP Over Time")
arrange()
Rows
Quick Example
Initial Data
sample_df %>% select(country, gdpPercap) %>% prettify()
country | gdpPercap |
---|---|
Australia | 34435.367 |
Brazil | 9065.801 |
Hungary | 18008.944 |
Ireland | 40675.996 |
New Zealand | 25185.009 |
Nicaragua | 2749.321 |
Nigeria | 2013.977 |
Singapore | 47143.180 |
Sri Lanka | 3970.095 |
Tunisia | 7092.923 |
End Data
sample_df %>% select(country, gdpPercap)%>% arrange(gdpPercap) %>% prettify(cols_changed = 2)
country | gdpPercap |
---|---|
Nigeria | 2013.977 |
Nicaragua | 2749.321 |
Sri Lanka | 3970.095 |
Tunisia | 7092.923 |
Brazil | 9065.801 |
Hungary | 18008.944 |
New Zealand | 25185.009 |
Australia | 34435.367 |
Ireland | 40675.996 |
Singapore | 47143.180 |
Use arrange()
to sort rows.
arrange()
by ascending number (smallest to largest)arrange()
gm_df
’spop
column so that smallest populations are on top
gm_df %>% arrange(pop) %>% prettify(cols_changed = 5)
country | continent | year | lifeExp | pop | gdpPercap |
---|---|---|---|---|---|
Sao Tome and Principe | Africa | 1952 | 46.471 | 60011 | 879.5836 |
Sao Tome and Principe | Africa | 1957 | 48.945 | 61325 | 860.7369 |
Djibouti | Africa | 1952 | 34.812 | 63149 | 2669.5295 |
Sao Tome and Principe | Africa | 1962 | 51.893 | 65345 | 1071.5511 |
Sao Tome and Principe | Africa | 1967 | 54.425 | 70787 | 1384.8406 |
Djibouti | Africa | 1957 | 37.328 | 71851 | 2864.9691 |
Sao Tome and Principe | Africa | 1972 | 56.480 | 76595 | 1532.9853 |
Sao Tome and Principe | Africa | 1977 | 58.550 | 86796 | 1737.5617 |
Djibouti | Africa | 1962 | 39.693 | 89898 | 3020.9893 |
Sao Tome and Principe | Africa | 1982 | 60.351 | 98593 | 1890.2181 |
arrange()
bydesc()
number (largest to smallest)arrange()
thelifeExp
column so that largest values are on top
gm_df %>% arrange(desc(lifeExp)) %>% prettify(cols_changed = 4)
country | continent | year | lifeExp | pop | gdpPercap |
---|---|---|---|---|---|
Japan | Asia | 2007 | 82.603 | 127467972 | 31656.07 |
Hong Kong, China | Asia | 2007 | 82.208 | 6980412 | 39724.98 |
Japan | Asia | 2002 | 82.000 | 127065841 | 28604.59 |
Iceland | Europe | 2007 | 81.757 | 301931 | 36180.79 |
Switzerland | Europe | 2007 | 81.701 | 7554661 | 37506.42 |
Hong Kong, China | Asia | 2002 | 81.495 | 6762476 | 30209.02 |
Australia | Oceania | 2007 | 81.235 | 20434176 | 34435.37 |
Spain | Europe | 2007 | 80.941 | 40448191 | 28821.06 |
Sweden | Europe | 2007 | 80.884 | 9031088 | 33859.75 |
Israel | Asia | 2007 | 80.745 | 6426679 | 25523.28 |
arrange()
alphabeticallyfilter()
gm_df
to keep only those rows whereyear == 2007
andcontinent == "Americas"
arrange()
thecountry
column alphabetically
gm_df %>% filter(year == 2007, continent == "Americas") %>% arrange(country) %>% prettify(cols_changed = 2:3)
country | continent | year | lifeExp | pop | gdpPercap |
---|---|---|---|---|---|
Argentina | Americas | 2007 | 75.320 | 40301927 | 12779.380 |
Bolivia | Americas | 2007 | 65.554 | 9119152 | 3822.137 |
Brazil | Americas | 2007 | 72.390 | 190010647 | 9065.801 |
Canada | Americas | 2007 | 80.653 | 33390141 | 36319.235 |
Chile | Americas | 2007 | 78.553 | 16284741 | 13171.639 |
Colombia | Americas | 2007 | 72.889 | 44227550 | 7006.580 |
Costa Rica | Americas | 2007 | 78.782 | 4133884 | 9645.061 |
Cuba | Americas | 2007 | 78.273 | 11416987 | 8948.103 |
Dominican Republic | Americas | 2007 | 72.235 | 9319622 | 6025.375 |
Ecuador | Americas | 2007 | 74.994 | 13755680 | 6873.262 |
group_by()
for Grouped Data
Quick Example
Initial Data
sample_df %>% select(country, continent, pop) %>% prettify()
country | continent | pop |
---|---|---|
Australia | Oceania | 20434176 |
Brazil | Americas | 190010647 |
Hungary | Europe | 9956108 |
Ireland | Europe | 4109086 |
New Zealand | Oceania | 4115771 |
Nicaragua | Americas | 5675356 |
Nigeria | Africa | 135031164 |
Singapore | Asia | 4553009 |
Sri Lanka | Asia | 20378239 |
Tunisia | Africa | 10276158 |
End Data
sample_df %>% select(country, continent, pop) %>% group_by(continent) %>% mutate(pop_by_continent = sum(pop)) %>% ungroup() %>% arrange(pop_by_continent) %>% prettify(cols_changed = 4)
country | continent | pop | pop_by_continent |
---|---|---|---|
Hungary | Europe | 9956108 | 14065194 |
Ireland | Europe | 4109086 | 14065194 |
Australia | Oceania | 20434176 | 24549947 |
New Zealand | Oceania | 4115771 | 24549947 |
Singapore | Asia | 4553009 | 24931248 |
Sri Lanka | Asia | 20378239 | 24931248 |
Nigeria | Africa | 135031164 | 145307322 |
Tunisia | Africa | 10276158 | 145307322 |
Brazil | Americas | 190010647 | 195686003 |
Nicaragua | Americas | 5675356 | 195686003 |
group_by()
allows us to group rows together based on column values.
Let’s say we wanted to compute summary values for each country for all years.
- Calculate the
mean_gdp_per_cap
of eachcountry
withgroup_by()
- take
gm_df
andgroup_by()
country
to group rows of the same country together - use
mean()
to calculate themean_gdp_per_cap
ungroup()
the rows- a habit you want
- keep only those rows with
distinct()
combinations ofcountry
andmean_gdp_per_cap
distinct()
’s default is to only keep columns used as arguments
- take
gm_df %>% group_by(country) %>% mutate(mean_gdp_per_cap = median(gdpPercap)) %>% ungroup() %>% distinct(country, mean_gdp_per_cap) %>% prettify(cols_changed = 2)
country | mean_gdp_per_cap |
---|---|
Afghanistan | 803.4832 |
Albania | 3253.2384 |
Algeria | 4853.8559 |
Angola | 3264.6288 |
Argentina | 9068.7844 |
Australia | 18905.6034 |
Austria | 20673.2530 |
Bahrain | 18779.8016 |
Bangladesh | 703.7638 |
Belgium | 20048.9102 |
ggplot()
Exercise 4
Steps
- Using
gm_df
,group_by()
thecontinent
andyear
mutate()
to add a column calledmean_gdp
for the average GDP of each continentungroup()
the data, because this is a habit that will save you headaches later- Keep only
distinct()
combinations ofcontinent
,year
, andmean_gdp
- Pipe the result to
ggplot()
- Select the plot’s
aes()
thetic valuesyear
for thex
valuesmean_gdp
for they
valuescontinent
for thecolor
values
- Add
geom_line()
as the geometry of the plot - Add a
title
and acaption
(for the source of the data) to the plot withlabs()
gm_df %>% group_by(year, continent) %>% mutate(mean_gdp = mean(gdpPercap)) %>% ungroup() %>% distinct(continent, year, mean_gdp) %>% ggplot(aes(x = year, y = mean_gdp, color = continent)) + geom_line() + labs(title = "Mean GDPs by Continent Over Time", caption = "Source: Free material from www.gapminder.org")
summarize()
Quick Example
Initial Data
sample_df %>% select(country, continent, lifeExp, pop) %>% prettify()
country | continent | lifeExp | pop |
---|---|---|---|
Australia | Oceania | 81.235 | 20434176 |
Brazil | Americas | 72.390 | 190010647 |
Hungary | Europe | 73.338 | 9956108 |
Ireland | Europe | 78.885 | 4109086 |
New Zealand | Oceania | 80.204 | 4115771 |
Nicaragua | Americas | 72.899 | 5675356 |
Nigeria | Africa | 46.859 | 135031164 |
Singapore | Asia | 79.972 | 4553009 |
Sri Lanka | Asia | 72.396 | 20378239 |
Tunisia | Africa | 73.923 | 10276158 |
sample_df %>% select(country, continent, lifeExp, pop) %>% group_by(continent) %>% summarise(max_pop = max(pop), mean_life_exp = mean(lifeExp)) %>% prettify(cols_changed = 2:3)
continent | max_pop | mean_life_exp |
---|---|---|
Africa | 135031164 | 60.3910 |
Americas | 190010647 | 72.6445 |
Asia | 20378239 | 76.1840 |
Europe | 9956108 | 76.1115 |
Oceania | 20434176 | 80.7195 |
Now that we know how to use group_by()
, we can summarize()
data by group. This can be done using all of the summary functions seen earlier.
Summary Functions | |
---|---|
first() |
min() |
last() |
max() |
nth() |
mean() |
n() |
median() |
n_distinct() |
var() |
IQR() |
sd() |
- Calculate some summary statistics for each continent.
- take
gm_df
andgroup_by()
continent
- using
summarize()
orsummarise()
, calculate:count
withn()
mean_pop
withmean()
max_gdp_per_cap
withmax()
- take
gm_df %>% group_by(continent) %>% summarise(count = n(), mean_pop = mean(pop), max_gdp_per_cap = max(gdpPercap)) %>% prettify(cols_changed = 2:4)
continent | count | mean_pop | max_gdp_per_cap |
---|---|---|---|
Africa | 624 | 9916003 | 21951.21 |
Americas | 300 | 24504795 | 42951.65 |
Asia | 396 | 77038722 | 113523.13 |
Europe | 360 | 17169765 | 49357.19 |
Oceania | 24 | 8874672 | 34435.37 |
ggplot()
Exercise 5
Steps
- Using
gm_df
,filter()
the data to remove rows wherecontinent
is not"Oceania"
group_by()
continent
andyear
summarize()
the groups by calculating themmean()
ofpop
ungroup()
the data, because this is a habit that will save you headaches later- Pipe the results to
ggplot()
- Select the plot’s
aes()
theticsyear
for thex
valuesmean_pop
for they
valuescontinent
for thecolor
values
- Add
geom_line()
for the first geometry - Add
geom_point()
for the second geometry - Change the theme by adding
theme_minimal()
- Using
facet_wrap()
, split the plot into panels for eachcontinent
~
is used as aformula
to select the facet variable
- Add a
title
and acaption
withlabs()
gm_df %>% filter(continent != "Oceania") %>% group_by(continent, year) %>% summarise(mean_pop = mean(pop)) %>% ungroup() %>% ggplot(aes(x = year, y = mean_pop, color = continent)) + geom_line() + geom_point() + theme_minimal() + facet_wrap(~ continent) + labs(title = "Mean Continent Populations over Time", caption = "Source: Free material from www.gapminder.org")
النهاية
sessionInfo() ## R version 3.5.1 (2018-07-02) ## Platform: x86_64-w64-mingw32/x64 (64-bit) ## Running under: Windows 10 x64 (build 17134) ## ## Matrix products: default ## ## locale: ## [1] LC_COLLATE=English_United States.1252 ## [2] LC_CTYPE=English_United States.1252 ## [3] LC_MONETARY=English_United States.1252 ## [4] LC_NUMERIC=C ## [5] LC_TIME=English_United States.1252 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] bindrcpp_0.2.2 kableExtra_0.9.0 knitr_1.20.8 ## [4] gapminder_0.3.0 forcats_0.3.0 stringr_1.3.1 ## [7] dplyr_0.7.6 purrr_0.2.5 readr_1.1.1 ## [10] tidyr_0.8.1 tibble_1.4.2.9004 ggplot2_3.0.0.9000 ## [13] tidyverse_1.2.1.9000 ## ## loaded via a namespace (and not attached): ## [1] tidyselect_0.2.4 xfun_0.3 reshape2_1.4.3 ## [4] haven_1.1.2 lattice_0.20-35 colorspace_1.3-2 ## [7] viridisLite_0.3.0 htmltools_0.3.6 yaml_2.1.19 ## [10] utf8_1.1.4 rlang_0.2.1 pillar_1.3.0.9000 ## [13] withr_2.1.2 foreign_0.8-70 glue_1.2.0 ## [16] modelr_0.1.2 readxl_1.1.0 bindr_0.1.1 ## [19] plyr_1.8.4 munsell_0.5.0 blogdown_0.7.1 ## [22] gtable_0.2.0 cellranger_1.1.0 rvest_0.3.2 ## [25] codetools_0.2-15 psych_1.8.4 evaluate_0.10.1 ## [28] labeling_0.3 parallel_3.5.1 fansi_0.2.3 ## [31] highr_0.7 broom_0.4.5 Rcpp_0.12.17 ## [34] scales_0.5.0.9000 jsonlite_1.5 mnormt_1.5-5 ## [37] hms_0.4.2 digest_0.6.15 stringi_1.2.3 ## [40] bookdown_0.7 grid_3.5.1 cli_1.0.0 ## [43] tools_3.5.1 magrittr_1.5 lazyeval_0.2.1 ## [46] crayon_1.3.4 pkgconfig_2.0.1 xml2_1.2.0 ## [49] lubridate_1.7.4 assertthat_0.2.0 rmarkdown_1.10.7 ## [52] httr_1.3.1 rstudioapi_0.7 htmldeps_0.1.0 ## [55] R6_2.2.2 nlme_3.1-137 compiler_3.5.1
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.