[This article was first published on   Deeply Trivial, and kindly contributed to R-bloggers].  (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
            Today, we’ll start digging into the wonderful world of joins! The tidyverse offers several different types of joins between two datasets, X and Y:Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
- left_join – keeps all rows from X and adds columns from Y to any that match cases in X; if there is no matching record from Y, the Y columns will be NA for those cases
- right_join – keeps all rows from Y and adds columns from Y; if there is no matching record in X, the new row from Y will have NAs for all X variables
- full_join – keeps all rows from both X and Y, matching where it can and filling in NAs where data are missing
- inner_join – only keeps rows in common between X and Y
- semi_join – does not add any columns from Y, but only keeps rows in X that have a matching record in Y
- anti_join – removes rows from X with a matching case in Y
For instance, with my reading dataset, I have multiple variables I could use as identifiers. Book.ID uniquely identifies each book in the set, and I could use that variable to match up additional information about the book. Author is another identifier, and I could use that to match up with a dataset that includes more information about the authors. It’s okay that my book dataset includes multiple books by the same author, as long as the author information set I merge with only has one record for each author. I could even use the dates in the dataset to match up with a dataset providing information on what I was doing on those days, such as whether I was traveling, working, or at home. Any of those variables could be used as an identifier for a merge, depending on what research question I want to answer.
In the dummy_cols post, I used the left_join to merge the book genres with my reading data (download the genre flag file and reading set to test out this code):
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------- tidyverse 1.3.0 --
## <U+2713> ggplot2 3.2.1     <U+2713> purrr   0.3.3
## <U+2713> tibble  2.1.3     <U+2713> dplyr   0.8.3
## <U+2713> tidyr   1.0.0     <U+2713> stringr 1.4.0
## <U+2713> readr   1.3.1     <U+2713> forcats 0.4.0
## -- Conflicts ---------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
reads2019 <- read_csv("~/Downloads/Blogging A to Z/ReadsNoGenre.csv",
                      col_names = TRUE)
## Parsed with column specification:
## cols(
##   Title = col_character(),
##   Pages = col_double(),
##   date_started = col_character(),
##   date_read = col_character(),
##   Book.ID = col_double(),
##   Author = col_character(),
##   AdditionalAuthors = col_character(),
##   AverageRating = col_double(),
##   OriginalPublicationYear = col_double(),
##   read_time = col_double(),
##   MyRating = col_double(),
##   Gender = col_double(),
##   NewRating = col_double(),
##   FinalRating = col_double()
## )
genres <- read_csv("~/Downloads/Blogging A to Z/genres.csv",
                   col_names = TRUE)
## Parsed with column specification:
## cols(
##   Book.ID = col_double(),
##   Fiction = col_double(),
##   Childrens = col_double(),
##   Fantasy = col_double(),
##   SciFi = col_double(),
##   Mystery = col_double(),
##   SelfHelp = col_double()
## )
reads2019 <- reads2019 %>%
  left_join(genres, by = "Book.ID")
For joins, variable names have to be formatted as strings (meaning, in quotes). We’re joining on a single variable (Book.ID), which has the same name in both files, so we only need one string after by = . If we were joining on two variables, we’d need a list of two string. To demonstrate, I created a file by Title and Author that gives the main reason why I read the book: 1) it was recommended to me (either single book or a whole series), 2) it was a personal goal to read/re-read this book, 3) I randomly found the book at a shop, or 4) I planned to buy that specific book (usually because I follow the author’s work). Let’s merge that file in to reads2019 by Title and Author:reason <- read_csv("~/Downloads/Blogging A to Z/reason.csv",
                   col_names = TRUE)
## Parsed with column specification:
## cols(
##   Title = col_character(),
##   Author = col_character(),
##   Reason = col_character()
## )
reads2019 <- reads2019 %>%
  left_join(reason, by = c("Title", "Author"))
reads2019 %>%
  ggplot(aes(Reason)) +
  geom_bar() +
  coord_flip()
