[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Today, we’ll start digging into the wonderful world of joins! The tidyverse offers several different types of joins between two datasets, X and Y:Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
- left_join – keeps all rows from X and adds columns from Y to any that match cases in X; if there is no matching record from Y, the Y columns will be NA for those cases
- right_join – keeps all rows from Y and adds columns from Y; if there is no matching record in X, the new row from Y will have NAs for all X variables
- full_join – keeps all rows from both X and Y, matching where it can and filling in NAs where data are missing
- inner_join – only keeps rows in common between X and Y
- semi_join – does not add any columns from Y, but only keeps rows in X that have a matching record in Y
- anti_join – removes rows from X with a matching case in Y
For instance, with my reading dataset, I have multiple variables I could use as identifiers. Book.ID uniquely identifies each book in the set, and I could use that variable to match up additional information about the book. Author is another identifier, and I could use that to match up with a dataset that includes more information about the authors. It’s okay that my book dataset includes multiple books by the same author, as long as the author information set I merge with only has one record for each author. I could even use the dates in the dataset to match up with a dataset providing information on what I was doing on those days, such as whether I was traveling, working, or at home. Any of those variables could be used as an identifier for a merge, depending on what research question I want to answer.
In the dummy_cols post, I used the left_join to merge the book genres with my reading data (download the genre flag file and reading set to test out this code):
library(tidyverse) ## -- Attaching packages ------------------------------------------------------------------------------- tidyverse 1.3.0 -- ## <U+2713> ggplot2 3.2.1 <U+2713> purrr 0.3.3 ## <U+2713> tibble 2.1.3 <U+2713> dplyr 0.8.3 ## <U+2713> tidyr 1.0.0 <U+2713> stringr 1.4.0 ## <U+2713> readr 1.3.1 <U+2713> forcats 0.4.0 ## -- Conflicts ---------------------------------------------------------------------------------- tidyverse_conflicts() -- ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() reads2019 <- read_csv("~/Downloads/Blogging A to Z/ReadsNoGenre.csv", col_names = TRUE) ## Parsed with column specification: ## cols( ## Title = col_character(), ## Pages = col_double(), ## date_started = col_character(), ## date_read = col_character(), ## Book.ID = col_double(), ## Author = col_character(), ## AdditionalAuthors = col_character(), ## AverageRating = col_double(), ## OriginalPublicationYear = col_double(), ## read_time = col_double(), ## MyRating = col_double(), ## Gender = col_double(), ## NewRating = col_double(), ## FinalRating = col_double() ## ) genres <- read_csv("~/Downloads/Blogging A to Z/genres.csv", col_names = TRUE) ## Parsed with column specification: ## cols( ## Book.ID = col_double(), ## Fiction = col_double(), ## Childrens = col_double(), ## Fantasy = col_double(), ## SciFi = col_double(), ## Mystery = col_double(), ## SelfHelp = col_double() ## ) reads2019 <- reads2019 %>% left_join(genres, by = "Book.ID")For joins, variable names have to be formatted as strings (meaning, in quotes). We’re joining on a single variable (Book.ID), which has the same name in both files, so we only need one string after by = . If we were joining on two variables, we’d need a list of two string. To demonstrate, I created a file by Title and Author that gives the main reason why I read the book: 1) it was recommended to me (either single book or a whole series), 2) it was a personal goal to read/re-read this book, 3) I randomly found the book at a shop, or 4) I planned to buy that specific book (usually because I follow the author’s work). Let’s merge that file in to reads2019 by Title and Author:
reason <- read_csv("~/Downloads/Blogging A to Z/reason.csv", col_names = TRUE) ## Parsed with column specification: ## cols( ## Title = col_character(), ## Author = col_character(), ## Reason = col_character() ## ) reads2019 <- reads2019 %>% left_join(reason, by = c("Title", "Author")) reads2019 %>% ggplot(aes(Reason)) + geom_bar() + coord_flip()