Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Joins Are No Mystery Anymore: Hands-On Tutorial — Part 1
Welcome! In this tutorial, I’ll be your guide as we unravel the mysteries of data joins in R. Whether you’re working with customer records, inventory lists, or historical documents, mastering data joins is essential for any data analyst or scientist. Together, we’ll explore a variety of join types through real-life examples and datasets, making complex concepts easy to understand and apply. By the end of this tutorial, you’ll be equipped with the knowledge and skills to confidently join data and uncover the valuable insights hidden within. Let’s get started and make joins a breeze!
At the very beginning… All datasets I am working on and getting using load() function are prepared for you and uploaded to Github.
Inner Join
An Inner Join is used to combine rows from two tables based on a related column between them. It returns only the rows where there is a match in both tables. If there are no matches, the result set will not include those rows.
Explanation of the Scenario
In our scenario, we have customer orders and payments. We want to find orders that have been paid. This will help us understand which customers have completed their payments and which orders are still pending.
Data file: https://github.com/kgryczan/medium_publishing/blob/main/inner_join_data.RData
Description of the Datasets
We will use two datasets:
- orders: Contains information about customer orders. Columns: order_id, customer_id, order_date
- payments: Contains information about payments made for orders. Columns: payment_id, order_id, amount, payment_date
Step-by-Step Code Examples
Loading the datasets
# Load the necessary libraries library(dplyr) # Load the datasets load("inner_join_data.RData") # Display the datasets print(orders, n=5) # A tibble: 30 × 3 order_id customer_id order_date <int> <int> <date> 1 1 102 2024-01-01 2 2 113 2024-01-02 3 3 108 2024-01-03 4 4 111 2024-01-04 5 5 106 2024-01-05 # ℹ 25 more rows print(payments, n=5) # A tibble: 20 × 4 payment_id order_id amount payment_date <int> <int> <dbl> <date> 1 201 27 167. 2024-01-05 2 202 30 80.2 2024-01-06 3 204 28 110. 2024-01-08 4 206 24 159. 2024-01-10 5 207 4 173. 2024-01-11 # ℹ 15 more rows
Performing the Inner Join
# Perform the inner join orders_paid <- inner_join(orders, payments, by = "order_id") # Display the result print(orders_paid) # A tibble: 20 × 6 order_id customer_id order_date payment_id amount payment_date <int> <int> <date> <int> <dbl> <date> 1 2 113 2024-01-02 209 90.6 2024-01-13 2 2 113 2024-01-02 229 129. 2024-02-02 3 4 111 2024-01-04 207 173. 2024-01-11 4 6 104 2024-01-06 228 181. 2024-02-01 5 7 111 2024-01-07 217 80.8 2024-01-21 6 11 109 2024-01-11 212 108. 2024-01-16 7 12 109 2024-01-12 224 183. 2024-01-28 8 13 105 2024-01-13 216 194. 2024-01-20 9 13 105 2024-01-13 226 117. 2024-01-30 10 16 112 2024-01-16 223 176. 2024-01-27 11 20 104 2024-01-20 208 131. 2024-01-12 12 20 104 2024-01-20 225 66.2 2024-01-29 13 21 115 2024-01-21 227 130. 2024-01-31 14 23 104 2024-01-23 230 103. 2024-02-03 15 24 111 2024-01-24 206 159. 2024-01-10 16 26 111 2024-01-26 213 89.4 2024-01-17 17 27 110 2024-01-27 201 167. 2024-01-05 18 28 101 2024-01-28 204 110. 2024-01-08 19 28 101 2024-01-28 221 156. 2024-01-25 20 30 114 2024-01-30 202 80.2 2024-01-06
Explanation of the Code
- We first load the datasets using the load function.
- We then use the inner_join function from the dplyr package to join the orders and payments datasets on the order_id column.
- Finally, we display the result to see which orders have been paid.
Interpretation of Results
The resulting dataset orders_paid contains only the rows where there is a match in both orders and payments datasets. This means that only the orders that have been paid are included in the result. Each row in the result represents an order that has been matched with a corresponding payment, showing details from both the orders and payments tables.
Homework for Readers
In the same inner_join_data.RData file, there is another set of datasets for a more creative scenario. You will find:
- enrollments: Contains information about student enrollments.
- Columns: student_id, course_id, enrollment_date
- exam_results: Contains information about exam results.
- Columns: student_id, course_id, exam_score, exam_date
Your task is to perform an inner join on these datasets to find students who have both enrolled and taken exams. Use the student_id and course_id columns for joining.
Left Join (Left Outer Join)
A Left Join returns all rows from the left table, and the matched rows from the right table. If there is no match, the result is NULL on the side of the right table.
Explanation of the Scenario
In this scenario, we have product information and sales records. We want to find all products, including those that haven’t been sold. This helps in understanding which products are in stock and which are moving in the market.
Data file: https://github.com/kgryczan/medium_publishing/blob/main/left_join_data.RData
Description of the Datasets
We will use two datasets:
- products: Contains information about the products.
- Columns: product_id, product_name, category
- sales: Contains information about the sales made.
- Columns: sale_id, product_id, quantity_sold, sale_date
Step-by-Step Code Examples
Loading the datasets
# Load the necessary libraries library(dplyr) # Load the datasets load("left_join_data.RData") # Display the datasets print(products, n = 5) # A tibble: 30 × 3 product_id product_name category <int> <chr> <chr> 1 1 Product A Category 1 2 2 Product B Category 3 3 3 Product C Category 3 4 4 Product D Category 3 5 5 Product E Category 3 # ℹ 25 more rows print(sales, n = 5) # A tibble: 30 × 4 sale_id product_id quantity_sold sale_date <int> <int> <int> <date> 1 101 2 10 2024-02-01 2 102 29 10 2024-02-02 3 103 16 6 2024-02-03 4 104 30 5 2024-02-04 5 105 25 4 2024-02-05 # ℹ 25 more rows
Performing the Left Join
# Perform the left join products_sales <- left_join(products, sales, by = "product_id") # Display the result print(products_sales) # A tibble: 41 × 6 product_id product_name category sale_id quantity_sold sale_date <int> <chr> <chr> <int> <int> <date> 1 1 Product A Category 1 106 7 2024-02-06 2 2 Product B Category 3 101 10 2024-02-01 3 2 Product B Category 3 118 8 2024-02-18 4 3 Product C Category 3 NA NA NA 5 4 Product D Category 3 107 4 2024-02-07 6 4 Product D Category 3 127 2 2024-02-27 7 5 Product E Category 3 113 9 2024-02-13 8 6 Product F Category 3 NA NA NA 9 7 Product G Category 1 NA NA NA 10 8 Product H Category 2 NA NA NA # ℹ 31 more rows
Explanation of the Code
- We first load the datasets using the load function.
- We then use the left_join function from the dplyr package to join the products and sales datasets on the product_id column.
- Finally, we display the result to see all products, including those that haven’t been sold.
Interpretation of Results
The resulting dataset products_sales contains all rows from the products dataset, with matched rows from the sales dataset. If a product hasn’t been sold, the columns from the sales dataset will have NULL values.
Homework for Readers
In the same left_join_data.RData file, there is another set of datasets for a more creative scenario. You will find:
- employees: Contains information about employees.
- Columns: employee_id, name, department
- parking_permits: Contains information about parking permits issued.
- Columns: permit_id, employee_id, permit_date
Your task is to perform a left join on these datasets to find all employees, including those without a parking permit. Use the employee_id column for joining.
Right Join (Right Outer Join)
A Right Join returns all rows from the right table, and the matched rows from the left table. If there is no match, the result is NULL on the side of the left table.
Explanation of the Scenario
In this scenario, we have marketing campaigns and responses to those campaigns. We want to find all responses, including those that did not belong to a campaign. This helps in understanding the effectiveness of marketing campaigns and identifying responses that might be related to other activities.
Data file: https://github.com/kgryczan/medium_publishing/blob/main/right_join_data.RData
Description of the Datasets
We will use two datasets:
- campaigns: Contains information about marketing campaigns.
- Columns: campaign_id, campaign_name, start_date
- responses: Contains information about responses to campaigns.
- Columns: response_id, campaign_id, response_date
Step-by-Step Code Examples
Loading the datasets
# Load the necessary libraries library(dplyr) # Load the datasets load("right_join_data.RData") # Display the datasets print(campaigns, n = 5) # A tibble: 20 × 3 campaign_id campaign_name start_date <int> <chr> <date> 1 2 Campaign B 2024-01-02 2 4 Campaign D 2024-01-04 3 5 Campaign E 2024-01-05 4 7 Campaign G 2024-01-07 5 8 Campaign H 2024-01-08 # ℹ 15 more rows print(responses, n = 5) # A tibble: 30 × 3 response_id campaign_id response_date <int> <int> <date> 1 101 11 2024-01-05 2 102 27 2024-01-06 3 103 2 2024-01-07 4 104 16 2024-01-08 5 105 22 2024-01-09 # ℹ 25 more rows
Performing the Right Join
# Perform the right join responses_campaigns <- right_join(campaigns, responses, by = "campaign_id") # Display the result print(responses_campaigns, n = 30) # A tibble: 30 × 5 campaign_id campaign_name start_date response_id response_date <int> <chr> <date> <int> <date> 1 2 Campaign B 2024-01-02 103 2024-01-07 2 4 Campaign D 2024-01-04 112 2024-01-16 3 4 Campaign D 2024-01-04 121 2024-01-25 4 5 Campaign E 2024-01-05 127 2024-01-31 5 8 Campaign H 2024-01-08 130 2024-02-03 6 15 Campaign O 2024-01-15 119 2024-01-23 7 15 Campaign O 2024-01-15 129 2024-02-02 8 16 Campaign P 2024-01-16 104 2024-01-08 9 16 Campaign P 2024-01-16 106 2024-01-10 10 16 Campaign P 2024-01-16 110 2024-01-14 11 16 Campaign P 2024-01-16 116 2024-01-20 12 16 Campaign P 2024-01-16 124 2024-01-28 13 17 Campaign Q 2024-01-17 126 2024-01-30 14 18 Campaign R 2024-01-18 123 2024-01-27 15 27 Campaign NA 2024-01-27 102 2024-01-06 16 28 Campaign NA 2024-01-28 108 2024-01-12 17 28 Campaign NA 2024-01-28 109 2024-01-13 18 28 Campaign NA 2024-01-28 117 2024-01-21 19 30 Campaign NA 2024-01-30 113 2024-01-17 20 11 NA NA 101 2024-01-05 21 22 NA NA 105 2024-01-09 22 19 NA NA 107 2024-01-11 23 6 NA NA 111 2024-01-15 24 14 NA NA 114 2024-01-18 25 3 NA NA 115 2024-01-19 26 9 NA NA 118 2024-01-22 27 9 NA NA 120 2024-01-24 28 11 NA NA 122 2024-01-26 29 9 NA NA 125 2024-01-29 30 11 NA NA 128 2024-02-01
Explanation of the Code:
- We first load the datasets using the load function.
- We then use the right_join function from the dplyr package to join the campaigns and responses datasets on the campaign_id column.
- Finally, we display the result to see all responses, including those that did not belong to a campaign.
Homework for Readers
In the same right_join_data.RData file, there is another set of datasets for a more creative scenario. You will find:
- online_courses: Contains information about online courses.
- Columns: course_id, course_name, launch_date
- completions: Contains information about course completions.
- Columns: completion_id, course_id, student_id, completion_date
Your task is to perform a right join on these datasets to find all completions, including those for courses that may have been removed. Use the course_id column for joining.
Full Join (Full Outer Join)
A Full Join returns all rows when there is a match in either the left or right table. If there is no match, the result is NULL on the side where there is no match.
Explanation of the Scenario
In this scenario, we have inventory records from two warehouses. We want to get a complete list of all products and quantities, whether they are in one warehouse or the other. This helps in having a comprehensive view of inventory across multiple locations.
Data file: https://github.com/kgryczan/medium_publishing/blob/main/full_join_data.RData
Description of the Datasets
We will use two datasets:
- warehouse1: Contains inventory information from warehouse 1.
- Columns: product_id, product_name, quantity
- warehouse2: Contains inventory information from warehouse 2.
- Columns: product_id, product_name, quantity
Step-by-Step Code Examples
Loading the datasets
# Load the necessary libraries library(dplyr) # Load the datasets load("full_join_data.RData") # Display the datasets print(warehouse1, n = 5) # A tibble: 20 × 3 product_id product_name quantity <int> <chr> <int> 1 1 Product A 153 2 2 Product B 200 3 3 Product C 111 4 4 Product D 108 5 5 Product E 177 # ℹ 15 more rows print(warehouse2, n = 5) # A tibble: 16 × 3 product_id product_name quantity <int> <chr> <int> 1 15 Product O 161 2 16 Product P 94 3 17 Product Q 63 4 18 Product R 94 5 19 Product S 111 # ℹ 11 more rows
Performing the Full Join
# Perform the full join inventory_full <- full_join(warehouse1, warehouse2, by = "product_id", suffix = c("_wh1", "_wh2")) # Display the result print.AsIs(inventory_full) product_id product_name_wh1 quantity_wh1 product_name_wh2 quantity_wh2 1 1 Product A 153 <NA> NA 2 2 Product B 200 <NA> NA 3 3 Product C 111 <NA> NA 4 4 Product D 108 <NA> NA 5 5 Product E 177 <NA> NA 6 6 Product F 161 <NA> NA 7 7 Product G 175 <NA> NA 8 8 Product H 70 <NA> NA 9 9 Product I 72 <NA> NA 10 10 Product J 89 <NA> NA 11 11 Product K 189 <NA> NA 12 12 Product L 109 <NA> NA 13 13 Product M 177 <NA> NA 14 14 Product N 124 <NA> NA 15 15 Product O 123 Product O 161 16 16 Product P 188 Product P 94 17 17 Product Q 119 Product Q 63 18 18 Product R 188 Product R 94 19 19 Product S 169 Product S 111 20 20 Product T 124 Product T 197 21 21 <NA> NA Product U 81 22 22 <NA> NA Product V 93 23 23 <NA> NA Product W 199 24 24 <NA> NA Product X 80 25 25 <NA> NA Product Y 104 26 26 <NA> NA Product Z 65 27 27 <NA> NA Product NA 112 28 28 <NA> NA Product NA 116 29 29 <NA> NA Product NA 58 30 30 <NA> NA Product NA 167
Explanation of the Code:
- We first load the datasets using the load function.
- We then use the full_join function from the dplyr package to join the warehouse1 and warehouse2 datasets on the product_id column. The suffix argument is used to distinguish between columns from the two warehouses.
- Finally, we display the result to see a comprehensive inventory list.
Interpretation of Results
The resulting dataset inventory_full contains all rows from both the warehouse1 and warehouse2 datasets. If a product is only in one warehouse, the columns from the other warehouse will have NULL values. As we see in our result products O to T, are available in both warehouses.
Homework for Readers
In the same full_join_data.RData file, there is another set of datasets for a more creative scenario. You will find:
- companyA_employees: Contains information about employees from company A.
- Columns: employee_id, name, department
- companyB_employees: Contains information about employees from company B.
- Columns: employee_id, name, department
Your task is to perform a full join on these datasets to ensure all employees are accounted for from both companies and who is working for both. Use the employee_id column for joining.
Semi Join
Introduction to Semi Join
A Semi Join returns all rows from the left table where there are matching values in the right table, but does not duplicate columns from the right table. It is useful for filtering the left table based on the presence of matching rows in the right table.
Explanation of the Scenario
In this scenario, we have customer information and order records. We want to find all customers who have made orders. This helps in identifying active customers.
Data file: https://github.com/kgryczan/medium_publishing/blob/main/semi_join_data.RData
Description of the Datasets
We will use two datasets:
- customers: Contains information about customers.
- Columns: customer_id, name, address
- orders: Contains information about customer orders.
- Columns: order_id, customer_id, order_date
Step-by-Step Code Examples
Loading the datasets
# Load the necessary libraries library(dplyr) # Load the datasets load("semi_join_data.RData") # Display the datasets print(customers, n=5) # A tibble: 30 × 3 customer_id name address <int> <chr> <chr> 1 1 Alice F 423 Pine St 2 2 Bob NA 779 Elm St 3 3 Carol B 257 Oak St 4 4 Zoe O 452 Elm St 5 5 Alice F 73 Pine St # ℹ 25 more rows print(orders, n=5) # A tibble: 30 × 3 order_id customer_id order_date <int> <int> <date> 1 101 11 2024-01-01 2 102 3 2024-01-02 3 103 18 2024-01-03 4 104 29 2024-01-04 5 105 9 2024-01-05 # ℹ 25 more rows
Performing the Semi Join
# Perform the semi join customers_with_orders <- semi_join(customers, orders, by = "customer_id") # Display the result print.AsIs(customers_with_orders) customer_id name address 1 1 Alice F 423 Pine St 2 3 Carol B 257 Oak St 3 5 Alice F 73 Pine St 4 6 Bob NA 587 Pine St 5 8 Zoe V 475 Elm St 6 9 Alice P 397 Oak St 7 10 Bob P 804 Pine St 8 11 Carol O 961 Pine St 9 12 Zoe I 14 Pine St 10 13 Alice X 104 Pine St 11 14 Bob I 981 Elm St 12 17 Alice R 295 Elm St 13 18 Bob NA 393 Maple St 14 20 Zoe NA 845 Maple St 15 21 Alice X 145 Elm St 16 22 Bob I 179 Maple St 17 23 Carol W 140 Oak St 18 24 Zoe Y 431 Elm St 19 25 Alice M 261 Oak St 20 26 Bob E 4 Maple St 21 29 Alice Z 609 Pine St
Explanation of the Code:
- We first load the datasets using the load function.
- We then use the semi_join function from the dplyr package to filter the customers dataset to include only those customers who have matching entries in the orders dataset, based on the customer_id column.
- Finally, we display the result to see which customers have made orders.
Interpretation of Results
The resulting dataset customers_with_orders contains only the rows from the customers dataset where there is a matching row in the orders dataset. This means that only customers who have made at least one order are included.
Homework for Readers
In the same semi_join_data.RData file, there is another set of datasets for a more creative scenario. You will find:
- products: Contains information about products.
- Columns: product_id, product_name, category
- reviews: Contains information about product reviews.
- Columns: review_id, product_id, review_date, rating
Your task is to perform a semi join on these datasets to identify products that have been reviewed by customers. Use the product_id column for joining.
In this first part of our series, we’ve embarked on a journey to demystify data joins in R. We’ve covered the foundational types of joins that are essential for any data analyst: Inner Join, Left Join, Right Join, Full Join, and Semi Join. Through practical, real-life scenarios and step-by-step code examples, we explored how to combine datasets to gain valuable insights.
We’ve seen how Inner Joins help us find orders that have been paid, Left Joins reveal products that haven’t been sold, Right Joins show responses that didn’t belong to any campaign, Full Joins provide a comprehensive view of inventory across warehouses, and Semi Joins filter customers who have made orders. Each of these joins plays a critical role in data analysis, enabling us to connect disparate pieces of information in meaningful ways.
Next week, we’ll continue our exploration by diving into more advanced join techniques. We’ll cover Anti Joins, Cross Joins, Natural Joins, Self Joins, and Equi Joins, each with their own unique applications and benefits. Additionally, we’ll set some challenging exercises to reinforce your learning and build confidence in applying these joins to your own data projects.
Stay tuned for the next installment, where we continue to unlock the power of data joins in R and take your data analysis skills to the next level. Happy coding!
Joins Are No Mystery Anymore: Hands-On Tutorial — Part 1 was originally published in Numbers around us on Medium, where people are continuing the conversation by highlighting and responding to this story.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.