Joins Are No Mystery Anymore: Hands-On Tutorial — Part 3

[This article was first published on Numbers around us - Medium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Joins Are No Mystery Anymore: Hands-On Tutorial — Part 3

Welcome back to the final installment of our series, “Joins Are No Mystery Anymore: Hands-On Tutorial.” In the previous weeks, we covered the foundational and advanced join techniques, including Inner Join, Left Join, Right Join, Full Join, Semi Join, Anti Join, Cross Join, Natural Join, Self Join, and Equi Join. We’ve seen how these joins can be applied to real-life scenarios to solve various data problems.

Today, we’ll dive even deeper into the world of data joins by exploring Non-Equi Joins, Rolling Joins, Overlap Joins, and Fuzzy Joins. These specialized joins will help you handle more complex data scenarios, such as matching based on non-equality conditions, finding the nearest matches, and dealing with approximate or fuzzy data.

Let’s get started with our first join of the day: Non-Equi Join.

Non-Equi Join

A Non-Equi Join is used to join tables based on non-equality conditions, such as greater than (>) or less than (<). This type of join is particularly useful when dealing with ranges or thresholds.

Explanation of the Scenario

In this scenario, we have sales and targets. We want to find sales that exceeded the targets. This helps in identifying successful sales that met or surpassed the set goals.

Description of the Datasets

We will use two datasets:

  • sales: Contains information about sales.
  • Columns: sale_id, amount, date
  • targets: Contains information about sales targets.
  • Columns: target_id, target_amount, target_date

Step-by-Step Code Examples

Loading the datasets:

# Load the necessary libraries
library(dplyr)
library(data.table)

# Load the datasets
load("non_equi_join_data.RData")

# Display the datasets
print(sales)

# A tibble: 20 × 3
   sale_id amount date      
     <int>  <int> <date>    
 1       1    178 2024-01-01
 2       2    219 2024-01-02
 3       3    111 2024-01-03
 4       4    266 2024-01-04
 5       5    208 2024-01-05
 6       6    231 2024-01-06
 7       7    296 2024-01-07
 8       8    242 2024-01-08
 9       9    149 2024-01-09
10      10    245 2024-01-10
11      11    137 2024-01-11
12      12    175 2024-01-12
13      13    209 2024-01-13
14      14    133 2024-01-14
15      15    256 2024-01-15
16      16    128 2024-01-16
17      17    107 2024-01-17
18      18    220 2024-01-18
19      19    295 2024-01-19
20      20    235 2024-01-20

print(targets)

# A tibble: 10 × 3
   target_id target_amount target_date
       <int>         <dbl> <date>     
 1         1           100 2024-01-01 
 2         2           120 2024-01-06 
 3         3           140 2024-01-11 
 4         4           160 2024-01-16 
 5         5           180 2024-01-21 
 6         6           200 2024-01-26 
 7         7           220 2024-01-31 
 8         8           240 2024-02-05 
 9         9           260 2024-02-10 
10        10           280 2024-02-15 

Performing the Non-Equi Join

# Convert to data.table
sales_dt <- as.data.table(sales)
targets_dt <- as.data.table(targets)

# Perform the non-equi join
successful_sales <- sales_dt[targets_dt, on = .(amount > target_amount), nomatch = 0]

# Display the result
print(successful_sales)

     sale_id amount       date target_id target_date
       <int>  <int>     <Date>     <int>      <Date>
  1:       1    100 2024-01-01         1  2024-01-01
  2:       2    100 2024-01-02         1  2024-01-01
  3:       3    100 2024-01-03         1  2024-01-01
  4:       4    100 2024-01-04         1  2024-01-01
  5:       5    100 2024-01-05         1  2024-01-01
 ---                                                
106:       4    260 2024-01-04         9  2024-02-10
107:       7    260 2024-01-07         9  2024-02-10
108:      19    260 2024-01-19         9  2024-02-10
109:       7    280 2024-01-07        10  2024-02-15
110:      19    280 2024-01-19        10  2024-02-15

Explanation of the Code:

  • We first load the datasets using the load function.
  • We then convert the sales and targets datasets to data.tables for efficient non-equi joins.
  • We perform the non-equi join using the on argument to specify the non-equality condition (amount > target_amount).
  • The nomatch = 0 argument ensures that only rows with matches are included in the result.
  • Finally, we display the result to see which sales exceeded the targets.

Interpretation of Results

The resulting dataset successful_sales contains only the rows from the sales dataset where the amount exceeds the target_amount from the targets dataset.

Homework for Readers

In the same non_equi_join_data.RData file, there is another set of datasets for a more creative scenario. You will find:

  • students: Contains information about students.
  • Columns: student_id, name, grade
  • scholarship_thresholds: Contains information about scholarship thresholds.
  • Columns: threshold_id, min_grade

Your task is to perform a non-equi join on these datasets to match students with scholarship thresholds they exceed. Use the grade and min_grade columns for the non-equality condition.

Rolling Join

A Rolling Join is used to join two tables based on a key column, with the ability to match the nearest value when an exact match is not found. This is particularly useful for time series data or any scenario where you need to find the closest preceding or following value.

Explanation of the Scenario

In this scenario, we have stock prices and company events. We want to join these tables to match stock prices with the nearest company events. This helps in understanding how company events might have influenced stock prices.

Description of the Datasets

We will use two datasets:

  • stock_prices: Contains information about stock prices.
  • Columns: date, stock_id, price
  • events: Contains information about company events.
  • Columns: event_id, stock_id, event_date, description

Step-by-Step Code Examples

Loading the datasets:

# Load the necessary libraries
library(dplyr)
library(data.table)

# Load the datasets
load("rolling_join_data.RData")

# Display the datasets
print(stock_prices)

# A tibble: 20 × 3
   date       stock_id price
   <date>        <dbl> <int>
 1 2024-01-01        1   106
 2 2024-01-02        1   106
 3 2024-01-03        1   108
 4 2024-01-04        1   103
 5 2024-01-05        1   103
 6 2024-01-06        1   109
 7 2024-01-07        1   101
 8 2024-01-08        1   104
 9 2024-01-09        1   107
10 2024-01-10        1   101
11 2024-01-11        1   108
12 2024-01-12        1   102
13 2024-01-13        1   107
14 2024-01-14        1   108
15 2024-01-15        1   101
16 2024-01-16        1   100
17 2024-01-17        1   104
18 2024-01-18        1   104
19 2024-01-19        1   104
20 2024-01-20        1   101

print(events)

# A tibble: 3 × 4
  event_id stock_id event_date description      
     <int>    <dbl> <date>     <chr>            
1        1        1 2024-01-05 Quarterly Meeting
2        2        1 2024-01-15 Product Launch   
3        3        1 2024-01-25 Earnings Call 

Performing the Rolling Join

# Convert to data.table
stock_prices_dt <- as.data.table(stock_prices)
events_dt <- as.data.table(events)

# Set keys for rolling join
setkey(stock_prices_dt, stock_id, date)
setkey(events_dt, stock_id, event_date)

# Perform the rolling join
stock_events <- events_dt[stock_prices_dt, roll = "nearest", on = .(stock_id, event_date = date)]

# Display the result
print(stock_events)

Key: <stock_id, event_date>
    event_id stock_id event_date       description price
       <int>    <num>     <Date>            <char> <int>
 1:        1        1 2024-01-01 Quarterly Meeting   106
 2:        1        1 2024-01-02 Quarterly Meeting   106
 3:        1        1 2024-01-03 Quarterly Meeting   108
 4:        1        1 2024-01-04 Quarterly Meeting   103
 5:        1        1 2024-01-05 Quarterly Meeting   103
 6:        1        1 2024-01-06 Quarterly Meeting   109
 7:        1        1 2024-01-07 Quarterly Meeting   101
 8:        1        1 2024-01-08 Quarterly Meeting   104
 9:        1        1 2024-01-09 Quarterly Meeting   107
10:        1        1 2024-01-10 Quarterly Meeting   101
11:        2        1 2024-01-11    Product Launch   108
12:        2        1 2024-01-12    Product Launch   102
13:        2        1 2024-01-13    Product Launch   107
14:        2        1 2024-01-14    Product Launch   108
15:        2        1 2024-01-15    Product Launch   101
16:        2        1 2024-01-16    Product Launch   100
17:        2        1 2024-01-17    Product Launch   104
18:        2        1 2024-01-18    Product Launch   104
19:        2        1 2024-01-19    Product Launch   104
20:        2        1 2024-01-20    Product Launch   101
    event_id stock_id event_date       description price

Explanation of the Code:

  • We first load the datasets using the load function.
  • We then convert the stock_prices and events datasets to data.tables for efficient rolling joins.
  • We set the keys for the rolling join using the setkey function on the stock_id and date columns for stock_prices, and stock_id and event_date columns for events.
  • We perform the rolling join using the roll argument set to “nearest”, which finds the closest match in terms of date.
  • Finally, we display the result to see the stock prices matched with the nearest company events.

Interpretation of Results

The resulting dataset stock_events contains the rows from the stock_prices dataset matched with the nearest preceding or following event from the events dataset based on the date and event_date columns.

Homework for Readers

In the same rolling_join_data.RData file, there is another set of datasets for a more creative scenario. You will find:

  • weather_records: Contains information about weather records.
  • Columns: record_id, date, temperature
  • weather_events: Contains information about significant weather events.
  • Columns: event_id, event_date, event_description

Your task is to perform a rolling join on these datasets to match weather records with the nearest significant weather events. Use the date and event_date columns for the rolling join.

Overlap Join

An Overlap Join is used to join tables based on overlapping ranges of values. This type of join is particularly useful for scenarios where you need to find overlapping time periods or other ranges.

Explanation of the Scenario

In this scenario, we have hotel bookings and we want to find overlapping bookings. This helps in identifying potential overbookings and managing reservations effectively.

Description of the Datasets

We will use one dataset:

  • bookings: Contains information about hotel bookings.
  • Columns: booking_id, room_id, start_date, end_date

Step-by-Step Code Examples

Loading the dataset:

# Load the necessary libraries
library(dplyr)
library(fuzzyjoin)

# Load the dataset
load("overlap_join_data.RData")

# Display the dataset
print(bookings)

# A tibble: 20 × 4
   booking_id room_id start_date end_date  
        <int>   <int> <date>     <date>    
 1          1     109 2024-01-01 2024-01-05
 2          2     111 2024-01-03 2024-01-07
 3          3     118 2024-01-05 2024-01-09
 4          4     104 2024-01-07 2024-01-11
 5          5     103 2024-01-09 2024-01-13
 6          6     103 2024-01-11 2024-01-15
 7          7     101 2024-01-13 2024-01-17
 8          8     101 2024-01-15 2024-01-19
 9          9     101 2024-01-17 2024-01-21
10         10     103 2024-01-19 2024-01-23
11         11     101 2024-01-21 2024-01-25
12         12     116 2024-01-23 2024-01-27
13         13     111 2024-01-25 2024-01-29
14         14     116 2024-01-27 2024-01-31
15         15     104 2024-01-29 2024-02-02
16         16     103 2024-01-31 2024-02-04
17         17     110 2024-02-02 2024-02-06
18         18     114 2024-02-04 2024-02-08
19         19     111 2024-02-06 2024-02-10
20         20     102 2024-02-08 2024-02-12

Performing the Overlap Join

# Perform the overlap join
overlapping_bookings <- fuzzy_left_join(
  bookings, bookings,
  by = c("room_id" = "room_id", "start_date" = "end_date", "end_date" = "start_date"),
  match_fun = list(`==`, `<=`, `>=`)
) %>%
filter(booking_id.x != booking_id.y)

# Display the result
print(overlapping_bookings)

# A tibble: 12 × 8
   booking_id.x room_id.x start_date.x end_date.x booking_id.y room_id.y start_date.y end_date.y
          <int>     <int> <date>       <date>            <int>     <int> <date>       <date>    
 1            5       103 2024-01-09   2024-01-13            6       103 2024-01-11   2024-01-15
 2            6       103 2024-01-11   2024-01-15            5       103 2024-01-09   2024-01-13
 3            7       101 2024-01-13   2024-01-17            8       101 2024-01-15   2024-01-19
 4            7       101 2024-01-13   2024-01-17            9       101 2024-01-17   2024-01-21
 5            8       101 2024-01-15   2024-01-19            7       101 2024-01-13   2024-01-17
 6            8       101 2024-01-15   2024-01-19            9       101 2024-01-17   2024-01-21
 7            9       101 2024-01-17   2024-01-21            7       101 2024-01-13   2024-01-17
 8            9       101 2024-01-17   2024-01-21            8       101 2024-01-15   2024-01-19
 9            9       101 2024-01-17   2024-01-21           11       101 2024-01-21   2024-01-25
10           11       101 2024-01-21   2024-01-25            9       101 2024-01-17   2024-01-21
11           12       116 2024-01-23   2024-01-27           14       116 2024-01-27   2024-01-31
12           14       116 2024-01-27   2024-01-31           12       116 2024-01-23   2024-01-27

Explanation of the Code:

  • We first load the dataset using the load function.
  • We then use the fuzzy_left_join function from the fuzzyjoin package to perform the overlap join. The by argument specifies the columns to join on, and the match_fun argument specifies the matching conditions for each column.
  • We filter the results to exclude self-joins by ensuring booking_id.x is not equal to booking_id.y.
  • Finally, we display the result to see the overlapping bookings.

Interpretation of Results

The resulting dataset overlapping_bookings contains pairs of rows from the bookings dataset where the bookings overlap based on the room_id, start_date, and end_date columns.

Homework for Readers

In the same overlap_join_data.RData file, there is another set of datasets for a more creative scenario. You will find:

  • projects: Contains information about projects.
  • Columns: project_id, project_name, start_date, end_date

Your task is to perform an overlap join on these datasets to find overlapping project timelines. Use the start_date and end_date columns for the overlap join.

Fuzzy Join

A Fuzzy Join is used to join tables based on approximate or “fuzzy” matching of key columns. Unlike traditional joins, which require exact matches between columns, fuzzy joins allow for matches based on similarity, proximity, or other non-exact criteria. This is particularly useful when dealing with data that has inconsistencies, such as typos, different naming conventions, or slight variations in values.

Fuzzy joins can be used in various scenarios, such as:

  • Merging customer records from different sources where names or addresses might be slightly different.
  • Matching products from different databases where product names might vary.
  • Combining historical documents with different naming conventions.

Fuzzy joins leverage different methods of similarity measurement, such as string distance (e.g., Levenshtein distance), numeric proximity, or custom matching functions, to find the best possible matches between rows.

Explanation of the Scenario

In this scenario, we have customer records from two different sources. We want to join these tables to combine records that refer to the same customers, even if there are slight differences in the names or addresses. This helps in consolidating customer data from multiple sources into a single, unified view.

Description of the Datasets

We will use two datasets:

  • customer_records_A: Contains customer information from source A.
  • Columns: customer_id, name, address
  • customer_records_B: Contains customer information from source B.
  • Columns: customer_id, name, address

Step-by-Step Code Examples

Loading the datasets:

# Load the necessary libraries
library(dplyr)
library(fuzzyjoin)

# Load the datasets
load("fuzzy_join_data.RData")

# Display the datasets
print(customer_records_A)

                Name                Address ID
1        Terry Welsh       5028 Paddock Way  1
2        Robin Lewis      3415 Photinia Ave  2
3      Tyrone Carter         5697 Smokey Ln  3
4        Noah Fuller      4028 Northaven Rd  4
5     Heather Barnes      5530 First Street  5
6          Alex Peck  7132 Mockingbird Hill  6
7   Alfredo Martinez         8920 Smokey Ln  7
8     Adrian Morales         6567 Nowlin Rd  8
9       Melvin Paine        8310 Fincher Rd  9
10     Harry Edwards    8848 Valley View Ln 10
11 Esther Williamson          2644 Daisy Dr 11
12   Stella Campbell   9021 E Sandy Lake Rd 12
13    Lawrence Grant      3221 First Street 13
14       Vivan Perez        996 Wycliff Ave 14
15        Eli Brewer  7344 Wheeler Ridge Dr 15
16     Edward Wagner      7174 W Sherman Dr 16
17        Dwayne Day 2115 Groveland Terrace 17
18      Erika Flores         3566 Nowlin Rd 18
19   Nicholas Nelson           7867 Dane St 19
20       Wade Willis    8608 Pecan Acres Ln 20

print(customer_records_B)

               Name                Address ID
1       Terry Welch       5028 Paddock Way  1
2  Courtney Elliott        6526 Cackson St  2
3      Tyron Carter         5697 Smokey Ln  3
4      Noah Fueller      4028 Northaven Rd  4
5     Jesus Herrera       722 Hillcrest Rd  5
6        Isaac Neal     308 W Campbell Ave  6
7      Annette Carr      2087 Photinia Ave  7
8      Rebecca Boyd      7584 Homestead Rd  8
9      Melvin Payne        8310 Fincher Rd  9
10      Irma Bowman      9065 Valwood Pkwy 10
11  Heather Wallace         95 Railroad St 11
12      Janice West        1545 W Pecan St 12
13    Dianne Chavez        326 Robinson Rd 13
14     Vivian Perez        996 Wycliff Ave 14
15      Dustin Wood 2677 Groveland Terrace 15
16     Calvin Jones      5859 Samaritan Dr 16
17     Tara Carroll  2215 Rolling Green Rd 17
18  Francis Gardner         1242 Sunset St 18
19  Bryan Henderson      5781 Ranchview Dr 19
20    Michelle Bell     9072 Westheimer Rd 20

Performing the Fuzzy Join

# Perform the fuzzy join
customer_matches <- stringdist_left_join(
  customer_records_A, customer_records_B,
  by = "Name",
  max_dist = 2,
  distance_col = "dist"
) %>% 
  filter(dist < 2)

# Display the result
print(customer_matches)

         Name.x         Address.x ID.x       Name.y         Address.y ID.y dist
1   Terry Welsh  5028 Paddock Way    1  Terry Welch  5028 Paddock Way    1    1
2 Tyrone Carter    5697 Smokey Ln    3 Tyron Carter    5697 Smokey Ln    3    1
3   Noah Fuller 4028 Northaven Rd    4 Noah Fueller 4028 Northaven Rd    4    1
4  Melvin Paine   8310 Fincher Rd    9 Melvin Payne   8310 Fincher Rd    9    1
5   Vivan Perez   996 Wycliff Ave   14 Vivian Perez   996 Wycliff Ave   14    1

Explanation of the Code:

  • We first load the datasets using the load function.
  • We then use the stringdist_left_join function from the fuzzyjoin package to perform the fuzzy join. The by argument specifies the column to join on (name), and the max_dist argument specifies the maximum allowable distance for matches (2 in this case).
  • The distance_col argument adds a column to the result showing the computed distance between the matched names.
  • Finally, we display the result to see which customer records were matched based on fuzzy name matching.

Interpretation of Results

The resulting dataset customer_matches contains rows from customer_records_A matched with the closest approximate rows from customer_records_B based on the name column. The dist column shows the computed distance between the matched names.

In this final installment of our series, “Joins Are No Mystery Anymore: Hands-On Tutorial,” we’ve taken a deep dive into specialized join techniques that are essential for handling more complex data scenarios. We’ve covered:

  • Non-Equi Join: Matching rows based on non-equality conditions, useful for comparing ranges or thresholds.
  • Rolling Join: Joining tables to find the nearest matches when an exact match is not found, ideal for time series data.
  • Overlap Join: Identifying overlapping ranges, such as booking dates or project timelines.
  • Fuzzy Join: Combining tables based on approximate matches, invaluable for dealing with inconsistent data.

Through practical examples and detailed code walkthroughs, we demonstrated how these advanced joins can solve real-world data problems. We’ve also provided homework tasks to reinforce your learning and give you hands-on experience with these techniques.

What’s Next?

In the bonus section, “Anatomy of a Basic Joining Function,” we’ll explore the different arguments in joining functions, explaining what each one does and how they change the output. This deep dive will enhance your understanding and give you even greater control over your data analysis.

Stay tuned as we uncover the intricacies of joining functions and provide you with the tools to master data joins in R. Thank you for joining us on this journey, and happy coding!

Anatomy of Basic Join Functions (from dplyr)

In this section, we’ll delve into the anatomy of basic joining functions in the dplyr package. We'll explain the different arguments you can use, how they affect the output, and provide examples to illustrate their usage.

1. by

Specifies the columns to join by. If not provided, dplyr will join by columns with the same name in both tables.

2. suffix

Determines the suffixes added to duplicate column names from the left and right tables. By default, it is set to c(".x", ".y").

3. copy

A logical argument that allows joining of data frames located in different databases. By default, it is set to FALSE.

4. keep

An argument in full_join that keeps the join columns in the output.

Example

# Example of different arguments in a join function
joined_data <- left_join(
  x = df1, 
  y = df2, 
  by = "id", 
  suffix = c("_left", "_right"), 
  copy = TRUE, 
  keep = TRUE
)

Joins Are No Mystery Anymore: Hands-On Tutorial — Part 3 was originally published in Numbers around us on Medium, where people are continuing the conversation by highlighting and responding to this story.

To leave a comment for the author, please follow the link and comment on their blog: Numbers around us - Medium.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)