The unequalled joy of non-equi joins

David Selby

1 year ago

[This article was first published on R on Tea & Stats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A common task in data analysis is to merge or join two tables according to shared keys or values. The operation is perhaps most commonly associated with relational databases and structured query language (SQL) but it’s just as useful in R with data frames.

Most joins are equi-joins, matching rows according to two columns having exactly equal values. These are easy to perfom in R using the base merge() function, the various join() functions in dplyr and the X[i] syntax of data.table.

But sometimes we need non-equi joins or \(\theta\)-joins, where the matching condition is an interval or a set of inequalities. Other situations call for a rolling join, used to link records according to their proximity in a time sequence.

How do you perform non-equi joins and rolling joins in R?

Motivating example

A famous YouTuber is testing out a new marketing strategy, promoting specific videos on social media. They collect daily view counts of old videos and want to summarise these counts in the days following each promotion.

The tables promos and views are as follows. The data span a two-week period, but for certain days and certain videos, the view counts are missing.

video	promo_date
A	2020-04-01
A	2020-04-07
B	2020-04-03
B	2020-04-08
C	2020-04-03

video	view_date	views
A	2020-04-01	992
A	2020-04-02	3304
A	2020-04-03	1417
A	2020-04-04	191
A	2020-04-05	2366
A	2020-04-06	7318
A	2020-04-07	1570
A	2020-04-08	2051
A	2020-04-09	5958
A	2020-04-10	3574
A	2020-04-11	6724
A	2020-04-12	12043
A	2020-04-13	481
A	2020-04-14	286
B	2020-04-01	6270
B	2020-04-02	1549
B	2020-04-03	2376
B	2020-04-04	3111
B	2020-04-05	6228
B	2020-04-06	1852
B	2020-04-07	24314
B	2020-04-08	3329
B	2020-04-09	24980
B	2020-04-10	1118
B	2020-04-11	6057
B	2020-04-14	1400
C	2020-04-01	1156
C	2020-04-02	6435
C	2020-04-03	2847
C	2020-04-04	15093
C	2020-04-05	2488
C	2020-04-06	1773
C	2020-04-07	8782
C	2020-04-08	3687
C	2020-04-11	1963
C	2020-04-12	9510
C	2020-04-13	3891
C	2020-04-14	3282

What are the mean view counts on videos in the three days immediately following promotions for those videos?

I will show how you might accomplish this task using either non-equi joins or rolling joins in R.

Crossing + filter with dplyr

The package dplyr has no function for joining on anything other than an equality relation. However, you can get the same results (possibly less efficiently) using an outer join or Cartesian product, followed by a filtering operation.

library(dplyr)
views %>%
  # Crossing
  full_join(promos) %>%
  # Filter
  filter(view_date >= promo_date,
         view_date <= promo_date + 3) %>%
  # Aggregate
  group_by(video) %>%
  summarise(mean(views), sd(views), `n days` = n())

video	mean(views)	sd(views)	n days
A	2382	1832	8
B	6131	7836	8
C	5550	6377	4

It would also be possible to compare promoted days with non-promoted days for each video, either by creating a binary indicator (instead of a filter) or by rejoining the table with the original dataset after the filter step.

Non-equi joins with sqldf

The sqldf package lets you query R data frames with SQL, as if you were working with a relational database. The result of the query is another data frame, and performance is sometimes better than equivalent R functions.

library(sqldf)
sqldf('SELECT v.video, view_date, views
       FROM views v
       JOIN promos p
       ON v.video = p.video AND
          view_date BETWEEN promo_date AND promo_date + 3'
      ) %>%
  group_by(video) %>%
  summarise(mean(views), sd(views), `n days` = n())

video	mean(views)	sd(views)	n days
A	2382	1832	8
B	6131	7836	8
C	5550	6377	4

The aggregation step could also be written in SQL, but it makes sense only to use SQL where it is absolutely needed, and to use native R functions for everything else.

Having access to this functionality is very powerful, but has the obvious disadvantage that you need to learn a bit of SQL to understand the syntax.

Non-equi joins with data.table

The high-performance data manipulation package data.table now (as of v1.9.8) supports non-equi joins.

Non-equi joins are made possible with the X[i] merging syntax and the on argument. It’s slightly less flexible than the equivalent SQL, because you can’t just write promo_date + 3 in the inequality: instead it needs to be an explicit column in the table. You also can’t use the infix %between% operator, so two inequalities have to do instead. Otherwise the syntax is similar. Like in SQL, a prefix is used to disambiguate the column names: here it’s x.name.

library(data.table)
setDT(views)
setDT(promos)[, promo_end := promo_date + 3]

video	promo_date	promo_end
A	2020-04-01	2020-04-04
A	2020-04-07	2020-04-10
B	2020-04-03	2020-04-06
B	2020-04-08	2020-04-11
C	2020-04-03	2020-04-06

views[promos,
      .(video, views, x.view_date),
      # Non equi join:
      on = .(video,
             view_date >= promo_date,
             view_date <= promo_end)
      ][, # Chain into aggregate:
        .(mean = mean(views), sd = sd(views), .N),
        by = video]

video	mean	sd	N
A	2382	1832	8
B	6131	7836	8
C	5550	6377	4

Rolling joins with data.table

This particular example, since it involves a time variable, is even simpler using a rolling join. The concept is a bit confusing, but essentially it attributes records in one table with the most recent preceding records in the second table. You can read more about rolling joins in this blog post by Robert Norberg.

When performing rolling joins in data.table, one of the joining time columns gets dropped, which can make it hard to identify your records if they don’t have an explicit ID. To mitigate this, we will copy each date column to the name join_date and join on that.

views[, join_date := view_date]
promos[, join_date := promo_date]

setkey(views, video, join_date)
setkey(promos, video, join_date)
promos[views, roll = TRUE]

video	promo_date	join_date	view_date	views
A	2020-04-01	2020-04-01	2020-04-01	992
A	2020-04-01	2020-04-02	2020-04-02	3304
A	2020-04-01	2020-04-03	2020-04-03	1417
A	2020-04-01	2020-04-04	2020-04-04	191
A	2020-04-01	2020-04-05	2020-04-05	2366
A	2020-04-01	2020-04-06	2020-04-06	7318
A	2020-04-07	2020-04-07	2020-04-07	1570
A	2020-04-07	2020-04-08	2020-04-08	2051
A	2020-04-07	2020-04-09	2020-04-09	5958
A	2020-04-07	2020-04-10	2020-04-10	3574
A	2020-04-07	2020-04-11	2020-04-11	6724
A	2020-04-07	2020-04-12	2020-04-12	12043
A	2020-04-07	2020-04-13	2020-04-13	481
A	2020-04-07	2020-04-14	2020-04-14	286
B	NA	2020-04-01	2020-04-01	6270
B	NA	2020-04-02	2020-04-02	1549
B	2020-04-03	2020-04-03	2020-04-03	2376
B	2020-04-03	2020-04-04	2020-04-04	3111
B	2020-04-03	2020-04-05	2020-04-05	6228
B	2020-04-03	2020-04-06	2020-04-06	1852
B	2020-04-03	2020-04-07	2020-04-07	24314
B	2020-04-08	2020-04-08	2020-04-08	3329
B	2020-04-08	2020-04-09	2020-04-09	24980
B	2020-04-08	2020-04-10	2020-04-10	1118
B	2020-04-08	2020-04-11	2020-04-11	6057
B	2020-04-08	2020-04-14	2020-04-14	1400
C	NA	2020-04-01	2020-04-01	1156
C	NA	2020-04-02	2020-04-02	6435
C	2020-04-03	2020-04-03	2020-04-03	2847
C	2020-04-03	2020-04-04	2020-04-04	15093
C	2020-04-03	2020-04-05	2020-04-05	2488
C	2020-04-03	2020-04-06	2020-04-06	1773
C	2020-04-03	2020-04-07	2020-04-07	8782
C	2020-04-03	2020-04-08	2020-04-08	3687
C	2020-04-03	2020-04-11	2020-04-11	1963
C	2020-04-03	2020-04-12	2020-04-12	9510
C	2020-04-03	2020-04-13	2020-04-13	3891
C	2020-04-03	2020-04-14	2020-04-14	3282

The syntax promos[views, roll=TRUE] means “which promotion immediately preceded each viewing date?” Conversely, views[promos, roll=TRUE] means “which viewing dates immediately preceded each promotion?”

In this case, we want something close to the former, but we’re only interested in promos in the past 3 days, whereas by default it’ll extend back into the depths of time looking for the last one, regardless of how long ago.

By changing roll = TRUE to roll = 3 the join will fail to match when the join_date differs by more than three days between the two tables. If we wanted to go in the opposite direction in time, we could use roll = -Inf to search forwards in time for future promotions, and roll = -3 for only those in the following three days.

promos[views, roll = 3]

video	promo_date	join_date	view_date	views
A	2020-04-01	2020-04-01	2020-04-01	992
A	2020-04-01	2020-04-02	2020-04-02	3304
A	2020-04-01	2020-04-03	2020-04-03	1417
A	2020-04-01	2020-04-04	2020-04-04	191
A	NA	2020-04-05	2020-04-05	2366
A	NA	2020-04-06	2020-04-06	7318
A	2020-04-07	2020-04-07	2020-04-07	1570
A	2020-04-07	2020-04-08	2020-04-08	2051
A	2020-04-07	2020-04-09	2020-04-09	5958
A	2020-04-07	2020-04-10	2020-04-10	3574
A	NA	2020-04-11	2020-04-11	6724
A	NA	2020-04-12	2020-04-12	12043
A	NA	2020-04-13	2020-04-13	481
A	NA	2020-04-14	2020-04-14	286
B	NA	2020-04-01	2020-04-01	6270
B	NA	2020-04-02	2020-04-02	1549
B	2020-04-03	2020-04-03	2020-04-03	2376
B	2020-04-03	2020-04-04	2020-04-04	3111
B	2020-04-03	2020-04-05	2020-04-05	6228
B	2020-04-03	2020-04-06	2020-04-06	1852
B	NA	2020-04-07	2020-04-07	24314
B	2020-04-08	2020-04-08	2020-04-08	3329
B	2020-04-08	2020-04-09	2020-04-09	24980
B	2020-04-08	2020-04-10	2020-04-10	1118
B	2020-04-08	2020-04-11	2020-04-11	6057
B	NA	2020-04-14	2020-04-14	1400
C	NA	2020-04-01	2020-04-01	1156
C	NA	2020-04-02	2020-04-02	6435
C	2020-04-03	2020-04-03	2020-04-03	2847
C	2020-04-03	2020-04-04	2020-04-04	15093
C	2020-04-03	2020-04-05	2020-04-05	2488
C	2020-04-03	2020-04-06	2020-04-06	1773
C	NA	2020-04-07	2020-04-07	8782
C	NA	2020-04-08	2020-04-08	3687
C	NA	2020-04-11	2020-04-11	1963
C	NA	2020-04-12	2020-04-12	9510
C	NA	2020-04-13	2020-04-13	3891
C	NA	2020-04-14	2020-04-14	3282

Since the default in data.table’s X[i] merge syntax is nomatch = NA, we get all of the views back, with the column promo_date equal to the date of the last promotion (in the last three days), or NA if no such promotion was found. If we set nomatch = 0 then these non-matching values are dropped from the result.

So the full operation to calculate the summary figures is

promos[views, roll = 3, nomatch = 0
       ][j = .(mean = mean(views), sd = sd(views), .N),
         by = video]

video	mean	sd	N
A	2382	1832	8
B	6131	7836	8
C	5550	6377	4

If the intervals needed to be different lengths for each of the campaigns (i.e. not all equal to 3), then you would probably want a non-equi join rather than a rolling join in this case.

Extensions

Arguably, for these examples you’d want to compare the view counts in promotional periods with those outside promotional periods. This does not require a special kind of join; rather you perform the non-equi or rolling join as above and then wrangle the output accordingly.

If the non-matching rows are filtered out, you need to re-join with the original dataset. Otherwise, you need to use some sort of indicator variable for whether the viewing date falls within a promotional period or not.

In dplyr:

promos %>%
  full_join(views) %>%
  mutate(promo = between(view_date - promo_date, 0, 3)) %>%
  group_by(video, view_date) %>%
  summarise(promo = any(promo),
            views = unique(views)) %>%
  group_by(promo) %>%
  summarise(mean(views), sd(views), n = n())

promo	mean(views)	sd(views)	n
FALSE	5637	5772	18
TRUE	4515	5790	20

In data.table:

promos[views, roll = 3][
         j = .(mean = mean(views), sd = sd(views), .N),
         by = .(promo = !is.na(promo_date))]

promo	mean	sd	N
TRUE	4515	5790	20
FALSE	5637	5772	18

To leave a comment for the author, please follow the link and comment on their blog: R on Tea & Stats.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.