Understanding data.table Rolling Joins
[This article was first published on bRogramming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Understanding data.table Rolling Joins
Robert Norberg
June 5, 2016
Introduction
Rolling joins in
data.table
are incredibly useful, but not that well documented. I wrote this to help myself figure out how to use them and perhaps it can help you too.library(data.table)
The Setup
Imagine we have an eCommerce website that uses a third party (like PayPal) to handle payments. We track user sessions on our website and PayPal tracks our payments, and we would like to attribute sales to user sessions. This way we can answer all kinds of questions along the lines of “What types of sessions lead to sales?”
Lets create some example data for a diverse group of customers.
Indecisive Isabel shops a lot before she buys anything.
isabel_website <- data.table(name = rep('Indecisive Isabel', 5), session_start_time = as.POSIXct(c('2016-01-01 11:01', '2016-01-02 8:59', '2016-01-05 18:18', '2016-01-07 19:03', '2016-01-08 19:01'))) isabel_paypal <- data.table(name = 'Indecisive Isabel', purchase_time = as.POSIXct('2016-01-08 19:10'))
Spendy Sally visits the website once and makes multiple purchases.
sally_website <- data.table(name = 'Spendy Sally', session_start_time = as.POSIXct('2016-01-03 10:00')) sally_paypal <- data.table(name = rep('Spendy Sally', 2), purchase_time = as.POSIXct(c('2016-01-03 10:06', '2016-01-03 10:15')))
Frequent Francis comes to the site a lot and sometimes buys things.
francis_website <- data.table(name = rep('Frequent Francis', 6), session_start_time = as.POSIXct(c('2016-01-02 13:09', '2016-01-03 19:22', '2016-01-08 8:44', '2016-01-08 20:22', '2016-01-10 17:36', '2016-01-15 16:56'))) francis_paypal <- data.table(name = rep('Frequent Francis', 3), purchase_time = as.POSIXct(c('2016-01-03 19:28', '2016-01-08 20:33', '2016-01-10 17:46')))
Error-prone Erica has mysteriously managed to make a purchase before ever visiting the site!
erica_website <- data.table(name = rep('Error-prone Erica', 2), session_start_time = as.POSIXct(c('2016-01-04 19:12', '2016-01-04 21:05'))) erica_paypal <- data.table(name = 'Error-prone Erica', purchase_time = as.POSIXct('2016-01-03 08:02'))
Visitor Vivian visits our website a couple times, but never makes a purchase (so she appears in the website data, but not in the payment data).
vivian_website <- data.table(name = rep('Visitor Vivian', 2), session_start_time = as.POSIXct(c('2016-01-01 9:10', '2016-01-09 2:15'))) vivian_paypal <- erica_paypal[0] # has 0 rows, but the same column names/classes
And Mom sent money to my PayPal account before my website was up and running (so she appears in the payment data, but not in the website data).
mom_website <- vivian_website[0] # has 0 rows, but the same column names/classes mom_paypal <- data.table(name = 'Mom', purchase_time = as.POSIXct('2015-12-02 17:58'))
Combine these into two
data.table
s:website <- rbindlist(list(isabel_website, sally_website, francis_website, erica_website, vivian_website, mom_website)) paypal <- rbindlist(list(isabel_paypal, sally_paypal, francis_paypal, erica_paypal, vivian_paypal, mom_paypal))
To keep things straight, lets give each website session a unique ID and each payment a unique ID.
website[, session_id:=.GRP, by = .(name, session_start_time)] paypal[, payment_id:=.GRP, by = .(name, purchase_time)]
Finally, have a look at the data.
website ## name session_start_time session_id ## 1: Indecisive Isabel 2016-01-01 11:01:00 1 ## 2: Indecisive Isabel 2016-01-02 08:59:00 2 ## 3: Indecisive Isabel 2016-01-05 18:18:00 3 ## 4: Indecisive Isabel 2016-01-07 19:03:00 4 ## 5: Indecisive Isabel 2016-01-08 19:01:00 5 ## 6: Spendy Sally 2016-01-03 10:00:00 6 ## 7: Frequent Francis 2016-01-02 13:09:00 7 ## 8: Frequent Francis 2016-01-03 19:22:00 8 ## 9: Frequent Francis 2016-01-08 08:44:00 9 ## 10: Frequent Francis 2016-01-08 20:22:00 10 ## 11: Frequent Francis 2016-01-10 17:36:00 11 ## 12: Frequent Francis 2016-01-15 16:56:00 12 ## 13: Error-prone Erica 2016-01-04 19:12:00 13 ## 14: Error-prone Erica 2016-01-04 21:05:00 14 ## 15: Visitor Vivian 2016-01-01 09:10:00 15 ## 16: Visitor Vivian 2016-01-09 02:15:00 16 paypal ## name purchase_time payment_id ## 1: Indecisive Isabel 2016-01-08 19:10:00 1 ## 2: Spendy Sally 2016-01-03 10:06:00 2 ## 3: Spendy Sally 2016-01-03 10:15:00 3 ## 4: Frequent Francis 2016-01-03 19:28:00 4 ## 5: Frequent Francis 2016-01-08 20:33:00 5 ## 6: Frequent Francis 2016-01-10 17:46:00 6 ## 7: Error-prone Erica 2016-01-03 08:02:00 7 ## 8: Mom 2015-12-02 17:58:00 8
The Joins
Before doing any rolling joins, I like to create a separate date/time column in each table to join on because one of the two tables loses it’s date/time field and I can never remember which.
website[, join_time:=session_start_time] paypal[, join_time:=purchase_time]
Next, set keys on each table. The last key column is the one the rolling join will “roll” on. We want to first join on
name
and then within each name
, match website sessions to purchases. So we key on name
first, then on the newly created join_time
.setkey(website, name, join_time) setkey(paypal, name, join_time)
Rolling Forward
Now let’s answer the question “what website session immediately preceded each payment?”
website[paypal, roll = T] # equivalent to website[paypal, roll = Inf] ## name session_start_time session_id join_time ## 1: Error-prone Erica <NA> NA 2016-01-03 08:02:00 ## 2: Frequent Francis 2016-01-03 19:22:00 8 2016-01-03 19:28:00 ## 3: Frequent Francis 2016-01-08 20:22:00 10 2016-01-08 20:33:00 ## 4: Frequent Francis 2016-01-10 17:36:00 11 2016-01-10 17:46:00 ## 5: Indecisive Isabel 2016-01-08 19:01:00 5 2016-01-08 19:10:00 ## 6: Mom <NA> NA 2015-12-02 17:58:00 ## 7: Spendy Sally 2016-01-03 10:00:00 6 2016-01-03 10:06:00 ## 8: Spendy Sally 2016-01-03 10:00:00 6 2016-01-03 10:15:00 ## purchase_time payment_id ## 1: 2016-01-03 08:02:00 7 ## 2: 2016-01-03 19:28:00 4 ## 3: 2016-01-08 20:33:00 5 ## 4: 2016-01-10 17:46:00 6 ## 5: 2016-01-08 19:10:00 1 ## 6: 2015-12-02 17:58:00 8 ## 7: 2016-01-03 10:06:00 2 ## 8: 2016-01-03 10:15:00 3
Notice several things about this result:
- Each payment is matched to the closest preceding payment.
all(purchase_time > session_start_time, na.rm = T)
evaluates toTRUE
. - Payments with no preceding sessions still appear in the result (that is,
nrow(result) == nrow(paypal)
). - Visitor Vivian does not appear in the results because she does not appear in the
paypal
table. - Mom’s “purchase” has no website session associated with it because she has never visited the website at all.
- Error-prone Erica’s mysterious purchase has no website session associated with it because she never visited the website prior to her purchase.
- Spendy Sally’s one website session is matched to both of her purchases.
Rolling Backward
Now lets switch the order of the two tables and answer the question “which sessions led to a purchase?” In this case, we want to match payments to website sessions, so long as the payment occurred after the beginning of the website session.
paypal[website, roll = -Inf] ## name purchase_time payment_id join_time ## 1: Error-prone Erica <NA> NA 2016-01-04 19:12:00 ## 2: Error-prone Erica <NA> NA 2016-01-04 21:05:00 ## 3: Frequent Francis 2016-01-03 19:28:00 4 2016-01-02 13:09:00 ## 4: Frequent Francis 2016-01-03 19:28:00 4 2016-01-03 19:22:00 ## 5: Frequent Francis 2016-01-08 20:33:00 5 2016-01-08 08:44:00 ## 6: Frequent Francis 2016-01-08 20:33:00 5 2016-01-08 20:22:00 ## 7: Frequent Francis 2016-01-10 17:46:00 6 2016-01-10 17:36:00 ## 8: Frequent Francis <NA> NA 2016-01-15 16:56:00 ## 9: Indecisive Isabel 2016-01-08 19:10:00 1 2016-01-01 11:01:00 ## 10: Indecisive Isabel 2016-01-08 19:10:00 1 2016-01-02 08:59:00 ## 11: Indecisive Isabel 2016-01-08 19:10:00 1 2016-01-05 18:18:00 ## 12: Indecisive Isabel 2016-01-08 19:10:00 1 2016-01-07 19:03:00 ## 13: Indecisive Isabel 2016-01-08 19:10:00 1 2016-01-08 19:01:00 ## 14: Spendy Sally 2016-01-03 10:06:00 2 2016-01-03 10:00:00 ## 15: Visitor Vivian <NA> NA 2016-01-01 09:10:00 ## 16: Visitor Vivian <NA> NA 2016-01-09 02:15:00 ## session_start_time session_id ## 1: 2016-01-04 19:12:00 13 ## 2: 2016-01-04 21:05:00 14 ## 3: 2016-01-02 13:09:00 7 ## 4: 2016-01-03 19:22:00 8 ## 5: 2016-01-08 08:44:00 9 ## 6: 2016-01-08 20:22:00 10 ## 7: 2016-01-10 17:36:00 11 ## 8: 2016-01-15 16:56:00 12 ## 9: 2016-01-01 11:01:00 1 ## 10: 2016-01-02 08:59:00 2 ## 11: 2016-01-05 18:18:00 3 ## 12: 2016-01-07 19:03:00 4 ## 13: 2016-01-08 19:01:00 5 ## 14: 2016-01-03 10:00:00 6 ## 15: 2016-01-01 09:10:00 15 ## 16: 2016-01-09 02:15:00 16
In this result
- Each website session is match to the nearest following payment.
all(session_start_time > purchase_time, na.rm = T)
evaluates toTRUE
. - Mom does not appear because she has no record in the
website
table. - Visitor Vivian’s sessions are not matched to any purchases because she hasn’t purchased anything.
- Neither of Erica’s website sessions are matched to her purchase because it took place before both sessions.
- Frequent Francis’s most recent session isn’t matched to a purchase because she hasn’t made a purchase after that session.
- All of Indecisive Isabel’s sessions are matched to her one purchase. In fact, several purchases appear more than once.
Rolling Windows
What if we wanted to add an additional criteria to the rolling join above: match payments to website sessions, so long as the payment occurred after the beginning of the website session and within 12 hours of the website session?
twelve_hours <- 60*60*20 # 12 hours = 60 sec * 60 min * 12 hours paypal[website, roll = -twelve_hours] ## name purchase_time payment_id join_time ## 1: Error-prone Erica <NA> NA 2016-01-04 19:12:00 ## 2: Error-prone Erica <NA> NA 2016-01-04 21:05:00 ## 3: Frequent Francis <NA> NA 2016-01-02 13:09:00 ## 4: Frequent Francis 2016-01-03 19:28:00 4 2016-01-03 19:22:00 ## 5: Frequent Francis 2016-01-08 20:33:00 5 2016-01-08 08:44:00 ## 6: Frequent Francis 2016-01-08 20:33:00 5 2016-01-08 20:22:00 ## 7: Frequent Francis 2016-01-10 17:46:00 6 2016-01-10 17:36:00 ## 8: Frequent Francis <NA> NA 2016-01-15 16:56:00 ## 9: Indecisive Isabel <NA> NA 2016-01-01 11:01:00 ## 10: Indecisive Isabel <NA> NA 2016-01-02 08:59:00 ## 11: Indecisive Isabel <NA> NA 2016-01-05 18:18:00 ## 12: Indecisive Isabel <NA> NA 2016-01-07 19:03:00 ## 13: Indecisive Isabel 2016-01-08 19:10:00 1 2016-01-08 19:01:00 ## 14: Spendy Sally 2016-01-03 10:06:00 2 2016-01-03 10:00:00 ## 15: Visitor Vivian <NA> NA 2016-01-01 09:10:00 ## 16: Visitor Vivian <NA> NA 2016-01-09 02:15:00 ## session_start_time session_id ## 1: 2016-01-04 19:12:00 13 ## 2: 2016-01-04 21:05:00 14 ## 3: 2016-01-02 13:09:00 7 ## 4: 2016-01-03 19:22:00 8 ## 5: 2016-01-08 08:44:00 9 ## 6: 2016-01-08 20:22:00 10 ## 7: 2016-01-10 17:36:00 11 ## 8: 2016-01-15 16:56:00 12 ## 9: 2016-01-01 11:01:00 1 ## 10: 2016-01-02 08:59:00 2 ## 11: 2016-01-05 18:18:00 3 ## 12: 2016-01-07 19:03:00 4 ## 13: 2016-01-08 19:01:00 5 ## 14: 2016-01-03 10:00:00 6 ## 15: 2016-01-01 09:10:00 15 ## 16: 2016-01-09 02:15:00 16
Now Indecisive Isabel’s last session only is associated with a purchase.
The rollends
Argument
Recall the first join from above, matching the preceding website session to each payment.
website[paypal, roll = T] # equivalent to website[paypal, roll = T, rollends = c(F, T)] ## name session_start_time session_id join_time ## 1: Error-prone Erica <NA> NA 2016-01-03 08:02:00 ## 2: Frequent Francis 2016-01-03 19:22:00 8 2016-01-03 19:28:00 ## 3: Frequent Francis 2016-01-08 20:22:00 10 2016-01-08 20:33:00 ## 4: Frequent Francis 2016-01-10 17:36:00 11 2016-01-10 17:46:00 ## 5: Indecisive Isabel 2016-01-08 19:01:00 5 2016-01-08 19:10:00 ## 6: Mom <NA> NA 2015-12-02 17:58:00 ## 7: Spendy Sally 2016-01-03 10:00:00 6 2016-01-03 10:06:00 ## 8: Spendy Sally 2016-01-03 10:00:00 6 2016-01-03 10:15:00 ## purchase_time payment_id ## 1: 2016-01-03 08:02:00 7 ## 2: 2016-01-03 19:28:00 4 ## 3: 2016-01-08 20:33:00 5 ## 4: 2016-01-10 17:46:00 6 ## 5: 2016-01-08 19:10:00 1 ## 6: 2015-12-02 17:58:00 8 ## 7: 2016-01-03 10:06:00 2 ## 8: 2016-01-03 10:15:00 3
What if we want the rolling join to handle Error-prone Erica’s case differently? Perhaps in cases like hers, where there is a purchase with no preceding session, we prefer the user’s first website session to be matched to the offending purchase. We can use the
rollends
argument for this. From the data.table
documentation (?data.table
),rollends[1]=TRUE will roll the first value backwards if the value is before it website[paypal, roll = T, rollends = c(T, T)] # equivalent to website[paypal, roll = T, rollends = T] ## name session_start_time session_id join_time ## 1: Error-prone Erica 2016-01-04 19:12:00 13 2016-01-03 08:02:00 ## 2: Frequent Francis 2016-01-03 19:22:00 8 2016-01-03 19:28:00 ## 3: Frequent Francis 2016-01-08 20:22:00 10 2016-01-08 20:33:00 ## 4: Frequent Francis 2016-01-10 17:36:00 11 2016-01-10 17:46:00 ## 5: Indecisive Isabel 2016-01-08 19:01:00 5 2016-01-08 19:10:00 ## 6: Mom <NA> NA 2015-12-02 17:58:00 ## 7: Spendy Sally 2016-01-03 10:00:00 6 2016-01-03 10:06:00 ## 8: Spendy Sally 2016-01-03 10:00:00 6 2016-01-03 10:15:00 ## purchase_time payment_id ## 1: 2016-01-03 08:02:00 7 ## 2: 2016-01-03 19:28:00 4 ## 3: 2016-01-08 20:33:00 5 ## 4: 2016-01-10 17:46:00 6 ## 5: 2016-01-08 19:10:00 1 ## 6: 2015-12-02 17:58:00 8 ## 7: 2016-01-03 10:06:00 2 ## 8: 2016-01-03 10:15:00 3
In this result, Erica’s first session is matched to her purchase, even though the session was after her purchase. Mom’s “purchase” still has no matching session because Mom does not appear in the
website
table. So all(purchase_time > session_start_time, na.rm = T)
no longer evaluates to TRUE
.What if we want to perform the same join as above, but only returning matches for payments with sessions before and after?
website[paypal, roll = T, rollends = c(F, F)] # equivalent to website[paypal, roll = T, rollends = F] ## name session_start_time session_id join_time ## 1: Error-prone Erica <NA> NA 2016-01-03 08:02:00 ## 2: Frequent Francis 2016-01-03 19:22:00 8 2016-01-03 19:28:00 ## 3: Frequent Francis 2016-01-08 20:22:00 10 2016-01-08 20:33:00 ## 4: Frequent Francis 2016-01-10 17:36:00 11 2016-01-10 17:46:00 ## 5: Indecisive Isabel <NA> NA 2016-01-08 19:10:00 ## 6: Mom <NA> NA 2015-12-02 17:58:00 ## 7: Spendy Sally <NA> NA 2016-01-03 10:06:00 ## 8: Spendy Sally <NA> NA 2016-01-03 10:15:00 ## purchase_time payment_id ## 1: 2016-01-03 08:02:00 7 ## 2: 2016-01-03 19:28:00 4 ## 3: 2016-01-08 20:33:00 5 ## 4: 2016-01-10 17:46:00 6 ## 5: 2016-01-08 19:10:00 1 ## 6: 2015-12-02 17:58:00 8 ## 7: 2016-01-03 10:06:00 2 ## 8: 2016-01-03 10:15:00 3
In this result, the purchases of Error-prone Erica and Mom are unmatched because they have no preceding sessions, and Spendy Sally’s two purchases are unmatched because they have no following website session.
Note that when
roll
is set to a negative number, the meaning of the two rollends
elements kind of flip-flops:website[paypal, roll = -Inf, rollends = c(F, T)] # default when roll < 0 is rollends = c(T, F) ## name session_start_time session_id join_time ## 1: Error-prone Erica <NA> NA 2016-01-03 08:02:00 ## 2: Frequent Francis 2016-01-08 08:44:00 9 2016-01-03 19:28:00 ## 3: Frequent Francis 2016-01-10 17:36:00 11 2016-01-08 20:33:00 ## 4: Frequent Francis 2016-01-15 16:56:00 12 2016-01-10 17:46:00 ## 5: Indecisive Isabel 2016-01-08 19:01:00 5 2016-01-08 19:10:00 ## 6: Mom <NA> NA 2015-12-02 17:58:00 ## 7: Spendy Sally 2016-01-03 10:00:00 6 2016-01-03 10:06:00 ## 8: Spendy Sally 2016-01-03 10:00:00 6 2016-01-03 10:15:00 ## purchase_time payment_id ## 1: 2016-01-03 08:02:00 7 ## 2: 2016-01-03 19:28:00 4 ## 3: 2016-01-08 20:33:00 5 ## 4: 2016-01-10 17:46:00 6 ## 5: 2016-01-08 19:10:00 1 ## 6: 2015-12-02 17:58:00 8 ## 7: 2016-01-03 10:06:00 2 ## 8: 2016-01-03 10:15:00 3
In this example,
- Each payment is matched to the nearest following website session (because of
roll = -Inf
). - Error-prone Erica’s purchase is not matched to the following session because there is no previous session (due to
rollends[1] = F
). - Spendy Sally’s purchases are joined to her most recent website session, even though it occurred before her purchases (because of
rollends[2] = T
).
To leave a comment for the author, please follow the link and comment on their blog: bRogramming.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.