non-equi joins in data.table
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I have been toying with some of the advent of code challenges (I am way behind though!).
For day 5, I had to create a function, and I’m writing this up, because it’s an example of a non-equi join between two tables.
In this particular sitation, there are are no common columns between the two tables, so my usual data.table
hack of copying the columns of interest, renaming themjoin_col
, and then keying them both does not work.
Here’s the function:
find_matches <- function(input_dt, lookup_dt) { res <- lookup_dt[input_dt, .(V1 = i.V1, dest_start = x.dest_start, source_start = x.source_start, source_end = x.source_end), on = .(source_start <= V1, source_end >= V1) ][, dest := fcase(is.na(source_start), V1, !is.na(source_start),V1 - source_start + dest_start) ][,.(V1 = dest)] return(res) }
I want to join the main table, input_dt
with the lookup_dt
.
Because data.table uses right joins, and removes the joining column from the main table (unless you tell it otherwise), I am being very specific.
This bit is where I specify the column names I want to return, and which table they come from.
i.col_name
is the main / large/ right hand side table
x.col_name
is the smaller/ left hand side table
Here I’m saying to keep V1
from the main table, and dest_start
, source_start
and source_end
from the smaller table:
.(V1 = i.V1, dest_start = x.dest_start, source_start = x.source_start, source_end = x.source_end),
Here is the non-equi join bit.
on = .(source_start <= V1, source_end >= V1)
The tables should join when V1 is between the source start and source end.
There will be rows where these conditions are not met, these will return NA
.
This section of code deals with those – any NA’s are replaced with the original value and no further calculation is required.
[, dest := fcase(is.na(source_start), V1, !is.na(source_start),V1 - source_start + dest_start) ]
The rest of the code returns the vector of interest.
The main takeway from this is to understand:
- data.table uses right joins by default
- the join column in the main table will be removed from the resulting joined table, unless you tell it otherwise
- Use
i.col_name
to specify columns from the main table - Use
x.col_name
to specify columns from the smaller / lookup table - you can use conditions in your
on
syntax to create equi or non equi joins
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.