Site icon R-bloggers

The differences of left join in SQL and R

[This article was first published on Aster Hu's Blog | Asteroid, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Recently, I encountered a situation where I needed to translate an Access SQL query to R, and I noticed the contrasting behaviors of these two languages when it comes to handling NA/NULL values in left joins.

< section id="the-impact-of-nanull-values-on-joins" class="level2">

The impact of NA/NULL values on joins

When performing a join operation between two tables, it is essential to consider the presence of NA/NULL values. Let’s take a left join as an example, where all records from the left table are retained and are attempted to be matched with corresponding rows in the right table. However, the presence of NA/NULL values raises questions about whether they should be matched with any corresponding values in the right table. As a result, this can lead to unexpected results due to the uncertainty inherent in the logic of the chosen tool.

< section id="sqls-approach-to-handle-null-in-joins" class="level2">

SQL’s approach to handle NULL in joins

Let’s look at a simple example by creating two tables named fname and lname.

< details> < summary> Click to expand the code
CREATE TABLE
  fname (id int, firstname varchar(50));

INSERT INTO fname
VALUES
  (1, 'Ada'),
  (2, 'Bob'),
  (NULL, 'Unknown');

CREATE TABLE
  lname (id int, lastname varchar(50));

INSERT INTO lname
VALUES
  (1, 'Smith'),
  (NULL, 'To be decided');
# fname
    id  firstname
1    1        Ada
2    2        Bob
3 NULL    Unknown

# lname
    id      lastname
1    1         Smith
2 NULL To be decided

In the fname table, the 3rd observation contains a NULL value in id, while still having a placeholder value of “Unknown” in the firstname column. The same situation also occurs in the lname table, where the 2nd observation has a NULL value in id and it’s being labeled as “To be decided” in the lastname column.

It may seem unusual, but this type of data is not uncommom in business settings, where it often occurs due to low-quality data or when businesses require a placeholder for empty observations that can be analyzed later.

If we apply a LEFT JOIN in SQL, the code would be:

SELECT *
FROM fname
  LEFT JOIN lname ON fname.id = lname.id;

And this is the outcome we will get.

    id firstname   id lastname
1    1       Ada    1    Smith
2    2       Bob NULL     NULL
3 NULL   Unknown NULL     NULL

As we can see, the placeholder value “To be decided” in the lname table has disappeared. This occurs because in SQL, NULL values in tables or views being joined never match each other1, resulting in SQL not returning a matched result.

The question is, is this the expected outcome? Based on above result, it’s impossible to tell whether a match couldn’t be found, or if it represent a NULL value. Besides, what if we want to match these two NULL values and have “To be decided” correspond with “Unknown”? These questions remain until we have a clear mind of how we wish to handle NULL values.

< section id="how-r-handles-na-in-joins" class="level2">

How R handles NA in joins

More precisely, we are referring to dplyr::left_join. Let’s create the same tables in R and perform a left join.

library(dplyr)

fname <- tibble(
  id = c(1:2, NA_character_),
  firstname = c("Ada", "Bob", "Unknown")
)

lname <- tibble(
  id = c(1, NA_character_),
  lastname = c("Smith", "To be decided")
)

left_join(fname, lname, by = "id")
# A tibble: 3 × 3
  id    firstname lastname     
  <chr> <chr>     <chr>        
1 1     Ada       Smith        
2 2     Bob       <NA>         
3 <NA>  Unknown   To be decided
Note

If you see a warning message saying “Each row in x is expected to match at most 1 row in y”, this is a result of the multiple matches warning in dplyr 1.1.0. Tidyverse has modified the behaviour of multiple matches warning since dplyr 1.1.1, significantly reducing the number of warnings. For more information, please refer to the release note of dplyr 1.1.1

From the above table, it actually returned a matched result “To be decided” for the NA value! This is because in R, by default two NA or NaN values are considered as equal, like %in%, match(), and merge().2

The good news is that we can modify this behaviour in R by utilizing the na_matches argument. When we set na_matches = "never", it will behave the same way as in SQL.

left_join(fname, lname, by = "id", na_matches = "never")
# A tibble: 3 × 3
  id    firstname lastname
  <chr> <chr>     <chr>   
1 1     Ada       Smith   
2 2     Bob       <NA>    
3 <NA>  Unknown   <NA>    
< section id="what-if-there-are-multiple-nas" class="level3">

What if there are multiple NAs?

Generally speaking, having multiple NAs in key variable in joins is not considered a best practice and it is usually preferable to omit them. However, we may come across such situations from time to time.

Assuming we have another NA in the lname table:

lname <- data.frame(id = c(1, NA_character_, NA_character_),
                     lastname = c("Smith", "To be decided", "Pending"))
lname
    id      lastname
1    1         Smith
2 <NA> To be decided
3 <NA>       Pending

What would happen if we join the two tables again?

left_join(fname, lname, by = "id")
# A tibble: 4 × 3
  id    firstname lastname     
  <chr> <chr>     <chr>        
1 1     Ada       Smith        
2 2     Bob       <NA>         
3 <NA>  Unknown   To be decided
4 <NA>  Unknown   Pending      

Since R treats two NAs as equal, similar to handling duplicates in key variables, it will return all the information from the right table due to multiple matches.

< section id="conclusion-and-takeaway" class="level2">

Conclusion and takeaway

In SQL, NULL values in tables being joined are not considered as equal. When both table have NA/NULL value in the key variable, SQL treats them as unmatched, which differs from the default behaviour of dplyr::left_join, where NA is treated as a a match. When working with both SQL and R for data analysis, it is crucial to be mindful of the divergent handling of NULL values and adjust the join behaviour accordingly in order to deliver consistent outcomes.

< section id="footnotes" class="footnotes footnotes-end-of-document">

Footnotes

  1. SyBooks Online. (2011). Infocenter.sybase.com. https://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.infocenter.dc32300.1570/html/sqlug/X78356.htm↩︎

  2. Mutating joins — mutate-joins. (n.d.). Dplyr.tidyverse.org. https://dplyr.tidyverse.org/reference/mutate-joins.html↩︎

To leave a comment for the author, please follow the link and comment on their blog: Aster Hu's Blog | Asteroid.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version