Something to note when using the merge function in R

kjytay

5 days ago

[This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Base R has a merge function which does join operations on data frames. As the documentation says, the function

[merges] two data frames by common columns or row names, or do other versions of database join operations.

One thing that I realized which may not be obvious is that merge can have somewhat unexpected behavior regarding the ordering of rows in the result. Let’s see an example with the mtcars dataset:

data(mtcars)
mtcars$ID <- 1:nrow(mtcars)
head(mtcars)
#                    mpg cyl disp  hp drat    wt  qsec vs am gear carb ID
# Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4  1
# Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4  2
# Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1  3
# Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1  4
# Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2  5
# Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1  6

Let’s say we want to add a row cyl_word which is the value in the cyl column in words (e.g. 6 -> "Six"). The code below creates a reference table and joins the two tables to get the desired result:

cyl_df <- data.frame(cyl = c(4, 6, 8),
                     cyl_word = c("four", "six", "eight"))

joined_df <- merge(mtcars, cyl_df, by = "cyl")
head(joined_df)
#   cyl  mpg  disp hp drat    wt  qsec vs am gear carb ID cyl_word
# 1   4 22.8 140.8 95 3.92 3.150 22.90  1  0    4    2  9     four
# 2   4 22.8 108.0 93 3.85 2.320 18.61  1  1    4    1  3     four
# 3   4 24.4 146.7 62 3.69 3.190 20.00  1  0    4    2  8     four
# 4   4 21.5 120.1 97 3.70 2.465 20.01  1  0    3    1 21     four
# 5   4 30.4  75.7 52 4.93 1.615 18.52  1  1    4    2 19     four
# 6   4 33.9  71.1 65 4.22 1.835 19.90  1  1    4    1 20     four

Look at the ID column: the rows are not returned in the same order! Reading the documentation tells us that merge has an argument sort which has default value TRUE, meaning that the results are sorted by the values in the columns that we merged on (cyl in this case). That is why all the cyl == 4 rows appear first in the return value.

(Notice also that the row names have disappeared: this may not be something you want!)

One might think that if we set sort = FALSE, the output would have the same row ordering as the input. Unexpectedly (at least to me), this is not the case:

joined_df <- merge(mtcars, cyl_df, by = "cyl", sort = FALSE)
head(joined_df)
#   cyl  mpg  disp  hp drat    wt  qsec vs am gear carb ID cyl_word
# 1   6 21.0 160.0 110 3.90 2.620 16.46  0  1    4    4  1      six
# 2   6 21.0 160.0 110 3.90 2.875 17.02  0  1    4    4  2      six
# 3   6 17.8 167.6 123 3.92 3.440 18.90  1  0    4    4 11      six
# 4   6 21.4 258.0 110 3.08 3.215 19.44  1  0    3    1  4      six
# 5   6 18.1 225.0 105 2.76 3.460 20.22  1  0    3    1  6      six
# 6   6 19.2 167.6 123 3.92 3.440 18.30  1  0    4    4 10      six

The documentation says as much: under the “Value” section, it writes that (emphasis mine)

The rows are by default lexicographically sorted on the common columns, but for sort = FALSE are in an unspecified order.

How can we get the output back in the original row order? One way is to add an ID column like we did above, then sort the result by that column:

joined_df <- joined_df[order(joined_df$ID), ]
head(joined_df)
#    cyl  mpg disp  hp drat    wt  qsec vs am gear carb ID cyl_word
# 1    6 21.0  160 110 3.90 2.620 16.46  0  1    4    4  1      six
# 2    6 21.0  160 110 3.90 2.875 17.02  0  1    4    4  2      six
# 9    4 22.8  108  93 3.85 2.320 18.61  1  1    4    1  3     four
# 4    6 21.4  258 110 3.08 3.215 19.44  1  0    3    1  4      six
# 19   8 18.7  360 175 3.15 3.440 17.02  0  0    3    2  5    eight
# 5    6 18.1  225 105 2.76 3.460 20.22  1  0    3    1  6      six

testthat::expect_equal(joined_df$ID, 1:nrow(joined_df))  # test passes

Alternatively, use functions in the dplyr package:

library(dplyr)
joined_df <- left_join(mtcars, cyl_df, by = "cyl")
head(joined_df)
#    mpg cyl disp  hp drat    wt  qsec vs am gear carb ID cyl_word
# 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4  1      six
# 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4  2      six
# 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1  3     four
# 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1  4      six
# 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2  5    eight
# 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1  6      six

testthat::expect_equal(joined_df$ID, 1:nrow(joined_df))  # test passes

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.