Something to note when using the merge function in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Base R has a merge
function which does join operations on data frames. As the documentation says, the function
[merges] two data frames by common columns or row names, or do other versions of database join operations.
One thing that I realized which may not be obvious is that merge
can have somewhat unexpected behavior regarding the ordering of rows in the result. Let’s see an example with the mtcars
dataset:
data(mtcars) mtcars$ID <- 1:nrow(mtcars) head(mtcars) # mpg cyl disp hp drat wt qsec vs am gear carb ID # Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1 # Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 2 # Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 3 # Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 4 # Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 5 # Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 6
Let’s say we want to add a row cyl_word
which is the value in the cyl
column in words (e.g. 6 -> "Six"
). The code below creates a reference table and joins the two tables to get the desired result:
cyl_df <- data.frame(cyl = c(4, 6, 8), cyl_word = c("four", "six", "eight")) joined_df <- merge(mtcars, cyl_df, by = "cyl") head(joined_df) # cyl mpg disp hp drat wt qsec vs am gear carb ID cyl_word # 1 4 22.8 140.8 95 3.92 3.150 22.90 1 0 4 2 9 four # 2 4 22.8 108.0 93 3.85 2.320 18.61 1 1 4 1 3 four # 3 4 24.4 146.7 62 3.69 3.190 20.00 1 0 4 2 8 four # 4 4 21.5 120.1 97 3.70 2.465 20.01 1 0 3 1 21 four # 5 4 30.4 75.7 52 4.93 1.615 18.52 1 1 4 2 19 four # 6 4 33.9 71.1 65 4.22 1.835 19.90 1 1 4 1 20 four
Look at the ID
column: the rows are not returned in the same order! Reading the documentation tells us that merge
has an argument sort
which has default value TRUE
, meaning that the results are sorted by the values in the columns that we merged on (cyl
in this case). That is why all the cyl == 4
rows appear first in the return value.
(Notice also that the row names have disappeared: this may not be something you want!)
One might think that if we set sort = FALSE
, the output would have the same row ordering as the input. Unexpectedly (at least to me), this is not the case:
joined_df <- merge(mtcars, cyl_df, by = "cyl", sort = FALSE) head(joined_df) # cyl mpg disp hp drat wt qsec vs am gear carb ID cyl_word # 1 6 21.0 160.0 110 3.90 2.620 16.46 0 1 4 4 1 six # 2 6 21.0 160.0 110 3.90 2.875 17.02 0 1 4 4 2 six # 3 6 17.8 167.6 123 3.92 3.440 18.90 1 0 4 4 11 six # 4 6 21.4 258.0 110 3.08 3.215 19.44 1 0 3 1 4 six # 5 6 18.1 225.0 105 2.76 3.460 20.22 1 0 3 1 6 six # 6 6 19.2 167.6 123 3.92 3.440 18.30 1 0 4 4 10 six
The documentation says as much: under the “Value” section, it writes that (emphasis mine)
The rows are by default lexicographically sorted on the common columns, but for
sort = FALSE
are in an unspecified order.
How can we get the output back in the original row order? One way is to add an ID
column like we did above, then sort the result by that column:
joined_df <- joined_df[order(joined_df$ID), ] head(joined_df) # cyl mpg disp hp drat wt qsec vs am gear carb ID cyl_word # 1 6 21.0 160 110 3.90 2.620 16.46 0 1 4 4 1 six # 2 6 21.0 160 110 3.90 2.875 17.02 0 1 4 4 2 six # 9 4 22.8 108 93 3.85 2.320 18.61 1 1 4 1 3 four # 4 6 21.4 258 110 3.08 3.215 19.44 1 0 3 1 4 six # 19 8 18.7 360 175 3.15 3.440 17.02 0 0 3 2 5 eight # 5 6 18.1 225 105 2.76 3.460 20.22 1 0 3 1 6 six testthat::expect_equal(joined_df$ID, 1:nrow(joined_df)) # test passes
Alternatively, use functions in the dplyr
package:
library(dplyr) joined_df <- left_join(mtcars, cyl_df, by = "cyl") head(joined_df) # mpg cyl disp hp drat wt qsec vs am gear carb ID cyl_word # 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1 six # 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 2 six # 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 3 four # 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 4 six # 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 5 eight # 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 6 six testthat::expect_equal(joined_df$ID, 1:nrow(joined_df)) # test passes
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.