Joining data frames in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to join two R data frames on a common key? Here’s one way do a SQL database style join operation in R.
We start with a data frame describing probes on a microarray. The key is the probe_id and the rest of the information describes the location on the genome targeted by that probe.
> head(probes) probe_id sequence strand start end 1 mm_ex_fwd_000541 Chr + 1192448 1192507 2 mm_ex_fwd_000542 Chr + 1192453 1192512 3 mm_ex_fwd_000543 Chr + 1192458 1192517 4 mm_ex_fwd_000544 Chr + 1192463 1192522 5 mm_ex_fwd_000545 Chr + 1192468 1192527 6 mm_ex_fwd_000546 Chr + 1192473 1192532 > dim(probes) [1] 241019 5
We also have a bunch of measurements in a numeric vector. For each probe (well, a few probes missing due to bad data) we have a value.
> head(value) mm_fwd_000002 mm_fwd_000003 mm_fwd_000004 mm_fwd_000005 mm_fwd_000006 mm_fwd_000007 0.05294899 0.11979251 0.28160017 0.57284569 0.74402510 0.78644199 > length(value) [1] 241007
Let’s join up these tables, er data frame and vector. We’ll use the match function. Match returns a vector of positions of the (first) matches of its first argument in its second (or NA if there is no match). So, we’re matching our values into our probes.
> joined = cbind(probes[match(names(value), probes$probe_id),], value) > dim(joined) [1] 241007 6 > head(joined) probe_id sequence strand start end value 3695 mm_fwd_000002 Chr + 15 74 0.05294899 3696 mm_fwd_000003 Chr + 29 88 0.11979251 3697 mm_fwd_000004 Chr + 43 102 0.28160017 3698 mm_fwd_000005 Chr + 57 116 0.57284569 3699 mm_fwd_000006 Chr + 71 130 0.74402510 3700 mm_fwd_000007 Chr + 85 144 0.78644199
Merge is probably more similar to a database join.
Inner join | merge(df1, df2, by=”common_key_column”) |
Outer join | merge(df1, df2, by=”common_key_column”, all=TRUE) |
Left outer | merge(df1, df2, by=”common_key_column”, all.x=TRUE) |
Right outer | merge(df1, df2, by=”common_key_column”, all.y=TRUE) |
If we have two data frames, we can use merge. Let’s convert our vector tp to a data frame and merge, getting the same result (in a different sort order).
> tp.df = data.frame(probe_id=names(tp), value=tp) > head(tp.df) probe_id value mm_fwd_000002 mm_fwd_000002 0.05294899 mm_fwd_000003 mm_fwd_000003 0.11979251 mm_fwd_000004 mm_fwd_000004 0.28160017 mm_fwd_000005 mm_fwd_000005 0.57284569 mm_fwd_000006 mm_fwd_000006 0.74402510 mm_fwd_000007 mm_fwd_000007 0.78644199 > m = merge(probes, tp.df, by="probe_id") > dim(m) [1] 241007 6 > head(mmm) probe_id sequence strand start end value 1 mm_ex_fwd_000541 Chr + 1192448 1192507 0.1354668 2 mm_ex_fwd_000542 Chr + 1192453 1192512 0.1942794 3 mm_ex_fwd_000543 Chr + 1192458 1192517 0.1924457 4 mm_ex_fwd_000544 Chr + 1192463 1192522 0.2526351 5 mm_ex_fwd_000545 Chr + 1192468 1192527 0.1922655 6 mm_ex_fwd_000546 Chr + 1192473 1192532 0.2610747
There’s a good discussion of merge on Stack Overflow, which includes right, left, inner and outer joins. Also the R wiki covers both match and merge. See also, the prior entry on select operations on R data frames.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.