An efficient way to do dataset intersection
[This article was first published on One Tip Per Day, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The main message is to use “match” to get index of needed rows and then get the rows by the index, instead of using the row names to select, which is much slower. Here is example:Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In example above, we know that the same values of column 2nd have same values of columns from 4th to the end. So, instead of doing unique on whole matrix, getting the unique of column 2nd and then getting the index of unique ones by match. Match(a,b) only return the index of first occurrence of a in b. For example
This tips also help in intersecting two big dataframes. For example,
To leave a comment for the author, please follow the link and comment on their blog: One Tip Per Day.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.