[This article was first published on Mollie's Research Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Often, you’ll need to merge two data frames based on multiple variables. For this example, we’ll use the common case of needing to merge by city and state.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
First, you need to read in both your data sets:
< !-- HTML generated using hilite.me -->
# import city coordinate data: coords <- read.csv("cities-coords.csv", header = TRUE, sep = ",") # import population data: data <- read.csv("cities-data.csv", header = TRUE, sep = ",")
Next comes the merge. You can use by.x and by.y to declare which variables the merge will be based on. If the variables have exactly the same name in both data sets, you can use by instead of by.x and by.y.
x and y represent the two data sets you are merging, in that order.
You also want to state whether you want to include all data from either data set, using all or all.x and all.y. In this case, we want to make sure we hold onto all our city data, even data for the cities we do not have coordinates for.
< !-- HTML generated using hilite.me -->
# merge data & coords by city & state: dataCoords <- merge(coords, data, by.x = c("City", "State"), by.y = c("city", "state"), all.x = FALSE, all.y = TRUE)
Running that code shows what we would expect. Houston is included in the final data set even though there are no coordinates for it, while Dallas is not included since it has coordinates but no data:
< !-- HTML generated using hilite.me -->
City State Latitude Longitude year population 1 Chicago IL 41.85003 -87.65005 2012 2714856 2 Columbus GA 32.46098 -84.98771 2012 198413 3 Columbus OH 39.96118 -82.99879 2012 809798 4 Columbus OH 39.96118 -82.99879 2010 787033 5 Los Angeles CA 34.05223 -118.24368 2012 3857799 6 New York NY 40.71427 -74.00597 2012 8336697 7 New York NY 40.71427 -74.00597 2010 8175133 8 San Francisco CA 37.77823 -122.44250 2012 825863 9 San Francisco CA 37.77823 -122.44250 2010 805235 10 Houston TX NA NA 2012 2160821
Bonus
If you’d like to get a list of which cases got merged in but lack coordinate data, there’s a simple line of code to do that:< !-- HTML generated using hilite.me -->
> dataCoords[!complete.cases(dataCoords[,c(3,4)]),] City State Latitude Longitude year population 10 Houston TX NA NA 2012 2160821
Also, you might want to tidy up the names of your variables, if they followed different conventions in their respective initial data sets:
< !-- HTML generated using hilite.me -->
> names(dataCoords) <- c("City", "State", "Latitude", "Longitude", "Year", "Population") > dataCoords City State Latitude Longitude Year Population 1 Chicago IL 41.85003 -87.65005 2012 2714856 2 Columbus GA 32.46098 -84.98771 2012 198413 3 Columbus OH 39.96118 -82.99879 2012 809798 4 Columbus OH 39.96118 -82.99879 2010 787033 5 Los Angeles CA 34.05223 -118.24368 2012 3857799 6 New York NY 40.71427 -74.00597 2012 8336697 7 New York NY 40.71427 -74.00597 2010 8175133 8 San Francisco CA 37.77823 -122.44250 2012 825863 9 San Francisco CA 37.77823 -122.44250 2010 805235 10 Houston TX NA NA 2012 2160821
The full sample code is available as a gist.
References
gist
Change R code tabs to 2 spaces.
bold package names
italicize functions
http://hilite.me/
manni
for R posts, use: sessionInfo()< !----->< !----->
To leave a comment for the author, please follow the link and comment on their blog: Mollie's Research Blog.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.