Merging Multiple Data Frames in R
[This article was first published on Econometric Sense, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Earlier I had a problem that required merging 3 years of trade data, with about 12 csv files per year. Merging all of these data sets with pairwise left joins using the R merge statement worked (especially after correcting some errors pointed out by Hadley Wickham However, in both my hobby hacking and on the job, I was curious if there might be a better way to do this than countless sets of merge statements (not to mention the multiple lines of code required for reading in the csv files)Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
So, I sent a tweet to the #rstats followers with a link to where I posted the problem on this blog to see if I could get a hint. (twitter has been a very valuable networking tool for me, I’ve learned a lot about data mining, machine learning, and R from the tweet-stream. Tweets and blog posts from people like Hadley Wickham, Drew Conway, and J.D. Long have been tremendously helpful to me as I’ve taken up R.
Back to the topic at hand, below is my new code, based on suggestions from Hadley Wickham and a comment (from Harlan) that lead me to some answers to a similar question on stack overflow. The code below requires he reshape library as well as plyr, which I should mention appears to have been created by Hadley Wickham himself.
# read all Y2000 csv files filenames <- list.files(path="/Users/wkuuser/Desktop/R Data Sets/TRADE_DATA/TempData00", full.names=TRUE) import.list <- llply(filenames, read.csv) # left join all Y2000 csv files AllData00 <- Reduce(function(x, y) merge(x, y, all=FALSE,by.x="Reporting.Countries.Partner.Countries",by.y="Reporting.Countries.Partner.Countries",all.x =TRUE, all.y =FALSE),import.list,accumulate=F) dim(AllData00) # n = 211 211 # rename common key variable to something less awkward and change World to World00 AllData00 <- rename(AllData00, c(Reporting.Countries.Partner.Countries="Partner", World = "World00")) names(AllData00) # list all variable names # keep only the partner name variable and total world trade AllData00 <-AllData00[c("Partner","World00")] dim(AllData00) # data dimensions names(AllData00) # variable names fix(AllData00) # view in data editor
That pretty well gives me the data set I need, for year 2000 data. I repeated the process for 2004 and 2008 data sets I had and then merged them with left joins to get the final data set. All I am after at this point is the total world trade for each of the countries/groups listed. This could probably be made even more efficient, but is is a lot less coding than what I initially used for the project. (see below- and this doesn't even include some of the renaming and sub-setting functions I performed above) And, this process would have to be repeated 2 more times for 2004 and 2008 data. To say that the above code is much more efficient is an understatement. (note the code below actually contains some mistakes as Hadley Wickham pointed out. For instance, in the merge statement, I have by.a and by.b, or by.'dataset', while in every case it should be by.x and by.y. I guess x and y are alias's for the data sets being merged, sort of like a and b would be in SQL, if you were to say:
create table newdataset as
select *
from dat1 a left join dat2 b
on a.partner=b.partner
So, I'm not sure why my code even worked to begin with. I do realize that instead of the merge statement in R I could have used the sqldf package in R, but I have had issues with my mac crashing when I try to load the library. Still, I don't think SQL would have made things any better, as I would still be doing a series of left joins vs. the more compact code using the reduce function in R. I've used sqldf in a windows environment before and it worked great by the way.
The code below first reads in each data file individually, and then executes the endless number of left joins to give me the same data set I got above with a fraction of the amount of required code.
# a a <- read.csv("X_A.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8") names(a) a <- rename(a, c(Reporting.Countries.Partner.Countries="Partner")) names(a) dim(a) # b b <- read.csv("X_B.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8") names(b) b <- rename(b, c(Reporting.Countries.Partner.Countries="Partner")) names(b) dim(b) # c c <- read.csv("X_C.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8") names(c) c <- rename(c, c(Reporting.Countries.Partner.Countries="Partner")) names(c) dim(c) # de de <- read.csv("X_DE.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8") names(de) de <- rename(de, c(Reporting.Countries.Partner.Countries="Partner")) names(de) dim(de) # fgh fgh <- read.csv("X_FGH.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8") names(fgh) fgh <- rename(fgh, c(Reporting.Countries.Partner.Countries="Partner")) names(fgh) dim(fgh) # ijk ijk <- read.csv("X_IJK.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8") names(c) ijk <- rename(ijk, c(Reporting.Countries.Partner.Countries="Partner")) names(ijk) dim(ijk) # lm lm <- read.csv("X_LM.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8") names(lm) lm <- rename(lm, c(Reporting.Countries.Partner.Countries="Partner")) names(lm) dim(lm) # nop nop <- read.csv("X_NOP.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8") names(nop) nop <- rename(nop, c(Reporting.Countries.Partner.Countries="Partner")) names(nop) dim(nop) # qr qr <- read.csv("X_QR.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8") names(qr) qr <- rename(qr, c(Reporting.Countries.Partner.Countries="Partner")) names(qr) dim(qr) # s odd name changed to 'SaloTomaPrincip' manaully in excel s <- read.csv("X_S.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8") names(s) s <- rename(s, c(Reporting.Countries.Partner.Countries="Partner")) names(s) dim(s) # tuv tuv <- read.csv("EX_TUV.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8") names(tuv) tuv <- rename(tuv, c(Reporting.Countries.Partner.Countries="Partner")) names(tuv) dim(tuv) # wxyz wxyz <- read.csv("X_WXYZ.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8") names(wxyz) wxyz <- rename(wxyz, c(Reporting.Countries.Partner.Countries="Partner")) names(wxyz) dim(wxyz) # ------------------------------------------------------------------ # sequentially left join data sets # ------------------------------------------------------------------ # a & b ab <- merge(a,b, by.a = Partner, by.b =Partner, all = FALSE, all.x = TRUE, all.y = FALSE) dim(ab) names(ab) 14 + 18 - 1 # r = 211 c = 31 # abc abc <- merge(ab,c, by.ab = Partner, by.c =Partner, all = FALSE, all.x = TRUE, all.y = FALSE) dim(abc) names(abc) 31 + 24 -1 # n = 54 # a_e a_e <- merge(abc,de, by.abc = Partner, by.de =Partner, all = FALSE, all.x = TRUE, all.y = FALSE) dim(a_e) names(a_e) 54 + 18 -1 # n = 71 # a_h a_h <- merge(a_e,fgh, by.a_e = Partner, by.fgh =Partner, all = FALSE, all.x = TRUE, all.y = FALSE) dim(a_h) names(a_h) 71 + 23 -1 # n = 93 # a_k a_k <- merge(a_h,ijk, by.a_h = Partner, by.ijk =Partner, all = FALSE, all.x = TRUE, all.y = FALSE) dim(a_k) names(a_k) 93 + 16 -1 # n = 108 # a_m a_m <- merge(a_k,lm, by.a_k = Partner, by.lm =Partner, all = FALSE, all.x = TRUE, all.y = FALSE) dim(a_m) names(a_m) 108 + 26 - 1 # n = 133 # a_p a_p <- merge(a_m,nop, by.a_m = Partner, by.nop =Partner, all = FALSE, all.x = TRUE, all.y = FALSE) dim(a_p) names(a_p) 133 + 20 - 1 # n = 152 # a_r a_r <- merge(a_p,qr, by.a_p = Partner, by.qr =Partner, all = FALSE, all.x = TRUE, all.y = FALSE) dim(a_r) names(a_r) 152 + 6 -1 # n = 157 # a_s a_s <- merge(a_r,s, by.a_r = Partner, by.s =Partner, all = FALSE, all.x = TRUE, all.y = FALSE) dim(a_s) names(a_s) 157 + 27 - 1 # n = 183 # a_v a_v <- merge(a_s,tuv, by.a_s = Partner, by.tuv =Partner, all = FALSE, all.x = TRUE, all.y = FALSE) dim(a_v) names(a_v) 183 + 21 - 1 # n = 203 # a_z (complete data set after this merge) a_z <- merge(a_v,wxyz, by.a_v = Partner, by.wxyz =Partner, all = FALSE, all.x = TRUE, all.y = FALSE) dim(a_z) # n = 211 names(a_z)
To leave a comment for the author, please follow the link and comment on their blog: Econometric Sense.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.