R: Combining vectors or data frames of unequal length into one data frame

markheckmann

13 years ago

[This article was first published on "R" you ready?, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Today I will treat a problem I encounter every once in a while. Let’s suppose we have several dataframes or vectors of unequel length but with partly matching column names, just like the following ones:

df1 <- data.frame(Intercept = .4, x1=.4, x2=.2, x3=.7)
df2 <- data.frame(Intercept = .5,        x2=.8       )

This for example may occur when fitting several multiple regression models each time using different combination of regressors. Now I would like to combine the results into one data frame. The merge() as well as the rbind() function do not help here as they require equal lengths.

I posted this matter on r-help as my first solution was somewhat awkward and could not be generalized to any data frames or list of data frames. The first solution was posted by Charles C. Berry. myList is a list containing the data frames as elements

myList <- list(df1, df2)

What he does is to use a nested loop. The inner loop runs for each data frame over each column name. It basically takes each column name and the correponding element [i, j] from the data frame ( myList[[i]] ) and writes it into an empty data frame (dat). Thereby a new column that is named just like the column from the list element data frame is created. The cells that are left out are automatically set NA.

dat <- data.frame()
for(i in seq(along=myList)) for(j in names(myList[[i]]))
                                 dat[i,j] <- myList[[i]][j]
dat

Note that the order of the output columns depends on the input order. The list below renders a different order, though it contains the same elements but ordered differently.

myList <- list(df2, df1)

  Intercept  x2  x1  x3
1       0.5 0.8  NA  NA
2       0.4 0.2 0.4 0.7

Another solution was posted by Henrique Dallazuanna. This one has the advantage that it does not use loops.

l <- myList
do.call(rbind, lapply(lapply(l, unlist), "[",
        unique(unlist(c(sapply(l,names))))))

It looks a bit scary at first, so let’s examine it starting from the inside.

# a list of names from each list element
c(sapply(l,names))

# unlist them and find unique names
unique(unlist(c(sapply(l,names))))

# gives unlisted vectors with column names for each list element
lapply(l, unlist)

As a next step for each vector with column names all columns are selected leaving those that are not present with NA values.

listOfVectors <- lapply(lapply(l, unlist), "[",
                        unique(unlist(c(sapply(l,names)))))

As a last step the vectors having the same columns are combined.

do.call(rbind, listOfVectors)
# or in full
DF <- do.call(rbind, lapply(lapply(l, unlist), "[",
              unique(unlist(c(sapply(l,names))))))

The only little flaw in this function is that the column names of the first vector are taken as column names of the developing data frame. Using the second list from above, gives the following.

l <- list(df2, df1) 
     Intercept  x2 <NA> <NA>
[1,]       0.5 0.8   NA   NA
[2,]       0.4 0.2  0.4  0.7

Thus, in a last step we need change the column names of the data frame.

DF <- as.data.frame(DF)
names(DF) <- unique(unlist(c(sapply(l,names))))
DF

Well this works but it would be much more convenient to get this done in one single function and well, since october 2008 there is one. It can be found in the plyr package written by Hadley Wickham. So the solution is as easy as:

library(plyr)
l <- myList
do.call(rbind.fill, l)

# another example

l <- list(data.frame(a=1, b=2), data.frame(a=2, c=3, d=5))
do.call(rbind.fill, l)

The results:

  Intercept  x1  x2  x3
1       0.4 0.4 0.2 0.7
2       0.5  NA 0.8  NA

Now, this is nice! It is really worthwhile having a look at Hadley Wickhams plyr package as it provides a lot of functions that make life a lot easier when it comes to splitting list or data frames, doing a calculation or not and merge them afterwards again. More on that another day.

Cheers, Mark