Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Aggregating and sorting data from different files is a typical task for anyone who works with it. It is not uncommon to gather information from many different sources before performing some kind of statistical analysis, in fact, collecting and preprocessing data is often the step that takes up most of the time and effort.
Say you have gathered data on two variables, x and y, from three different sources and you are left with three files. Suppose, for instance, that each file has the following structure:
# x1,y1 # x2,y2 # ... # xn,yn
Now let’s say you would like to merge all the x and y datapoints in two dataframes while at the same time keeping a reference to remember the source they came from, such as follows
# $x # df_1,df_2,df_3 # x1_1,x2_1,x3_1 # ... # x1_n,x2_n,x3_n
and the same for y. As you can see, the aim of the sorting strategy is to group all the values of each variables in a specific dataframe and keep a reference that reminds you the origin of each column of samples. The final output should be a single R object containing both the x and y dataframes. The following picture pretty much sums up the sorting process I have just described
It is immediate to note how this task can be easily generalized to the case of j different sources of data and k different variables.
How can you perform this sorting task in a fast and concise way in R?
Loading the data to be sorted in R
First of all, you need to load the data. You can either use read.csv() or read.table() to load the data according to the source file format. A quick way to load a collection of dataframes into a list is the following:
path <- list("file1", "file2", "file3") df_list_in <- lapply(path, read.csv)
lapply() lets you quickly call the read.csv() function on each file and returns a list with the output of each call to the function. In this case a list of dataframes is returned. By running the code above, you should obtain a list structured as the one below that I am going to use in this toy example.
df_list_in <- list ( df_1 = data.frame(x = 1:5, y = 5:1), df_2 = data.frame(x = 6:10, y = 10:6), df_3 = data.frame(x = 11:15, y = 15:11) )
Note that each dataframe in the example list has exactly the same structure as specified in the introduction above.
In order to organize the data in the desired format, you need a short sorting function which can get a list of dataframes as input, merge the values of each variable and return as output a list of k dataframes, where k is the number of different variables (in this case k is equal to 2, namely x and y). Ideally each dataframe in the output list should be accessible by indexing the list with the variable name.
In short, below is the desired output structure
df_list_out <- list ( df_x = data.frame(x_1 = 1:5, x_2 = 6:10, x_3 = 11:15), df_y = data.frame(y_1 = 5:1, y_2 = 10:6, y_3 = 15:11) )
The sorting function should not depend on the user setting how many variables and how many dataframes are in the input list: it should be able to get the df_list_in as an argument and output the df_list_out without any additional arguments or information.
Note that the only assumption is that each dataframe in the input list should have the same structure, i.e. the dataframes in the input list are assumed to differ only in the particular values of x and y stored.
The sorting function(s)
A first option might be the following sorting function:
sort_list_df <- function(df_l) { out <- do.call(cbind, df_l) number_of_vars <- ncol(df_l[[1]]) name_of_dfs <- names(out) out_l <- list() for(i in 1:number_of_vars) { index <- seq(i,ncol(out),number_of_vars) tempdf = out[, names(out) %in% name_of_dfs[index]] names(tempdf) <- names(df_l) out_l[[i]] = tempdf } names(out_l) <- names(df_l[[1]]) return(out_l) }
As you can see, first all the columns of all the datasets in the input list are coerced into a single dataframe (the outdataframe), the out dataframe is subsequently subset by assigning the values of each variable to a temporary dataframe (tempdf) which is eventually assigned to the output list. The for loop iterates k times the procedure described above. Before returning the output list, names are assigned.
The function above works fine but it is rather lengthy, uses for loops and it is not using the full potential of R functionals such as lapply. It probably serves its purpose, however there is also another, in my opinion, more elegant solution. Take a look at the function below:
arrange_col <- function(dl) { n_col <- ncol(dl[[1]]) out_names <- names(dl[[1]]) out <- lapply(1:n_col, function(i, x) as.data.frame(lapply(x, '[[' , i = i)) , dl) names(out) <- out_names return(out) }
This solution is much more neat and concise than the first one. It is also very easy to read through therefore enhancing code readabilty which is always a plus.
How it works: first the number of variables is assessed and stored. Same for the column names which are stored in theout_names variables to be assigned to the final output list. The core of this solution is the use of the lapply() functional which iterates the function passed as an argument through the input list. Note how a single line of code replaces the for loop and the six lines I used in the previous “naive” solution. lapply() iterates through each element of the input list calling the inline defined function that merges the variables into the desired data structure.
By running both functions on the test example you can actually see they output the same result.
sort_list_df(df_list_in) arrange_col(df_list_in) ## $x ## df_1 df_2 df_3 ## 1 1 6 11 ## 2 2 7 12 ## 3 3 8 13 ## 4 4 9 14 ## 5 5 10 15 ## ## $y ## df_1 df_2 df_3 ## 1 5 10 15 ## 2 4 9 14 ## 3 3 8 13 ## 4 2 7 12 ## 5 1 6 11
Performance and efficiency considerations
In order to assess which one of the two options is faster, you can use the microbenchmark package. Microbenchmark gives you basic statistics in nanoseconds of the evaluation time of each function passed as argument.
library(microbenchmark) benc <- microbenchmark( arrange_col(df_list_in), sort_list_df(df_list_in), times=100 )
In this example the parameter times has been set to 100: this means that each function will be evaluated and timed 100 times. This should be enough to get a solid grasp of the average time each function needs to be evaluated. Here is the output
print(benc)
## Unit: microseconds ## expr min lq mean median uq ## arrange_col(df_list_in) 438.232 454.8690 506.4917 471.1295 527.4780 ## sort_list_df(df_list_in) 149.672 159.2715 178.2066 167.9120 183.4115 ## max neval ## 1405.249 100 ## 572.220 100
As you can see sort_list_df() seems to be much faster than the arrange_col(): on average sort_list_df() uses half the time needed by arrange_col(). In order to get a better understanding on the performance, you can plot the output of the microbenchmark as a violin plot
library(ggplot2) plot1 <- ggplot(benc,aes(factor(expr),log(time))) plot1 + geom_violin(alpha=0.5, aes(color=expr)) + theme(panel.background = element_rect(fill = 'white', colour = 'black'))
or as a more traditional boxplot
plot2 <- ggplot(benc,aes(factor(expr),log(time)),colour='white') plot2 + geom_boxplot(aes(color=expr)) + theme(panel.background = element_rect(fill = 'white', colour = 'black'))
By taking a quick look at the plots you can clearly see that sort_list_df is considerably faster than arrange_col() on average.
A practical sorting example with stocks data
This practical example may explain a little bit better why such sorting functions may be useful. Suppose you have gathered .csv stocks data from Google Finance or Yahoo Finance in the OHLC format and you would like to look through a particular variable, such as volume, for different stocks. For instance if you have downloaded Cree and Yahoo stocks data, by running the following code
stocks <- list(cree="cree.csv", yahoo="yahoo.csv") df_list_in <- lapply(stocks, read.csv) out_list <- sort_list_df(df_list_in)
the output you would get is the one below
print(out_list) ## $Date ## cree yahoo ## 1 2016-02-11 2016-02-11 ## 2 2016-02-10 2016-02-10 ## ... ## ## $Open ## cree yahoo ## 1 29.01 26.46 ## 2 29.60 27.11 ## ... ## ## $High ## cree yahoo ## 1 29.24 26.97 ## 2 29.90 27.81 ## ... ## ## $Low ## cree yahoo ## 1 27.95 26.15 ## 2 29.23 26.84 ## ... ## ## $Close ## cree yahoo ## 1 28.44 26.76 ## 2 29.30 27.10 ## ... ## ## $Volume ## cree yahoo ## 1 1254300 11154900 ## 2 1666600 8933600 ## ... ## ## $Adj.Close ## cree yahoo ## 1 28.44 26.76 ## 2 29.30 27.10 ## ...
Note how the output list has merged all the variables (open price, volume, etc…) for each stock in a single dataframe accessible by indexing the outputlist through the variable name. It is also worth noting that the output list also kept track of the name of the stocks. Furthermore, and this is where this sorting process becomes handy, this can be performed with hundreds of stocks or other files provided that the structure of the source file is identical.
Final considerations
sort_list_df() is much faster than arrange_col() but it uses for loops and probably wastes a lot of memory in storing temporary variables, especially when the list fed as input contains a significant number of dataframes, whilearrange_col() on the other hand is slower but more neat, concise and uses less lines of code: it is a great example of how the apply() functionals can considerably shorten the number of lines of code used to achieve a certain task.
Both functions run smoothly and reach the same goal using two different approaches: depending on your preferences and needs you should be able to use either one or the other (or both!) successfully.
Thank you for reading this article, please feel free to leave a comment if you have any questions or suggestions.
The post How to sort a list of dataframes appeared first on MilanoR.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.