Site icon R-bloggers

Read a lot of datasets at once with R

[This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I often have to read a lot of datasets at once using R. So I’ve wrote the following function to solve this issue:

read_list <- function(list_of_datasets, read_func){

        read_and_assign <- function(dataset, read_func){
                dataset_name <- as.name(dataset)
                dataset_name <- read_func(dataset)
        }

        # invisible is used to suppress the unneeded output
        output <- invisible(
                sapply(list_of_datasets,
                           read_and_assign, read_func = read_func, simplify = FALSE, USE.NAMES = TRUE))

        # Remove the extension at the end of the data set names
        names_of_datasets <- c(unlist(strsplit(list_of_datasets, "[.]"))[c(T, F)])
        names(output) <- names_of_datasets
        return(output)
}

You need to supply a list of datasets as well as the function to read the datasets to read_list. So for example to read in .csv files, you could use read.csv() (or read_csv() from the readr package, which I prefer to use), or read_dta() from the package haven for STATA files, and so on.

Now imagine you have some data in your working directory. First start by saving the name of the datasets in a variable:

data_files <- list.files(pattern = ".csv")

print(data_files)
## [1] "data_1.csv" "data_2.csv" "data_3.csv"

Now you can read all the data sets and save them in a list with read_list():

library("readr")
library("tibble")

list_of_data_sets <- read_list(data_files, read_csv)


glimpse(list_of_data_sets)
## List of 3
##  $ data_1:Classes 'tbl_df', 'tbl' and 'data.frame':  19 obs. of  3 variables:
##   ..$ col1: chr [1:19] "0,018930679" "0,8748013128" "0,1025635934" "0,6246140983" ...
##   ..$ col2: chr [1:19] "0,0377725807" "0,5959457638" "0,4429121533" "0,558387159" ...
##   ..$ col3: chr [1:19] "0,6241767189" "0,031324594" "0,2238059868" "0,2773350732" ...
##  $ data_2:Classes 'tbl_df', 'tbl' and 'data.frame':  19 obs. of  3 variables:
##   ..$ col1: chr [1:19] "0,9098418493" "0,1127788509" "0,5818891392" "0,1011773532" ...
##   ..$ col2: chr [1:19] "0,7455905887" "0,4015039612" "0,6625796605" "0,029955339" ...
##   ..$ col3: chr [1:19] "0,327232932" "0,2784035673" "0,8092386735" "0,1216045306" ...
##  $ data_3:Classes 'tbl_df', 'tbl' and 'data.frame':  19 obs. of  3 variables:
##   ..$ col1: chr [1:19] "0,9236124896" "0,6303271761" "0,6413583054" "0,5573887416" ...
##   ..$ col2: chr [1:19] "0,2114708388" "0,6984538266" "0,0469865249" "0,9271510226" ...
##   ..$ col3: chr [1:19] "0,4941919971" "0,7391538511" "0,3876723797" "0,2815014394" ...

If you prefer not to have the datasets in a list, but rather import them into the global environment, you can change the above function like so:

read_list <- function(list_of_datasets, read_func){

        read_and_assign <- function(dataset, read_func){
                assign(dataset, read_func(dataset), envir = .GlobalEnv)
        }

        # invisible is used to suppress the unneeded output
        output <- invisible(
                sapply(list_of_datasets,
                           read_and_assign, read_func = read_func, simplify = FALSE, USE.NAMES = TRUE))

}

But I personnally don’t like this second option, but I put it here for completeness.

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.