Read a lot of datasets at once with R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I often have to read a lot of datasets at once using R. So I’ve wrote the following function to solve this issue:
read_list <- function(list_of_datasets, read_func){ read_and_assign <- function(dataset, read_func){ dataset_name <- as.name(dataset) dataset_name <- read_func(dataset) } # invisible is used to suppress the unneeded output output <- invisible( sapply(list_of_datasets, read_and_assign, read_func = read_func, simplify = FALSE, USE.NAMES = TRUE)) # Remove the extension at the end of the data set names names_of_datasets <- c(unlist(strsplit(list_of_datasets, "[.]"))[c(T, F)]) names(output) <- names_of_datasets return(output) }
You need to supply a list of datasets as well as the function to read the datasets to read_list
. So for example to read in .csv
files, you could use read.csv()
(or read_csv()
from the readr
package, which I prefer to use), or read_dta()
from the package haven
for STATA files, and so on.
Now imagine you have some data in your working directory. First start by saving the name of the datasets in a variable:
data_files <- list.files(pattern = ".csv") print(data_files) ## [1] "data_1.csv" "data_2.csv" "data_3.csv"
Now you can read all the data sets and save them in a list with read_list()
:
library("readr") library("tibble") list_of_data_sets <- read_list(data_files, read_csv) glimpse(list_of_data_sets) ## List of 3 ## $ data_1:Classes 'tbl_df', 'tbl' and 'data.frame': 19 obs. of 3 variables: ## ..$ col1: chr [1:19] "0,018930679" "0,8748013128" "0,1025635934" "0,6246140983" ... ## ..$ col2: chr [1:19] "0,0377725807" "0,5959457638" "0,4429121533" "0,558387159" ... ## ..$ col3: chr [1:19] "0,6241767189" "0,031324594" "0,2238059868" "0,2773350732" ... ## $ data_2:Classes 'tbl_df', 'tbl' and 'data.frame': 19 obs. of 3 variables: ## ..$ col1: chr [1:19] "0,9098418493" "0,1127788509" "0,5818891392" "0,1011773532" ... ## ..$ col2: chr [1:19] "0,7455905887" "0,4015039612" "0,6625796605" "0,029955339" ... ## ..$ col3: chr [1:19] "0,327232932" "0,2784035673" "0,8092386735" "0,1216045306" ... ## $ data_3:Classes 'tbl_df', 'tbl' and 'data.frame': 19 obs. of 3 variables: ## ..$ col1: chr [1:19] "0,9236124896" "0,6303271761" "0,6413583054" "0,5573887416" ... ## ..$ col2: chr [1:19] "0,2114708388" "0,6984538266" "0,0469865249" "0,9271510226" ... ## ..$ col3: chr [1:19] "0,4941919971" "0,7391538511" "0,3876723797" "0,2815014394" ...
If you prefer not to have the datasets in a list, but rather import them into the global environment, you can change the above function like so:
read_list <- function(list_of_datasets, read_func){ read_and_assign <- function(dataset, read_func){ assign(dataset, read_func(dataset), envir = .GlobalEnv) } # invisible is used to suppress the unneeded output output <- invisible( sapply(list_of_datasets, read_and_assign, read_func = read_func, simplify = FALSE, USE.NAMES = TRUE)) }
But I personnally don’t like this second option, but I put it here for completeness.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.