Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Importing multiple excel files (not sheets, but files) and flattening into one data frame / tibble / data.table should be straightforward – shouldn’t it? How many ways can there be?
What are the pros and cons of each?
I’ve made three copies of an excel file from a PreppinData challenge.
For this, I just want to import the first sheet from each workbook, into one table. Ideally I’d like to identify the sheets – but it’s not a deal breaker.
suppressPackageStartupMessages(library(rio)) suppressPackageStartupMessages(library(readxl)) suppressPackageStartupMessages(library(dplyr)) suppressPackageStartupMessages(library(purrr)) suppressPackageStartupMessages(library(data.table)) files <- dir(pattern = "*.xlsx") # in case anything else sneaks into folder
base….how low can you go?
As above, no packages, what can we do with base R?
method_lapply <- lapply(files, read_excel) method_lapply <- do.call(rbind, Map(data.frame, method_lapply)) head(method_lapply) ## Date New...Saddles New...Mudguards New...Wheels New...Bags ## 1 2021-01-21 13 42 19 38 ## 2 2021-02-21 1 9 14 6 ## 3 2021-03-21 8 22 6 35 ## 4 2021-04-21 3 9 8 16 ## 5 2021-05-21 2 8 5 34 ## 6 2021-06-21 11 2 6 8 ## Existing...Saddles Existing...Mudguards Existing...Wheels Existing...Bags ## 1 17 48 19 13 ## 2 2 4 19 24 ## 3 0 48 17 16 ## 4 18 50 18 25 ## 5 17 3 12 19 ## 6 2 8 3 1
Is it a data.frame? Yes.
Is it one I’d want to work with? No – those column names are horrible.
Purrr
method_purrr <- map_dfr(files, read_excel, sheet = "Manchester") str(method_purrr) ## tibble [36 x 9] (S3: tbl_df/tbl/data.frame) ## $ Date : POSIXct[1:36], format: "2021-01-21" "2021-02-21" ... ## $ New - Saddles : num [1:36] 13 1 8 3 2 11 16 10 15 9 ... ## $ New - Mudguards : num [1:36] 42 9 22 9 8 2 5 7 25 11 ... ## $ New - Wheels : num [1:36] 19 14 6 8 5 6 15 18 1 11 ... ## $ New - Bags : num [1:36] 38 6 35 16 34 8 37 27 38 0 ... ## $ Existing - Saddles : num [1:36] 17 2 0 18 17 2 19 10 18 18 ... ## $ Existing - Mudguards: num [1:36] 48 4 48 50 3 8 1 4 9 10 ... ## $ Existing - Wheels : num [1:36] 19 19 17 18 12 3 7 8 0 17 ... ## $ Existing - Bags : num [1:36] 13 24 16 25 19 1 28 9 23 7 ...
Let’s verify the sheet
argument is working by pulling in the London
data instead:
method_purrr2 <- map_dfr(files, read_excel, sheet = "London") str(method_purrr2) ## tibble [36 x 9] (S3: tbl_df/tbl/data.frame) ## $ Date : POSIXct[1:36], format: "2021-01-21" "2021-02-21" ... ## $ New - Saddles : num [1:36] 3 19 16 6 5 13 10 16 1 9 ... ## $ New - Mudguards : num [1:36] 44 32 52 56 3 6 8 9 32 8 ... ## $ New - Wheels : num [1:36] 18 2 2 1 6 16 7 10 2 17 ... ## $ New - Bags : num [1:36] 30 38 1 14 36 22 25 2 23 23 ... ## $ Existing - Saddles : num [1:36] 5 3 17 18 16 12 20 20 14 18 ... ## $ Existing - Mudguards: num [1:36] 35 7 0 15 0 4 9 0 36 15 ... ## $ Existing - Wheels : num [1:36] 19 17 18 1 1 18 3 7 17 8 ... ## $ Existing - Bags : num [1:36] 19 14 14 9 26 20 12 6 23 36 ...
If I want to identify the source workbook, how do I do that? Turns out I
need to specify the .id
argument, and I need to give it a name wrapped
in quotes.
This fails, because I didn’t wrap source_wb
( for source workbook
)
in quotes :
method_purrr2 <- map_dfr(files, read_excel, sheet = "London", .id = source_wb)
The .id
argument needs to be quoted. This now works, and it returns a
character vector to identify the source – although these are not the
actual names of the workbooks, so still not ideal.
method_purrr2 <- map_dfr(files, read_excel, sheet = "London", .id = "source_wb") method_purrr2 %>% select(source_wb) %>% distinct() ## # A tibble: 3 x 1 ## source_wb ## <chr> ## 1 1 ## 2 2 ## 3 3
This shows I have 3 unique source workbook values, and that the structure is as expected.
str(method_purrr2) ## tibble [36 x 10] (S3: tbl_df/tbl/data.frame) ## $ source_wb : chr [1:36] "1" "1" "1" "1" ... ## $ Date : POSIXct[1:36], format: "2021-01-21" "2021-02-21" ... ## $ New - Saddles : num [1:36] 3 19 16 6 5 13 10 16 1 9 ... ## $ New - Mudguards : num [1:36] 44 32 52 56 3 6 8 9 32 8 ... ## $ New - Wheels : num [1:36] 18 2 2 1 6 16 7 10 2 17 ... ## $ New - Bags : num [1:36] 30 38 1 14 36 22 25 2 23 23 ... ## $ Existing - Saddles : num [1:36] 5 3 17 18 16 12 20 20 14 18 ... ## $ Existing - Mudguards: num [1:36] 35 7 0 15 0 4 9 0 36 15 ... ## $ Existing - Wheels : num [1:36] 19 17 18 1 1 18 3 7 17 8 ... ## $ Existing - Bags : num [1:36] 19 14 14 9 26 20 12 6 23 36 ...
How do I get the actual workbook names? I need to pipe the files vector
to set_names
and then onto map_dfr
method_purrr3 <- files %>% set_names() %>% map_dfr(read_excel, sheet = "Manchester",.id = "source_wb") str(method_purrr3) ## tibble [36 x 10] (S3: tbl_df/tbl/data.frame) ## $ source_wb : chr [1:36] "copy2.xlsx" "copy2.xlsx" "copy2.xlsx" "copy2.xlsx" ... ## $ Date : POSIXct[1:36], format: "2021-01-21" "2021-02-21" ... ## $ New - Saddles : num [1:36] 13 1 8 3 2 11 16 10 15 9 ... ## $ New - Mudguards : num [1:36] 42 9 22 9 8 2 5 7 25 11 ... ## $ New - Wheels : num [1:36] 19 14 6 8 5 6 15 18 1 11 ... ## $ New - Bags : num [1:36] 38 6 35 16 34 8 37 27 38 0 ... ## $ Existing - Saddles : num [1:36] 17 2 0 18 17 2 19 10 18 18 ... ## $ Existing - Mudguards: num [1:36] 48 4 48 50 3 8 1 4 9 10 ... ## $ Existing - Wheels : num [1:36] 19 19 17 18 12 3 7 8 0 17 ... ## $ Existing - Bags : num [1:36] 13 24 16 25 19 1 28 9 23 7 ...
And let’s verify that those (admittedly horrible) source workbook names are coming through correctly
method_purrr3 %>% select(source_wb) %>% distinct() ## # A tibble: 3 x 1 ## source_wb ## <chr> ## 1 copy2.xlsx ## 2 copy3.xlsx ## 3 PD 2021 Wk 4 Input.xlsx
for loop and data.table’s rbindlist()
filecount <- as.numeric(length(files)) temp_list <- list() for (i in seq_along(files)) { filename <- files[i] df <- read_excel(path = filename, sheet = "Manchester") df$source_wb <- filename temp_list[[i]] <- df rm(df) method_datatable <- data.table::rbindlist(temp_list, fill = TRUE) } rm(temp_list) str(method_datatable) ## Classes 'data.table' and 'data.frame': 36 obs. of 10 variables: ## $ Date : POSIXct, format: "2021-01-21" "2021-02-21" ... ## $ New - Saddles : num 13 1 8 3 2 11 16 10 15 9 ... ## $ New - Mudguards : num 42 9 22 9 8 2 5 7 25 11 ... ## $ New - Wheels : num 19 14 6 8 5 6 15 18 1 11 ... ## $ New - Bags : num 38 6 35 16 34 8 37 27 38 0 ... ## $ Existing - Saddles : num 17 2 0 18 17 2 19 10 18 18 ... ## $ Existing - Mudguards: num 48 4 48 50 3 8 1 4 9 10 ... ## $ Existing - Wheels : num 19 19 17 18 12 3 7 8 0 17 ... ## $ Existing - Bags : num 13 24 16 25 19 1 28 9 23 7 ... ## $ source_wb : chr "copy2.xlsx" "copy2.xlsx" "copy2.xlsx" "copy2.xlsx" ... ## - attr(*, ".internal.selfref")=<externalptr>
Ugly, but effective. This is fast, as you’d expect, and we get the source workbook name along with the data.
rio
Not only is rio the misspelled name of my favourite Japanese, Liverpool
supporting rstats football guru, but it also excels at – well, importing
excel.
Check it out – the source file name comes through with no additional
code required from us.
method_rio <- import_list(files, rbind = TRUE, sheet = "Manchester") str(method_rio) ## 'data.frame': 36 obs. of 10 variables: ## $ Date : POSIXct, format: "2021-01-21" "2021-02-21" ... ## $ New - Saddles : num 13 1 8 3 2 11 16 10 15 9 ... ## $ New - Mudguards : num 42 9 22 9 8 2 5 7 25 11 ... ## $ New - Wheels : num 19 14 6 8 5 6 15 18 1 11 ... ## $ New - Bags : num 38 6 35 16 34 8 37 27 38 0 ... ## $ Existing - Saddles : num 17 2 0 18 17 2 19 10 18 18 ... ## $ Existing - Mudguards: num 48 4 48 50 3 8 1 4 9 10 ... ## $ Existing - Wheels : num 19 19 17 18 12 3 7 8 0 17 ... ## $ Existing - Bags : num 13 24 16 25 19 1 28 9 23 7 ... ## $ _file : chr "copy2.xlsx" "copy2.xlsx" "copy2.xlsx" "copy2.xlsx" ... ## - attr(*, ".internal.selfref")=<externalptr>
Yes, it works, it’s quick, it identifies the source, and it’s ridiculously easy. We can also specify a name for the identity column:
method_rio2 <- import_list(files, rbind = TRUE, rbind_label = "source_wb", sheet = "London") str(method_rio2) ## 'data.frame': 36 obs. of 10 variables: ## $ Date : POSIXct, format: "2021-01-21" "2021-02-21" ... ## $ New - Saddles : num 3 19 16 6 5 13 10 16 1 9 ... ## $ New - Mudguards : num 44 32 52 56 3 6 8 9 32 8 ... ## $ New - Wheels : num 18 2 2 1 6 16 7 10 2 17 ... ## $ New - Bags : num 30 38 1 14 36 22 25 2 23 23 ... ## $ Existing - Saddles : num 5 3 17 18 16 12 20 20 14 18 ... ## $ Existing - Mudguards: num 35 7 0 15 0 4 9 0 36 15 ... ## $ Existing - Wheels : num 19 17 18 1 1 18 3 7 17 8 ... ## $ Existing - Bags : num 19 14 14 9 26 20 12 6 23 36 ... ## $ source_wb : chr "copy2.xlsx" "copy2.xlsx" "copy2.xlsx" "copy2.xlsx" ... ## - attr(*, ".internal.selfref")=<externalptr>
So there you are, several methods of importing multiple excel files,
take your pick, I have to say, the rio
function is rather sweet and
might be the one I use in future.
purrrr’s map_dfr
is also pretty cool, once you realise that the
additional arguments to the read_excel
function don’t go inside the
brackets of the call to read_excel
.
Hope this has been useful / helpful to some of you. If so, let me know.
The source files are here if you want to try this yourself
Please also let me know if you spot the obscure rock references in this blog post – there are at least 2!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.