An Update on Importing Excel Data in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
Back in 2015 I wrote a long blog post on importing Excel tables into R. Happily for everyone this is now a lot easier than it was. This post provides an update on importing spreadsheets into R and exporting from R to Excel. I’ll also cover reading an excel file into R from a url as that seems to be an ongoing struggle.
Reading an Excel file from a URL
In the 2015 post on importing Excel I wrote:
“It is faster to simply download the file to your drive, or swim the Atlantic ocean, than to successfully download an excel file on http: or, in particular https:. So maybe ask yourself what is the path of least resistance and run with that.”
As far as I can tell the situation is not radically different now. However, this is something that lots of people have logically wanted to do. By the power of Stack Overflow, a solution can be found. Luke A provided the following answer to this question on downloading excel files.
library(readxl) library(httr) packageVersion("readxl") # [1] ‘0.1.1’ GET(url1, write_disk(tf <- tempfile(fileext = ".xls"))) df <- read_excel(tf, 2L) str(df)
This code uses the httr
package to read in a .xls file from a url that is written to disk and then passed to readxl
.
We can wrap this into a small function with some adjustments. In this case we use str_detect
to detect whether the file type is included in the URL. Note that this will not address those cases (such as Google Drive) where the Excel file type is not included (see the googledrive package). Nor will it detect other Excel file types such as .xlsm
for macro enabled workbooks. As this suggests the task is more complex than it might at first appear. This small function addresses common use cases but will not address all use cases.
The function assumes that the file extension is contained in the URL and will spot that for us, in the case of a zip extension it will download and attempt to extract the file and if all else fails, we can provide the file extension. the ...
informs us that other arguments such as col_types =
can be passed to the function and will be picked up by read_excel
.
readxl_online <- function(url, type = NULL, ...) { test <- stringr::str_detect(url, "[.]xls|[.]zip") if (test == FALSE) { print(message("Expecting file extension of type .xlsx, .xls or .zip. Check the URL or the data source for the correct file extension and use the type argument")) } # test for individual file extensions for xls use look forward, xls not # followed by x t1 <- stringr::str_detect(url, "[.]xlsx") t2 <- stringr::str_detect(url, "[.]xls(?!x)") tz <- stringr::str_detect(url, "[.]zip") if (t1 == TRUE) { type = ".xlsx" } if (t2 == TRUE) { type = ".xls" } if (tz == TRUE) { httr::GET(url, write_disk("tmp.zip", overwrite = TRUE)) tmp <- unzip("tmp.zip") # On osx more than one file name is returned, select first element. df <- readxl::read_excel(tmp[[1]]) return(df) } if (!is.null(type)) { type = type } df <- httr::GET(url, write_disk(paste0("tmp", type), overwrite = TRUE)) df <- readxl::read_excel(paste0("tmp", type)) }
This is not perfect, but it is a start. We can now run a test on different data types to see if it will work. These urls are all from excel files on Github. Github file urls are actually placeholders and so we need to follow the link and copy the Raw file url (see raw=true in the url). Note also that these urls are all https:
The .xls case:
dfxls <- readxl_online("https://github.com/wipo-analytics/opensource-patent-analytics/blob/master/2_datasets/pizza_all_24294/patentscope_pizza_1940_2005_9659.xls?raw=true")
The xlsx case:
dfxlsx <- readxl_online("https://github.com/wipo-analytics/opensource-patent-analytics/blob/master/2_datasets/ewaste/ewaste.xlsx?raw=true")
The zip file case:
dfzip <- readxl_online("https://github.com/poldham/opensource-patent-analytics/blob/master/2_datasets/taxonomy_final.zip?raw=true")
It is always a good thing if functions fail fast and provide a helpful message.
error <- readxl_online("https://www.google.co.uk/")
This prints the expected message.
“Expecting file extension of type .xlsx, .xls or .zip. Check the URL or the data source for the correct file extension and use the type argument”
Tidying column names with janitor
One issue once you have your data in R is that column names in excel files often contain mixed case names and spaces or other characters such as brackets that can be awkward to work with in R. To solve that an easy option is to use the recent janitor
package.
install.packages("janitor")
For this we need an excel worksheet with noisy names. For R coding Blue Peter fans…“Here is one we prepared earlier”.
noisycols <- read_excel("/Users/pauloldham17inch/blog/content/post/noisydf.xlsx")
noisy(yes) | really_,Noisy;! | EVEN noisier !?*$! | OMG- I_can’t-***//believe?it| |
---|---|---|---|
these | are | not | the |
noisiest | column | names | in |
the | world, | just | a |
tribute | NA | NANA | NANANANA |
library(janitor) noisycols1 <- janitor::clean_names(noisycols)
noisy_yes | really_noisy | even_noisier | omg_i_can_t_believe_it |
---|---|---|---|
these | are | not | the |
noisiest | column | names | in |
the | world, | just | a |
tribute | NA | NANA | NANANANA |
This does a very good job of cleaning up names but may not always catch everything. If you have particular needs the stringr
package (now installed with the tidyverse
) is the go to package. Try the str_replace_all
function.
If you need more help try the Basic Regular Expressions Cheatsheet in R or the chapter on strings in Hadley Wickham’s book R for Data Science. Charlotte Wickham also offers a Data Camp course on String Manipulation in R with stringr. When it comes to working with strings stringr
is your friend and if that doesn’t solve the problem then try the stringi
package that powers stringr
.
Exporting to Excel
In the earlier post I wrote about using write.xlsx()
from the xlsx
package. That is still a very good option. However, as a personal preference I have now switched over to the writexl
package as I find it easier to remember and use. It is also an ROpenSci package and I use a lot of ROpenSci packages. writexl
has the added bonus that Clippy appears in the documentation to brighten up your day… or drive you insane… as the case may be. So, if you prefer to be a curmudgeon about Clippy you may want to use the xlsx package.
install.packages("writexl")
We simply specify the file and the name of the file we want to write. An additional argument col_names = TRUE
is set to TRUE by default so you only need to specify that if you want the value to be FALSE.
library(writexl) writexl::write_xlsx(df, path = "df.xlsx", col_names = TRUE)
You can also now write multiple sheets by specifying the data frames in a list and passing them to write_xlsx.
library(writexl) tmp <- list(df, noisycols1) write_xlsx(tmp, "tmp.xlsx")
Round Up
There we have it. Reading and writing Excel files in R is now way way easier than it was just a couple of years ago thanks to the dedicated work of those behind readxl
(Hadley Wickham and Jenny Bryan) and writexl
by Jeroen Ooms and John McNamara. Other packages will take you to the same place but these are my go to packages. Community contributions are helping to solve the mystery of reading Excel files from urls and we might hope that at some point readxl
may address this problem.
If you would like to learn more on importing data into R then try the DataCamp course on Importing Data & Cleaning with R that covers Excel.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.