An Update on Importing Excel Data in R

Paul Oldham's Analytics Blog

4 years ago

[This article was first published on Paul Oldham's Analytics Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Back in 2015 I wrote a long blog post on importing Excel tables into R. Happily for everyone this is now a lot easier than it was. This post provides an update on importing spreadsheets into R and exporting from R to Excel. I’ll also cover reading an excel file into R from a url as that seems to be an ongoing struggle.

Import Directly from the RStudio Menu

The big change is that it is now very easy to import from Excel using the RStudio Menu: File > Import Dataset > From Excel.

Next, navigate to the file that you want to import and select it. You will then see something like this.

One point to bear in mind is that the import will often default to the name dataset so that you need to make sure you enter a meaningful name for the dataset.

If your workbook has multiple sheets then you can choose a sheet number using Sheet, choose the maximum number of rows or skip rows if you have a bunch of filler junk in the top rows. Regular Excel users may also want to select columns by Range.

You can also click on a column and choose to skip it or change the format.

It is worth bearing in mind that if you are importing a number of worksheets you can easily lose track. I sometimes use the approach of copying the import chunk into an Rmarkdown document to keep track of what I am doing and where a file came from.

When copying chunks note the small clipboard icon in the top right above the chunk that will copy the chunk to the clipboard for pasting into the console or an R markdown code chunk to document your import steps for the future. My approach when working with multiple sheets is to create an R markdown file and copy and paste the import code into chunks that I then save. That allows “future me”, to borrow from Hadley Wickham, to understand where the datasets came from.

As we can see from importing the file behind the scenes RStudio is using the readxl library to import the file.

readxl will commonly generate warning messages during the import process. For example this dataset generated a long long string of warnings that looked like this.

“Expecting logical in AH5501 / R5501C34: got ‘Aaptos suberitoides’Expecting logical in AH5502 / R5502C34: got ’Abdopus abaculus’Expecting logical in AH5503 / R5503C34: got ’Abdopus aculeatus’Expecting logical in AH5504 / R5504C34: got ’Abralia armata’Expecting logical in AH5505 / R5505C34: got ’Abraliopsis hoylei’Expecting logical in AH5506 / R5506C34: got ’Abudefduf bengalensis’Expecting logical in AH5507 / R5507C34: got ’Abudefduf sexfasciatus’”

These warnings arise because readxl guesses the column type by reading the top 1000 rows for each column. However, where a column contains a mix of numbers or characters this can lead to an expecting logical/expecting integer type of error. A lot of the time this is not actually a problem. However, it is important to pay attention to the warnings because they may indicate an actual problem with your data (such as lines spilling across rows).

To fix this there are a number of options to try.

1. Use the guess_max argument

Use the guess_max argument to increase the number of rows that are read to guess the column type. The default is 1000 and here we reset it to 2000. In the case of our example dataset this didn’t work because the problems appeared lower down but it often will. You can add an n_max value (shown below as NULL) where you know the maximum number of rows.

library(readxl)
taxonomy <- read_excel("/Users/pauloldham17inch/Desktop/open_source_master/asean/data-taxonomy/taxonomy_final.xlsx", 
    guess_max = min(2000, n_max = NULL))

An alternative to this approach is simply to set min as the maximum number of rows. The issue here is that you would of course need to already have opened the spreadsheet to identify the number of rows, but there is no reason not to simply guess large.

library(readxl)
taxonomy <- read_excel("/Users/pauloldham17inch/Desktop/taxonomy_final.xlsx", 
    guess_max = min(8400, n_max = NULL))
taxonomy
## # A tibble: 8,400 x 50
##    scientificname  type  genusorabove specificepithet parsed authorsparsed
##    <chr>           <chr> <chr>        <chr>           <lgl>  <lgl>        
##  1 abramis brama   SCIE… Abramis      brama           TRUE   TRUE         
##  2 acanthamoeba p… SCIE… Acanthamoeba polyphaga       TRUE   TRUE         
##  3 acaudina molpa… SCIE… Acaudina     molpadioides    TRUE   TRUE         
##  4 acipenser dabr… SCIE… Acipenser    dabryanus       TRUE   TRUE         
##  5 acipenser fulv… SCIE… Acipenser    fulvescens      TRUE   TRUE         
##  6 acipenser mika… SCIE… Acipenser    mikadoi         TRUE   TRUE         
##  7 acipenser oxyr… SCIE… Acipenser    oxyrinchus      TRUE   TRUE         
##  8 acipenser ruth… SCIE… Acipenser    ruthenus        TRUE   TRUE         
##  9 acipenser schr… SCIE… Acipenser    schrencki       TRUE   TRUE         
## 10 acrocalanus gr… SCIE… Acrocalanus  gracilis        TRUE   TRUE         
## # ... with 8,390 more rows, and 44 more variables: canonicalname <chr>,
## #   canonicalnamewithmarker <chr>, canonicalnamecomplete <chr>,
## #   rankmarker <chr>, gbif_id <chr>, db <chr>, match <chr>,
## #   multiple_matches <lgl>, pattern_match <lgl>, uri <chr>, kingdom <chr>,
## #   phylum <chr>, class <chr>, order <chr>, family <chr>, genus <chr>,
## #   species <chr>, kingdom_id <chr>, phylum_id <chr>, class_id <chr>,
## #   order_id <chr>, family_id <chr>, genus_id <chr>, species_id <chr>,
## #   query <chr>, scientificname1 <chr>, required_fields_check <dbl>,
## #   environment_aphia_worms <chr>, name_aphia_worms <chr>,
## #   aphiaid_worms <dbl>, accepted_name_aphia_worms <chr>,
## #   valid_aphiaid_worms <dbl>, status_aphia_worms <chr>,
## #   taxonmatch_matchcount_worms <dbl>, taxonmatch_note_worms <chr>,
## #   species1 <chr>, match1 <lgl>, environment <chr>, marine <chr>,
## #   brackish <chr>, freshwater <chr>, terrestrial <chr>,
## #   noenvironment <chr>, worms_id <dbl>

2. Specify the column types

If that doesn’t work for you then a third option is to work out what the format should be and pass it as a string. Arguably, this should be the first option. However, it can also be the most time consuming.

A toy example is the following data frame that we can write to excel (see below on writing files).

library(tidyverse)
library(writexl)
df <- tibble(a = c(1,2,3), 
             b = c("a", "b", "c"), 
             c = c(TRUE, FALSE, TRUE), 
             d = c("2017-12-10", "20170815", "2017_06_12")) %>%
  writexl::write_xlsx(., "df.xlsx")

When we read this in we specify the column types. Note that in this case we need to use the term “text” rather than the familiar “character” or we get an error.

df <- read_excel("/Users/pauloldham17inch/blog/content/post/df.xlsx", col_names = TRUE, 
    col_types = c("numeric", "text", "logical", "text"))
df
## # A tibble: 3 x 4
##       a b     c     d         
##   <dbl> <chr> <lgl> <chr>     
## 1    1. a     TRUE  2017-12-10
## 2    2. b     FALSE 20170815  
## 3    3. c     TRUE  2017_06_12

The documentation for read_excel (?read_excel) sets out quite a few other options. For example we could specify the format of some columns and leave the function to guess the others. That would look like this.

df <- read_excel("/Users/pauloldham17inch/blog/content/post/df.xlsx", col_names = TRUE, 
    col_types = c("guess", "guess", "logical", "guess"))

3. Convert all columns to a single type

For a dataset with a lot of columns trying to work out the column types or writing guess, logical, character can rapidly become painful. Depending on your needs it may be easier to simply use the col_types = "text" for all columns and change them where needed later using as.character(), as.logical(), as.numeric() or as.Date().

library(readxl)
taxonomy <- read_excel("/Users/pauloldham17inch/Desktop/taxonomy_final.xlsx", 
    col_types = "text")
taxonomy
## # A tibble: 8,400 x 50
##    scientificname  type  genusorabove specificepithet parsed authorsparsed
##    <chr>           <chr> <chr>        <chr>           <chr>  <chr>        
##  1 abramis brama   SCIE… Abramis      brama           TRUE   TRUE         
##  2 acanthamoeba p… SCIE… Acanthamoeba polyphaga       TRUE   TRUE         
##  3 acaudina molpa… SCIE… Acaudina     molpadioides    TRUE   TRUE         
##  4 acipenser dabr… SCIE… Acipenser    dabryanus       TRUE   TRUE         
##  5 acipenser fulv… SCIE… Acipenser    fulvescens      TRUE   TRUE         
##  6 acipenser mika… SCIE… Acipenser    mikadoi         TRUE   TRUE         
##  7 acipenser oxyr… SCIE… Acipenser    oxyrinchus      TRUE   TRUE         
##  8 acipenser ruth… SCIE… Acipenser    ruthenus        TRUE   TRUE         
##  9 acipenser schr… SCIE… Acipenser    schrencki       TRUE   TRUE         
## 10 acrocalanus gr… SCIE… Acrocalanus  gracilis        TRUE   TRUE         
## # ... with 8,390 more rows, and 44 more variables: canonicalname <chr>,
## #   canonicalnamewithmarker <chr>, canonicalnamecomplete <chr>,
## #   rankmarker <chr>, gbif_id <chr>, db <chr>, match <chr>,
## #   multiple_matches <chr>, pattern_match <chr>, uri <chr>, kingdom <chr>,
## #   phylum <chr>, class <chr>, order <chr>, family <chr>, genus <chr>,
## #   species <chr>, kingdom_id <chr>, phylum_id <chr>, class_id <chr>,
## #   order_id <chr>, family_id <chr>, genus_id <chr>, species_id <chr>,
## #   query <chr>, scientificname1 <chr>, required_fields_check <chr>,
## #   environment_aphia_worms <chr>, name_aphia_worms <chr>,
## #   aphiaid_worms <chr>, accepted_name_aphia_worms <chr>,
## #   valid_aphiaid_worms <chr>, status_aphia_worms <chr>,
## #   taxonmatch_matchcount_worms <chr>, taxonmatch_note_worms <chr>,
## #   species1 <chr>, match1 <chr>, environment <chr>, marine <chr>,
## #   brackish <chr>, freshwater <chr>, terrestrial <chr>,
## #   noenvironment <chr>, worms_id <chr>

With our toy dataset we could easily change the columns that are our target as needed.

df$a <- as.numeric(df$a)

Dates can be troublesome and in cases where you need to format date fields the lubridate package will really make your life a whole lot easier.

In our toy dataset while the dates are all in YYYYMMDD format (and those in your dataset may not be) the separators are different. Using as.Date() won’t work for the second and third dates.

as.Date(df$d)

However, this problem is easily handled by lubridate::as_date

library(lubridate)
df$e <- lubridate::as_date(df$d)
df
## # A tibble: 3 x 5
##       a b     c     d          e         
##   <dbl> <chr> <lgl> <chr>      <date>    
## 1    1. a     TRUE  2017-12-10 2017-12-10
## 2    2. b     FALSE 20170815   2017-08-15
## 3    3. c     TRUE  2017_06_12 2017-06-12

Charlotte Wickham offers an incredibly useful DataCamp course Working with Dates and Times in R that will have you up and running in no time.

Reading an Excel file from a URL

In the 2015 post on importing Excel I wrote:

“It is faster to simply download the file to your drive, or swim the Atlantic ocean, than to successfully download an excel file on http: or, in particular https:. So maybe ask yourself what is the path of least resistance and run with that.”

As far as I can tell the situation is not radically different now. However, this is something that lots of people have logically wanted to do. By the power of Stack Overflow, a solution can be found. Luke A provided the following answer to this question on downloading excel files.

library(readxl)
library(httr)
packageVersion("readxl")
# [1] ‘0.1.1’

GET(url1, write_disk(tf <- tempfile(fileext = ".xls")))
df <- read_excel(tf, 2L)
str(df)

This code uses the httr package to read in a .xls file from a url that is written to disk and then passed to readxl.

We can wrap this into a small function with some adjustments. In this case we use str_detect to detect whether the file type is included in the URL. Note that this will not address those cases (such as Google Drive) where the Excel file type is not included (see the googledrive package). Nor will it detect other Excel file types such as .xlsm for macro enabled workbooks. As this suggests the task is more complex than it might at first appear. This small function addresses common use cases but will not address all use cases.

The function assumes that the file extension is contained in the URL and will spot that for us, in the case of a zip extension it will download and attempt to extract the file and if all else fails, we can provide the file extension. the ... informs us that other arguments such as col_types = can be passed to the function and will be picked up by read_excel.

readxl_online <- function(url, type = NULL, ...) {
    test <- stringr::str_detect(url, "[.]xls|[.]zip")
    if (test == FALSE) {
        print(message("Expecting file extension of type .xlsx, .xls or .zip. Check the URL or the data source for the correct file extension and use the type argument"))
    }
    # test for individual file extensions for xls use look forward, xls not
    # followed by x
    t1 <- stringr::str_detect(url, "[.]xlsx")
    t2 <- stringr::str_detect(url, "[.]xls(?!x)")
    tz <- stringr::str_detect(url, "[.]zip")
    if (t1 == TRUE) {
        type = ".xlsx"
    }
    if (t2 == TRUE) {
        type = ".xls"
    }
    if (tz == TRUE) {
        httr::GET(url, write_disk("tmp.zip", overwrite = TRUE))
        tmp <- unzip("tmp.zip")
        # On osx more than one file name is returned, select first element.
        df <- readxl::read_excel(tmp[[1]])
        return(df)
    }
    if (!is.null(type)) {
        type = type
        
    }
    df <- httr::GET(url, write_disk(paste0("tmp", type), overwrite = TRUE))
    df <- readxl::read_excel(paste0("tmp", type))
    
}

This is not perfect, but it is a start. We can now run a test on different data types to see if it will work. These urls are all from excel files on Github. Github file urls are actually placeholders and so we need to follow the link and copy the Raw file url (see raw=true in the url). Note also that these urls are all https:

The .xls case:

dfxls <- readxl_online("https://github.com/wipo-analytics/opensource-patent-analytics/blob/master/2_datasets/pizza_all_24294/patentscope_pizza_1940_2005_9659.xls?raw=true")

The xlsx case:

dfxlsx <- readxl_online("https://github.com/wipo-analytics/opensource-patent-analytics/blob/master/2_datasets/ewaste/ewaste.xlsx?raw=true")

The zip file case:

dfzip <- readxl_online("https://github.com/poldham/opensource-patent-analytics/blob/master/2_datasets/taxonomy_final.zip?raw=true")

It is always a good thing if functions fail fast and provide a helpful message.

error <- readxl_online("https://www.google.co.uk/")

This prints the expected message.

“Expecting file extension of type .xlsx, .xls or .zip. Check the URL or the data source for the correct file extension and use the type argument”

Tidying column names with `janitor`

One issue once you have your data in R is that column names in excel files often contain mixed case names and spaces or other characters such as brackets that can be awkward to work with in R. To solve that an easy option is to use the recent janitor package.

install.packages("janitor")

For this we need an excel worksheet with noisy names. For R coding Blue Peter fans…“Here is one we prepared earlier”.

noisycols <- read_excel("/Users/pauloldham17inch/blog/content/post/noisydf.xlsx")

noisy(yes)	really_,Noisy;!	EVEN noisier !?*$!	OMG- I_can’t-***//believe?it\|
these	are	not	the
noisiest	column	names	in
the	world,	just	a
tribute	NA	NANA	NANANANA

library(janitor)
noisycols1 <- janitor::clean_names(noisycols)

noisy_yes	really_noisy	even_noisier	omg_i_can_t_believe_it
these	are	not	the
noisiest	column	names	in
the	world,	just	a
tribute	NA	NANA	NANANANA

This does a very good job of cleaning up names but may not always catch everything. If you have particular needs the stringr package (now installed with the tidyverse) is the go to package. Try the str_replace_all function.

If you need more help try the Basic Regular Expressions Cheatsheet in R or the chapter on strings in Hadley Wickham’s book R for Data Science. Charlotte Wickham also offers a Data Camp course on String Manipulation in R with stringr. When it comes to working with strings stringr is your friend and if that doesn’t solve the problem then try the stringi package that powers stringr.

Exporting to Excel

In the earlier post I wrote about using write.xlsx() from the xlsx package. That is still a very good option. However, as a personal preference I have now switched over to the writexl package as I find it easier to remember and use. It is also an ROpenSci package and I use a lot of ROpenSci packages. writexl has the added bonus that Clippy appears in the documentation to brighten up your day… or drive you insane… as the case may be. So, if you prefer to be a curmudgeon about Clippy you may want to use the xlsx package.

install.packages("writexl")

We simply specify the file and the name of the file we want to write. An additional argument col_names = TRUE is set to TRUE by default so you only need to specify that if you want the value to be FALSE.

library(writexl)
writexl::write_xlsx(df, path = "df.xlsx", col_names = TRUE)

You can also now write multiple sheets by specifying the data frames in a list and passing them to write_xlsx.

library(writexl)
tmp <- list(df, noisycols1)
write_xlsx(tmp, "tmp.xlsx")

Round Up

There we have it. Reading and writing Excel files in R is now way way easier than it was just a couple of years ago thanks to the dedicated work of those behind readxl (Hadley Wickham and Jenny Bryan) and writexl by Jeroen Ooms and John McNamara. Other packages will take you to the same place but these are my go to packages. Community contributions are helping to solve the mystery of reading Excel files from urls and we might hope that at some point readxl may address this problem.

If you would like to learn more on importing data into R then try the DataCamp course on Importing Data & Cleaning with R that covers Excel.

To leave a comment for the author, please follow the link and comment on their blog: Paul Oldham's Analytics Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.