Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
Back in 2015 I wrote a long blog post on importing Excel tables into R. Happily for everyone this is now a lot easier than it was. This post provides an update on importing spreadsheets into R and exporting from R to Excel. I’ll also cover reading an excel file into R from a url as that seems to be an ongoing struggle.
Import Directly from the RStudio Menu
The big change is that it is now very easy to import from Excel using the RStudio Menu: File > Import Dataset > From Excel
.
Next, navigate to the file that you want to import and select it. You will then see something like this.
One point to bear in mind is that the import will often default to the name dataset
so that you need to make sure you enter a meaningful name for the dataset.
If your workbook has multiple sheets then you can choose a sheet number using Sheet
, choose the maximum number of rows or skip rows if you have a bunch of filler junk in the top rows. Regular Excel users may also want to select columns by Range.
You can also click on a column and choose to skip it or change the format.
It is worth bearing in mind that if you are importing a number of worksheets you can easily lose track. I sometimes use the approach of copying the import chunk into an Rmarkdown document to keep track of what I am doing and where a file came from.
When copying chunks note the small clipboard icon in the top right above the chunk that will copy the chunk to the clipboard for pasting into the console or an R markdown code chunk to document your import steps for the future. My approach when working with multiple sheets is to create an R markdown file and copy and paste the import code into chunks that I then save. That allows “future me”, to borrow from Hadley Wickham, to understand where the datasets came from.
As we can see from importing the file behind the scenes RStudio is using the readxl
library to import the file.
readxl
will commonly generate warning messages during the import process. For example this dataset generated a long long string of warnings that looked like this.
“Expecting logical in AH5501 / R5501C34: got ‘Aaptos suberitoides’Expecting logical in AH5502 / R5502C34: got ’Abdopus abaculus’Expecting logical in AH5503 / R5503C34: got ’Abdopus aculeatus’Expecting logical in AH5504 / R5504C34: got ’Abralia armata’Expecting logical in AH5505 / R5505C34: got ’Abraliopsis hoylei’Expecting logical in AH5506 / R5506C34: got ’Abudefduf bengalensis’Expecting logical in AH5507 / R5507C34: got ’Abudefduf sexfasciatus’”
These warnings arise because readxl
guesses the column type by reading the top 1000 rows for each column. However, where a column contains a mix of numbers or characters this can lead to an expecting logical/expecting integer
type of error. A lot of the time this is not actually a problem. However, it is important to pay attention to the warnings because they may indicate an actual problem with your data (such as lines spilling across rows).
To fix this there are a number of options to try.
1. Use the guess_max argument
Use the guess_max
argument to increase the number of rows that are read to guess the column type. The default is 1000 and here we reset it to 2000. In the case of our example dataset this didn’t work because the problems appeared lower down but it often will. You can add an n_max
value (shown below as NULL) where you know the maximum number of rows.
library(readxl) taxonomy <- read_excel("/Users/pauloldham17inch/Desktop/open_source_master/asean/data-taxonomy/taxonomy_final.xlsx", guess_max = min(2000, n_max = NULL))
An alternative to this approach is simply to set min
as the maximum number of rows. The issue here is that you would of course need to already have opened the spreadsheet to identify the number of rows, but there is no reason not to simply guess large.
library(readxl) taxonomy <- read_excel("/Users/pauloldham17inch/Desktop/taxonomy_final.xlsx", guess_max = min(8400, n_max = NULL)) taxonomy ## # A tibble: 8,400 x 50 ## scientificname type genusorabove specificepithet parsed authorsparsed ## <chr> <chr> <chr> <chr> <lgl> <lgl> ## 1 abramis brama SCIE… Abramis brama TRUE TRUE ## 2 acanthamoeba p… SCIE… Acanthamoeba polyphaga TRUE TRUE ## 3 acaudina molpa… SCIE… Acaudina molpadioides TRUE TRUE ## 4 acipenser dabr… SCIE… Acipenser dabryanus TRUE TRUE ## 5 acipenser fulv… SCIE… Acipenser fulvescens TRUE TRUE ## 6 acipenser mika… SCIE… Acipenser mikadoi TRUE TRUE ## 7 acipenser oxyr… SCIE… Acipenser oxyrinchus TRUE TRUE ## 8 acipenser ruth… SCIE… Acipenser ruthenus TRUE TRUE ## 9 acipenser schr… SCIE… Acipenser schrencki TRUE TRUE ## 10 acrocalanus gr… SCIE… Acrocalanus gracilis TRUE TRUE ## # ... with 8,390 more rows, and 44 more variables: canonicalname <chr>, ## # canonicalnamewithmarker <chr>, canonicalnamecomplete <chr>, ## # rankmarker <chr>, gbif_id <chr>, db <chr>, match <chr>, ## # multiple_matches <lgl>, pattern_match <lgl>, uri <chr>, kingdom <chr>, ## # phylum <chr>, class <chr>, order <chr>, family <chr>, genus <chr>, ## # species <chr>, kingdom_id <chr>, phylum_id <chr>, class_id <chr>, ## # order_id <chr>, family_id <chr>, genus_id <chr>, species_id <chr>, ## # query <chr>, scientificname1 <chr>, required_fields_check <dbl>, ## # environment_aphia_worms <chr>, name_aphia_worms <chr>, ## # aphiaid_worms <dbl>, accepted_name_aphia_worms <chr>, ## # valid_aphiaid_worms <dbl>, status_aphia_worms <chr>, ## # taxonmatch_matchcount_worms <dbl>, taxonmatch_note_worms <chr>, ## # species1 <chr>, match1 <lgl>, environment <chr>, marine <chr>, ## # brackish <chr>, freshwater <chr>, terrestrial <chr>, ## # noenvironment <chr>, worms_id <dbl>
2. Specify the column types
If that doesn’t work for you then a third option is to work out what the format should be and pass it as a string. Arguably, this should be the first option. However, it can also be the most time consuming.
A toy example is the following data frame that we can write to excel (see below on writing files).
library(tidyverse) library(writexl) df <- tibble(a = c(1,2,3), b = c("a", "b", "c"), c = c(TRUE, FALSE, TRUE), d = c("2017-12-10", "20170815", "2017_06_12")) %>% writexl::write_xlsx(., "df.xlsx")
When we read this in we specify the column types. Note that in this case we need to use the term “text” rather than the familiar “character” or we get an error.
df <- read_excel("/Users/pauloldham17inch/blog/content/post/df.xlsx", col_names = TRUE, col_types = c("numeric", "text", "logical", "text")) df ## # A tibble: 3 x 4 ## a b c d ## <dbl> <chr> <lgl> <chr> ## 1 1. a TRUE 2017-12-10 ## 2 2. b FALSE 20170815 ## 3 3. c TRUE 2017_06_12
The documentation for read_excel (?read_excel
) sets out quite a few other options. For example we could specify the format of some columns and leave the function to guess the others. That would look like this.
df <- read_excel("/Users/pauloldham17inch/blog/content/post/df.xlsx", col_names = TRUE, col_types = c("guess", "guess", "logical", "guess"))
3. Convert all columns to a single type
For a dataset with a lot of columns trying to work out the column types or writing guess, logical, character
can rapidly become painful. Depending on your needs it may be easier to simply use the col_types = "text"
for all columns and change them where needed later using as.character()
, as.logical()
, as.numeric()
or as.Date()
.
library(readxl) taxonomy <- read_excel("/Users/pauloldham17inch/Desktop/taxonomy_final.xlsx", col_types = "text") taxonomy ## # A tibble: 8,400 x 50 ## scientificname type genusorabove specificepithet parsed authorsparsed ## <chr> <chr> <chr> <chr> <chr> <chr> ## 1 abramis brama SCIE… Abramis brama TRUE TRUE ## 2 acanthamoeba p… SCIE… Acanthamoeba polyphaga TRUE TRUE ## 3 acaudina molpa… SCIE… Acaudina molpadioides TRUE TRUE ## 4 acipenser dabr… SCIE… Acipenser dabryanus TRUE TRUE ## 5 acipenser fulv… SCIE… Acipenser fulvescens TRUE TRUE ## 6 acipenser mika… SCIE… Acipenser mikadoi TRUE TRUE ## 7 acipenser oxyr… SCIE… Acipenser oxyrinchus TRUE TRUE ## 8 acipenser ruth… SCIE… Acipenser ruthenus TRUE TRUE ## 9 acipenser schr… SCIE… Acipenser schrencki TRUE TRUE ## 10 acrocalanus gr… SCIE… Acrocalanus gracilis TRUE TRUE ## # ... with 8,390 more rows, and 44 more variables: canonicalname <chr>, ## # canonicalnamewithmarker <chr>, canonicalnamecomplete <chr>, ## # rankmarker <chr>, gbif_id <chr>, db <chr>, match <chr>, ## # multiple_matches <chr>, pattern_match <chr>, uri <chr>, kingdom <chr>, ## # phylum <chr>, class <chr>, order <chr>, family <chr>, genus <chr>, ## # species <chr>, kingdom_id <chr>, phylum_id <chr>, class_id <chr>, ## # order_id <chr>, family_id <chr>, genus_id <chr>, species_id <chr>, ## # query <chr>, scientificname1 <chr>, required_fields_check <chr>, ## # environment_aphia_worms <chr>, name_aphia_worms <chr>, ## # aphiaid_worms <chr>, accepted_name_aphia_worms <chr>, ## # valid_aphiaid_worms <chr>, status_aphia_worms <chr>, ## # taxonmatch_matchcount_worms <chr>, taxonmatch_note_worms <chr>, ## # species1 <chr>, match1 <chr>, environment <chr>, marine <chr>, ## # brackish <chr>, freshwater <chr>, terrestrial <chr>, ## # noenvironment <chr>, worms_id <chr>
With our toy dataset we could easily change the columns that are our target as needed.
df$a <- as.numeric(df$a)
Dates can be troublesome and in cases where you need to format date fields the lubridate
package will really make your life a whole lot easier.
In our toy dataset while the dates are all in YYYYMMDD format (and those in your dataset may not be) the separators are different. Using as.Date()
won’t work for the second and third dates.
as.Date(df$d)
However, this problem is easily handled by lubridate::as_date
library(lubridate) df$e <- lubridate::as_date(df$d) df ## # A tibble: 3 x 5 ## a b c d e ## <dbl> <chr> <lgl> <chr> <date> ## 1 1. a TRUE 2017-12-10 2017-12-10 ## 2 2. b FALSE 20170815 2017-08-15 ## 3 3. c TRUE 2017_06_12 2017-06-12
Charlotte Wickham offers an incredibly useful DataCamp course Working with Dates and Times in R that will have you up and running in no time.
Reading an Excel file from a URL
In the 2015 post on importing Excel I wrote:
“It is faster to simply download the file to your drive, or swim the Atlantic ocean, than to successfully download an excel file on http: or, in particular https:. So maybe ask yourself what is the path of least resistance and run with that.”
As far as I can tell the situation is not radically different now. However, this is something that lots of people have logically wanted to do. By the power of Stack Overflow, a solution can be found. Luke A provided the following answer to this question on downloading excel files.
library(readxl) library(httr) packageVersion("readxl") # [1] ‘0.1.1’ GET(url1, write_disk(tf <- tempfile(fileext = ".xls"))) df <- read_excel(tf, 2L) str(df)
This code uses the httr
package to read in a .xls file from a url that is written to disk and then passed to readxl
.
We can wrap this into a small function with some adjustments. In this case we use str_detect
to detect whether the file type is included in the URL. Note that this will not address those cases (such as Google Drive) where the Excel file type is not included (see the googledrive package). Nor will it detect other Excel file types such as .xlsm
for macro enabled workbooks. As this suggests the task is more complex than it might at first appear. This small function addresses common use cases but will not address all use cases.
The function assumes that the file extension is contained in the URL and will spot that for us, in the case of a zip extension it will download and attempt to extract the file and if all else fails, we can provide the file extension. the ...
informs us that other arguments such as col_types =
can be passed to the function and will be picked up by read_excel
.
readxl_online <- function(url, type = NULL, ...) { test <- stringr::str_detect(url, "[.]xls|[.]zip") if (test == FALSE) { print(message("Expecting file extension of type .xlsx, .xls or .zip. Check the URL or the data source for the correct file extension and use the type argument")) } # test for individual file extensions for xls use look forward, xls not # followed by x t1 <- stringr::str_detect(url, "[.]xlsx") t2 <- stringr::str_detect(url, "[.]xls(?!x)") tz <- stringr::str_detect(url, "[.]zip") if (t1 == TRUE) { type = ".xlsx" } if (t2 == TRUE) { type = ".xls" } if (tz == TRUE) { httr::GET(url, write_disk("tmp.zip", overwrite = TRUE)) tmp <- unzip("tmp.zip") # On osx more than one file name is returned, select first element. df <- readxl::read_excel(tmp[[1]]) return(df) } if (!is.null(type)) { type = type } df <- httr::GET(url, write_disk(paste0("tmp", type), overwrite = TRUE)) df <- readxl::read_excel(paste0("tmp", type)) }
This is not perfect, but it is a start. We can now run a test on different data types to see if it will work. These urls are all from excel files on Github. Github file urls are actually placeholders and so we need to follow the link and copy the Raw file url (see raw=true in the url). Note also that these urls are all https:
The .xls case:
dfxls <- readxl_online("https://github.com/wipo-analytics/opensource-patent-analytics/blob/master/2_datasets/pizza_all_24294/patentscope_pizza_1940_2005_9659.xls?raw=true")
The xlsx case:
dfxlsx <- readxl_online("https://github.com/wipo-analytics/opensource-patent-analytics/blob/master/2_datasets/ewaste/ewaste.xlsx?raw=true")
The zip file case:
dfzip <- readxl_online("https://github.com/poldham/opensource-patent-analytics/blob/master/2_datasets/taxonomy_final.zip?raw=true")
It is always a good thing if functions fail fast and provide a helpful message.
error <- readxl_online("https://www.google.co.uk/")
This prints the expected message.
“Expecting file extension of type .xlsx, .xls or .zip. Check the URL or the data source for the correct file extension and use the type argument”
Tidying column names with janitor
One issue once you have your data in R is that column names in excel files often contain mixed case names and spaces or other characters such as brackets that can be awkward to work with in R. To solve that an easy option is to use the recent janitor
package.
install.packages("janitor")
For this we need an excel worksheet with noisy names. For R coding Blue Peter fans…“Here is one we prepared earlier”.
noisycols <- read_excel("/Users/pauloldham17inch/blog/content/post/noisydf.xlsx")
noisy(yes) | really_,Noisy;! | EVEN noisier !?*$! | OMG- I_can’t-***//believe?it| |
---|---|---|---|
these | are | not | the |
noisiest | column | names | in |
the | world, | just | a |
tribute | NA | NANA | NANANANA |
library(janitor) noisycols1 <- janitor::clean_names(noisycols)
noisy_yes | really_noisy | even_noisier | omg_i_can_t_believe_it |
---|---|---|---|
these | are | not | the |
noisiest | column | names | in |
the | world, | just | a |
tribute | NA | NANA | NANANANA |
This does a very good job of cleaning up names but may not always catch everything. If you have particular needs the stringr
package (now installed with the tidyverse
) is the go to package. Try the str_replace_all
function.
If you need more help try the Basic Regular Expressions Cheatsheet in R or the chapter on strings in Hadley Wickham’s book R for Data Science. Charlotte Wickham also offers a Data Camp course on String Manipulation in R with stringr. When it comes to working with strings stringr
is your friend and if that doesn’t solve the problem then try the stringi
package that powers stringr
.
Exporting to Excel
In the earlier post I wrote about using write.xlsx()
from the xlsx
package. That is still a very good option. However, as a personal preference I have now switched over to the writexl
package as I find it easier to remember and use. It is also an ROpenSci package and I use a lot of ROpenSci packages. writexl
has the added bonus that Clippy appears in the documentation to brighten up your day… or drive you insane… as the case may be. So, if you prefer to be a curmudgeon about Clippy you may want to use the xlsx package.
install.packages("writexl")
We simply specify the file and the name of the file we want to write. An additional argument col_names = TRUE
is set to TRUE by default so you only need to specify that if you want the value to be FALSE.
library(writexl) writexl::write_xlsx(df, path = "df.xlsx", col_names = TRUE)
You can also now write multiple sheets by specifying the data frames in a list and passing them to write_xlsx.
library(writexl) tmp <- list(df, noisycols1) write_xlsx(tmp, "tmp.xlsx")
Round Up
There we have it. Reading and writing Excel files in R is now way way easier than it was just a couple of years ago thanks to the dedicated work of those behind readxl
(Hadley Wickham and Jenny Bryan) and writexl
by Jeroen Ooms and John McNamara. Other packages will take you to the same place but these are my go to packages. Community contributions are helping to solve the mystery of reading Excel files from urls and we might hope that at some point readxl
may address this problem.
If you would like to learn more on importing data into R then try the DataCamp course on Importing Data & Cleaning with R that covers Excel.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.