Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The problem of openxlsx
At work, I rely on the R package openxlsx
heavily and use openxlsx::read.xlsx
to import excel spreadsheet, which is the most common format in business settings.
Now here is the issue I discovered recently. I have a spreadsheet that contains excel formula which generates value with special character &
. However when importing the file using openxlsx::read.xlsx
, all &
become &
Here is the sample of spreadsheet sample_data.xlsx. This is what it looks like in excel:
colA | colB |
---|---|
colB is text | US & Canada |
colB is formula (=B2) | US & Canada |
Note that the 2nd observation in colB is a formula “=B2”. Therefore the expected result should be the same as the 1st observation.
# Import spreadshet to R df <- openxlsx::read.xlsx("sample_data.xlsx") df colA colB 1 colB is text US & Canada 2 colB is formula (=B2) US & Canada
However, after importing to R, the value &
becomes &
instead. This is what the result shows in R:
colA | colB |
---|---|
colB is text | US & Canada |
colB is formula (=B2) | US & Canada |
Solution 1: Resolve it in excel
The obvious solution is to get rid of the problem at the beginning. There are many ways can achieve the desired outcome – in the end, what we want is the value in R matches the value in the spreadsheet.
- Convert to csv. Csv is my preferred format as it eliminates lots of format issues in xlsx. The problem is that this will creates multiple files , and people can get confused on which files to use. Besides, it might not be a good option if the file is meant to be a living dataset that need continuous edits.
- Paste as value. Another way to do it is to transform the formula to plain text in excel by pasting as value. If the spreadsheet is okay to be edited and is not too big, it would be an easy solution.
Solution 2: Use other packages
Sometimes you probably don’t want to touch the raw data. Also, if it’s a large dataset, it might be unrealistic to perform such functionality in excel as the solution 1 suggested. Since this issue only occurs when using openxlsx
package, we can choose other packages as a substitute.
readxl
is one of the packages belong to tidyverse collections. It’s designed to read data out of excel into R for both xls and xlsx, and does not have other dependencies. It’s a dependent package outside of tidyverse library so we would need to specifically call it out in R.
readxl::read_excel
is the substitute we’re looking for and it does the job well.
library(readxl) # Import df <- read_excel("sample_data.xlsx") df # A tibble: 4 × 2 colA colB <chr> <chr> 1 colB is text US & Canada 2 colB is formula (=B2) US & Canada
However, readxl
has its limitation. Comparing with openxlsx
, readxl::read_excel
has less arguments and therefore will possibly generates other format issue. For example, detectDates
is the argument I use often that readxl::read_excel
does not have, so I have to recognize dates and perform conversion later. Also, if I need to save the output to spreadsheet, I would need to call out another library writexl
and it’s just not as convenient as the all-in-one package openxlsx
.
Looking foward
I still prefer openxlsx
to other as I like its arguments and it fits the business world well. There’s a successor package openxlsx2
on the way – Let’s wait and see if it resolves the issue.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.