Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Any practicing data scientist is going to eventually have to work with a data stored in a Microsoft Excel
spreadsheet. A lot of analysts use this format, so if you work with others you are going to run into it. We have already written how we Excel-like-formats-to-exchange-data/”>don’t recommend using Excel
-line formats to exchange data. But we know if you are going to work with others you are going to have to make accommodations (we even built our own modified version of gdata
‘s underlying Perl
script to work around a bug).
But one thing that continues to confound us is how hard it is to read Excel
data correctly. When Excel
exports into CSV/TSV
style formats it uses fairly clever escaping rules about quotes and new-lines. Most CSV/TSV
readers fail to correctly implement these rules and often fail on fields that contain actual quote characters, separators (tab or comma), or new-lines. Another issue is Excel
itself often transforms data without any user verification or control. For example: Excel
routinely turns date-like strings into time since epoch (which it then renders as a date). We recently ran into another uncontrollable Excel
transform: changing the strings “TRUE
” and “FALSE
” into 1 and 0 inside the actual “.xlsx
” file. That is Excel
does not faithfully store the strings “TRUE
” and “FALSE
” even in its native format. Most Excel
users do not know about this, so they certainly are in no position to warn you about it.
This would be a mere annoyance, except it turns out Libre Office
(or at least LibreOffice_4.3.4_MacOS_x86-64) has a severe and silent data mangling bug on this surprising Microsoft boolean type.
We first ran into this in client data (and once the bug triggered it seemed to alter most of the columns), but it turns out the bug is very easy to trigger. In this note we will demonstrate the data representation issue and bug.
Our example Excel
spreadsheet was produced using Microsoft Excel
2011 for OSX. We started a new sheet and typed in a few cells by hand. We formatted the header and the numeric column, but did not move off default settings for any of the TRUE/FALSE
cells. The spreadsheet looks like the following:
Original
Excel
spreadsheet (TRUE/FALSE typed in as text, no formatting commands on those cells).You can also download the spreadsheet here.
On OSX
Apple Numbers
can read the sheet correctly. We demonstrate this below.
Sheet looks okay in Apple Numbers.
However, Libre Office
doesn’t reverse the encoding (as it may not know some details of Excel
‘s encoding practices) and also shows corrupted data as we see below.
TRUE/FALSE
represented as 1/0
in Libre Office
, and third row damaged.In practice we have seen the data damage is pervasive and not limited to columns who’s original value was FALSE
. It also does not seem to be a mere presentation problem as the blanked cells behave like blanks in sums and other operators.
Apple Preview
and Quick Look
both also fail to understand the Excel
data encoding, as we show below.
Sheet damaged in Apple Preview (same for Apple Quick Look).
Our favorite analysis hammer (R) appears to read the data correctly (with only the undesired translation of TRUE/FALSE
to 1/0
):
R appears to load what was stored correctly.
But what is going on? It turns out Excel
.xlsx
files are actually zip
archives storing a directory tree of xml
artificts. By changing the file extension from .xlsx
to .zip
we can treat the spreadsheet as a zip
archive and inflate it to see the underlying files. The inflated file tree is shown below.
The file tree representing the
Excel
workbook on disk.Of particular interest are the files xl/worksheets/sheet1.xml
and xl/sharedStrings.xml
. sheet1.xml
contains the worksheet data and sharedStrings.xml
is a shared string table containing all strings used in the worksheet (the worksheet stores no user supplied strings, only indexes into the shared string table). Let’s look into sheet1.xml
:
The sheet data is arranged into rows that contain columns. It is easy to match these rows and cells to our original spreadsheet. For cells containing uninterpreted strings the <c>
tag has has an attributed set to t="s"
(probably denoting type is “string” and to use the <v>
value as a string index). Notice floating point numbers are not treated as shared strings, but stored directly in the <v>
tag. Further notice that the last three columns are stored as 0/1
and have the attribute t="b"
set. My guess is this is declaring the type is “boolean” which then must have the convention that 1
represents TRUE
and 0
represents FALSE
.
This doesn’t seem that complicated, but clearly of all the “Excel
compatible” tools we tried only Apple Numbers
knew all of the details of this encoding (and was able to reverse it). Other than Numbers
only R
‘s gdata
package was able to extract usable data (and event it only recovered the encoded version of the field, not the original user value).
And these are our issue with working with data that has passed through Excel
.
- Excel has a lot of non-controllable data transforms including booleans, and dates. Some of these transforms are non-faithful or not reversible.
- Very few tools that claim to interoperate with
Excel
actually get the corner cases right. Even for simple well-documented data types likeExcel
CSV
export. And definitely not for the native.xlsx
format.
Because working with data that has passed through Excel
is hard to get right, data that has passed through Excel
is often wrong.
Related posts:
- Please stop using Excel-like formats to exchange data
- Must Have Software
- Reading the Gauss-Markov theorem
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.