Site icon R-bloggers

Excel spreadsheets are hard to get right

[This article was first published on Win-Vector Blog » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Any practicing data scientist is going to eventually have to work with a data stored in a Microsoft Excel spreadsheet. A lot of analysts use this format, so if you work with others you are going to run into it. We have already written how we Excel-like-formats-to-exchange-data/”>don’t recommend using Excel-line formats to exchange data. But we know if you are going to work with others you are going to have to make accommodations (we even built our own modified version of gdata‘s underlying Perl script to work around a bug).

But one thing that continues to confound us is how hard it is to read Excel data correctly. When Excel exports into CSV/TSV style formats it uses fairly clever escaping rules about quotes and new-lines. Most CSV/TSV readers fail to correctly implement these rules and often fail on fields that contain actual quote characters, separators (tab or comma), or new-lines. Another issue is Excel itself often transforms data without any user verification or control. For example: Excel routinely turns date-like strings into time since epoch (which it then renders as a date). We recently ran into another uncontrollable Excel transform: changing the strings “TRUE” and “FALSE” into 1 and 0 inside the actual “.xlsx” file. That is Excel does not faithfully store the strings “TRUE” and “FALSE” even in its native format. Most Excel users do not know about this, so they certainly are in no position to warn you about it.

This would be a mere annoyance, except it turns out Libre Office (or at least LibreOffice_4.3.4_MacOS_x86-64) has a severe and silent data mangling bug on this surprising Microsoft boolean type.

We first ran into this in client data (and once the bug triggered it seemed to alter most of the columns), but it turns out the bug is very easy to trigger. In this note we will demonstrate the data representation issue and bug.

Our example Excel spreadsheet was produced using Microsoft Excel 2011 for OSX. We started a new sheet and typed in a few cells by hand. We formatted the header and the numeric column, but did not move off default settings for any of the TRUE/FALSE cells. The spreadsheet looks like the following:



Original Excel spreadsheet (TRUE/FALSE typed in as text, no formatting commands on those cells).

You can also download the spreadsheet here.

On OSX Apple Numbers can read the sheet correctly. We demonstrate this below.



Sheet looks okay in Apple Numbers.

However, Libre Office doesn’t reverse the encoding (as it may not know some details of Excel‘s encoding practices) and also shows corrupted data as we see below.



TRUE/FALSE represented as 1/0 in Libre Office, and third row damaged.

In practice we have seen the data damage is pervasive and not limited to columns who’s original value was FALSE. It also does not seem to be a mere presentation problem as the blanked cells behave like blanks in sums and other operators.

Apple Preview and Quick Look both also fail to understand the Excel data encoding, as we show below.



Sheet damaged in Apple Preview (same for Apple Quick Look).

Our favorite analysis hammer (R) appears to read the data correctly (with only the undesired translation of TRUE/FALSE to 1/0):



R appears to load what was stored correctly.

But what is going on? It turns out Excel .xlsx files are actually zip archives storing a directory tree of xml artificts. By changing the file extension from .xlsx to .zip we can treat the spreadsheet as a zip archive and inflate it to see the underlying files. The inflated file tree is shown below.



The file tree representing the Excel workbook on disk.

Of particular interest are the files xl/worksheets/sheet1.xml and xl/sharedStrings.xml. sheet1.xml contains the worksheet data and sharedStrings.xml is a shared string table containing all strings used in the worksheet (the worksheet stores no user supplied strings, only indexes into the shared string table). Let’s look into sheet1.xml:


The XML representing the sheet data.

The sheet data is arranged into rows that contain columns. It is easy to match these rows and cells to our original spreadsheet. For cells containing uninterpreted strings the <c> tag has has an attributed set to t="s" (probably denoting type is “string” and to use the <v> value as a string index). Notice floating point numbers are not treated as shared strings, but stored directly in the <v> tag. Further notice that the last three columns are stored as 0/1 and have the attribute t="b" set. My guess is this is declaring the type is “boolean” which then must have the convention that 1 represents TRUE and 0 represents FALSE.

This doesn’t seem that complicated, but clearly of all the “Excel compatible” tools we tried only Apple Numbers knew all of the details of this encoding (and was able to reverse it). Other than Numbers only R‘s gdata package was able to extract usable data (and event it only recovered the encoded version of the field, not the original user value).

And these are our issue with working with data that has passed through Excel.

Because working with data that has passed through Excel is hard to get right, data that has passed through Excel is often wrong.

Related posts:

  1. Please stop using Excel-like formats to exchange data
  2. Must Have Software
  3. Reading the Gauss-Markov theorem

To leave a comment for the author, please follow the link and comment on their blog: Win-Vector Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.