Spreadsheet errors
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
For my sins, I have done more than my fair share of analysis in Excel. I am quite capable of building and maintaining 130Mb spreadsheets (I had a dozen of them for one client). Excel is pretty much installed everywhere, so it is sometimes the only way to get started getting commercial value of the data in the organisation. But I don’t like it and let’s have a look at one reason why. In order not to always pick on Microsoft, we use another application, but you get the same results with Excel.
Y | X1 | X2 | X3 | X4 |
---|---|---|---|---|
5.88 | 1 | 1 | 1 | 1 |
2.56 | 6 | 1 | 1 | 1 |
11.11 | 1 | 1 | 1 | 1 |
0.79 | 6 | 1 | 1 | 1 |
0.00 | 6 | 1 | 1 | 1 |
0.00 | 0 | 1 | 1 | 1 |
15.6 | 8 | 1 | 1 | 1 |
3.7 | 4 | 1 | 1 | 1 |
8.49 | 3 | 1 | 1 | 1 |
51.2 | 6 | 1 | 1 | 1 |
14.2 | 7 | 1 | 1 | 1 |
7.14 | 5 | 1 | 1 | 1 |
4.2 | 7 | 1 | 1 | 1 |
6.15 | 4 | 1 | 1 | 1 |
10.46 | 6 | 1 | 1 | 1 |
0.00 | 8 | 1 | 1 | 1 |
10.42 | 2 | 1 | 1 | 1 |
17.36 | 5 | 1 | 1 | 1 |
13.41 | 8 | 1 | 1 | 1 |
41.67 | 0 | 1 | 1 | 1 |
2.78 | 0 | 1 | 1 | 1 |
2.98 | 8 | 1 | 1 | 1 |
9.62 | 7 | 1 | 1 | 1 |
0.00 | 0 | 1 | 1 | 1 |
4.65 | 5 | 1 | 0 | 2 |
3.13 | 3 | 1 | 0 | 2 |
24.58 | 6 | 1 | 0 | 2 |
0.00 | 1 | 1 | 0 | 2 |
5.56 | 4 | 1 | 0 | 2 |
9.26 | 3 | 1 | 0 | 2 |
0.00 | 0 | 1 | 0 | 2 |
0.00 | 0 | 1 | 0 | 2 |
3.13 | 1 | 1 | 0 | 2 |
0.00 | 0 | 1 | 0 | 2 |
7.56 | 5 | 0 | 1 | 3 |
9.93 | 6 | 0 | 1 | 3 |
0.00 | 8 | 0 | 1 | 3 |
16.67 | 6 | 0 | 1 | 3 |
16.89 | 7 | 0 | 1 | 3 |
13.71 | 6 | 0 | 1 | 3 |
6.35 | 5 | 0 | 1 | 3 |
2.5 | 3 | 0 | 1 | 3 |
2.47 | 7 | 0 | 1 | 3 |
21.74 | 3 | 0 | 1 | 3 |
23.6 | 8 | 0 | 0 | 4 |
11.11 | 8 | 0 | 0 | 4 |
0.00 | 7 | 0 | 0 | 4 |
3.57 | 8 | 0 | 0 | 4 |
2.9 | 5 | 0 | 0 | 4 |
2.94 | 3 | 0 | 0 | 4 |
2.42 | 8 | 0 | 0 | 4 |
18.75 | 4 | 0 | 0 | 4 |
0.00 | 5 | 0 | 0 | 4 |
2.27 | 3 | 0 | 0 | 4 |
However, the predictors are (accidentally) collinear so no meaningful fit is possible, unless one of them are dropped. We see that very easily if we try to do the analysis using the R statistical computing and analysis platform:
> d <- read.delim("clipboard") # Read DATA range from clipboard > summary(lm(Y ~ ., data = d)) Call: lm(formula = Y ~ ., data = d) Residuals: Min 1Q Median 3Q Max -11.222 -5.821 -2.546 3.171 40.750 Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) 4.1945 3.9749 1.055 0.296 X1 0.3862 0.5652 0.683 0.497 X2 0.2308 3.1590 0.073 0.942 X3 3.7072 2.9922 1.239 0.221 X4 NA NA NA NA Residual standard error: 10.14 on 50 degrees of freedom Multiple R-squared: 0.04767, Adjusted R-squared: -0.009466 F-statistic: 0.8343 on 3 and 50 DF, p-value: 0.4814
We have highlighted the message that R has automatically dropped one of the predictors.
Everybody likes to pick on Excel, so let us load the data into version 3.3.2 of LibreOffice, the free Open Source personal productivity suite, instead. It faithfully implements many of the worst features of Excel. You can grab a copy of the spreadsheet GS-spreadsheet-error.ods yourself and see the results. The relevant function in both Excel and LibreOffice for linear regression is LINEST and applying it to the data set give us:
Of the 16 values returned by the function, fully 12 of them are incorrect (highlighted in red), and the ‘#VALUE!’ entries are the only thing that suggests we may have a problem. (The ‘#N/A’ values are a feature of the function and not a problem.) Excluding the X4 values from the function call gives meaningful (and correct) results:
There is so much wrong with doing even this trivial analysis in a spreadsheet that it is hard to know where to start. Some of the problems:
- Garbage results instead of errors
- Instead of giving meaningful errors or warnings, the spreadsheets simply produce garbage results. This is nearly impossible to debug.
- No help on how to correct the problem
- In the erroneous results of the first figure, there is no clue, no hint, no help to figure out how to correct the problem. You could argue about R correcting the issue ”automagically”, but at least it finds a solution to the problem and tells you about it.
- Error prone output formats
- I put in the row and column headings because otherwise it is just too hard to read the data. Where does the function stuff the F statistics again?
And don’t get me started on version control and documentation. Don’t even mention that the maths in Excel are wrong. Remember: Friends do not let friends do data analysis in spreadsheets.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.