Can you spot the Error?
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Peter Huber referred to “the rawness of raw data”, a kind of data we would not expect to find in a textbook. The book of Fahrmeir and Tutz on multivariate modelling refers to the visual impairment data from Liang et al., 1992 in table 3.12:
Nothing wrong here at first sight; but how would you tell? There are some people who are actually able to look at non-trivial table data and spot “the round peg in the square hole”, but that just won’t work for the rest of us.
As you might guess, I am going to make a case for graphics here.
Let’s start with what the mainstream would do: plot the data in a dotplot like thing using the trellis paradigm of conditioning. I used ggplot2 to make sure to trellis state-of-the-art. A simple
qplot(count, side, data=visual2, colour=impaired) + facet_grid(age ~ race)
gives me:
(I still have a hard time to find that syntax intuitive …) Surprisingly this plot already is sufficient to spot the “problem” in the data, although some important properties of the data can’t be seen here.
A mosaic plot makes the whole thing even easier:
(impairment cases highlighted, left and right is left and right)
The left and right cases are (what a surprise) always of the same size, except for the 70+, black – hard to believe that in this group 110 cyclops show up not having a right eye.
In the mosaic plot the higher proportion of the impaired right eyes for 70+ blacks jumps immediately to ones eyes, but what reveals the error is the missing independence between race and side for 70+. That implies that we have too few cases here, and what is ’226′ in the table should actually be ’336′.
Here is the (corrected) data.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.