A more systematic look at suppressed data by @ellis2013nz
free range statistics - R
[This article was first published on free range statistics - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Last week I blogged about some different ways of dealing with data in a cross tab that has been suppressed as a means of disclosure control, when the count in a cell is less than six. I tried simple replacement of those cells with “3”, two different multiple imputation methods, and left-censored Poisson regression based on survival methods. I tested those methods on a single two-way simulated cross-tab of the counts of three different types of animals in four different regions, with two suppressed cells.
An obvious piece of unfinished business is to extend the comparison I made to many different randomly generated tables. I have now done this, and I’ve also extended out the methods I’m comparing to cover three different multiple imputation methods and four different simple substitutions (with 0, 1, 3 and 5).
Because my tables are generated by a data generating process under my control, I know the true values of coefficients for the fully saturated Poisson family generalized linear model that is being used to analyse each table with each imputation method. So I can compare the true coefficients with those for each method, including using the full observed data (as though you had access to it, perhaps because you work for the agency that suppressed the data in the first instance). Note – if a fully saturated Poisson family generalized linear model sounds complicated, you can remember that it is the Chi square test for independence of variables in a cross tab which is usually one of the first two or three tools taught in a basic statistics methods class.
Some surprising results
It’s fair to say the results surprised me.
The best of the nine methods was simply to replace all values below six with the number 5
Four of the methods delivered better results on average in uncovering the true data generating process than did fitting the model to the real, unsuppressed data
My multiple imputation that provides numbers from zero to five with probabilities based on a truncated Poisson distribution is the best of the fancy methods, but only very slightly better than the default imputation from the mice package, which allows imputed values to be greater than 5 even though we know the reality wasn’t that big
left-censored Poisson regression with survival analysis methods still performs not particularly well.
My simulations were of tables from two to ten columns and two to ten rows. All the true coefficients for effect sizes including interactions were from a uniform distribution between -1 and 1. The intercept was drawn from a uniform distribution from 1 to 3. The intercept was excluded from the calculation of the mean square error of coefficients, because it’s usually not of substantive interest and because it is on a different scale to the other coefficients.
Looking at the relationship of error in coefficients to the size of the table and to the proportion of cells that were suppressed for being below six, we get these charts:
There are some interesting findings here too:
all methods perform worse as the proportion of cells under six increases, but for the poorer performing methods this is a stronger effect
all methods perform worse with more coefficents to estimate, but this is strongest for the method using the unsuppressed data
the unsuppressed data is clearly bimodal, with a cluster of poorly performing models well separated from an otherwise good performance.
Unusual behaviour of model with the model with full unsuppressed data
This last point about bimodality of performance of the model with the full unsuppressed data is particularly interesting. It wasn’t obvious in my first boxplot, because while that’s a nice simple comparison graphic it will usually hide bi-modality in a distribution. Here’s density plots instead:
It’s worth having a closer look at this. Unlike the other methods, using the full data normally performs exceptionally well (as would be expected) but sometimes extremely badly. Is there any pattern? Let’s look at a plot of the mean squared error across the size of the cross tab and the proportion of cells that were suppressed, just for the method that uses the full data despite suppression:
There’s a definite pattern that the more data were below 6, the worse this model does, but there are still occasions even when half the data are below 6 the model does ok (small reddish dots).
Luckily I had written the function that compares the models so it can be reproducible and provide the full set of coefficients. I had a look at one run that was particularly poor performing for the full data model, number 125. Here is the actual data for run 125:
animals
region
count
expected
a
A
0
4.9508590
b
A
5
3.7161291
c
A
6
12.5528917
d
A
17
12.6124215
e
A
2
5.2896170
f
A
1
3.5390988
g
A
10
6.7345340
h
A
6
6.2255400
i
A
3
1.8735223
a
B
7
8.3839561
b
B
5
2.8352860
c
B
20
23.8147302
d
B
21
18.3545554
e
B
6
7.6222396
f
B
14
14.2599204
g
B
18
16.0890494
h
B
17
17.6865488
i
B
2
2.4653642
a
C
5
3.7365829
b
C
1
4.4282008
c
C
15
13.6843526
d
C
6
6.3213678
e
C
7
4.6640287
f
C
4
1.1177665
g
C
9
8.5946791
h
C
1
2.2103981
i
C
0
0.7286345
… and here are the coefficients from all the models, compared to the true coefficients of the underlying data generating process:
So that’s interesting. It’s revealing that the full data model performs badly in a similar way to the absurb “replace all suppressed cells with zero” model. What’s going on here is that the observed data has an unusual number of very low values (eg a-A-0, f-A-1, b-C-1). The low values lead to a very unstable estimate of our Poisson regression model, and coefficients leap up to +25 and -25 (when the real values are between -1 and 1). This happens in enough runs that these wildly underperforming models totally mess up the record of the full data model.
What we’re seeing here is that this sort of model – a Poisson family GLM fit to a cross tab – performs erratically (and potentially delivers ridiculous results) when lots of counts are low. Which of course has been known since the early days of the Chi square test and the warning not to use it when expected counts are less than five.
A probable solution is clear – some kind of shrinkage of coefficient estimates, either through regularization if we’re a frequentist or a more sensible prior if we’re a Bayesian. Then I suspect the full data model would perform quite adequately.
Code
Here’s the R code for today’s exercise, all in one chunk:
To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R.