[This article was first published on SAS and R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Nick has a paper in the American Statistician warning about bias in multiple imputation arising from rounding data imputed under a normal assumption. One example where you might run afoul of this is if the data are truly dichotomous or count variables, but you model it as normal (either because your software is unable to model dichotomous values directly or because you prefer the theoretical soundness of multivariate normal imputation to, e.g., chained equations). In such cases, one might impute assuming normality, then round the imputed values to plausible integers. The paper shows theoretically the bias that can result if this process is pursued, and also that allowing the “implausible values” will eliminate the bias. (Of course, modeling the missing variable using a logistic regression model will be most appropriate here).Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In another paper, Nick and Stuart Lipsitz (TAS 2001) comment that the method of predictive mean matching (PMM) “ensures that imputed values are plausible, and may be more appropriate if the normality assumption is violated.” Briefly, the PMM method predicts a value from a model for both missing and observed values. The imputation for a subject with a missing value is the observed value of the subject with the nearest predicted value (or random draw of observed values from among the subjects with the nearest predicted values).
How does this play out in practice? Can the PMM method overcome the theoretical rounding bias while still generating only plausible imputed values?
SAS
We begin by simulating dichotomous data, choosing the value of p (probability of 1) = .25, a value with a large absolute bias, according to the paper. We set values to missing with probability 0.5, using a MCAR mechanism. Then we use proc mi (section 6.5, example 9.4) to impute the missing values, assuming normality. The mean and standard error of the mean of y are calculated in proc summary (section 2.1.1) and combined in proc mianalyze. Then the values are rounded manually and the analysis repeated. Next, we impute separately with PMM. Finally, we impute again with a logistic imputation. We use 5 imputations throughout, though 50 would likely be preferable.
Note that a Poisson regression imputation is not yet available for proc mi, so that the exercise is not wholly academic–if you needed to impute count values, you’d have to choose among implausible values, rounding, and PMM. Also note our use of the fcs imputation method, though it is not needed here with an obviously monotone missingness pattern. Finally, note that proc mi here requires at least two variables, for no reason we know of. We generate a normally-distributed and uncorrelated covariate.
data testpmm; do i = 1 to 5000; x = normal(0); y = rand('BINOMIAL', .25, 1); missprob = ranuni(0); if missprob le .5 then y = .; output; end; run; title "Normal imputation"; proc mi data=testpmm out=normal nimpute=5; var x y; fcs reg; run; title2 "Implausible values"; proc summary data = normal mean stderr; by _imputation_; var y; output out=outnormal mean=meany stderr=stderry; run; proc mianalyze data = outnormal; modeleffects meany; stderr stderry; run; title2 "Rounded"; /* make the rounded data */ data normalrnd; set normal; if y lt .5 then y=0; else y=1; run; proc summary data = normalrnd mean stderr; by _imputation_; var y; output out=outnormalrnd mean=meany stderr=stderry; run; proc mianalyze data = outnormalrnd; modeleffects meany; stderr stderry; run; title "regpmm imputation"; proc mi data=testpmm out=pmm nimpute=5; var x y; fcs regpmm; run; ... title "logistic imputation"; proc mi data=testpmm out=logistic nimpute=5; class y; var x y; fcs logistic; run; ...
We omit the summary and mianalyze procedures for the latter imputations. Ordinarily, it would be easiest to do this kind of repetitive task with a macro, but we leave it in open code here for legibility.
The results are shown below
Normal imputation-- Implausible values Parameter Estimate Std Error 95% Confidence Limits meany 0.249105 0.008634 0.230849 0.267362 Normal imputation-- Rounded meany 0.265280 0.006408 0.252710 0.277850 regpmm imputation meany 0.246320 0.006642 0.233204 0.259436 logistic imputation meany 0.255120 0.008428 0.237449 0.272791
As theory suggests, rounding the normally imputed values leads to bias, while using the normal imputations does not (though it results in implausible values). Nether PMM imputation nor direct logistic imputation appear to be biased.
R
We will use the mice package written by Stef van Buuren, one of the key developers of chained imputation. Stef also has a new book describing the package and demonstrating its use in many applied examples. We use 5 imputations throughout, though 50 would likely be preferable.
We begin by creating the data. Note that mice(), like proc mi, requires at least two columns of data. To do the logistic regression imputation, mice() wants the missing data to be a factor, so we make a copy of the data as a data frame object as well.
library(mice) n = 5000 # number of observations m = 5 # number of imputations (should be 25-50 in practice) x = rnorm(n) y = rbinom(n, 1, .25) # interesting point according to Horton and Lipsitz (TAS 2004) unif = runif(n) y[unif < .5] = NA # make half of the Y's be missing ds = cbind(y, x) ds2 = data.frame(factor(y), x)
The mice package works analogously to proc mi/proc mianalyze. The mice() function performs the imputation, while the pool() function summarizes the results across the completed data sets. The method option to mice() specifies an imputation method for each column in the input object. Here we fit the simplest linear regression model (intercept only).
# normal model with implausible values impnorm = mice(ds, method="norm", m=m) summary(pool(with(impnorm, lm(y ~ 1))))
Rounding could be done by tampering with the mids-type object that mice() produces, but there is a more direct way to do this through the post= option. It accepts text strings with R commands that will be applied to the imputed values. Here we use the ifelse() function to make the normal values equal to 0 or 1. The code for the predictive mean matching and logistic regression follow.
impnormround = mice(ds, method="norm", m=m, post= c("imp[[j]][,i] = ifelse(imp[[j]][,i] < .5, 0, 1)","")) imppmm = mice(ds, method="pmm", m=m) implog = mice(ds2, method="logreg", m=m)
The results of summary(pool()) calls are shown below..
> summary(pool(with(impnorm, lm(y ~ 1)))) est se lo 95 hi 95 (Intercept) 0.272912 0.007008458 0.2589915 0.2868325 > summary(pool(with(impnormround, lm(y ~ 1)))) est se lo 95 hi 95 (Intercept) 0.28544 0.00854905 0.2676263 0.3032537 > summary(pool(with(imppmm, lm(y ~ 1)))) est se lo 95 hi 95 (Intercept) 0.277636 0.03180604 0.2145564 0.3407156 > summary(pool(with(implog, lm(y ~ 1)))) est se lo 95 hi 95 (Intercept) 0.2652899 0.00879988 0.2480342 0.2825457
The message on bias is similar, though there is some hint of trouble in the CI for the PMM method (it seems to have a bias towards 0.5). The default option of 3 donors may be too few (this can be tweaked by use of the donors = NUMBER option).
To leave a comment for the author, please follow the link and comment on their blog: SAS and R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.