Site icon R-bloggers

Why multiple imputation?

[This article was first published on StaTEAstics., and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
< !DOCTYPE html>< !-- saved from url=(0014)about:internet --> < !-- Styles for R syntax highlighter --> < !-- R syntax highlighter --> < !-- MathJax scripts -->

Background

In the forth coming week, I will be giving a presentation on the fundamentals of imputation to my colleagues. One of the most important idea I would like to present is multiple imputation.

In my last post, I have given a small example of multiple imputation, but it does not provide the evidence why we should use it. This is the aim of this post in which I have taken a small example to illustrate how inference may change under single and multiple imputation.

Like most international organization and statistical institutes, we use the least-squares growth rate whenever possible to estimate the growth of a certain indicator. In brevity, a linear regression on the logged variable with respect to time is fitted, then the exponential of the time coefficient minus one will be the growth rate. \[ LSGR = (e^{\hat{\beta}} - 1) \times 100 \]

Since a linear regression is fitted, we can conveniently test whether a growth exist. In this example, I have chosen the production of wheat of Romania to illustrate my point of why multiple imputation should be used for drawing inference and to avoid being over confident.

Method

Lets load the data and simulate some missing value.


## Load the data, the production domain of Romania
romaniaMaize.df = structure(list(
    year = 1961:2011,
    production = c(5739600, 4932400, 6022700, 6691700,
        5877000, 8021600, 6857879, 7105287, 7675808, 6535512,
        7850300, 9816700, 7397200, 7439600, 9240661,
        11583200, 10113559, 9671200, 11768100, 10563300,
        8321500, 10550000, 10296000, 9891000, 11903200,
        10900900, 7526900, 7182200, 6761800, 6809604,
        10497338, 6828270, 7987450, 9343000, 9923132,
        9607944, 12686700, 8623370, 10934800, 4897600,
        9119200, 8399800, 9576985, 14541564, 10388499,
        8984729, 3853920, 7849080, 7973258, 9042032,
        11717591),
    areaHarvested = c(3428400, 3106766, 3379300, 3319100,
        3305800, 3287700, 3221075, 3344013, 3293068,
        3084036, 3131400, 3196500, 2956800, 2963000,
        3304691, 3377600, 3317671, 3178798, 3311291,
        3287560, 3077900, 2764387, 2935120, 3090530,
        3090100, 2858100, 2787100, 2579400, 2733400, 2466735,
        2574999, 3335920, 3065682, 2983400, 3109236, 3277041,
        3037700, 3128900, 3013400, 3049400, 2974000, 2761223,
        3119104, 3196130, 2609110, 2512944, 2263080, 2432210,
        2333501, 2094249, 2587102),
    seed = c(124000, 135000, 133000, 132000, 139000, 137000,
        140000, 140000, 135000, 135000, 138000, 135000,
        133000, 140000, 120000, 107000, 134000, 139000,
        133000, 135000, 124000, 144000, 148000, 145000,
        135000, 130000, 122000, 130000, 139000, 145000,
        180000, 180000, 142500, 125600, 130900, 90587,
        124753, 77000, 76000, 69907, 70692, 72785, 79065,
        67000, 64600, 63829, 65123, 62851, 59985, 53880,
        64744)),
    .Names = c("year", "production", "areaHarvested", "seed"),
    row.names = 6110:6160, class = "data.frame")

## Simulate missing value in production
n = NROW(romaniaMaize.df)
pctMiss = 0.5
myseed = 587
set.seed(myseed)
missIndex = sample(n, n * pctMiss)
romaniaMaize.df$productionSim = romaniaMaize.df$production
romaniaMaize.df[missIndex, "productionSim"] = NA

Then I will impute the data using the mi package assuming that production depends on the seed utilized and area harvested. I have imputed the data 50 times for the plot, but according to Rubin 3 to 10 imputation would be sufficient.


## Create missing information matrix
info = mi.info(romaniaMaize.df)
info = mi.info.update.include(info,
    list(areaCode = FALSE, unsdSubReg = FALSE, production = FALSE,
         year = TRUE))

## I assume that the production has a trend and is determined by the
## area harvested and seed utilized.
info = mi.info.update.imp.formula(info,
    list(productionSim = "productionSim ~ year + areaHarvested + seed"))


## multiple imputation via mi
romaniaMaize.mi = mi(romaniaMaize.df, info = info, n.imp = 50,
    n.iter = 100, seed = myseed, run.past.convergence = TRUE)

Then we compute the single imputation by taking the average value of all the multiple imputation.

## Compute the single imputation
romaniaMaize.df$imputed = romaniaMaize.df$productionSim
romaniaMaize.df[is.na(romaniaMaize.df$imputed), "imputed"] =
    rowMeans(sapply(romaniaMaize.mi@imp,
                    FUN = function(chain) chain$productionSim@random))

The following plot illustrate the difference between multiple and single imputation. For single imputation we impute only once, while in multiple imputation we impute multiple times to reflect the uncertainty, each set of imputation can be interpreted as a potentially observed realization.


## Plot and compare the imputations
with(romaniaMaize.df, plot(year, production, type = "n",
                          ylim = c(0, max(production, imputed))))
sapply(mi.completed(romaniaMaize.mi), FUN = function(chain)
       lines(romaniaMaize.df$year, chain$productionSim,
             col = rgb(0.5, 0.5, 0.5, alpha = 0.3)))
with(romaniaMaize.df, lines(year, production, 
                          ylim = c(0, max(production, imputed))))
with(romaniaMaize.df, points(year, production, pch = 19, 
                          ylim = c(0, max(production, imputed)),
                          col = is.na(productionSim) + 1))
with(romaniaMaize.df, lines(year, imputed, col = "steelblue", lwd = 2))

Compute the growth rate on single and multiple imputation and test using the frequentists approach.

> ## Fit the regression on single/multiple imputed data set
> single.fit = lm(log(romaniaMaize.df$imputed) ~ year,
+     data = romaniaMaize.df)
> multiple.fit = lm.mi(log(productionSim) ~ year, romaniaMaize.mi)
> 
> ## Compare the growth rate
> (exp(coef(single.fit)[2]) - 1) * 100
     year 
0.4365756 
> ((exp(coef(multiple.fit)[2]) - 1) * 100)
     year 
0.4377279 
> 
> ## Compare the P-value
> 2 * pt(single.fit$coef[2]/sqrt(diag(vcov(single.fit)))[2],
+        df = single.fit$df.residual, lower.tail = FALSE)
       year 
0.001873983 
> 2 * pt(multiple.fit@mi.pooled$coef[2]/multiple.fit@mi.pooled$se[2],
+        df = single.fit$df.residual, lower.tail = FALSE)
     year 
0.1726464 

From the result we can see that the point estimates are very similar. However, if we perform the hypothesis testing on the singularly imputed dataset, we would have came to the conclusion that the production of Romania has been growing at a rate which is statistical significant over the past half a century. On the other hand, the test on the multiplied dataset has a P-value greater than the typical threshold of 5 percent.

Why are the results different? Lets not get into all the issues of the hypothesis testing under the Frequentist paradigm. The main disparity results from the fact that we take into account of the uncertainty of imputation and correctly estimate the standard error. Rather than providing a single point we provide a distribution of possible points.

To leave a comment for the author, please follow the link and comment on their blog: StaTEAstics..

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.