Non-randomly missing data is hard, or why weights won’t solve your survey problems and you need to think generatively
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Throw this onto the big pile of stats problems that are a lot more subtle than they seem at first glance. This all started when Lauren pointed me at the post Another way to see why mixed models in survey data are hard on Thomas Lumley’s blog. Part of the problem is all the jargon in survey sampling—I couldn’t understand Lumley’s language of estimators and least squares; part of it is that missing data is hard.
The full data model
Imagine we have a a very simple population of items with values normally distributed members with standard deviation known to be 2,
To complete the Bayesian model, we’ll assume a standard normal prior on ,
Now we’re not going to observe all , but only a sample of the elements. If the model is correct, our inferences will be calibrated in expection given a random sample of items from the population.
Missing data
Now let’s assume the sample of we observe is not drawn at random from the population. Imagine instead that we have a subset of items from the population, and for each item , there is a probability that the item will be included in the sample. We’ll take the log odds of inclusion to be equal to the item’s value,
.
Now when we collect our sample, we’ll do something like poll people from the population, but each person only has a chance of responding. So we only wind up with observations, with observations missing.
This situation arises in surveys, where non-response can bias results without careful adjustment (e.g., see Andrew’s post on pre-election polling, Don’t believe the bounce).
So how do we do the careful adjustment?
Approach 1: Weighted likelihood
A traditional approach is to inverse weight the log likelihood terms by the inclusion probability,
Thus if an item has a 20% chance of being included, its weight is 5.
In Stan, we can code the weighted likelihood as follows (assuming pi is given as data).
for (n in 1:N_obs) target += inv(pi[n]) * normal_lpdf(y[n] | mu, 2);
If we optimize with the weighted likelihood, the estimates are unbiased (i.e., the expectation of the estimate is the true value ). This is borne out in simulation.
Although the parameter estimates are unbiased, the same cannot be said of the uncertainties. The posterior intervals are too narrow. Specifically, this approach fails simulation-based calibration; for background on SBC, see Dan’s blog post You better check yo self before you wreck yo self.
One reason the intervals are too narrow is that we are weighting the data as if we had observed items when we’ve only observed items. That is, their weights are what we’d expect to get if we’d observed items.
So my next thought was to standardize. Let’s take the inverse weights and normalize so the sum of inverse weights is equal to That also fails. The posterior intervals are still too narrow under simulation.
Sure, we could keep fiddling weights in an ad hoc way for this problem until they were better calibrated empirically, but this is clearly the wrong approach. We’re Bayesians and should be thinking generatively. Maybe that’s why Lauren and Andrew kept telling me I should be thinking generatively (even though they work on a survey weighting project!).
Approach 2: Missing data
What is going on generativey? We poll people out of a population of , each of which has a chance of responding, leading to a set of responses of size
Given that we know how relates to , we can just model everything (in the real world, this stuff is really hard and everything’s estimated jointly).
Specifically, the missing items each get parameters representing how they would’ve responded had they responded. We also model response, so we have an extra term for the unobserved values and an extra term for the observed values.
This works. Here’s the Stan program.
data { int N_miss; int N_obs; vector[N_obs] y_obs; } parameters { real mu; vector[N_miss] y_miss; } model { // prior mu ~ normal(0, 1); // observed data likelihood y_obs ~ normal(mu, 2); 1 ~ bernoulli_logit(y_obs); // missing data likelihood and missingness y_miss ~ normal(mu, 2); 0 ~ bernoulli_logit(y_miss); }
The Bernoulli sampling statements are vectorized and repeated for each element of y_obs and y_miss. The suffix _logit indicates the argument is on the log odds scale, and could have been written:
for (n in 1:N_miss) 0 ~ bernoulli(y_miss[n] | inv_logit(y_miss[n]))
And here’s the simulation code, including a cheap run at SBC:
library(rstan) rstan_options(auto_write = TRUE) options(mc.cores = parallel::detectCores(), logical = FALSE) printf <- function(msg, ...) { cat(sprintf(msg, ...)); cat("\n") } inv_logit <- function(u) 1 / (1 + exp(-u)) printf("Compiling model.") model <- stan_model('missing.stan') for (m in 1:20) { # SIMULATE DATA mu <- rnorm(1, 0, 1); N_tot <- 1000 y <- rnorm(N_tot, mu, 2) z <- rbinom(N_tot, 1, inv_logit(y)) y_obs <- y[z == 1] N_obs <- length(y_obs) N_miss <- N_tot - N_obs # COMPILE AND FIT STAN MODEL fit <- sampling(model, data = list(N_miss = N_miss, N_obs = N_obs, y_obs = y_obs), chains = 1, iter = 5000, refresh = 0) mu_ss <- extract(fit)$mu mu_hat <- mean(mu_ss) q25 <- quantile(mu_ss, 0.25) q75 <- quantile(mu_ss, 0.75) printf("mu = %5.2f in 50pct(%5.2f, %5.2f) = %3s; mu_hat = %5.2f", mu, q25, q75, ifelse(q25
Here's some output with random seeds, with mu, mu_hat and 50% intervals and indicator of whether mu is in the 50% posterior interval.
mu = 0.60 in 50pct( 0.50, 0.60) = no; mu_hat = 0.55 mu = -0.73 in 50pct(-0.67, -0.56) = no; mu_hat = -0.62 mu = 1.13 in 50pct( 1.00, 1.10) = no; mu_hat = 1.05 mu = 1.71 in 50pct( 1.67, 1.76) = yes; mu_hat = 1.71 mu = 0.03 in 50pct(-0.02, 0.08) = yes; mu_hat = 0.03 mu = 0.80 in 50pct( 0.76, 0.86) = yes; mu_hat = 0.81
The only problem I'm having is that this crashes RStan 2.19.2 on my Mac fairly regularly.
Exercise
How would the generative model differ if we polled members of the population at random until we got 1000 respondents? Conceptually it's more difficult in that we don't know how many non-resondents were approached on the way to 1000 respondents. This would be tricky in Stan as we don't have discrete parameter sampling---it'd have to be marginalized out.
Lauren started this conversation saying it would be hard. It took me several emails, part of a Stan meeting, buttonholing Andrew to give me an interesting example to test, lots of coaching from Lauren, then a day of working out the above simulations to convince myself the weighting wouldn't work and code up a simple version that would work. Like I said, not easy. But at least doable with patient colleagues who know what they're doing.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.