Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Throw this onto the big pile of stats problems that are a lot more subtle than they seem at first glance. This all started when Lauren pointed me at the post Another way to see why mixed models in survey data are hard on Thomas Lumley’s blog. Part of the problem is all the jargon in survey sampling—I couldn’t understand Lumley’s language of estimators and least squares; part of it is that missing data is hard.
The full data model
Imagine we have a a very simple population of
To complete the Bayesian model, we’ll assume a standard normal prior on
Now we’re not going to observe all
Missing data
Now let’s assume the sample of
Now when we collect our sample, we’ll do something like poll
This situation arises in surveys, where non-response can bias results without careful adjustment (e.g., see Andrew’s post on pre-election polling, Don’t believe the bounce).
So how do we do the careful adjustment?
Approach 1: Weighted likelihood
A traditional approach is to inverse weight the log likelihood terms by the inclusion probability,
Thus if an item has a 20% chance of being included, its weight is 5.
In Stan, we can code the weighted likelihood as follows (assuming pi is given as data).
for (n in 1:N_obs) target += inv(pi[n]) * normal_lpdf(y[n] | mu, 2);
If we optimize with the weighted likelihood, the estimates are unbiased (i.e., the expectation of the estimate
Although the parameter estimates are unbiased, the same cannot be said of the uncertainties. The posterior intervals are too narrow. Specifically, this approach fails simulation-based calibration; for background on SBC, see Dan’s blog post You better check yo self before you wreck yo self.
One reason the intervals are too narrow is that we are weighting the data as if we had observed
So my next thought was to standardize. Let’s take the inverse weights and normalize so the sum of inverse weights is equal to
Sure, we could keep fiddling weights in an ad hoc way for this problem until they were better calibrated empirically, but this is clearly the wrong approach. We’re Bayesians and should be thinking generatively. Maybe that’s why Lauren and Andrew kept telling me I should be thinking generatively (even though they work on a survey weighting project!).
Approach 2: Missing data
What is going on generativey? We poll
Given that we know how
Specifically, the
This works. Here’s the Stan program.
data { int N_miss; int N_obs; vector[N_obs] y_obs; } parameters { real mu; vector[N_miss] y_miss; } model { // prior mu ~ normal(0, 1); // observed data likelihood y_obs ~ normal(mu, 2); 1 ~ bernoulli_logit(y_obs); // missing data likelihood and missingness y_miss ~ normal(mu, 2); 0 ~ bernoulli_logit(y_miss); }
The Bernoulli sampling statements are vectorized and repeated for each element of y_obs and y_miss. The suffix _logit indicates the argument is on the log odds scale, and could have been written:
for (n in 1:N_miss) 0 ~ bernoulli(y_miss[n] | inv_logit(y_miss[n]))
And here’s the simulation code, including a cheap run at SBC:
library(rstan) rstan_options(auto_write = TRUE) options(mc.cores = parallel::detectCores(), logical = FALSE) printf <- function(msg, ...) { cat(sprintf(msg, ...)); cat("\n") } inv_logit <- function(u) 1 / (1 + exp(-u)) printf("Compiling model.") model <- stan_model('missing.stan') for (m in 1:20) { # SIMULATE DATA mu <- rnorm(1, 0, 1); N_tot <- 1000 y <- rnorm(N_tot, mu, 2) z <- rbinom(N_tot, 1, inv_logit(y)) y_obs <- y[z == 1] N_obs <- length(y_obs) N_miss <- N_tot - N_obs # COMPILE AND FIT STAN MODEL fit <- sampling(model, data = list(N_miss = N_miss, N_obs = N_obs, y_obs = y_obs), chains = 1, iter = 5000, refresh = 0) mu_ss <- extract(fit)$mu mu_hat <- mean(mu_ss) q25 <- quantile(mu_ss, 0.25) q75 <- quantile(mu_ss, 0.75) printf("mu = %5.2f in 50pct(%5.2f, %5.2f) = %3s; mu_hat = %5.2f", mu, q25, q75, ifelse(q25
Here's some output with random seeds, with mu, mu_hat and 50% intervals and indicator of whether mu is in the 50% posterior interval.
mu = 0.60 in 50pct( 0.50, 0.60) = no; mu_hat = 0.55 mu = -0.73 in 50pct(-0.67, -0.56) = no; mu_hat = -0.62 mu = 1.13 in 50pct( 1.00, 1.10) = no; mu_hat = 1.05 mu = 1.71 in 50pct( 1.67, 1.76) = yes; mu_hat = 1.71 mu = 0.03 in 50pct(-0.02, 0.08) = yes; mu_hat = 0.03 mu = 0.80 in 50pct( 0.76, 0.86) = yes; mu_hat = 0.81
The only problem I'm having is that this crashes RStan 2.19.2 on my Mac fairly regularly.
Exercise
How would the generative model differ if we polled members of the population at random until we got 1000 respondents? Conceptually it's more difficult in that we don't know how many non-resondents were approached on the way to 1000 respondents. This would be tricky in Stan as we don't have discrete parameter sampling---it'd have to be marginalized out.
Lauren started this conversation saying it would be hard. It took me several emails, part of a Stan meeting, buttonholing Andrew to give me an interesting example to test, lots of coaching from Lauren, then a day of working out the above simulations to convince myself the weighting wouldn't work and code up a simple version that would work. Like I said, not easy. But at least doable with patient colleagues who know what they're doing.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.