Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Data that’s been only partially observed
I’ve been updating my skills in fitting models to truncated data and was pleased to find that, like so much else in statistics, it’s easier than it used to be.
First, some definitions:
- censored data is where some observations are cut off at some maximum or minimum level; those data points are “off the scale” but at least we know they exist and we know which direction they are off the scale. For example, if we were analysing the life span of people born in 1980, anyone who has survived to the end of 2017 has their age at death recorded as “37 or higher”. We know they’re in the data, and their value is at least some minimum amount, but we don’t know with precision what it will end up being.
- truncated data is where data beyond some maximum or minimum level is just missing. Typically this is because of some feature of the measurement process eg anything smaller than X just doesn’t show up.
I’ve got some future blog posts on a more substantive real life issue where I have count data for which, in some situations, I only see the observations with a count higher than some threshold. Let’s imagine, for example, we are looking at deaths per vehicle crash, and are dependent for measurement on newspapers that only report crashes with at least two deaths, even though many crashes have one or zero deaths.
Here’s a greatly simplified example. I generate 1,000 observations of counts, with an average value of 1.3. Then I compare that original distribution with what I’d get if only those of two or higher were observed. It looks like this:
…generated by this code:
Estimating the key parameter lambda
for the full data (a
) works well, giving an estimate of 1.347 that is just over one standard error from the true value of 1.3. The fitdistr
function from the MASS
package distributed with base R does a nice job in such circumstances.
But the mean value of b
is badly biased upwards if used to estimate lambda
; at 2.6 the mean of b
is roughly twice the correct value of the mean of the underlying distribution. Obviously, removing a whole bunch of data at one end of the distribution is going to make naive estimation methods biased. So we need specialist methods that try to estimate lambda on the assumption that the data come from a Poisson distribution, but only the right-most part of it.
Maximum likelihood
The fitdistrplus
package by Aurélie Siberchicot, Marie Laure Delignette-Muller and Christophe Dutang in combination with truncdist
by Frederick Novomestky and Saralees Nadarajah gives a straightforward way to implement maximum likelihood estimation of a truncated distribution. Methods other than maximum likelihood are also available if required.
You need to make truncated versions of the dpois
and ppois
functions (or their equivalents for whatever distribution you are modelling) and use these within fitdistrplus::fitdist
, which has some added functionality over MASS::fitdistr
used in the previous chunk of code.
Note that to do this I specified the lower threshold as 1.5; as all the data are integers this effectively means we only observe the observations of 2 or more, as is the case. We also needed to specify a reasonably starting value for the estimate of lambda
– getting this too far out will lead to errors.
This method gives us an estimate of 1.34 with a standard error of 0.08, which is pretty good given we’ve only got 398 observations now. Of course, we’ve got the luxury of knowing for sure the true data generating process was Poisson.
Bayesian
For an alternative Bayesian method, Stan makes it easy to describe data and probability distributions as truncated. The Stan manual has an entire chapter on truncated or censored data. Here’s an example Stan program to estimate the mean of the original Poisson distribution from our truncated data. As well as the original data, which I call x
in this program, we need to tell it how many observations (n
), the lower_limit
that it was truncated by, and whatever is needed to characterise a prior distribution for the parameter we’re estimating.
The key bits of the program below are:
- In the
data
chunk, specify that the data forx
has a lower limit oflower_limit
- In the
model
chunk, specify that distribution ofx
is truncated viaT[lower_limit, ]
With a little more effort it’s possible to extend this by making Stan estimate lower_limit
from the data; not necessary in my hypothetical example because I know where the minimum cut-off point of observed data lies.
Here’s how the data are fed to Stan from R:
This gives us a posterior distribution for lambda
that matches that from the fitdistrplus
method: 1.35 with a standard deviation of 0.08. The rstan
package automatically turns this into a ggplot2
image of a credibility interval:
So, nice. Two simple ways to estimate the original distribution from truncated data.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.