Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Data are censored when samples from one or both ends of a continuous distribution are cut off and assigned one value. Environmental chemical exposure datasets are often left-censored: instruments can detect levels of chemicals down to a limit, underneath which you can’t say for sure how much of the chemical was present. The chemical may not have been present at all, or was present at a low enough level that the instrument couldn’t pick it up.
There are a number of methods of dealing with censored data in statistics. The crudest approach is to substitute a value for every censored value. This may be zero, the detection limit, the detection limit over the square root of two, or some other value. This method, however, biases the statistical results depending on what you choose to substitute.
A more nuanced alternative is to assume the data follows a particular distribution. In the case of environmental data, it often makes sense to assume a log-normal distribution. When we do this, while we may not know what the precise levels of every non-detect, we can at least assume that they follow a log-normal distribution.
In this article, we will walk through fitting a log-normal distribution to a censored dataset using the method of maximum likelihood.
Fitting a censored log-normal distribution
First, we need to generate a contrived dataset. Here we take 10000 samples from a log-normal distribution with a log-mean of 5 and log-standard deviation of 1; call those values \( y \). Any value below \( L = 50 \) will be censored. Let \( y^{\ast} \) be the censored dataset in which every censored point is fixed to the censoring limit. Formally, \( y^{\ast} \) is defined as:
Here we generate a dataset in R:
The spike on the left represents the censored values. The detection frequency, or the proportion of detected values, is 0.8669.
First, let’s do a better job of defining the problem we’re trying to solve. We set out to fit a log-normal distribution to some censored data. Our fundamental problem here is that we want to find the right parameters (the mean and standard deviation) for a log-normal distribution that “fits”” the observed data. But how do you tell what the “right” parameters are, or what the best “fit” is? The maximum likelihood approach is to say “the best distribution is the one that has the highest likelihood of generating the data we’ve observed.” To actually find those parameters, you use an optimization algorithm that tries a bunch of different parameters and hones in on the one that defines a distribution with the highest likelihood of generating the observed data.1
Let’s develop some intuition about how we are going to handle this censored data. With censored data, we don’t know the exact value of each datapoint that is below a censoring limit. The true value could be anywhere below the limit. To deal with this, we assume that the entire dataset is log-normally distributed, so even if we don’t know the exact value of the censored data, we know something about how those points are distributed.
Specifically, let \( P(x \vert \mu, \sigma) \) and \( D(x \vert \mu, \sigma) \) be the log-normal probability density function (PDF) and cumulative density function (CDF), respectively, taking as parameters a mean \( \mu \) and standard deviation \( \sigma \). We seek to find values for \( \mu \) and \( \sigma \) that maximize the likelihood function \( L(y^{\ast} \vert \mu, \sigma) \). A key assumption is that that the measurements in \( y^{\ast} \) are independent.
If the point is above the censoring limit, then its likelihood is defined by the log-normal PDF. If it below the censoring limit, we don’t know where the real value was exactly; all we know is that it’s less than \( L \). As such, its likelihood is the cumulative probability of a value being below \( L \) in a log-normal distribution with mean \( \mu \) and standard deviation \( \sigma \).
The total likelihood is found by multiplying all the individual likelihoods.2 Formally:
It’s often more convenient, especially for computers, to work with the log-likelihood function. It’s easy to get it, just take the log of both sides:
The R translation uses the built in dlnorm
and plnorm
functions for the log-normal PDF and CDF, choosing which one to use based on whether the data point is censored. It also checks the parameters to make sure they make sense: a negative standard deviation is invalid, so the function rejects that parameter set.
We use the maxLik
function from the maxLik
3 package to perform the estimation. We need to supply starting values for the parameters, so let’s have it start with a mean of 0 and standard deviation of 1.
The fitted distribution gets pretty close to recovering the correct mean and standard deviation. Censored data loses information, so in general you shouldn’t expect to get perfect estimates here.
Let’s plot the original, uncensored data \( y \) against the distribution fitted to the censored data to see how they compare visually:
The core idea with this overall approach is that we assume something about the distribution of the censored data points so that we can get more nuanced parameter estimates than if we replaced them with a constant, like zero or the detection limit.
Future posts will go into regression techniques with censored datasets.
-
In “generative” statistical modelling you assume that the data you have was generated by some definable process. The task of the modeller is to write down how they think the data was generated, this is your “model” of how the world works. Everything that follows (parameter fitting, inference) is based off of this core assumption. In our case, we assume the data was generated from a log-normal distribution; that’s our “model” of how the data came to be. As it happens, this is exactly where the data came from! ↩
-
This comes from the assumption that each measurement is independent. Recall that the probability of two independent events occurring is the product of the individual probabilities: \( P(A \, \mathrm{and} \, B) = P(A) \cdot P(B) \) ↩
-
Henningsen, Arne and Toomet, Ott (2011). maxLik: A package for maximum likelihood estimation in R. Computational Statistics 26(3), 443-458. DOI 10.1007/s00180-010-0217-1. ↩
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.