Site icon R-bloggers

The Problem with Percentiles

[This article was first published on Not Normal Consulting, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
< !DOCTYPE html>< !-- saved from url=(0014)about:internet --> < !-- Styles for R syntax highlighter --> < !-- R syntax highlighter --> < !-- MathJax scripts -->

The Problem with Percentiles

Percentiles (or, more accurately, quantiles) are deeply embedded in the psyche of actuaries, statisticians and similar beasts. They are referred to implicitly in the Solvency 2 directive (Article 100, Value at Risk) without explanation. They are so ingrained that we normally use them without thinking, which I was doing until an issue came up at a client which forced me to look deeper.

In this article I'll consider what we actually mean by percentiles / quantiles, what their relationship is to samples generated by Monte Carlo methods, and how to estimate confidence intervals for them.

Sample Quantiles

Let's generate samples from a simple frequency-severity model like one we might us for cat or op risk modelling. We'll have an average arrival rate of 1 per year, an “average” loss-given-incidence of 1,000,000 and a “worst case” loss of around 8,000,000.

First lets define a function to generate the sample in R:

fGenerateSimulations <- function(freq, shape, scale, nb_sims) {

    # A vector of labels
    sim_nos <- 1:nb_sims

    # A vector of poisson arrivals
    poisson_counts <- cbind(sim = sim_nos, incidents = rpois(n = nb_sims, lambda = freq))

    # One for each incident
    severities <- rlnorm(n = sum(poisson_counts[, 2]), meanlog = scale, sdlog = shape)

    # Sum the incidents and sum up severities per simulation
    sims <- merge(poisson_counts, aggregate(severities, by = list(rep(sim_nos, 
        poisson_counts[, 2])), FUN = sum), all.x = TRUE, by.x = 1, by.y = 1)

    # The merge will generate NA where there was no incident, change it to
    # zero
    sims[which(0L == sims[, 2]), 3] <- 0

    return(sims)

}

Now we'll generate a thousand simulations:

set.seed(1)
one_simulation_set <- fGenerateSimulations(1, 2, 12, 1000)  # parameters set to roughly match the calibration notes above

Now standard operation procedure should just be to apply the quantile function to read off (say) a 99.5th percentile Value at Risk. Or we might shudder load the simulations into Excel and apply the PERCENTILE function to the same end. But let's take a step back and consider what we are doing. We want the value for which 99.5% of the distribution is less than that value. Mathematically we are looking for \( q \) defined by \( \int_{-\infty}^qf(x)dx = 0.995 \). If the integral has a convenient form then we have (say) \( F(q) = 0.995 \) and hopefully \( F^{-1}(0.995) = G(0.995) = q \) for some known function \( G \).

Which is a lot of if's and hopefully's. What we actually have is a set of discrete sample points. But that should make things easy - each point contributes one over the number of simulations to the total, so let's just order the simulations and count up until we hit the requred number.

That sounds reasonable until you start to consider the corner cases. What if we didn't have a thousand simulations, but only 100. At (ordered) simulation 99 we would have 99% of the probability accounted for, but at simulation 100 we would have all of it. So 99.5% would be somewhere in between.

The order-and-count method also fails when we remember that the simulations are just a sample from a deeper distribution. We've effectively approximated the integral above by a sum, so the distribution resulting distribution is a step function. We should acknowledge this in some way, and perhaps try to smooth it.

# Zoom on a small section of interest
rng <- c(0.99, 0.999)

# Plot the sample CDF
par(bg = "white")
plot(ecdf(one_simulation_set[, 3]), ylim = rng, xlim = quantile(one_simulation_set[, 
    3], probs = rng), main = "Section of sample CDF \n near 99.5%", xlab = "Losses")

# Show the locations of possible definitions of quantiles at a certain
# probability
some_colours <- rainbow(9)
prob <- 0.998
for (i in 1:9) points(x = quantile(one_simulation_set[, 3], probs = prob, type = i), 
    y = prob, col = some_colours[i], pch = i)
legend("bottomright", legend = paste("type", 1:9), col = some_colours, pch = 1:9)

# As a footnote - one thing about R that really irritates me is how
# painfull it is to draw graphs. I did start putting together a GUI to
# ggplot2 in shiny but never finished it

Fortunately someone has already done this kind of thinking, as well as considering other factors. Even more fortunately the R implementation gives you access to many variations, documents them and gives a link to the reference. Which is more than we can say for Excel. In the plot above I've indicated where the 99.8th percentile would be under each. In this case a lot of them coincide, which we can check explicitly:

for (i in 1:9) print(paste("Type ", i, "=", prettyNum(quantile(one_simulation_set[, 
    3], probs = prob, type = i), big.mark = ",")))

[1] "Type  1 = 34,413,581"
[1] "Type  2 = 53,908,634"
[1] "Type  3 = 34,413,581"
[1] "Type  4 = 34,413,581"
[1] "Type  5 = 53,908,634"
[1] "Type  6 = 73,325,708"
[1] "Type  7 = 34,491,561"
[1] "Type  8 = 60,380,992"
[1] "Type  9 = 58,762,903"

and a couple of these values do actually coincide with the 998th largest loss:

prettyNum(sort(one_simulation_set[, 3])[prob * 1000], big.mark = ",")

[1] "34,413,581"

Some of these values are very different, although I have cheated slightly by using the quantile with the largest distance to the next point to illustrate. But given the potentially material differences between the various definitions, it might be worth checking which one you are using! R defaults to variant 7. A bit of experimentation reveals that the old Excel PERCENTILE and the new PERCENTILE.INC also use that variant, whereas the new PERCENTILE.EXC uses variant 6.

I'll leave the interested reader to go through the R documentation or the original reference, and just mention that each method is some sort of linear combination of two order statistics. Which brings us neatly to…

Order Statistics

So an order statistic is an element of the sorted sample, i.e. the smallest, the 10th smallest, the 995th biggest, etc. Clearly they are closely related to our “naive” order-and-count quantile method, as demonstrated above. They're also often confused with quantiles: frequently I've heard people refer to the “50th worst of 10,000” and the 99.5th percentile loss as being equivalent. This is especially common when dealing with multi-dimensional models like the ones would use for risk analysis - the order statistics of the loss / PnL distribution refer to actual simulations which can be examined (what did equities do? Where were rates?). On the other hand the quantiles are abstract numbers lying somewhere between two simulations, with no clear link to the driving risk factors.

A Demonstration

Now to get to the crux of the situation: let's take our simple model, run off two sets of a million simulations and compare the results.

format <- function(x) formatC(x, big.mark = ",", format = "d")

set.seed(1)
first_simulation_set <- fGenerateSimulations(1, 2, 12, 1e+06)
first_mean <- mean(first_simulation_set[, 3])
first_var <- quantile(first_simulation_set[, 3], probs = 0.995, type = 1)

set.seed(2)
second_simulation_set <- fGenerateSimulations(1, 2, 12, 1e+06)
second_mean <- mean(second_simulation_set[, 3])
second_var <- quantile(second_simulation_set[, 3], probs = 0.995, type = 1)

print(data.frame(Mean = c(format(first_mean), format(second_mean), format(first_mean - 
    second_mean), format((first_mean - second_mean)/first_mean)), VaR = c(format(first_var), 
    format(second_var), format(first_var - second_var), ((first_var - second_var)/first_var)), 
    row.names = c("First", "Second", "AbsDiff", "PercDiff")))

              Mean                 VaR
First    1,208,472          29,191,390
Second   1,199,516          29,081,081
AbsDiff      8,956             110,309
PercDiff         0 0.00377883602690822

Now a million simulations is a substantial number. Intuitively we would expect statistics from such samples to be very close, and indeed the mean is. However the quantile is still different, possibly different enough to be material (remember, this is a pretty arbitrary scale).

Sampling distribution of percentiles

The intuition that the mean of two million simulation samples should be very close comes from Stats 101. Recall the distribution of the mean of a sample from a population with mean \( \mu \) and variance \( \sigma \):

\[ \bar{x} \sim Normal\left( \mu, \frac{ \sigma^2 } { n } \right) \]

which is pretty much the central limit theorem rearranged. An analagous result exists for the quantiles of a sample (the Bahadur representation):

\[ Q(p) \sim Normal\left( q_p, \frac{ p ( 1 - p ) } {f(q_p)n} \right) \]

Attentive readers will immediately spot our problem: the variance of the quantile is inversely dependent on the denisty (\( f() \)) of the variable at the the required quantile. At the 99.5th percentile the distribution function will be tiny - of the order of 1-10. So we require many more simulations to get tight confidence intervals for an extreme percentile than we do for the mean. There's also a more practical problem in that we don't know the value of the density at \( q_p \), and, unlike the population mean, we have no simple way to approximate it. Below we implement the solution suggested here, to estimate the 100(1-\( \delta \))% confidence interval for the population quantile \( q_p \).

Again, I'll leave the reader to peruse the details in the link, but briefly the estimator works by jacknifing (using distinct subsets of the data) and sectioning (using overlapping subsets), taking the appropriate order statistics for each and combining them. This gives an estimator with a small bias, for which we can calculate a confidence interval. For the jacknifing and sectioning we need to specify a number of subsets, \( m \) which should be of the order of 10 - 20.

#' Estimate the population quantile $q_p$ from a sample
#'
#' Uses jacknifing and sectioning
#' 
#'@references http://www.stanford.edu/class/msande223/handouts/lecturenotes09.pdf
#'
#'@param x A vector of observations
#'@param m A number of subsets to use (should be in the range 10 <= m <= 20)
#'@param p a probability
#'@param delta a confidence level
#'
#'@example jacknife.estimator( rnorm(1e6), 10, 0.995, 0.05 )
#'
jacknife.estimator <- function(x, m, p, delta) {

    n <- length(x)

    k <- n/m

    Qn <- sort(x)[p * n]

    alpha <- 0
    V <- 0

    for (i in 1:m) {

        idx <- ((i - 1) * k + 1):(i * k)

        Qni <- sort(x[-idx])[p * (m - 1) * k]

        alpha_i <- (m * Qn - (m - 1) * Qni)

        alpha <- alpha + alpha_i/m

        V <- V + alpha_i^2/(m - 1)

    }

    V <- V - m/(m - 1) * alpha^2

    t <- qt(1 - delta, m - 1)

    return(list(alpha = alpha, V = V, interval = alpha + c(-1, 1) * t * sqrt(V/m)))

}

And we demonstrate:

first_quantile <- jacknife.estimator(first_simulation_set[, 3], 10, 0.995, 0.05)
second_quantile <- jacknife.estimator(second_simulation_set[, 3], 10, 0.995, 
    0.05)

summary <- data.frame(Set = c(rep("first", 3), rep("second", 3)), Value = c(first_quantile$interval[1], 
    first_quantile$alpha, first_quantile$interval[2], second_quantile$interval[1], 
    second_quantile$alpha, second_quantile$interval[2]))
ggplot(summary, aes(Set, Value)) + geom_point()

The middle points of each column are the point estimates, and the others are the end points of the 95% confidence intervals for the population or underlying “true” quantiles. So we see that the differences between the two point estimates are comfortably within the implied variances (I know that's not a rigourous test, but that's not the purpose of this illustration).

The Bottom Line

Anyway, the point of all this is pretty simple: as statisticians (or at least, people who use statistics), we should always attempt to quantify the uncertainty about the numbers we use. So rather than presenting a single number (say \( q \)) derived from a sample and calling the “the” 99.5th percentile, what we should actually be doing is saying that we are 95% sure that the 99.5th percentile lies in the range \( (q-s, q+s) \), with a central estimate of \( q \).

This may leave you with some further education and explanation to do, particularly if your stakeholders have become used to numbers being presented as facts, not estimates. But I think in the long run this will be rewarded. I think that a grasp of the uncertainties inherent in models will paradoxically increase trust in the risk modelling / control systems, as opposed to the actual models themselves. Acknowledging or quantifying the “known unknowns” shows that the people running those models have their eyes open, and are alert to the fact that model outputs are simply best guesses.

To leave a comment for the author, please follow the link and comment on their blog: Not Normal Consulting.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.