Site icon R-bloggers

Frequentist German Tank Problem

[This article was first published on jacobsimmering.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
< !-- Styles for R syntax highlighter --> < !-- R syntax highlighter --> < !-- MathJax scripts -->

The German Tank Problem: The Frequentist Way

Many things are given a serial number and often that serial number, logically, starts at 1 and for each new unit is increased by 1. For example, German tanks in World War II had several parts with serial numbers. By collecting the value of these numbers, Allied statisticians could produce estimates of the total number of tanks produced by the Germans in a given time period.

The idea behind this was that the serial numbers encountered in the field were samples from a discrete uniform distribution starting at 1 and terminating at some unknown value \( N \), where \( N \) is the true number of tanks built. The problem then became to use the information from the tanks captured/destroyed in battle estimate the value of \( N \).

If you had an urn with numbered balls from 1 to \( N \) and draw \( k \) balls with replacement, the maximum value observed in the sample is a reasonable estimator of \( N \).

A simple simulation shows that this is the case:

sampleUrn <- function(k, N = 10000, r = 100) {
    urn <- 1:N
    sampleMax <- replicate(r, max(sample(urn, k, replace = TRUE)))
    sampleMax
}
k <- seq(1, 10000, length.out = 100)
sampleMax <- lapply(k, sampleUrn)
sampleMax <- data.frame(maxValue = do.call(c, sampleMax), k = rep(k, each = 100))
ggplot(sampleMax, aes(x = k, y = maxValue)) + geom_point() + scale_y_continuous("Sample Maximum")

Indeed, as suggested by the plot, the estimator is consistent for \( N \) (a formal treatment is in Example 5.1.1 in Hogg, McKean and Craig 6th edition).

However, try as they might, the Allies couldn't capture enough tanks to make an appeal to asymptotic theory. For fairly small sample sizes, the bias of this estimator is significant. This makes sense as the sample max cannot exceed but only equal the maximum value of the support and the probability of a draw containing a given sample depends on the size of the draw (and the sample space).

Furthermore, they weren't sampling the tanks with replacement. The information about the tank's serial numbers came when the tank was captured or disabled in combat. Therefore, once observed, the tank can never be observed again. This complicates the pmf for \( M \) somewhat but provides more accurate estimation of \( N \) for a fixed sample than without accepting this complexity.

Nevertheless, it is possible to produce an estimator of \( N \), \( \hat{N} \), that is unbiased for small sample sizes and captures the fact that the sampling is done without replacement. To start simply,

\[ E(M; k) = \sum_{m = 0}^{N} m \text{Pr}(m) \]

or the expectation of the sample maximum, \( M \), given a draw of size \( k \) is simply the sum of the products of \( m \) times \( \text{Pr}(m) \) where \( m \in {1, 2, \ldots} \).

If we recognize that we have observed \( k \) tanks, it is not possible for \( N < k \) because we are sampling without replacement and have seen \( k \) unique tanks, we can split the summation into two parts

\[ E(M; k) = \sum_{m = 0}^{k-1} m \text{Pr}(m) + \sum_{m = k}^{N} m \text{Pr}(m) \]

and since \( Pr(N = m) = 0 \) for \( m < k \), this reduces to

\[ E(M; k) = \sum_{m = k}^{N} m \text{Pr}(m) \]

Since the sampling is without replacement, the pmf of \( M \) is the number of ways to select \( k-1 \) tanks from a set of \( m-1 \) tanks divided by the total number of ways to select \( k \) tanks from the total set of \( N \) tanks. After questions involving cards, this is my least favorite part of math stats, but it appears that this expression is given by

\[ \text{Pr}(M = m) = \frac{\binom{m-1}{k-1}}{\binom{N}{k}} \]

If we plug this in the the summation above for \( \text{Pr}(m) \), we get

\[ E(M; k) = \sum_{m = k}^{N} m \frac{\binom{m-1}{k-1}}{\binom{N}{k}} \]

Which, even without expansion of the binomial coefficients into a factorial form, is ugly. And so expansion is exactly what we do

\[ E(M; k) = \sum_{m = k}^{N} m \frac{\frac{(m-1)!}{(k-1)!((m-1)-(k-1))!}}{\binom{N}{k}} \]

And if we bring the \( m \) into the expression

\[ E(M; k) = \sum_{m = k}^{N} \frac{\frac{m!}{(k-1)!((m - k))!}}{\binom{N}{k}} \]

The expansion would be much nicer now if we could somehow put it back into a binomial coefficient. It is close to \( \binom{m}{k} \) but we would need a \( \frac{1}{k} \) multiplier. If we multiply the top and bottom of the expansion by \( k \), we do not change the value but we arrive at

\[ E(M; k) = \sum_{m = k}^{N} \frac{\frac{km!}{k(k-1)!((m - k))!}}{\binom{N}{k}} \]

which can be rewritten as

\[ E(M; k) = \sum_{m = k}^{N} \frac{k\binom{m}{k}}{\binom{N}{k}} \]

As far as the summation is concerned, \( k \) and \( \binom{N}{k} \) are constants and we can rewrite the expression as

\[ E(M; k) = \frac{k}{\binom{N}{k}} \sum_{m = k}^{N} \binom{m}{k} \]

At first glance, this doesn't look like much of an improvement. However, now we just have the summation of a single binomial coefficient without any multiplication inside the summation.

Recall from earlier, the pmf of \( M \) is

\[ \text{Pr}(M = m) = \frac{\binom{m-1}{k-1}}{\binom{N}{k}} \]

By the definition of a pmf, the sum of \( \text{Pr}(M = m) \) over the support of \( m \) is 1. As the values of \( \text{Pr}(m) \) are 0 for \( m < k \), this becomes

\[ 1 = \sum_{m = k}^{n} \frac{\binom{m-1}{k-1}}{\binom{N}{k}} \]

The \( \frac{1}{\binom{N}{k}} \) can be pulled out giving

\[ 1 = \frac{1}{\binom{N}{k}} \sum_{m = k}^{n} \binom{m-1}{k-1} \]

In order for this expression to be true,

\[ \binom{N}{k} = \sum_{m = k}^{N} \binom{m-1}{k-1} \]

Now returning to the expectation, we see something similar with \( \sum_{m = k}^{N} \binom{m}{k} \), however, we lack the “\( -1 \)” bit. If we realize that playing with the indices by adding or subtracting a constant doesn't change the value of the expression as long as we change all the indices, lets rewrite the expectation as

\[ E(M; k) = \frac{k}{\binom{N}{k}} \sum_{m = k+1}^{N+1} \binom{m - 1}{k - 1} \]

we now have an expression in the right form for this trick based on the pmf to apply. Taking relationship \( \binom{N}{k} = \sum_{m = k}^{N} \binom{m-1}{k-1} \) and substituting, we arrive at

\[ E(M; k) = \frac{k}{\binom{N}{k}} \binom{N + 1}{k + 1} \]

No more ugly summations! If we expand the binomial coefficients we get to

\[ E(M; k) = \frac{k \frac{(N + 1)!}{(k+1)!(N-k)!}} {\frac{N!}{k! (N-k)!}} \] \[ E(M; k) = \frac{k(N+1)!k!}{(k+1)!N!} \]

If we realize that \( \frac{k!}{(k+1)!} \) cancels to \( \frac{1}{k+1} \) and that \( \frac{(n+1)!}{n!} \) reduces to \( n+1 \), we can write this as

\[ E(M; k) = \frac{k (N+1)}{k+1} \]

By solving this expression for \( N \),

\[ N = \frac{M k + M}{k} - 1 \]

Therefore,

\[ E(N) = E\bigg(\frac{M k + M}{k} - 1\bigg) = \frac{M k + M}{k} - 1 \]

Note that this can be reduced to

\[ E(N) = M \bigg(1 + \frac{1}{k}\bigg) - 1 \]

for simplicity.

Since this is an R-centric blog after all, lets check the performance of this estimator against using just the sample maximum for a collection of fairly small samples. According to Wikipedia, a reasonable number for \( N \) under the historical context is about 300.

germanTankSim <- function(k, N) {
    tanks <- 1:N
    m <- max(sample(tanks, k))
    nhat <- m * (1 + 1/k) - 1
    nhat
}
sampleMaxEstimator <- function(k, N) {
    tanks <- 1:N
    m <- max(sample(tanks, k))
    m
}

N <- 300
k <- round(seq(0.01, 1, 0.01) * N)
k[k == 0] <- 1  # doesn't make sense with k = 0
tanks <- sapply(rep(k, each = 1000), germanTankSim, N)
tanks <- data.frame(adjustedMax = tanks, k = rep(k, each = 1000))
tanks$sampleMax <- sapply(rep(k, each = 1000), sampleMaxEstimator, N)

If you visualize the corrected estimator's estimate as a function of the number of observed tanks, you see a pretty smooth decline in the variance of the estimate and a relatively unbiased estimator of \( N \). Note that I've used both jitter and alpha blending to make the changes in distribution a little more clear.

ggplot(tanks, aes(x = k)) + geom_jitter(aes(y = adjustedMax), alpha = 0.1, position = position_jitter(width = 1, 
    height = 1)) + geom_hline(aes(yintercept = 300), lty = 2) + scale_x_continuous("Number of Observed Tanks") + 
    scale_y_continuous("Unbiased Estimate of Number of Total Tanks")

Turning towards the sample maximum, it is clearly biased towards lower values for the total number of tanks. Given that the probability for a draw of a very small size, say 10% of the population, to contain the population maximum value, this makes sense. The bias decays with increased numbers of observed tanks and converges to the true value when \( k \sim N \).

ggplot(tanks, aes(x = k)) + geom_jitter(aes(y = sampleMax), alpha = 0.1, position = position_jitter(width = 1, 
    height = 1)) + geom_hline(aes(yintercept = 300), lty = 2) + scale_x_continuous("Number of Observed Tanks") + 
    scale_y_continuous("Sample Max")

If we calculate the mean square error at each value of \( k \) for the two estimators, we see that the adjusted estimator has a slightly more accurate estimate of \( N \) than the raw sample mean.

mse <- function(value, N) {
    mean((value - N)^2)
}
esd <- aggregate(cbind(tanks$adjustedMax, tanks$sampleMax), by = list(tanks$k), 
    mse, N = 300)
names(esd) <- c("k", "adjustMax", "sampleMax")
esd <- data.frame(k = rep(esd$k, 2), mse = c(esd$adjustMax, esd$sampleMax), 
    estimator = rep(c("Adjusted", "Sample Max"), each = 100))
ggplot(esd, aes(x = k, y = mse, color = estimator)) + geom_line()

While it started out with tanks, this method can be useful for estimating a large number of different maximum values based on sequential serial numbers. In the absence of sales data, estimation based on serial numbers observed in the wild can often provide reasonable estimates of the total number of units. It may also spell out the end of humanity.

It also has somewhat different estimates using Bayesian approaches, specifically a non-finite mean for \( k = 1 \) or \( k = 2 \) without setting a prior limit on the number of units.

And most importantly, it is fun math, statistics and computing.

To leave a comment for the author, please follow the link and comment on their blog: jacobsimmering.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.