Site icon R-bloggers

Numerical pitfalls in computing variance

[This article was first published on BayesFactor: Software for Bayesian inference, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
One of the most common tasks in statistical computing is computation of sample variance. This would seem to be straightforward; there are a number of algebraically equivalent ways of representing the sum of squares \(S\), such as \[ S = \sum_{k=1}^n ( x_k – \bar{x})^2 \] or \[ S = \sum_{k=1}^n x_k^2 + \frac{1}{n}\bar{x}^2 \] and the sample variance is simply \(S/(n-1)\).

What is straightforward algebraically, however, is sometimes not so straightforward in the floating-point arithmetic used by computers. Computers cannot represent numbers to infinite precision, and arithmetic operations can affect the precision of floating-point numbers in unexpected ways.




Consider the numbers .1 and .3/3. These two numbers are equal. However,
.1 - .3/3

## [1] 0.00000000000000001388
is not exactly 0, as one would expect it to be (for more, see “Falling into the Floating
Point Trap”, Chapter 4 in the “R Inferno” by Patrick Burns. Multiple ways of computing the variance that are algebraically equivalent do not necessarily yield equal answers in software such as R, and some ways are better than others.

In a series of posts, John D. Cook shows that the seemingly reasonable, commonly used second method above, which he calls the “sum of squares” method, can be extremely unstable in certain circumstances, even giving impossible negative values. He also discusses how to compute the sample variance in a numerically stable way using Welford’s method. Both of these posts are well worth reading.

When I read them, I thought two things. First, I was reminded that I use [used; this was written some time ago] the “sum of squares” method in the BayesFactor package. Secondly, I thought I would not be affected by the problem, because I represent numbers as logarithms internally for numerical stability and ease of division. Logarithms make many things easier: very large and very small numbers become easier to work with; exponentiation becomes multiplication; and multiplication and division become addition and subtraction. The tricky part of working with logarithms is addition and subtraction. If we have two numbers, \(\exp(a)\) and \(\exp(b)\) represented by their logarithms \(a\) and \(b\), and we want to know the logarithm of their sum, we can make use of the identities
\[
\begin{eqnarray*}
\log(\exp(a) + \exp(b)) = a + \log(1 + \exp(b – a))\\
\log(\exp(a) – \exp(b)) = a + \log(1 – \exp(b – a))
\end{eqnarray*}
\]
Now arithmetic with \(a\) and \(b\) is addition and subtraction, and we can use accurate floating point approximations of \(\log(1+\exp(x))\) and \(\log(1-\exp(x))\).

When logarithms don’t help



But I wasn’t really sure whether I would be affected by the instability of the “sum of squares” method, so I decided to check. It turns out, representing numbers logarithmically doesn’t necessarily help. In order to demonstrate this easily, I created an R S4 class that eases arithmetic on logarithmically-represented values. First, we load necessary libraries/files:
# Install the BayesFactor and devtools packages, if you don't already have them

# Load my S4 class for representing real numbers with logarithms
# and performing arithmetic on them
# See the code at https://gist.github.com/richarddmorey/3c77d0065983e31241bff3807482443e
devtools::source_gist('3c77d0065983e31241bff3807482443e')

# set random seed so results are reproducible
set.seed(2)

[Click here to view the R file you’ll be sourcing above]

To see the S4 class in action, we need to generate some numbers that are logarithmically-represented. The variables x and y below are equal to \(\exp(1)=2.718\) and \(\exp(2)=7.389\), respectively. The modulo argument gives the log-represented magnitude of the number, and the sign argument gives the sign (with 1L meaning the integer representation of 1):
x = logRepresentedReal(modulo = 1, sign = 1L)
y = logRepresentedReal(modulo = 2, sign = 1L)
We can add the two numbers together, for instance:
x + y

##  10.10734
Although the result does not look logarithmically-represented, we can verify that it is using the str function:
str( x + y )

## Formal class 'logRepresentedReal' [package ".GlobalEnv"] with 2 slots
##   ..@ modulo: num 2.31
##   ..@ sign  : int 1
The result is of class logRepresentedReal, and the modulo slot tells us \(\log(x+y)\). With the arithmetic on the logarithmically-represented numbers defined using the logRepresentedReal class, we can test whether our logarithms help stabilize the estimate of the variance. Following Cook, we will sample values from a uniform distribution, making use of the fact that if \(z\) has a uniform distribution, then \(−\log(z)\) has an standard exponential distribution:
runif2 = function(n){
  # Sample log values from exponential distribution
  x = -rexp(n)
  # represent all values logarithmically in a list
  lapply(x, logRepresentedReal, sign = 1L)
}

n = 100
z = runif2(n)
We sampled \(n=100\) values from a uniform distribution. We can now compute the variance in several ways. The first way is to use the “sum of squares” method on the exponentiated values:
# Sum of squares method
var.sumsq.exp = function(z)
{
  n = length(z)
  z = sapply(z, as.numeric)
  (sum(z^2) - n*mean(z)^2)/(n-1)
}

var.sumsq.exp(z)

## [1] 0.07419988
This presents no problem, since our uniformly-distributed values are rather moderate. We now use Welford’s method on the logarithmically-represented values to compute the variance:
var.welford <- function(z){
  n = length(z)
  M = list()
  S = list()
  M[[1]] = z[[1]]
  S[[1]] = 0

  for(k in 2:n){
    M[[k]] = M[[k-1]] + ( z[[k]] - M[[k-1]] ) / k
    S[[k]] = S[[k-1]] + ( z[[k]] - M[[k-1]] ) * ( z[[k]] - M[[k]] )
  }
  return(S[[n]] / (n - 1))
}

var.welford(z)

##  0.07419988
And finally, we can use the “sum of squares” method on the logarithmically-represented values:
var.sumsq = function(z){
  n = length(z)
  zsqr = sapply(z, function(x) x^2)

  sumz = 0
  sumz2 = 0
  for(k in 1:n){
    sumz = sumz + z[[k]]
    sumz2 = sumz2 + zsqr[[k]] 
  }
  mnz = sumz/n
  ssz = sumz2 - n * mnz^2 
  return(ssz / (n-1))  
}

var.sumsq(z)

##  0.07419988
Again, this presents no problem, since our uniformly-distributed values are moderate. So far, we see no signs of numerical instability, but none are expected. As Cook did in his example, we can add a very large number — in this case, one billion — to all sampled values. This makes the variance quite small compared to the mean, and would be expected to make the sum of squares estimates unstable.
const = 1e9
z = sapply(z, function(x) x + const)

var.sumsq.exp(z)

## [1] -165.4949

var.welford(z)

##  0.07419973

var.sumsq(z)

##  35886
Notice that both the sum of square estimates fail; the logarithmic representation used by var.sumsq does not help the numeric stability. The Welford method, on the other hand, yields an accurate value. 


Conclusion 


There are many ways of combating numerical instability. Representing numbers as logarithms is one important method. Although representing numbers as logarithms is effective at combating numerical instability from some sources, it does not necessarily help in all cases. The superiority of the Welford method of computing variance, even when numbers are represented logarithmically, shows this clearly.
To leave a comment for the author, please follow the link and comment on their blog: BayesFactor: Software for Bayesian inference.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.