Simulation (is where it’s happening)
Jim Silverton wrote to the Allstat mailing list recently:
“Hi,
Anyone up for a challenge?
Suppose we have [4] random variables that are random points on the surface of a sphere. What is the probability that the tetrahedron made by joining these points, contain the origin?
I tried over and over – no idea.”
This is really a classic probability “hard” question. If you’re comfortable with geometry and probability (and you happen to think of it in the right way), you’ll be fine. I looked it up online and found it came from the 1992 Putnam math competition (and could well have existed before that). There’s a solution at Wolfram so scary it would make Steven Seagal weep, while Howard and Sisson riff on it in a paper from 1996. You have to mess about with calculus and vector geometry and probability to get the extended answers.
I have to say, I hate this kind of thing. My family have pretty much learned not to show me logic puzzles in the newspaper or anything like that. I don’t see the point. There clearly is an answer, and you just have to struggle to find it. Where’s the satisfaction in that?But this innocuous post caught my imagination because it seemed to me a good vehicle to promote computing skills and simulation over teeth-grinding math. Wolfram says the answer is 1/8. I wrote this R code to make random tetrahedra 100,000 times and in each case check whether they contained the origin. It takes 61 seconds to run on my office PC. The answer is apparently 0.12391 (95% confidence interval 0.122 to 0.126), which is pretty neat compared to the analytical result of 0.125.
# x is the matrix of 4 points on the unit sphere
# d is in 1:4, indicating which vector to compare with the plane joining
# the other three.
# If the 4th point and the plane are in opposite directions, this supports
# the tetrahedron containing the origin. If all four are opposite, it
# is confirmed (so it seems to me, but I leave proof to others with
# time on their hands!)
m <- function(xx) {
return(sqrt(sum(xx^2)))
}
proj<-function(x,d) {
j<-(1:4)[-d]
pq<-x[j[2],]-x[j[1],]
pr<-x[j[3],]-x[j[1],]
orth<-c((pq[2]*pr[3])-(pq[3]*pr[2]),
(pq[3]*pr[1])-(pq[1]*pr[3]),
(pq[1]*pr[2])-(pq[2]*pr[1]))
unitorth<-orth/m(orth)
plane<-unitorth*sum(x[j[1],]*unitorth)
point4<-unitorth*sum(x[d,]*unitorth)
oppo<-((plane[1]>0 & point4[1]<0)|(point4[1]>0 & plane[1]<0))
return(list(oppo,point4,plane))
}
set.seed(434)
iter<-100000
results<-rep(NA,iter)
vol<-rep(NA,iter)
pb<-txtProgressBar(min=0,max=iter,style=3)
Sys.time()
for (i in 1:iter) {
a <- matrix(runif(12,min=-1,max=1), nrow=4)
mags <- apply(a, 1, m)
mags <- matrix(rep(mags, 3), ncol=3)
x <- a/mags
temp<-rep(NA,4)
for (k in 1:4) {
temp[k]<-proj(x,k)[[1]]
}
results[i]<-all(temp)
vol[i]<-(1/6)*abs(det(cbind(x,rep(1,4))))
setTxtProgressBar(pb,i)
}
table(results)/iter
png("volume.png")
plot(density(vol))
dev.off()
Sys.time()
My approach to checking whether a tetrahedron contains the origin is a little heuristic, as you’ll see. It looks sensible enough though; if you draw a line from the origin to the kth vertex, and also from the origin, orthogonal to the plane joining the other three points, and they are in the opposite direction for all k, then the origin is contained.
My point is simply that simulation gives you enormous flexibility with gnarly problems. That code above throws in the extra stuff they do at Wolfram because the original question wasn’t hard enough. What are the central moments of the distribution of the volumes? Who cares. Let’s just get the distribution straight out of the simulation! After doing the quintuple integral and then some, Wolfram say the mean volume is (4/105)pi = 0.11968. My code says 0.11923.
There are times when you can get an analytical formula and evaluate that in a split second for each scenario, which is obviously a good move. For example, you could solve a multiple linear regression by MCMC, or by matrix algebra, and the latter is going to be lightning fast, always reliable, but not extendable to anything more complex. There are other times when you can’t get the formula but by thinking about the data-generating process, you can write a simulation program really quickly and get on with your life. How about the sampling distribution of the third quartile of dodecahedral volumes in 11-dimensional space whose vertices are sampled from autocorrelated mixtures of multivariate Cauchy distributions? No problem. Last week I did some sample size calculations for students, all by simulation, because there was no simple applicable formula. It just works, but surprisingly few places teach people these skills.
PS Do you teach this kind of stuff in an introductory stats setting? I’d like to hear from you! We might be able to put together a scientific meeting around this theme. There is interest and support and a central London venue but I need speakers…