Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Recent revelations about PRISM, the NSA’s massive program of surveillance of civilian communications have caused quite a stir. And rightfully so, as it appears that the agency has been granted warrantless direct access to just about any form of digital communication engaged in by American citizens, and that their access to such data has been growing significantly over the past few years.
Some may argue that there is a necessary trade-off between civil liberties and public safety, and that others should just quit their whining. Lets take a look at this proposition (not the whining part). Specifically, let’s ask: how much benefit, in terms of thwarted would-be attacks, does this level of surveillance confer?
Lets start by recognizing that terrorism is extremely rare. So the probability that an individual under surveillance (and now everyone is under surveillance) is also a terrorist is also extremely low. Lets also assume that the neck-beards at the NSA are fairly clever, if exceptionally creepy. We assume that they have devised an algorithm that can detect ‘terrorist communications’ (as opposed to, for instance, pizza orders) with 99% accuracy.
P(+ | bad guy) = 0.99
A job well done, and Murica lives to fight another day. Well, not quite. What we really want to know is: what is the probability that they’ve found a bad guy, given that they’ve gotten a hit on their screen? Or,
P(bad guy | +) =??
Which is quite a different question altogether. To figure this out, we need a bit more information. Recall that bad guys (specifically terrorists) are extremely rare, say on the order of one in a million (this is a wild over estimate with the true rate being much lower, of course – but lets not let that stop us). So,
P(bad guy) = 1/1,000,000
Further, lets say that the spooks have a pretty good algorithm that only comes up falsely positive (ie when the person under surveillance is a good guy) one in one hundred times.
P(+ | good guy) = 0.01
And now we have all that we need. Apply a little special Bayes sauce:
P(bad guy | +) = P(+ | bad guy) P(bad guy) / [ P(+ | bad guy) P(bad guy) + P(+ | good guy) P(good guy) ]
and we get:
P(bad guy | +) = 1/10,102
That is, for every positive (the NSA calls these ‘reports’) there is only a 1 in 10,102 chance (using our rough assumptions) that they’ve found a real bad guy.
UPDATE: While former NSA analyst turned whistle blower William Binney thinks this is a plausible estimate, the point here is not that this is the ‘correct probability‘ involved (remember that we based our calculations on very rough assumptions). The take away message is simply that whenever the rate of an event of interest is extremely low, even a very accurate test will fail very often.
Big brother is always watching, but he’s still got a needle in a haystack problem.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.