Data visualization with statistical reasoning: seeing uncertainty with the bootstrap
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This blog post is one of a series highlighting specific images from my book Data Visualization: charts, maps and interactive graphics. This is from Chapter 8, “Visualizing Uncertainty”.
Why?
One of the most common concerns that I hear from dataviz people is that they need to visualise not just a best estimate about the behaviour of their data, but also the uncertainty around that estimate. Sometimes, the estimate is a statistic, like the risk of side effects of a particular drug, or the percentage of voters intending to back a given candidate. Sometimes, it is a prediction of future data, and sometimes it is a more esoteric parameter in a statistical model. The objective is always the same: if they just show a best estimate, some readers may conclude that it is known with 100% certainty, and generally that’s not the case.
I want to describe a very simple and flexible technique for quantifying uncertainty called the bootstrap. This tries to tackle the problem that your data are often just a sample from a bigger population, and so that sample could yield an under- or over-estimate just by chance. We can’t tell if the sample’s estimate is off the true value, because we don’t know the true value, but (and I found this incredible when I first learnt it) statistical theory allows us to work out how likely we are to be off by a certain distance. That lets us put bounds on the uncertainty.
Now, it is worth saying here, before we go on, that this is not the only type of uncertainty you might come across. The poll of voters is uncertain because you didn’t ask every voter, just a sample, and we can quantify that as I’m describing here, but it’s also likely to be uncertain because the voters who agreed to answer your questions are not like the ones who did not agree. That latter source of uncertainty calls for other methods.
The underlying task is to work out what the estimates would look like if you had obtained a different sample from the same population. Sometimes, there are mathematical shortcut formulas that give you this — the familiar standard error, for example — immediately, by just plugging the right stats into a formula. But, there are some difficulties. For one, the datavizzer needs to know about these formulas, which one applies to their purposes, and to be confident in obtaining them from some analytical software or programming them. The second problem is that these formulas are sometimes approximations, which might be fine or might be off, and it takes experience and skill to know the difference. The third is that there are several useful stats, like the median, for which no decent shortcut formula exists, only rough approximations. The fourth problem is that shortcut formulas (I take this term from the Locks) mask the thought process and logic behind quantifying uncertainty, while the bootstrap opens it up to examination and critical thought.
The American Statistical Association’s GAISE guidelines for teaching stats now recommend starting with the bootstrap and related methods before you bring in shortcut formulas. So, if you didn’t study stats, yet want to visualise uncertainty from sampling, read on.
How?
If you do dataviz, and you come from a non-statistical background, you will probably find bootstrapping useful. Here it is in a nutshell. If we had lots of samples (of the same size, picked the same way) from the same population, then it would be simple. We could get an estimate from each of the samples and look at how variable those estimates are. Of course, that would also be pointless because we could just put all the samples together to make a megasample. Real life isn’t like that. The next best thing to having another sample from the same population is having a pseudo-sample by picking from our existing data. Say you have 100 observations in your sample. Pick one at random, record it, and put it back — repeat one hundred times. Some observations will get picked more than once, some not at all. You will have a new sample that behaves like it came from the whole population.
Sounds too easy to be true, huh? Most people think that when they first hear about it. Yet its mathematical behaviour was established back in 1979 by Brad Efron.
Now, if you work out the estimate of interest from that pseudo-sample, and do this a lot (as the computer’s doing it for you, no sweat, you can generate 1000 pseudo-samples and their estimates of interest). Look at the distribution of those bootstrap estimates. The average of them should be similar to your original estimate, but you can shift them up or down to match (a bias-corrected bootstrap). How far away from the original do they stretch? Suppose you pick the central 95% of the bootstrap estimates; that gives you a 95% bootstrap confidence interval. You can draw that as an error bar, or an ellipse, or a shaded region around a line. Or, you could draw the bootstrap estimates themselves, all 1000 of them, and just make them very faint and semi-transparent. There are other, more experimental approaches too.
You can apply the bootstrap to a lot of different statistics and a lot of different data, but use some common sense. If you are interested in the maximum value in a population, then your sample is always going to be a poor estimate. Bootstrapping will not help; it will just reproduce the highest few values in your sample. If your data are very unrepresentative of the population for some reason, bootstrapping won’t help. If you only have a handful of observations, bootstrapping isn’t going to fill in more details than you already have. But, in that way, it can be more honest than the shortcut formulas.
If you want to read more about bootstrapping, you’ll need some algebra at the ready. There are two key books, one by bootstrap-meister Brad Efron with Stanford colleague Rob Tibshirani, and the other by Davison and Hinkley. They are pretty similar for accessibility. I own a copy of Davison and Hinkley, for what it’s worth.
You could do bootstrapping in pretty much any software you like, as long as you know how to pick one observation out of your data at random. You could do it in a spreadsheet, though you should be aware of the heightened risk of programming errors. I wrote a simple R function for bootstraps a while back, for my students when I was teaching intro stats at St George’s Medical School & Kingston Uni. If you use R, check that out.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.