Uncertainty in Data Science (Transcript)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Here is a link to the podcast.
Introducing Allen Downey
Hugo: Hi, there, Allen, and welcome to DataFramed.
Allen: Hey, Hugo. Thank you very much.
Hugo: Such a pleasure to have you on the show, and I’m really excited to have you here to talk about uncertainty in data science, how we think about prediction, and how we can think probabilistically, and how we do it right, and how we can get it wrong as well, but before we get into that, I’d love to find out a bit about you, and so I’m wondering what you’re known for in the data community.
Allen: Right. Well, I’m working on a book series that’s called Think X, for all X, so hopefully some people know about that. Think Python is kind of the starting point, and then for data science, Think Stats and Think Bayes, for data science and for Bayesian statistics.
Hugo: Great, and so why Think?
Allen: Came about, roundabout, the original book was called How to Think Like a Computer Scientist, and it was originally a Java book, and then it became a Python book, and then it wasn’t really about programming. It was about bigger ideas, and so then when I started the other books, the premise of the books is that you’re using computation as a tool to learn something else, so it’s a way of thinking, it’s an approach to the topic, and so that’s how we got to the schema that’s always think something for various values of something.
Computation
Hugo: Right. I like that a lot, and speaking to this idea of computation, I know you’re a huge proponent of the role of computation in helping us to think, so maybe you can speak to that for a minute.
Allen: Sure. I mean, it partly comes … I’ve been teaching in an engineering program, and engineering education has been very math-focused for a long time, so the curriculum, you have to take a lot of calculus and linear algebra before you get to do any engineering, and it doesn’t have to be that way at all. I think there are a lot of ideas in engineering that you can get to very quickly computationally that are much harder mathematically.
Allen: One of the examples that comes up all the time is integration, which is a little bit of a difficult idea. Students, when they see an integral sign, immediately there’s gonna be some challenge there, but if you do everything discretely, you can take all of those integrals, you just turn them into summations, and then if you do it computationally, you take all of the summations and turn them into for loops, and then you can have very clear code where you’re looping through space, you’re adding up all of the elements. That’s what an integral is.
Hugo: Absolutely, and I think another place that you’ve thought about a lot, and a lot of us have worked in where this rears its head is the idea of using computation and sampling and re-sampling datasets to get an idea about statistics. Right?
Allen: Right. Yeah. I think classical statistical inference, looking at things like confidence intervals and hypothesis tests, re-sampling is a very powerful tool. You’re running simulations of the system, and you can compute things like sampling distribution or a p-value in a very straightforward way, meaning that it’s easy to do, but it also just makes the concept transparent. It’s really obvious what’s going on.
Hugo: That’s right, and you actually … We’ve had a segment on the podcast previously, which is … It’s blog post of the week, and we had one on your blog post, There Is Only One Test, which really spells out the idea of that in the world of statistical hypothesis testing, there is really only one test, and the idea of you can actually see that, and this one of your great points, you can see that when you take the sampling, re-sampling, bootstrapping approach. Right?
Allen: Right. Yeah. I think it makes the framework visible, that hypothesis tests, there’s a model of the null hypothesis, and that’s gonna be different for different scenarios, and there’s the test statistic, and that’s gonna be different for different scenarios, but once you’ve specified those two pieces, everything else is the same. You’re running the same framework. So, I think it makes the concept much clearer.
Hugo: Great, and we’ll link to that in the show notes. We’ll also link to your fantastic followup post called “There Is Still Only One Test”.
Allen: Well, that’s just because I didn’t explain it very well the first time, so I had to try again.
How did you get into data science?
Hugo: It also proves the point, though, that there is still only one test, and I’ll repeat that, that there is still only one test. So, how did you get into data science originally?
Allen: Well, my background is computer science, so there are a lot of ways, a lot of doors into data science, but I think computer science is certainly one of the big ones. I did … My master’s thesis was on computer vision, so that was kind of a step in that direction. My PhD was all about measuring and modeling computational systems, so there are a lot of things that come in there like long tail distributions, and then in 2009 I did a sabbatical, and I was working at Google in a group that was working on internet performance, so we were doing a lot of measurement, modeling, statistical descriptions, and predictive modeling, so that’s kind of where it started to get serious, and that’s where I started when I was working on Think Stats for the first time.
Hugo: So, this origin story of you getting involved in data science I think makes an interesting point, that you’ve actually touched a lot of different types of data, and I know that you’re a huge fan of the idea that data science isn’t necessarily only for data scientists, that it actually could be of interest to everyone because it touches … There are so many touch points with the way we live and data science. Right?
Allen: Right. Yeah. This is one of my things that I get a little upset about, is when people talk about data science, and then they talk about big data, and then they talk about quantitative finance and business analytics, like that’s all there is, and I use a broader notion of what data science is. I’d like to push the idea that it’s any time that you’re using data to answer questions and to guide decision making, because that includes a lot of science, which is often about answering questions, a lot about engineering where you’re designing a system to achieve a particular goal, and of course, decision making, both on an individual or a business or a national public policy level. So, I’d like to see data science involved in all of those pieces.
Hugo: Absolutely. So, we’re here to talk about uncertainty today. One part of data science is making predictions, which we’ll get to, but the fact that we live in an uncertain world is incredibly interesting because what we do as a culture and a society, we use probability to think about uncertainty, so I’m wondering your thoughts on whether we us humans are actually good at thinking probabilistically.
Allen: Right. It’s funny because we are and we are not at the same time.
Hugo: I’m glad you didn’t say we probably are.
Allen: Right. Yeah. That would’ve been good. So, we do seem to have some instinct for probabilistic thinking, even for young children. We do something that’s like a Bayesian update. When we get new data, if we’re uncertain about something, we get new evidence, we update our beliefs, and in some cases we actually do a pretty good approximation of an accurate Bayesian update, typically for things that are kind of in the middling range of probability, maybe from about 25% to 75%. At the same time, we’re terrible at very rare things. Small probabilities we’re pretty bad at, and then there are a bunch of ways that we can be consistently fooled because we’re not actually doing the math. We’re doing approximations to it, and those approximations fail consistently in ways that behavioral psychologists have pointed out, things like confirmation bias and other cognitive failures like that.
“Why Are We So Surprised?””
Hugo: Absolutely. So, I want to speak to an article you wrote on your blog called Why Are We So Surprised?, in which you stated, “In theory, we should not be surprised by the outcome of the 2016 presidential election, but in practice, we are.” So, I’m wondering why you think we shouldn’t have been surprised.
Allen: Right. Well, a lot of the forecasts, a lot of the models coming from FiveThirtyEight and from The New York Times, they were predicting that Trump had about a 25% chance, maybe more, of winning the election. So, if something’s got a 25% chance, that’s the same as flipping a coin twice and getting heads twice. You wouldn’t be particularly surprised by that. So, in theory a 25% risk shouldn’t be surprising, but in practice, I think people still don’t really understand probabilistic predictions.
Allen: One reason we can see that is the lack of symmetry, which is, if I tell you that Trump has a 25% chance of winning, you think, “Well, okay. That might happen,” but when FiveThirtyEight said that Hillary Clinton had a 70% chance of winning, I think a lot of people interpreted that as a deterministic prediction, that FiveThirtyEight was saying, “Hillary Clinton is going to win,” and then when that didn’t happen, they said, “Well, then FiveThirtyEight was wrong,” and I don’t think that’s the right interpretation of a probabilistic prediction. If someone tells you there’s a 70% chance and it doesn’t happen, that should be mildly surprising, but it doesn’t necessarily mean that the prediction was wrong.
Hugo: Yeah, and in your article, you actually make a related point that everybody predicted at some level, well, predicted that Hillary had over a 50% chance of winning, and you made the point that people interpreted this as there was consensus that Hillary would win with different degrees of confidence, but that’s … So, as you stated, that’s interpreting it as deterministic predictions, not probabilistic predictions. Right?
Allen: Yeah, I think that’s right, and it also … It fails the symmetry test again because different predictions, they ranged all the way from 70% to 99%, and people reacted as if that was a consensus, but that’s not a consensus. If you flip it around, that’s the range from saying that Trump has anywhere between 1% and 30% chance of winning, and if the predictions had been expressed that way, I think people would’ve looked at that and said, “Oh, clearly there’s not a consensus there, because there’s a big difference between 1% and 30%.”
Hugo: I really like this analogy to flipping coins, because it puts a lot of things in perspective, and another example, as you mention in your article, The New York Times gave Trump a 9% chance of winning, and if you flip a coin four times in a row and get four heads, that’s relatively surprising, but you wouldn’t be like, “Oh, I can’t believe that happened,” and that has a 6.25% chance of happening. Right?
Allen: Right. Yeah, I think that’s a good way to get a sense for what these probabilities mean.
Hugo: Absolutely. So, you mentioned also that these models were actually relatively credible models, so maybe you can speak to that.
Allen: Yeah. I think going in, two reasons to think that these predictions were credible, one of them was just past performance, that FiveThirtyEight and The New York Times had done well in previous elections, but maybe more important, their methodology was transparent. They were showing you all of the poll data that they were using as inputs, and I think they weren’t actually publishing the algorithms, but they gave a lot of detail about how these things were working. Some polls are more believable than others. They were applying correction factors, and they also had … They were taking time into account. So, a more recent poll would be weighted more heavily than a poll that was farther into the past. So, all of those, I think ahead of the fact, we had good reasons to believe the predictions, and after the fact, even though the outcome wasn’t what we expected, that really just doesn’t mean that the models are wrong.
Hugo: So, with all of this knowledge around how uncertain we are about uncertainty and how we can be good and bad about thinking probabilistically, what approaches can we as a data reporting community take to communicate around uncertainty better in the future?
Allen: Right. I think we don’t know yet, but one of the things that I think is good is that people are trying a lot of different things. So, again, taking the election as an example, The New York Times had the twitchy needle that was sort of famously maybe not the best way to represent that information. There were other examples. Nate Silver’s predictions are based on running many simulations. So, he would show a histogram that would show the outcome of doing many, many simulations, and that I think probably works for some audiences. I think it’s tough for other audience.
Allen: One of the suggestions I made that I would love to see someone try is instead of running many simulations and trying to summarize the results, I’d love to see one simulation per day with the results of one simulation presented in detail. So, thinking back to 2016, suppose that every day you looked in the paper, and it showed you one possible outcome of the election, and let’s say that Nate Silver’s predictions were right, and there was a 70% chance that Clinton would win. So, in a given week, you would see Clinton win maybe four or five times. You would see Trump win two or three times, and I think at the end of that week, your intuition would actually have a good sense for that probability.
Hugo: I think that’s an incredible idea, because what it speaks to for me personally is you’re not really looking at these simulations or these results in the abstract. You’re actually experiencing them firsthand in some way.
Allen: Exactly. So, you get the emotional effect of opening the paper and seeing that Trump won, and if that’s already happened a few times in simulation, then the reality would be a lot less surprising.
Hugo: Absolutely. Are there any other types of approaches or ways of thinking that you’d like to see more in the future?
Allen: Well, as I said, I think there are a lot of experiments going, so I think we will get better at communicating these ideas, and I think the audience is also learning, so different visualizations that wouldn’t have worked very well a few years ago, now people are I think just better at interpreting data, interpreting visualizations, because it’s become part of the media in a way that it wasn’t. If you’d look back not that long ago, I don’t know if you remember when USA Today started doing infographics, and that was a thing. People were really excited about those infographics, and you look back at those things now, and they’re terrible. It’ll be like-
Hugo: Mm-hmm (affirmative). We’ve come a long way.
Allen: It’s something that’s really just a bar chart, except that the bar is made up of stacked up apples and stacked up oranges, and that was data visualization, say, 20 years ago, and now you look at the things that The New York Times is doing with interactive visualizations. I saw one the other day, which is their three-dimensional visualization of the yield curve, which is a tough idea in finance and economics, and a 3-D visualization is tough, and interactive visualization is challenging, so maybe it doesn’t work for every audience, but I really appreciated just the ambition of it.
Hugo: So, you mentioned the role of data science in decision making in general, and I think in a lot of ways, we make decisions based on all the data we have, and then a decision is made, but a lot of the time, the quality of the decision will be rated on the quality of the outcome, which isn’t necessarily the correct way to think about these things. Right?
Allen: Right. I gave an example about Blackjack, that you can make the right play in Blackjack. You take a hit when you’re supposed to take a hit, and if you go bust, it’s tempting to say, “Oh. Well, I guess I shouldn’t have done that,” but that’s not correct. You made the right play, and in the long run that’s the right decision. Any specific outcome is not necessarily gonna go your way.
Hugo: Yeah, but we know that in that case because we can evaluate the predictions based on the theory we have and the simulations we have in our mind or computationally. Right? On long-term rates, essentially.
Allen: Right. Yeah. Blackjack is easy because every game of Blackjack is kind of the same, so you’ve got these identical trials. You’ve got long-term rates. We have a harder time with single-case predictions, single-case probabilities.
Hugo: Like election forecasting?
Allen: Like elections, right, but in that case, right, you can’t evaluate a single prediction. You can’t say specifically whether it’s right or wrong, but you can evaluate the prediction process. You can check to make sure that probabilistic predictions are calibrated. So, maybe getting back to Nate Silver again, in The Signal and the Noise, he uses a nice example, which is the National Weather Service, which is, they make probabilistic predictions. They say, “20% chance of rain, 80% chance of rain,” and on any given day, you don’t know if they were wrong.
Allen: So, if they 20% then it rains, or if they say 80% and it doesn’t rain, that’s a little bit surprising, but it doesn’t make them wrong. But in the long run, if you keep track of every single time that they say 20% and then you count up how many times does it actually rain on 20% days, and how many times does it rain on 80% days, if the answer is 20% and 80%, then that’s a well-calibrated probabilistic prediction.
Where is uncertainty prevalent in society?
Hugo: Absolutely. So, this is another example. The weather is one. We’ve talked about election forecasting, and these are both examples where it’s we really need to think about uncertainty. I’m wondering what other examples in society are where we need to think about uncertainty and why they’re important.
Allen: Yep. Well, a big one … Anything that’s related to health and safety, those are all cases where we’re talking about risks, we’re talking about interventions that have certain probabilities of good outcomes, certain probabilities of side effects, and those are other cases, I think, where sometimes our heuristics are good, and other times we make really consistent cognitive errors.
Hugo: There are a lot of cognitive biases, and one that I fall prey to constantly is, I’m not even sure what it’s called, but it’s when you have a small sample size, and I see something occur several times, I’m like, “Oh, that’s probably the way things work.”
Allen: Right. Yeah. I guess that’s a form of over-fitting. In statistics, there’s sort of a joke that people talk about the law of small numbers, but that’s right. I think that’s a version of jumping to conclusions. That’s an example where I think doctors have had a version of that in the past, which is they make decisions often about treatment that are based on their own patients, so, “Such-and-such a drug has worked well for my patients, and I’ve seen bad outcomes with my patients,” as contrasted with using large randomized trials, which we’ve got a lot of evidence now that randomized trials are a more reliable form of evidence than the example that you gave of generalizing from small numbers.
Hugo: So, health and safety, as you said, are two relevant examples. What can we do to combat this, do you think?
Allen: That one’s tough. I’m thinking about some of the ways that we get health wrong, some of the ways that we get safety. Certainly, one of the problems is that we’re very bad at small risks, small probabilities. There’s some evidence that we can do a little bit better if we express things in terms of natural frequencies, so if I tell you that something has a .01% probability, you might have a really hard time making sense of that, but if I tell you that it’s something like one person out of 10,000, then you might have a way to picture that. You could say, “Well, okay. At a baseball game, there might be 30,000 people, so there could be three people here right now how have such-and-such a condition.” So, I think expressing things in terms of natural frequencies might be one thing that helps.
Hugo: Interesting. So, essentially, these are, I suppose, linguistic technologies and adopting things that we know work in language.
Allen: Yeah, I think so. I think graphical visualizations are important, too. Certainly, we have this incredibly powerful tool, which is our vision system, that’s able to take a huge amount of data and process it quickly, so that’s, I think, one of the best ways to get information off a page and into someone’s brain.
Hugo: Yeah. Look, this actually just reminded me of something I haven’t thought about in years, but it must’ve been 10 or 15 years ago, I was at an art show in Melbourne, Australia, and there was an artwork which it was visualizing how many people had been in certain situations or done certain things using grains of rice. So, they had a bowl, like the total population of Australia, the total population of the US, and then the number of people who were killed during the Holocaust and the number of people who’ve stepped on the moon, and that type of stuff, and it was actually incredibly vivid and memorable, and you got a strong sense of magnitude there.
Allen: Yes. I think that works. There’s a video I saw, we’ll have to find this and maybe put in a link, about war casualties and showing a little individual person for each casualty, but then adding it up and showing colored rectangles of different casualties in different wars, the number of people from each country, and that was very effective, and then I’m reminded of XKCD has done several really nice examples to show the relative sizes of things, just by mapping them onto area on the page. One of the ones that I think is really good is different doses of radioactivity, where he was able to show many different orders of magnitude by starting with a small unit that was represented by a single square, and then scaling it up, and then scaling it up, so that you could see that there are orders of magnitude between things like dental x-rays that we really should not be worrying about, and other kinds of exposure that are actual health risks.
Uncertainty Misconceptions
Hugo: Incredible. So, what are the most important misconceptions regarding uncertainty that you think we need to correct, those data-oriented educators?
Allen: Right. Well, we talked about probabilistic predictions. I think that’s a big one. I think the other big one that I think about is the shapes of distributions, that when you try to summarize a distribution, if I just tell you the mean, then people generally assume that it’s something like a bell-shaped curve, and we have some intuition for what that’s like, that if I tell you that the average human being is about 165 centimeters tall, or I think it’s more than that, but anyway, you get a sense of, “Okay. So, probably there are some people who are over 200, and probably there are some people who are less than 60, but there probably isn’t anybody who is a kilometer tall.” We have a sense of that distribution.
Allen: But then you get things like the Pareto distribution, and this is one of the examples I use in my book, is what I call Pareto World, which is same as our world, because the average height is about the same, but the distribution is shaped like a Pareto distribution, which is one of these crazy long-tailed distributions, and in Pareto World, the average height is between one and two meters, but the vast majority of people are only a centimeter tall, and if you have seven billion people in Pareto World, the tallest one is probably a hundred kilometers tall.
Pareto Distributions
Hugo: That’s incredible, and just quickly, what type of phenomena do Pareto distributions, what are they known to model?
Allen: Right. Well, I think wealth and income are two of the big ones. In fact, I think that’s the original domain where Pareto was looking at these long-tailed distributions, and that’s the case where a few people have almost all of the wealth, and the vast majority of people have almost none. So, that’s a case where if I tell you the mean and you are imagining a bell-shaped distribution, you have totally the wrong picture of what’s going on. The mean is really not telling you what a typical person has. In fact, there may be no typical person.
Hugo: Absolutely, and in fact, that’s a great example. Another example is if you have a bimodal distribution with nothing in the middle, the mean. There could actually be no one with that particular quantity of whatever we’re talking about.
Allen: Yeah, that’s a good example.
Hugo: So Allen, when you were discussing the Pareto distribution and the normal distribution, then something really struck me that as stakeholders and decision makers and research scientists and data scientists, we seem to be more comfortable in thinking about summary statistics and concrete numbers instead of distribution. So what I mean by that is, we like to report the mean, the mode, the median and measures of spread such as the variance. And there seems to be some sort of discomfort we feel, and we’re not great at thinking about distributions which seem kind of necessary to quantify and think about uncertainty.
Allen: No, I think that’s right. It doesn’t come naturally. You know, I work with students. It takes awhile to just understand the idea of what a distribution is. But I think it’s important because it captures all of the information that you have about a prediction. You want to know all possible outcomes, and the probability for each possible outcome. That’s what a distribution is. It captures exactly the information that you need as a decision maker.
Hugo: Exactly. So, I mean, instead of communicating, for example, P-values in hypothesis testing, we can actually show the distribution of the possible effect sizes, right?
Allen: Right, and this is the strength of Bayesian methods, because what you’ve got is a posterior distribution that captures this information. And if you now feed that into a decision making process, it answers all the questions that you might want to ask. If you only care about the central tendency you can get that, but very often there’s a cost function that says, you know, if this value turns out to be very high, there’s a cost associated with that. If it’s low, there’s a cost associated with that. So if you’ve got the whole distribution, you can feed that into a cost benefit analysis and make better decisions.
Hugo: Absolutely. And I love the point that you made, which I think about a lot of the time, and when I teach Bayesian thinking and Bayesian inference, I make this incredibly explicit all the time, that from the posterior, from the distribution, you can get out so many of the other things that you need and you would want to report.
Allen: Right, so maybe you care, you know, what’s the probability of a given catastrophic output. So, in that case you would be looking at, you know, the tails of that distribution. Or something like, you know, what’s the probability that I’ll be off by a certain amount or again, you know, things like the mean and the spread. Whatever the number is, you can get it from the distribution.
What technologies are best suited for thinking and communicating around uncertainty?
Hugo: Absolutely. And this is actually … this leads to another question which I wanted to talk about. Bayesian inference I think of in a number of ways, as a technology that we’ve developed to deal with these types of questions and concepts. I think also we have reached a point in the past decades where Bayesian inference now, because of computational power we have, is actually far more feasible to do in a robust and efficient manner. And I think we may get to that in a bit. But I’m wondering in general, so what technologies, to your mind, are best suited for thinking and communicating around uncertainty, Allen?
Allen: Well, you know, a couple of the visualizations that people use all the time, and of course, you know, the classic one is a histogram. And that one, I think, is most appropriate for a general audience. Most people understand histograms. Violin plots are kinda similar, that’s just two histograms back-to-back. And I think those are good because people understand them, but problematic. I mean, I’ve seen a number of articles of people pointing out that you kinda have to get histograms right. If the bin size is too big, then you’re smoothing away a lot of information that you might care about. If the bin size is too small, you’re getting a lot of noise and it can be hard to see the shape of the distribution through the noise.
Allen: So, one of the things I advocate for is using CDFs instead of histograms, or PDFs, as the default visualization. And when I’m exploring a data set, I’m almost always looking at CDFs because you get the best view of the shape of the distribution, you can see modes, you can see central tendencies, you can see spread. But also if you’ve got weird outliers, they jump out, and if you’ve got repeated values, you can see those clearly in a CDF, with less visual noise that distracts you from the important stuff. So I love CDFs. The only problem is that people don’t understand them. But I think this is another case where the audience is getting educated, that the more people are consuming data journalism, the more they’re seeing visualizations like this. And there’s some implicit learning that’s going on.
Allen: I saw one example very recently, someone showing the altitude that human populations live at. ‘Cause they were talking about sea levels rising and talking about the fraction of people who live less than four meters above sea level. But the visualization was kind of a sneaky CDF, they showed, it actually a CDF sideways. But it was done in a way where a person who doesn’t necessarily have technical training would be able to figure out what that graph was showing. So I think that’s a step in a good direction.
Hugo: I like that a lot. And just to clarify, a CDF is a cumulative distribution function?
Allen: Yes. Sorry, I should’ve said that.
Hugo: Yeah.
Allen: And in particular I’m talking about empirical CDFs, where you’re just taking it straight from data and generating the cumulative distribution function.
Hugo: Fantastic. And one of the nice things there, so each point on the x-axis, the y value will correspond to the number of data points equal to a less than, that particular point. And one of the great things is, you can also read off all your percentiles, right?
Allen: Exactly, right. You can read it in both directions. So, if you start on the y-axis, you can pick the percentile you want, like the median, 50 percentile. And then read off the corresponding x value. Or, the flip side is exactly what you said. If you want to know what fraction of the values are below a certain threshold, then you just read off that threshold and get the corresponding y-value.
Hugo: Yeah. And one of the other things that I love, you mentioned a bunch of, well several very attractive characteristics of empirical CDF, ECDFs. I also love that you can plot, you know, your control and a lot of different experiments just on the same figure and actually see how they differ, as opposed to you try to plot a bunch of histograms together, you gotta do wacky transparencies and all this stuff, right?
Allen: Yes, that’s exactly right. And you can stack lots of CDFs on the same axes, and the differences that you see are really the differences that matter. When you compare histograms, you’re seeing a lot of noise and you can see differences between histograms that are just random. When you’re looking at CDFs, you get a pretty robust view of what the differences are and where in the distribution those differences happen.
Hugo: Yeah. Fantastic. Look, I’m very excited for a day in which the general populace appreciates CDFs and they appear in the mainstream media. I think that’s a bright future.
Allen: Yeah, and I think we’re close. I’ve seen one example, there have got to be more.
Hugo: Are there any other technologies or ways of thinking about uncertainty that you think are useful?
Allen: Well we talked a little bit about visualizing simulations, I think that matters. There’s one example maybe getting back to … if we have to get back to the 2016 election, I think one of the issues that came up is that a lot of the predictions, when they showed you a map of the different states, they were showing a color scale where there would be a red state and a blue state, but also pink and light blue and purple. And they were trying to show uncertainty using that color map, but then that’s, you know, and that’s not how the electoral college works. The electoral college, every state is either all red or all blue, with just a couple of exceptions. So that was a case where the predictions ended up looking very different from what the final results looked like, and I think that’s part of why we were uncomfortable with predictions and the results.
Hugo: Interesting. So what is a fix for that, do you think?
Allen: Well, again coming back to my suggestion about, you know, don’t try to show me all possible simulation outcomes, but show me one simulation per day. And in that case, the result that you show me, the daily result, would be all red or all blue. So, the predictions in that sense would look exactly like the outcome. And then when you see the outcome, the chances are that it’s gonna resemble at least one of the predictions that you made.
Hugo: Great. Now I just had kind of a future flash, a brainwave into a future where we can use virtual reality technologies to drop people into potential simulations. But that’s definitely future music.
Allen: Yes. I think that’s interesting.
What does the future of data science look like to you?
Hugo: Yeah. So speaking of the future, we’ve talked a lot about modern data science and uncertainty. I’m wondering what the future of data science looks like to you?
Allen: I think a big part of it looks like more people being involved. So not just highly trained technical statisticians, but we’ve been talking like data journalists, for example, who are people who have a technical skill to look at data, but also the storytelling skill to ask interesting questions, get answers, and then communicate those answers. I’d love to see all of that become more a part of general education, starting in primary school. Starting in secondary school, working with data, working with some of these visualizations we’ve been talking about. Using data to answer questions. Using data to explore and find out about the world, you know, at the stage that’s appropriate at different levels of education.
Allen: There’s a lot of talk about trying to get maybe less calculus in the world and more data science, and I think that’s gotta be the direction we go. If you look at what people really need to know and what they’re likely to use, practically everybody is going to be a consumer of data science and I think more and more people are gonna be producers of data science. So I think that’s gotta be part of a core education. And calculus, I love calculus. But, it’s just not as important for as many people.
Hugo: Yeah. And arguably, for you in your engineering background, I mean, calculus is incredibly important for engineers and physicists, but other people who need to be quantitative, it is, I think your point is very strong that learning how to actually work with data and statistics around that, is arguably a lot more essential.
Allen: Yeah. I think, as I said, more and more people are gonna be doing at least some kind of data science where they’re taking advantage of all of the data now that’s freely available, and that’s, you know, government agencies are producing huge volumes of data and often they don’t have the resources to really do anything with it. They’ve got a mandate to produce the data, but they don’t have the people to do that. But the flip side of that is there’s a huge opportunity for anyone with basic data skills to get in there and find interesting things. Often, you’re one of the first people to explore a data set, you know, if you jump in there on the day it’s published, you can find all kinds of things, not necessarily using, you know, powerful or complex statistical methods, just basic exploratory data analysis.
Hugo: Yeah, and the ability now to get, you know, learners, students, people in education institutions, involved in data science by making it or letting them realize that it’s relevant to them, that there’s data about their lives or about their physiological systems that they can analyze and explore, I think, is a huge win.
Allen: It is. It’s really empowering, and this is one of the reasons that I … I call myself a data optimist. And what I mean by that is I think there are huge opportunities here to use data science for social good. Getting into these data sets, as you said, they are relevant to people’s lives. You can find things. I saw a great example at a conference recently, I was talking to a young guy from Brazil, who had worked on an application that was going through government data that was available online and flagging evidence of corruption, evidence of budgets that were being misspent. And they would tweet about it. There was just a robot that would find suspicious things in these accounts, and tweet them out there, which is, you know, kind of transparency that I think makes governments better. So I think there’s a lot of potential there.
Hugo: That’s incredible. Actually, that reminded me. I met a lawyer who was non-technical awhile ago, and non-computational, but he was learning a bit of machine learning, a bit of Python. He was trying to figure out whether you could predict judgements handed down by the Supreme Court based on previous judgements, and who would vote in a particular way. And that’s just because that’s something that really interests him professionally and in terms of social justice, as well.
Allen: Right. And I think, you know, the fact that people can do that who are not necessarily experts in that field, but amateurs for lack of a better word, can get in there and really do useful work. I think, you know, there are a lot of concerns, too. And this is getting a lot of attention right now, I’m actually in the middle of reading Weapons of Math Destruction, Cathy O’Neill’s book. And there are a lot of concerns and I think there are things that are scary that we should be thinking about, but one of the things I’m actually thinking about now and trying to figure out is, how do we balance this discussion? ‘Cause I think we’re having, or at least starting, a good public discussion about this. It’s good to get the problems on the table and address them, but how do we get the right balance between the optimism that I think is appropriate, but also the concerns that we should be dealing with.
Hugo: Yeah, absolutely. And as you say, the more and more books being published, more and more conversations happening in public. I mean it’s the past several weeks that Mike Loukides, Hilary Mason, and DJ Patil who have posted their series of articles on data ethics and what they would like to see adoption in culture and in tech, among other places. I do think Weapons of Math Destruction is very interesting as part of this conversation, because of course one of the key parts of the definition for Cathy O’Neil over Weapon of Math Destruction is that it’s not transparent, as well, right? So all the cases we’re talking about kind of involve necessary transparency, so if we see more of that going forward, we’ll at least be able to have a conversation around it.
Allen: Right, and I agree with both O’Neill and with you. I think that’s a crucial part of these algorithms and, you know, open science and reproducible science is based on transparency and open data, and you know, also open code and open methodology.
Hugo: Absolutely. And this actually brings me to another question, which is a through line here is, the ability of everybody, every citizen to interact with data science in some sense. And I’m wondering for you in your practice, and as a data scientist and an educator, what is the role of the open source in the ability of everybody to interact with data science?
Allen: Right, I think it’s huge. You know, reproducible science doesn’t work if your code is proprietary. If you, you know, if you only share your data but not your methods, that only goes so far. It also doesn’t help very much if I publish my code but it’s in a language that’s not accessible to everybody, you know, languages that are very expensive to get your hands on. Even among relatively affluent countries, you’re not necessarily gonna have access to that code. And then when you go worldwide, there are, you know, a great majority of people in the world that are not gonna have access to that as contrasted with languages like R and Python that are freely available, now you still have to access to technology and that’s not universal, but it’s better and I think free software is an important part of that.
Hugo: Yeah.
Allen: This is, you know, part of the reason that I put my books up under free licenses is I know that there are a lot of people in the world who are not gonna buy hard copies of these books, but I want to make them available, and I do, you know, I get a lot of correspondence from people who are using my labs in electronic forms, who would not have access to them in hard copy.
Favorite Data Science Technique
Hugo: So, Allen, we’ve talked about a bunch of techniques that are dear to your heart. I’m wondering what one of your favorite data science-y techniques or methodologies is.
Allen: Right. I have a lot.
Hugo: Let’s do it.
Allen: This might not be a short list.
Hugo: Sure.
Allen: So I am at heart a Bayesian. I do a certain amount of computational inference, you know, you do in classical statistical inference, but I’m really interested in helping Bayesian methods spread. And I think one of the challenges there is just understanding the ideas. It’s one of these ideas that seems hard when you first encounter it, and then at some point there’s a breakthrough, and then it seems obvious. Once you’ve got it, it is such a beautiful simple idea that it changes how you see everything. So that’s what I want to help readers get to, and my students, is get that transition from the initial confusion into that moment of clarity.
Allen: One of the methods I use for that, and this is what I use in Think Bayes a lot, is just grid algorithms where you take everything that’s continuous and break it up into discrete chunks, and then all the integrals become for loops, and I think it makes the ideas very clear. And then I think the other part of it that’s important is the algorithms, particularly MCMC algorithms, which, you know, that’s what makes Bayesian methods practical for substantial problems. You mentioned earlier that, you know, the computational power has become available. And that’s a big part of what makes Bayes practical. But I think the algorithms are just as important, and particularly when you start to get up into higher dimensions. It’s just not feasible without modern algorithms that are really quite new, developed in the last decade or so.
Hugo: Yeah. And I just want to speak to the idea of grid methods and, you said, turning, you say integrals become for loops. And I think is something which has actually been behind a lot of what we’ve been discussing as well and something that actually attracted me to your pedagogy initially and all of your work, was this idea of turning math into computation. And we see the same with techniques such as the bootstrap and resampling, but taking concepts that seem, you know, relatively abstract and seeing how they actually play out in a computational structure and making that translational step there.
Allen: Right. Yeah, I’ve found that very powerful for me as a learner. I’ve had that experience over and over, of reading something expressed using mathematical concepts, and then I turn it into code and I feel like that’s how I get to understand it. Partly because you get to see it happening, often it’s very visual in a way that the math is not, at least for me. But the other is it’s debuggable. That if you have a misunderstanding, then when you try to represent it in code, you’re gonna see evidence of the misunderstanding. It’s gonna pop up as a bug. So, when you’re debugging your code, you’re also debugging your understanding. Which, for me, builds the confidence that when I’ve got working code, it also makes me believe that I understand the thing.
Hugo: Absolutely, and a related concept is the idea that breaking it down into chunks of code allows you to understand smaller concepts and build up the entire concept in smaller steps.
Allen: Right, yeah. I think that’s a good point, too.
Hugo: Great. So, are there any other favorite techniques? You can have one or two more if you’d like.
Allen: I’ll mention one which is survival analysis. And partly because it doesn’t come up in an introductory class most of the time, but it’s something I keep coming back to. I’ve used it for several projects, not necessarily looking at survival or medicine, but things like a study I did of how long a marriage lasts. Or, how long it is until someone has a first child, or gets married for the first time, or how long the marriage itself lasts until a divorce. So, as I say, it’s not an idea that everybody sees, but once you learn it, you start seeing a lot of applications for it.
Hugo: Absolutely. And this did make it into your Think Stats book, do I recall correctly, or?
Allen: Yes. Yeah, I’ve got a section on survival analysis.
Call to Action
Hugo: Yeah, fantastic. So I’ll definitely link to that in the show notes, as well. So, my last question is, do you have a call to action for our listeners out there?
Allen: Maybe two. I think if you have not yet had a chance to study data science, you should. And I think there are a lot of great resources that are available now that just weren’t around not too long ago. And especially if you took a statistics class in high school or college, and it did not connect with you, the problem is not necessarily you. The standard curriculum in statistics for a long time I think has just not been right for most people. I think it’s just spent way too much time on esoteric hypothesis tests. It gets bogged down in some statistical philosophy that’s actually not very good philosophy, it’s not very good philosophy, it’s science.
Allen: If you come back to it now from a data science point of view, it’s much more likely that you’re gonna find classes and educational resources that are much more relevant. They’re gonna be based on data. They’re gonna be much more compelling. So give it another shot. I think that’s my first call to action.
Hugo: I would second that.
Allen: And then the other is, for people who have got data science skills, there are a lot of ways to use that to do social good in the world. I think a lot of data scientists end up doing, you know, quantitative finance and business analytics, those are kinda the two big application domains. And there’s nothing wrong with that, but I also think there are a lot of ways to use the skills that you’ve got to do something good, to, you know, find stories about what’s happening and get those stories out. To, you know, use those stories as a way to effect change. Or if nothing else, just to answer questions about the world. If there’s something that interests you, very often you can find data and answer questions.
Hugo: And there are a lot of very interesting data for social good programs out there, which we’ve actually had Peter Bull on the podcast to talk about data for good in general, and I’ll put some links in the show notes as well.
Allen: Yes, and then I’ve got actually a talk that I want to link to that I’ve done a couple of times, and it’s called Data Science, Data Optimism. And the last part of the talk is my call for data science for social good. I’ve got a bunch of links there that I’ve collected, that are just really the people that I know and groups that I know who are working in this area, but it’s not complete by any means. So I would love to hear more from people, and maybe help me to expand my list.
Hugo: Fantastic. And people can reach out to you on Twitter, as well? Is that right?
Allen: Yes. I’m Allen Downey.
Hugo: Fantastic. Allen, it’s been an absolute pleasure having you on the show.
Allen: Thank you very much. It’s been great talking with you.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.