Site icon R-bloggers

Finally understanding what “statistical significance” and p-values mean: A simple example (with R code)

[This article was first published on For-loops and piep kicks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One day I realized that I finally really understood what “statistical significance” means (p < .01). I had probably heard the term hundreds of times by then. If you are still struggling with the concept, I hope it doesn’t take you this long and perhaps this post can be of help. The article will be quite lengthy (since previously this was the content of several hours of class), but if you have a few minutes, give it a try.

Let’s assume it’s the 2020 US presidential elections, and you are tasked with carrying out exit polls in two states to determine who of the candidates won these states. We’re taking Georgia and Alaska as examples since the former was a very close run (Biden won) whereas in the latter, Trump won by a big margin.

These are the actual results in these two states:

library(tidyverse)

georgia <- c(rep("Biden", 2473707), rep("Trump", 2461779))
alaska <- c(rep("Biden", 153778), rep("Trump", 189951))

proportions(table(georgia))
proportions(table(alaska))

Which gives us:

Percentages of Trump/Biden votes in Georgia and Alaska

We thus have created two objects “alaska” and “georgia” and stored the actual election results in them. These true results are unknown to you on election day, of course, which is the whole point of conducting an exit poll, asking a random sample of voters who they just voted for.

A sample of 1,000 voters

We start by simulating an exit poll of 1,000 voters in Alaska:

set.seed(2020) 

survey_a1 <- sample(alaska,1000,replace=F)

table(survey_a1) %>% as.data.frame %>%
  ggplot(aes(x=survey_a1, y = Freq, label = Freq)) +
  geom_col(fill=c("blue", "red")) + 
  geom_label(label = paste(proportions(table(survey_a1))*100,"%")) +
  theme_minimal() + 
  ggtitle("Results of our first survey in Alaska (N = 1,000)")

prop.test(table(survey_a1))

What the code does is first take a random sample of 1,000 voters (from our “alaska” object where we stored the true voting results), and then visualize the results and run a significance test. We’ll explain in a bit what the significance test means; let’s first look at the result of running the code:

Figure 1: Result of our first survey in Alaska
Significance test of our first survey (to be explained later)

Why are we not getting the exact true value in our survey?

We know that the true percentage of Biden votes in Alaska was 44.7%. In our survey of 1,000 voters, it was 43.6%, so a bit lower. Let’s first establish why the percentage in our sample is not exactly the same as the true value. We disregard here any bias coming from, e.g., social desirability in surveys; our survey respondents are all completely honest (they are simulated after all). Since we drew a random sample from a population, our results can differ a bit from the true results as a result of random chance. We may happen to select a few more Biden voters, or a few more Trump voters, in our random sample, as compared with the true percentage.

This is probably easy to understand. The probability of rolling a 6 when throwing a die is 1/6. This does not mean that if you throw a die 600 times, you will always get exactly 100 6’s. Maybe you’ll get a 6 in 98, or 105, or even only 88 out of the 600 throws. But you will intuitively agree that it is more likely that you will get 100 6’s as opposed to, say, 50 or 150.

This is what the so-called Central Limit Theorem (CLT) in statistics posits:

Let’s translate this to our example:

I think it becomes more clear once you look at the following plots.

Repeating a survey many times

If you had endless time and resources, you could of course conduct a second survey, since you’re first survey indicated that Trump won Alaska but you want to be sure that you’re not, by random chance, a bit above the true Trump vote and in reality, Biden won (again, keep in mind that you wouldn’t know the true outcome at the time of the exit poll, which is the whole reason for conducting the exit poll).

A second survey:

survey_a2 <- sample(alaska,1000,replace=F)

table(survey_a2) %>% as.data.frame %>%
  ggplot(aes(x=survey_a2, y = Freq, label = Freq)) +
  geom_col(fill=c("blue", "red")) + 
  geom_label(label = paste(proportions(table(survey_a2))*100,"%")) +
  theme_minimal() + 
  ggtitle("Results of our second survey in Alaska (N = 1,000)")
Figure 2: Results of our second simulated survey

In your second survey, too, Trump leads. This time his vote share is a bit lower, but he is still leading by a big margin.

But you really want to know the truth and therefore, you resort to an extreme measure: You conduct 1,000 surveys of 1,000 voters each (thereby terrorizing most of Alaska’s population at least once) and you write down the percentage of Trump vs. Biden votes in each survey, and at the end you count how many of the surveys show Trump in the lead, hoping this will settle it.

You can simulate this with R the following way:

draw_sample <- function(x,n=1000){
  s <- sample(x, n, F)
  proportions(table(s))
}

set.seed(2020)
alaska_sim <- replicate(1000,draw_sample(alaska))

alaska_sim[1:2,1:20]

This code creates a function that gives you a sample of 1,000 voters and returns the percentages of Biden and Trump votes. We then run (replicate) this function 1,000 times and store the results in the object “alaska_sim”. Look at the first couple results of our meta-study:

First few results of 1,000 simulated surveys with 1,000 voters each

The first few results point a very clear picture. We see that Trump always leads by a large margin. Sometimes it’s 54%-46%, sometimes it’s 58%-42%. So there is a certain variability, but we never see Biden in the lead in these surveys.

Let’s plot the results from all our 1,000 surveys:

dat <- data.frame(t(alaska_sim))
qplot() + theme_minimal() + 
  geom_histogram(aes(x=dat$Biden),fill="blue", alpha=.5) + 
  geom_histogram(aes(x=dat$Trump), fill="red", alpha=.5) +
  geom_vline(xintercept = mean(dat$Biden), color="blue") +
  geom_vline(xintercept = mean(dat$Trump), color="red") +
  ggtitle("1,000 surveys in Alaska of 1,000 voters each", 
          subtitle="Histograms of simulated values for Trump (red) and Biden (blue) percentages")

c(mean(dat$Biden), mean(dat$Trump))

Which gives us this result:

Figure 3: Distributions of sample mean values in 1,000 simulations (Alaska)

The plot shows two histograms of Biden (blue) and Trump (red) share of votes. A histogram is a bar chart which shows how often a value occurs in our dataset. For instance, the low red bar at the x = 0.50 mark indicates that there were only 3 out of our 1,000 simulated surveys in which Trump had between 50% and 51% of the votes. By contrast, in more than 150 surveys, the share of Trump votes was 55-56% (high bar at the x = 0.55 mark). Here you can perfectly see what the “Central Limit Theorem” states:

The latter can be seen here:

Mean values for Trump and Biden share of votes among our 1,000 Alaska surveys

Notice that the real share of Biden votes in Alaska was 44.73% – the average of all our 1,000 surveys was 44.75% and thus very, very close! Thus, if we indeed could conduct 1,000 surveys, and average the result, we would be very close to the true election result. But of course in a real-life scenario you can’t do that, you can usually only conduct one survey. By now we thus have established that:

How can we know if our single survey is close to the truth?

Since we would not know the true value, we could not compute the distributions shown in Figure 3. Thus, if we conduct only one survey, we can’t know if we’re very close to the true value or, by bad luck, very far off.

But: We know that our survey result (share of Trump votes) is somewhere on a normal distribution around the true value (as posited by the Central Limit Theorem and as shown in our simulation above). From the well-known shape and properties of a normal distribution, we can derive how likely it is that our result of 56% Trump votes (in our first survey) is, say, 7 percentage points away from the true value (meaning that actually Biden could be in the lead).

There are two parameters that determine the “width” of the normal distribution:

Here you see a normal distribution with N = 1,000 around the Alaska vote (from our first survey) for Trump:

Figure 4: Normal distribution with 95% interval for N = 1,000

Code for this plot:

vote_share <- .564
N <- 1000
mean <- vote_share*N
sd <- sqrt(vote_share*(1-vote_share)*N)
x <- (mean - 5*sd):(mean + 5*sd)
norm1 <- dnorm(x,mean,sd)
p <- pnorm(x,mean,sd)

qplot() + theme_minimal() +
  geom_line(aes(x=x,y=norm1), color="red") +
  geom_area(aes(x=x,y=norm1),fill="red",alpha=.1) +
  geom_vline(xintercept = x[which.min(abs(p-0.025))], color="red") +
  geom_vline(xintercept = x[which.min(abs(p-0.975))], color="red") +
  ggtitle("Normal distribution with N = 1,000",
          subtitle= paste0("Vertical lines = 95% interval around the mean: ",
                           round(x[which.min(abs(p-0.025))]/N*100,1), "% to ", 
                           round(x[which.min(abs(p-0.975))]/N*100,1),"%"))

In the plot I have marked with the two vertical lines the borders of the so-called 95%-interval. This is the area where 95% of cases are located. Outside (left tail or right tail) are together only 5% of observations. This is important because, as is common convention, we use the 95% interval to denote the values that we still deem “probable”.

The plot shows basically what we already saw in Figure 3 (red histogram), but with a smoothed line (the probability density function). It tells us that:

The important message from this analysis is: Our confidence interval is far apart from the 50% mark. Meaning, we can be very sure that Trump is indeed in the lead over Biden in Alaska. Yes, there is still some uncertainty; the 95%-interval is more than six percentage points wide, so we can’t really say for sure whether Trump in reality got 54% or rather 59% of the votes, but we can be extremely sure that Biden has not received as many or more votes than Trump.

There is of course always a small risk that our conclusion is wrong if we base it on the 95% interval. The probability of obtaining a value outside the 95% interval in Figure 4 is, well, 5%. In our example, the probability that Trump has in fact fewer than 50% is very, very low, as you can see from the small tail of the distribution to the left of the 500 mark, it’s around 0.002%. This is the infamous p-value (we’ll discuss this in detail in a minute).

Since this value is well below 5%, we can say “Trump’s lead in Alaska is statistically significant at the 5% level”. In addition to the 5% significance level (which corresponds to the 95% confidence interval), you also see sometimes the 1% or 0.1% significance levels. So, because our p-value is well below 0.1%, we could also say “the difference between Trump and Biden is statistically significant at the 0.1% level”. These three (5%, 1%, 0.1%) are the common significane levels – meaning that even though the probability that we’re wrong and we obtained these results even though Biden was in fact on par with Trump (0.002%) is even smaller than 0.1%, we wouldn’t write something such as “it’s significant at the 0.01% level”.

The effect of larger sample sizes

As posited by the Central Limit Theorem and the law of large numbers, and you will intuitively agree that this is the case, a larger sample size will lead to a smaller margin of error. Let’s see how large the confidence interval is with a sample of 10,000 voters instead of 1,000 voters:

N <- 10000
mean <- vote_share*N
sd <- sqrt(vote_share*(1-vote_share)*N)
x <- x*10
norm1 <- dnorm(x,mean,sd)
p <- pnorm(x,mean,sd)

qplot() + theme_minimal() +
  geom_line(aes(x=x,y=norm1), color="red") +
  geom_area(aes(x=x,y=norm1),fill="red",alpha=.1) +
  geom_vline(xintercept = x[which.min(abs(p-0.025))], color="red") +
  geom_vline(xintercept = x[which.min(abs(p-0.975))], color="red") +
  ggtitle("Normal distribution with N = 1,000",
          subtitle= paste0("Vertical lines = 95% interval around the mean: ",
                           round(x[which.min(abs(p-0.025))]/N*100,1), "% to ", 
                           round(x[which.min(abs(p-0.975))]/N*100,1),"%"))

Now we have a normal distribution for a share of 56.4% Trump votes, but obtained among a sample of 10,000 voters:

Figure 5: Normal distribution with mean = 56.3% and N = 10,000

As you can see, the distribution is much more narrow as compared with Figure 4 where we had only 1,000 voters. The margin of error is much smaller, insted of 56.4% plus-or-minus 3.1%, we could now say: Trump’s vote share is 56.4%, plus or minus 1.1%. Which is obviously a huge improvement. But in reality it would also be much more costly to recruit 10,000 voters, of course. Thus, there is a trade-off between getting results fast and cheap, and getting results that are as accurate as possible.

Keep this in mind when you read something along the lines of “43% of people want X wheras only 41% of people favor Y”, based on a survey of 1,000 people, because now you know how big the margin of error usually is for this sample size, and so in reality, people might as well favor Y over X.

What is a “standard error”?

By the way, you can easily see why increasing the number of observations leads to a more narrow distribution from the formula of the standard deviation of a binomial distribution as mentioned earlier which is: sqrt(p*q*n) [where p = percentage Trump, q = percentage Biden, n = number of observations]. So if n is increased by the factor 100, then the standard deviation only increases by the factor sqrt(100) = 10. So the relative size of the standard deviation in proportion to your number of voters for Trump has decreased.

The standard deviation of the distribution of all our (hypothetical) survey results is also called the standard error (SE) of our estimate. This is important because many people confuse standard deviation and standard error.

If we know the distribution of all our 1,000 surveys (because we simulated them ourselves), we can easily get the standard error just from the standard deviation of the distribution of all our repeated experiments. Let’s try this:

dat <- data.frame(t(alaska_sim))
sd(dat$Trump)

Result:

Standard error – obtained as the standard deviation from all our 1,000 simulated results of Trump vote shares

But, if we don’t have thousands of repeated experiments but only one actual survey, we have to estimate the standard error based on our existing data. The estimate for the standard error for a binomial distribution (i.e. two outcomes, Trump or biden) from a sample is given by the formula: sqrt(p*(1-p)/n), where p is the share of Trump votes and n is the sample size (see, e.g., here).

sqrt(0.564*0.436/1000)
Standard error – estimated from one single sample where the share of Trump votes was 56.4%

As you can see, the values are very similar. This is important for us to know, because it means we can infer the standard error from a single survey – without having to simulate or actually conduct thousands.

By the way: If you don’t have a binary outcome (Trump/Biden) but a continuous variable (e.g., income, body weight, etc.), the formula for the standard error of the mean is given by: sd/sqrt(n)

An important rule of thumb about the normal distribution as shown in Figures 4 and 5 is that the limits of the 95% confidence interval are ca.: < mark style="background-color:#FFB302;" class="has-inline-color">mean value +/- 2 * standard error. This means that if you compute the standard error of your survey estimate, you can derive the margin of error by simply multiplying it by 2! (To be yet a bit more precise, for the t-distribution, which is normally used instead of the standard normal distribution (and is virtually identical to it except for very small sample sizes), the formula is 95% confidence interval = mean +/- 1.96*SE).

Try it out:

se = sqrt(0.564*0.436/1000)
c(.546-2*se, .564+2*se)

The result:

Confidence interval calculated by rule of thumb (mean value +/- 2* standard error

Compare this with Figure 4 (confidence interval given in the subtitle) – it is indeed a very good approximation of the distribution of sample results if we had repeated the sampling many times.

Summary of this section: If you have a random sample from some population, and you want to know how large the margin of error for your sample estimate is (i.e., if you took many other random samples, how much would they differ), you can compute the standard error in your sample, multiply it by 2, and that is the margin of error (for a 95% interval). We have seen here with simulated data that this rule of thumb is quite accurate, and in an applied use case you therefore can make use of this without having to actually repeat your experiment countless times or simulate data.

The null hypothesis and the p-value

You probably heard these terms often already, so it’s time to properly explain why researchers formulate and test a “null hypothesis” and what the p-value means. We have implicitly done this above where we asked if we can be sure that Trump’s vote share in Alaska (56.4% according to our survey) is really above 50% (which would mean he is tied with Biden), but we haven’t done it the technically correct way.

In Figure 3, we could nicely see where the true results for Biden vs. Trump are, and where we likely (or unlikely) end up with a survey. Our problem is of course that we don’t know the true values. Therefore, we also don’t know where the distributions shown in Figures 4 and 5 are really located on the X axis – which would be good to know because we want to know how likely it is that Trump is actually in the lead over Biden.

Therefore, we procede the other way round: We assume, for the sake of the statistical test, that the difference between Trump and Biden among all voters is zero. We can then compute a normal distribution around the zero with our known parameters (sample size, standard error), and then check what the probability would be to obtain the results from our statistic.

This is the essence of the statistical significance test: < mark style="background-color:#FFB302;" class="has-inline-color">If the true effect is zero, what is the probability of obtaining my data from a random sample of size N?

This probability is the infamous p-value. Formally, we want to know: p(D|H0), i.e. the probability (“p-value”) of our data (D) [or a more extreme result] given the null hypothesis (H0).

It’s probably best understood looking at the following graph:

This is a visualisation of the null hypothesis: We assume for a moment that Trump and Biden had won the equal amount of votes in Alaska, so the difference between Trump and Biden was zero – so Trump’s vote share is 50%. Our sample is N = 1,000. Our standard error is, as calculated above, sqrt(0.564*(1-0.564)/N) = 1.568%. The result in our survey was 56.4% (“our data”), which we will use to compare against the null hypothesis:

N <- 1000
null_hypothesis <- 0.5
our_result <- 0.564
se = sqrt(our_result*(1-our_result)/N)

x <- seq((null_hypothesis - 5*se),(null_hypothesis + 5*se),by=.001)
norm1 <- dnorm(x,null_hypothesis,se)

qplot() + theme_minimal() +
  geom_line(aes(x=x,y=norm1), color="black") +
  geom_area(aes(x=x,y=norm1),fill="gray",alpha=.1) +
  geom_vline(xintercept = our_result, color="red") +
  geom_label(aes(x=our_result,y=max(norm1), label = paste0("Our data:\n",our_result*100,"%"))) +
  ggtitle("Null hypothesis: True Trump votes = Biden votes",
          subtitle="vs. the result of our survey")

(Just a note here, if you are going to perform a significance test yourself, you don’t have to write that much code and calculate standard errors by hand etc. I’m doing that only for didactical reasons here so you can see how these things are related to one another (I hope some readers are still following…), but we’ll soon look at easy R functions that give us the significance values.) Here are the results:

Figure 6: Visualisation of the null hypothesis, compared with our results from our first survey

We’re assuming the true vote share of Trump = Biden’s = 50%, i.e. zero difference (“null hypothesis”). If that were the case, we could of course, by random chance, have a bit of a different result in a survey of N = 1,000 randomly picked voters. We could get 51% or 49%, or maybe 52% or 48%. In fact, in almost all cases (95%), we would end up somewhere between 46.9% and 53.1% (remember: mean +/- 2*SE). Thus the grey normal distribution around the 50% mark in the Figure above.

Our data, however, shows Trump at 56.4%. How likely is that if the null hypothesis were in fact true? Not very likely, as you can see from the figure: The vertical red line is far away from the vast majority of values that are likely to obtain under the null scenario. Give us the exact p-value:

p_value <- 1 - pnorm(our_result, mean=0.5, sd=se)
p_value
p-value of our survey result (Trump votes = 56.4% in Alaska) against the null hypothesis Trump = Biden = 50% [one-sided test]

This means: If there was no difference between Trump and Biden among all voters in Alaska, then the probability would be 0.002% to obtain our result of 56.4% Trump votes (or higher) in a sample of N = 1,000 voters.

I know this sentence is quite cumbersome, but this is exactly what the p-value tells us! So it’s very important that you memorize this. More often than not, you see wrong statements about the p-value (“the probability of the null hypothesis is low”, “the probability that our data are true is high”, etc.).

Thus – the probability of our data given the null hypothesis is 0.002%, so very low. Which is why we would say: We can reject the null hypothesis and instead we’re quite sure that Trump is indeed in the lead in Alaska. The difference between Trump’s and Biden’s votes is statistically significant (p < .001), also sometimes marked with three stars (***) (one star usually denotes the .05 level, two stars the .01 leve).

How you can quickly test for significance in R

Since you have (hopefully) understood the basic idea behind the statistical significance test, there is no need for manual calculations, simulations, and lengthy code anymore in your own applications. Here are a few common examples for how you can quickly perform statistical significance tests with one line of code:

binom.test(564,1000, p=0.5, alternative="greater")

Which gives us the p-value:

Output of binomial test in R

An alternative would be prop.test(), as used in the beginning of the article.

Examples: When to perform a significance test

The punchline of this article is still to come, namely the Georgia results. Let’s first look at a few other examples where it makes sense to calculate a significance test (and where it does not).

In general, you run a significance test if you have a random sample from a total population and you want to make an inference to the total population (which is unkown to you) (hence the term “inferential statistics”). In some cases, you don’t really have a sample from a population, but there is some other source of random variation in the data-generating process which is why a significance test also makes sense.

Examples:

What statistical significance does NOT mean

There are many types of applications, on the other hand, where testing for statistical significance does not make much sense. Let’s look at common misconceptions first about what the test actually tells you:

Here are a few examples were testing for statistical significance hardly makes any sense:

The latter point is often brought forward by proponents of Bayesian statistics against p-values and significance testing. They argue that, in many cases, we have substantial prior knowledge that suggests a null hypothesis is not very realistic. A null-hypothesis significance test (NHST), however, as you know, gives us the probability for our data under the assumption that the null hypothesis is true. If you know a priori that the null hypothesis is most likely not true, then the test is of little help to you.

However, in many cases, you simply don’t know how large the effect under study will likely be, and the information provided by the null-hypothesis significance test is thus valuable: Did Biden really receive more votes than Trump in a certain state? Does the new supplier deliver parts that are most certainly of worse quality as compared with the old supplier? Do my customers prefer design A over design B in my product? Has a certain government policy had any effect on the unemployment rate? Is taking a certain drug really associated with an increased risk of developing cancer? In all these cases, knowing that one candidate/supplier/product/policy/treatment group is most likely performing better than the other would be an important information and thus testing against a null hypothesis certainly makes sense.

Finally, a problem to avoid when testing for statistical significance is what is known as multiple comparisons: You want to discern real effects from random noise with a certain level of confidence (e.g., p < 0.05), but often your analysis involves performing more than one test; e.g., comparing men and women, different income groups, as well as Blacks, Whites, Hispanics, Asians etc., and thereby the chances are high that one or more of all tested effects will turn out significant at p < .05 as a result of random chance even if all differences across all groups were in fact zero. [The probability of wrongly rejecting the null hypothesis of one test is p < .05, but with ten tests the probability that at least one of them is significant for random reasons increases to (max) 1 – .95^10 = 40.1%, which is not that unlikely].

A survey in Georgia

If you ever wondered why for days after the 2020 US presidential election, no one knew which candidate won, here is one important reason.

Recall that with the true Alaska results, our simulated survey showed Trump in the lead with a statistically significant result. What does that mean again? If in reality, Trump and Biden had equal share of votes, it would be very very unlikely for us to find our result in our sample. So we could be very confident in concluding, after just one exit poll among as few as 1,000 voters, that Trump most certainly won Alaska. [In fact, simulating 1,000 surveys, Trump was in the lead in every single one of them.]

Now let’s do the same thing with the Georgia election results:

set.seed(2020) 

survey_g1 <- sample(georgia,1000,replace=F)

table(survey_g1) %>% as.data.frame %>%
  ggplot(aes(x=survey_g1, y = Freq, label = Freq)) +
  geom_col(fill=c("blue", "red")) + 
  geom_label(label = paste(proportions(table(survey_g1))*100,"%")) +
  theme_minimal() + 
  ggtitle("Results of our first survey in Georgia (N = 1,000)")

prop.test(table(survey_g1))

We have drawn a random sample of N = 1,000 voters and these are the results:

Figure 7: Random sample of 1,000 voters in Georgia
Significance test from the result in Georgia

Our survey sees Biden in the lead with 50.7% of the votes, but the difference between Trump and Biden is not statistically significant. The test gives us a p-value of 0.681 – which means: If Biden and Trump had equal share of votes, the probability of drawing a random sample like ours and finding that one of the candidate leads with 50.7% (or even more votes) would be 68.1%. There is thus a high risk that a result like this can be produced by random chance. We therefore can’t reject the null hypothesis.

Reasons for why a result is not significant

If we get an insignificant effect, there can be two reasons:

  1. The true effect is indeed zero (although in our case we would assume that both candidates will not have the exact same number of votes, so one of them almost certainly won).
  2. Our test power is too low.

To understand what the latter means, considering the following two error types that a test can make:

Our sample size is thus insufficient for Georgia, whereas the same number of voters were sufficient to determine the winner in Alaska, simply because the results were much more close in Georgia.

In sum, these are the main factors determining whether your results will be statistically significant: The size of the effect in the population, its variance, and your sample size.

How large of a sample do we need in Georgia?

We increase the sample size and ask 100,000 voters, instead of only 1,000, at the Georgia exit polls:

set.seed(2020)
survey_g2 <- sample(georgia,100000,replace=F)

table(survey_g2) %>% as.data.frame %>%
  ggplot(aes(x=survey_g2, y = Freq, label = Freq)) +
  geom_col(fill=c("blue", "red")) + 
  geom_label(label = paste(proportions(table(survey_g2))*100,"%")) +
  theme_minimal() + 
  ggtitle("Results of our second survey in Georgia (N = 100,000)")

prop.test(table(survey_g2))

Which gives us:

Figure 8: Survey in Georgia with N = 100,000
Significance test for the survey in Georgia with N = 100,000

We are now much closer to the actual election result (Biden = 50.12% of votes) with our sample estimate of 50.07%. However, again, our result is not statistically significant. So even with this sample which is 100 times larger, our result lies within the margin of error and we can’t be sure that Biden really won.

We have unlimited funds and so we decide to ask half of all voters, i.e. 3 Million people:

survey_g4 <- sample(georgia,3000000,replace=F)
table(survey_g4) %>% as.data.frame %>%
  ggplot(aes(x=survey_g4, y = Freq, label = Freq)) +
  geom_col(fill=c("blue", "red")) + 
  geom_label(label = paste(round(proportions(table(survey_g4))*100,4),"%")) +
  theme_minimal() + 
  ggtitle("Results of our third survey in Georgia (N = 3,000,000)")

prop.test(table(survey_g4))
Figure 9: Survey results in Georgia with huge sample size (N = 3,000,000)
Significance test for Georgia survey with N = 3,000,000

Finally, our result is statistically significant and we can be quite sure that Biden actually won Georgia! But we had to ask half of the population to get this result, so our “exit poll” at this point is not much less of an effort as opposed to just counting all the votes.

But this is of course why immediately after the elections, exit polls in several states where the race was very close could not deliver conclusive results as to which candidate won. The Alaska and Georgia examples thus nicely show that:

The end

If you’re still with me, then thank you for your time and I hope this article could help you understand what statistical significance does and does not mean. If you have any suggestions about what is missing or what might be explained in a different way, let me know in the comments!

To leave a comment for the author, please follow the link and comment on their blog: For-loops and piep kicks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.