Finally understanding what “statistical significance” and p-values mean: A simple example (with R code)

ApokalypsePartyTeam

2 months ago

[This article was first published on For-loops and piep kicks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One day I realized that I finally really understood what “statistical significance” means (p < .01). I had probably heard the term hundreds of times by then. If you are still struggling with the concept, I hope it doesn’t take you this long and perhaps this post can be of help. The article will be quite lengthy (since previously this was the content of several hours of class), but if you have a few minutes, give it a try.

Let’s assume it’s the 2020 US presidential elections, and you are tasked with carrying out exit polls in two states to determine who of the candidates won these states. We’re taking Georgia and Alaska as examples since the former was a very close run (Biden won) whereas in the latter, Trump won by a big margin.

These are the actual results in these two states:

library(tidyverse)

georgia <- c(rep("Biden", 2473707), rep("Trump", 2461779))
alaska <- c(rep("Biden", 153778), rep("Trump", 189951))

proportions(table(georgia))
proportions(table(alaska))

Which gives us:

Percentages of Trump/Biden votes in Georgia and Alaska

We thus have created two objects “alaska” and “georgia” and stored the actual election results in them. These true results are unknown to you on election day, of course, which is the whole point of conducting an exit poll, asking a random sample of voters who they just voted for.

A sample of 1,000 voters

We start by simulating an exit poll of 1,000 voters in Alaska:

set.seed(2020) 

survey_a1 <- sample(alaska,1000,replace=F)

table(survey_a1) %>% as.data.frame %>%
  ggplot(aes(x=survey_a1, y = Freq, label = Freq)) +
  geom_col(fill=c("blue", "red")) + 
  geom_label(label = paste(proportions(table(survey_a1))*100,"%")) +
  theme_minimal() + 
  ggtitle("Results of our first survey in Alaska (N = 1,000)")

prop.test(table(survey_a1))

What the code does is first take a random sample of 1,000 voters (from our “alaska” object where we stored the true voting results), and then visualize the results and run a significance test. We’ll explain in a bit what the significance test means; let’s first look at the result of running the code:

Figure 1: Result of our first survey in Alaska

Significance test of our first survey (to be explained later)

Why are we not getting the exact true value in our survey?

We know that the true percentage of Biden votes in Alaska was 44.7%. In our survey of 1,000 voters, it was 43.6%, so a bit lower. Let’s first establish why the percentage in our sample is not exactly the same as the true value. We disregard here any bias coming from, e.g., social desirability in surveys; our survey respondents are all completely honest (they are simulated after all). Since we drew a random sample from a population, our results can differ a bit from the true results as a result of random chance. We may happen to select a few more Biden voters, or a few more Trump voters, in our random sample, as compared with the true percentage.

This is probably easy to understand. The probability of rolling a 6 when throwing a die is 1/6. This does not mean that if you throw a die 600 times, you will always get exactly 100 6’s. Maybe you’ll get a 6 in 98, or 105, or even only 88 out of the 600 throws. But you will intuitively agree that it is more likely that you will get 100 6’s as opposed to, say, 50 or 150.

This is what the so-called Central Limit Theorem (CLT) in statistics posits:

If you repeat an experiment a large number of times, your summary statistic (e.g., mean value, count of Biden votes, or proportion of 6’s) will scatter around the actual “true” expected value.
Sometimes you will be below the expected value, sometimes above. But more often you will be closer to the true value, and less often you’ll be very far off (e.g., have extremely bad luck and roll only 50 6’s with 600 throws).
In fact, if you repeat the experiment many times, your results will be distributed in the form of a Bell curve (normal distribution) around the true expected value (see graph below).
Another important thing to remember: The more often you repeat the experiment, the closer you will get to the true value if you average all your results from all the experiments (see also “law of large numbers”).

Let’s translate this to our example:

We randomly selected 1,000 voters and calculated the proportion of Biden votes. Our result could be a bit below or a bit above the actual true election result as a result of random chance during the sampling process.
We’re more likely to be close to the true value than way off, however.
If we could repeat this very often – say, we select another 1,000 voters and then another, and do this 1,000 times and write down the share of Biden votes in each of our surveys – we would get a normal distribution around the true value.
If we could repeat the survey 100,000 times instead of only 1,000, we’d be more likely to end up much more close to the true value and less likely to randomly over- or underestimate the true value by a large margin.

I think it becomes more clear once you look at the following plots.

Repeating a survey many times

If you had endless time and resources, you could of course conduct a second survey, since you’re first survey indicated that Trump won Alaska but you want to be sure that you’re not, by random chance, a bit above the true Trump vote and in reality, Biden won (again, keep in mind that you wouldn’t know the true outcome at the time of the exit poll, which is the whole reason for conducting the exit poll).

A second survey:

survey_a2 <- sample(alaska,1000,replace=F)

table(survey_a2) %>% as.data.frame %>%
  ggplot(aes(x=survey_a2, y = Freq, label = Freq)) +
  geom_col(fill=c("blue", "red")) + 
  geom_label(label = paste(proportions(table(survey_a2))*100,"%")) +
  theme_minimal() + 
  ggtitle("Results of our second survey in Alaska (N = 1,000)")

Figure 2: Results of our second simulated survey

In your second survey, too, Trump leads. This time his vote share is a bit lower, but he is still leading by a big margin.

But you really want to know the truth and therefore, you resort to an extreme measure: You conduct 1,000 surveys of 1,000 voters each (thereby terrorizing most of Alaska’s population at least once) and you write down the percentage of Trump vs. Biden votes in each survey, and at the end you count how many of the surveys show Trump in the lead, hoping this will settle it.

You can simulate this with R the following way:

draw_sample <- function(x,n=1000){
  s <- sample(x, n, F)
  proportions(table(s))
}

set.seed(2020)
alaska_sim <- replicate(1000,draw_sample(alaska))

alaska_sim[1:2,1:20]

This code creates a function that gives you a sample of 1,000 voters and returns the percentages of Biden and Trump votes. We then run (replicate) this function 1,000 times and store the results in the object “alaska_sim”. Look at the first couple results of our meta-study:

First few results of 1,000 simulated surveys with 1,000 voters each

The first few results point a very clear picture. We see that Trump always leads by a large margin. Sometimes it’s 54%-46%, sometimes it’s 58%-42%. So there is a certain variability, but we never see Biden in the lead in these surveys.

Let’s plot the results from all our 1,000 surveys:

dat <- data.frame(t(alaska_sim))
qplot() + theme_minimal() + 
  geom_histogram(aes(x=dat$Biden),fill="blue", alpha=.5) + 
  geom_histogram(aes(x=dat$Trump), fill="red", alpha=.5) +
  geom_vline(xintercept = mean(dat$Biden), color="blue") +
  geom_vline(xintercept = mean(dat$Trump), color="red") +
  ggtitle("1,000 surveys in Alaska of 1,000 voters each", 
          subtitle="Histograms of simulated values for Trump (red) and Biden (blue) percentages")

c(mean(dat$Biden), mean(dat$Trump))

Which gives us this result:

Figure 3: Distributions of sample mean values in 1,000 simulations (Alaska)

The plot shows two histograms of Biden (blue) and Trump (red) share of votes. A histogram is a bar chart which shows how often a value occurs in our dataset. For instance, the low red bar at the x = 0.50 mark indicates that there were only 3 out of our 1,000 simulated surveys in which Trump had between 50% and 51% of the votes. By contrast, in more than 150 surveys, the share of Trump votes was 55-56% (high bar at the x = 0.55 mark). Here you can perfectly see what the “Central Limit Theorem” states:

In some cases, we’re overestimating the share of Trump votes, and sometimes we’re underestimating it.
We’re more often close to the true value (high bars in the center of each histogram) than very far off the true value (tails of the histogram).
The distribution of all our surveys is roughly the shape of a normal distribution for both candidates.
If we average all of our 1,000 surveys (red and blue vertical thin lines in the center), we are closest to the actual true value.

The latter can be seen here:

Mean values for Trump and Biden share of votes among our 1,000 Alaska surveys

Notice that the real share of Biden votes in Alaska was 44.73% – the average of all our 1,000 surveys was 44.75% and thus very, very close! Thus, if we indeed could conduct 1,000 surveys, and average the result, we would be very close to the true election result. But of course in a real-life scenario you can’t do that, you can usually only conduct one survey. By now we thus have established that:

If we could repeat an experiment (like a survey among a random sample of voters) many times, the average value from all these experiments would be very close to the true value.
One single result from one single survey, however, can be somewhat different from the true result. There is a small chance that it could even be way off.

How can we know if our single survey is close to the truth?

Since we would not know the true value, we could not compute the distributions shown in Figure 3. Thus, if we conduct only one survey, we can’t know if we’re very close to the true value or, by bad luck, very far off.

But: We know that our survey result (share of Trump votes) is somewhere on a normal distribution around the true value (as posited by the Central Limit Theorem and as shown in our simulation above). From the well-known shape and properties of a normal distribution, we can derive how likely it is that our result of 56% Trump votes (in our first survey) is, say, 7 percentage points away from the true value (meaning that actually Biden could be in the lead).

There are two parameters that determine the “width” of the normal distribution:

The number of observations (in our case, 1,000 voters),
and the standard deviation (i.e., how large the variation is in our data). For a binomial distribution (there are only two outcomes, Trump or Biden), the standard deviation is given by sqrt(p*q*n), where p = percentage Trump, q = percentage Biden, n = number of observations.

Here you see a normal distribution with N = 1,000 around the Alaska vote (from our first survey) for Trump:

Figure 4: Normal distribution with 95% interval for N = 1,000

Code for this plot:

vote_share <- .564
N <- 1000
mean <- vote_share*N
sd <- sqrt(vote_share*(1-vote_share)*N)
x <- (mean - 5*sd):(mean + 5*sd)
norm1 <- dnorm(x,mean,sd)
p <- pnorm(x,mean,sd)

qplot() + theme_minimal() +
  geom_line(aes(x=x,y=norm1), color="red") +
  geom_area(aes(x=x,y=norm1),fill="red",alpha=.1) +
  geom_vline(xintercept = x[which.min(abs(p-0.025))], color="red") +
  geom_vline(xintercept = x[which.min(abs(p-0.975))], color="red") +
  ggtitle("Normal distribution with N = 1,000",
          subtitle= paste0("Vertical lines = 95% interval around the mean: ",
                           round(x[which.min(abs(p-0.025))]/N*100,1), "% to ", 
                           round(x[which.min(abs(p-0.975))]/N*100,1),"%"))

In the plot I have marked with the two vertical lines the borders of the so-called 95%-interval. This is the area where 95% of cases are located. Outside (left tail or right tail) are together only 5% of observations. This is important because, as is common convention, we use the 95% interval to denote the values that we still deem “probable”.

The plot shows basically what we already saw in Figure 3 (red histogram), but with a smoothed line (the probability density function). It tells us that:

If the share of Trump votes in Alaska was 56.4% (as in our first survey), then repeating a survey many times with N = 1,000 voters, 95% of our results would range between 53.4% and 59.5%.
This is called the 95% confidence interval (you have probably heard this term before). We could thus write: “Our exit poll puts Trump at 56.4% of the votes (95% confidence interval: 53.4%-59.5%)”.
Another way to put it, as you also probably have seen in newspapers, is to say that the margin of error is “plus-or-minus 3.1%” (because this is the distance from the highest and lowest value from the confidence interval to our obtained result, e.g. 59.5%-56.4%).

The important message from this analysis is: Our confidence interval is far apart from the 50% mark. Meaning, we can be very sure that Trump is indeed in the lead over Biden in Alaska. Yes, there is still some uncertainty; the 95%-interval is more than six percentage points wide, so we can’t really say for sure whether Trump in reality got 54% or rather 59% of the votes, but we can be extremely sure that Biden has not received as many or more votes than Trump.

There is of course always a small risk that our conclusion is wrong if we base it on the 95% interval. The probability of obtaining a value outside the 95% interval in Figure 4 is, well, 5%. In our example, the probability that Trump has in fact fewer than 50% is very, very low, as you can see from the small tail of the distribution to the left of the 500 mark, it’s around 0.002%. This is the infamous p-value (we’ll discuss this in detail in a minute).

Since this value is well below 5%, we can say “Trump’s lead in Alaska is statistically significant at the 5% level”. In addition to the 5% significance level (which corresponds to the 95% confidence interval), you also see sometimes the 1% or 0.1% significance levels. So, because our p-value is well below 0.1%, we could also say “the difference between Trump and Biden is statistically significant at the 0.1% level”. These three (5%, 1%, 0.1%) are the common significane levels – meaning that even though the probability that we’re wrong and we obtained these results even though Biden was in fact on par with Trump (0.002%) is even smaller than 0.1%, we wouldn’t write something such as “it’s significant at the 0.01% level”.

The effect of larger sample sizes

As posited by the Central Limit Theorem and the law of large numbers, and you will intuitively agree that this is the case, a larger sample size will lead to a smaller margin of error. Let’s see how large the confidence interval is with a sample of 10,000 voters instead of 1,000 voters:

N <- 10000
mean <- vote_share*N
sd <- sqrt(vote_share*(1-vote_share)*N)
x <- x*10
norm1 <- dnorm(x,mean,sd)
p <- pnorm(x,mean,sd)

qplot() + theme_minimal() +
  geom_line(aes(x=x,y=norm1), color="red") +
  geom_area(aes(x=x,y=norm1),fill="red",alpha=.1) +
  geom_vline(xintercept = x[which.min(abs(p-0.025))], color="red") +
  geom_vline(xintercept = x[which.min(abs(p-0.975))], color="red") +
  ggtitle("Normal distribution with N = 1,000",
          subtitle= paste0("Vertical lines = 95% interval around the mean: ",
                           round(x[which.min(abs(p-0.025))]/N*100,1), "% to ", 
                           round(x[which.min(abs(p-0.975))]/N*100,1),"%"))

Now we have a normal distribution for a share of 56.4% Trump votes, but obtained among a sample of 10,000 voters:

Figure 5: Normal distribution with mean = 56.3% and N = 10,000

As you can see, the distribution is much more narrow as compared with Figure 4 where we had only 1,000 voters. The margin of error is much smaller, insted of 56.4% plus-or-minus 3.1%, we could now say: Trump’s vote share is 56.4%, plus or minus 1.1%. Which is obviously a huge improvement. But in reality it would also be much more costly to recruit 10,000 voters, of course. Thus, there is a trade-off between getting results fast and cheap, and getting results that are as accurate as possible.

Keep this in mind when you read something along the lines of “43% of people want X wheras only 41% of people favor Y”, based on a survey of 1,000 people, because now you know how big the margin of error usually is for this sample size, and so in reality, people might as well favor Y over X.

What is a “standard error”?

By the way, you can easily see why increasing the number of observations leads to a more narrow distribution from the formula of the standard deviation of a binomial distribution as mentioned earlier which is: sqrt(p*q*n) [where p = percentage Trump, q = percentage Biden, n = number of observations]. So if n is increased by the factor 100, then the standard deviation only increases by the factor sqrt(100) = 10. So the relative size of the standard deviation in proportion to your number of voters for Trump has decreased.

The standard deviation of the distribution of all our (hypothetical) survey results is also called the standard error (SE) of our estimate. This is important because many people confuse standard deviation and standard error.

The standard deviation tells us how much variation there is in our data (are all votes going to Trump or is there an even split, or in case of a continuous variabe such as body height: how much does a person differ from the average height).
The standard error is something that we only need if we make an inference from a sample to a population. It tells us how, if we repeated our experiment very often (i.e., took a random sample of 1,000 voters over and over again and compared the obtained vote shares for Trump), what would be the standard deviation of the distribution of these results, as shown in Figures 4 and 5. It’s called standard “error” since it tells us the amount by which our repeated experiments are off the mark on average.

If we know the distribution of all our 1,000 surveys (because we simulated them ourselves), we can easily get the standard error just from the standard deviation of the distribution of all our repeated experiments. Let’s try this:

dat <- data.frame(t(alaska_sim))
sd(dat$Trump)

Result:

But, if we don’t have thousands of repeated experiments but only one actual survey, we have to estimate the standard error based on our existing data. The estimate for the standard error for a binomial distribution (i.e. two outcomes, Trump or biden) from a sample is given by the formula: sqrt(p*(1-p)/n), where p is the share of Trump votes and n is the sample size (see, e.g., here).

sqrt(0.564*0.436/1000)

Standard error – estimated from one single sample where the share of Trump votes was 56.4%

As you can see, the values are very similar. This is important for us to know, because it means we can infer the standard error from a single survey – without having to simulate or actually conduct thousands.

By the way: If you don’t have a binary outcome (Trump/Biden) but a continuous variable (e.g., income, body weight, etc.), the formula for the standard error of the mean is given by: sd/sqrt(n)

An important rule of thumb about the normal distribution as shown in Figures 4 and 5 is that the limits of the 95% confidence interval are ca.: < mark style="background-color:#FFB302;" class="has-inline-color">mean value +/- 2 * standard error. This means that if you compute the standard error of your survey estimate, you can derive the margin of error by simply multiplying it by 2! (To be yet a bit more precise, for the t-distribution, which is normally used instead of the standard normal distribution (and is virtually identical to it except for very small sample sizes), the formula is 95% confidence interval = mean +/- 1.96*SE).

Try it out:

se = sqrt(0.564*0.436/1000)
c(.546-2*se, .564+2*se)

The result:

Confidence interval calculated by rule of thumb (mean value +/- 2* standard error

Compare this with Figure 4 (confidence interval given in the subtitle) – it is indeed a very good approximation of the distribution of sample results if we had repeated the sampling many times.

Summary of this section: If you have a random sample from some population, and you want to know how large the margin of error for your sample estimate is (i.e., if you took many other random samples, how much would they differ), you can compute the standard error in your sample, multiply it by 2, and that is the margin of error (for a 95% interval). We have seen here with simulated data that this rule of thumb is quite accurate, and in an applied use case you therefore can make use of this without having to actually repeat your experiment countless times or simulate data.

The null hypothesis and the p-value

You probably heard these terms often already, so it’s time to properly explain why researchers formulate and test a “null hypothesis” and what the p-value means. We have implicitly done this above where we asked if we can be sure that Trump’s vote share in Alaska (56.4% according to our survey) is really above 50% (which would mean he is tied with Biden), but we haven’t done it the technically correct way.

In Figure 3, we could nicely see where the true results for Biden vs. Trump are, and where we likely (or unlikely) end up with a survey. Our problem is of course that we don’t know the true values. Therefore, we also don’t know where the distributions shown in Figures 4 and 5 are really located on the X axis – which would be good to know because we want to know how likely it is that Trump is actually in the lead over Biden.

Therefore, we procede the other way round: We assume, for the sake of the statistical test, that the difference between Trump and Biden among all voters is zero. We can then compute a normal distribution around the zero with our known parameters (sample size, standard error), and then check what the probability would be to obtain the results from our statistic.

This is the essence of the statistical significance test: < mark style="background-color:#FFB302;" class="has-inline-color">If the true effect is zero, what is the probability of obtaining my data from a random sample of size N?

This probability is the infamous p-value. Formally, we want to know: p(D|H0), i.e. the probability (“p-value”) of our data (D) [or a more extreme result] given the null hypothesis (H0).

It’s probably best understood looking at the following graph:

This is a visualisation of the null hypothesis: We assume for a moment that Trump and Biden had won the equal amount of votes in Alaska, so the difference between Trump and Biden was zero – so Trump’s vote share is 50%. Our sample is N = 1,000. Our standard error is, as calculated above, sqrt(0.564*(1-0.564)/N) = 1.568%. The result in our survey was 56.4% (“our data”), which we will use to compare against the null hypothesis:

N <- 1000
null_hypothesis <- 0.5
our_result <- 0.564
se = sqrt(our_result*(1-our_result)/N)

x <- seq((null_hypothesis - 5*se),(null_hypothesis + 5*se),by=.001)
norm1 <- dnorm(x,null_hypothesis,se)

qplot() + theme_minimal() +
  geom_line(aes(x=x,y=norm1), color="black") +
  geom_area(aes(x=x,y=norm1),fill="gray",alpha=.1) +
  geom_vline(xintercept = our_result, color="red") +
  geom_label(aes(x=our_result,y=max(norm1), label = paste0("Our data:\n",our_result*100,"%"))) +
  ggtitle("Null hypothesis: True Trump votes = Biden votes",
          subtitle="vs. the result of our survey")

(Just a note here, if you are going to perform a significance test yourself, you don’t have to write that much code and calculate standard errors by hand etc. I’m doing that only for didactical reasons here so you can see how these things are related to one another (I hope some readers are still following…), but we’ll soon look at easy R functions that give us the significance values.) Here are the results:

Figure 6: Visualisation of the null hypothesis, compared with our results from our first survey

We’re assuming the true vote share of Trump = Biden’s = 50%, i.e. zero difference (“null hypothesis”). If that were the case, we could of course, by random chance, have a bit of a different result in a survey of N = 1,000 randomly picked voters. We could get 51% or 49%, or maybe 52% or 48%. In fact, in almost all cases (95%), we would end up somewhere between 46.9% and 53.1% (remember: mean +/- 2*SE). Thus the grey normal distribution around the 50% mark in the Figure above.

Our data, however, shows Trump at 56.4%. How likely is that if the null hypothesis were in fact true? Not very likely, as you can see from the figure: The vertical red line is far away from the vast majority of values that are likely to obtain under the null scenario. Give us the exact p-value:

p_value <- 1 - pnorm(our_result, mean=0.5, sd=se)
p_value

p-value of our survey result (Trump votes = 56.4% in Alaska) against the null hypothesis Trump = Biden = 50% [one-sided test]

This means: If there was no difference between Trump and Biden among all voters in Alaska, then the probability would be 0.002% to obtain our result of 56.4% Trump votes (or higher) in a sample of N = 1,000 voters.

I know this sentence is quite cumbersome, but this is exactly what the p-value tells us! So it’s very important that you memorize this. More often than not, you see wrong statements about the p-value (“the probability of the null hypothesis is low”, “the probability that our data are true is high”, etc.).

Thus – the probability of our data given the null hypothesis is 0.002%, so very low. Which is why we would say: We can reject the null hypothesis and instead we’re quite sure that Trump is indeed in the lead in Alaska. The difference between Trump’s and Biden’s votes is statistically significant (p < .001), also sometimes marked with three stars (***) (one star usually denotes the .05 level, two stars the .01 leve).

How you can quickly test for significance in R

Since you have (hopefully) understood the basic idea behind the statistical significance test, there is no need for manual calculations, simulations, and lengthy code anymore in your own applications. Here are a few common examples for how you can quickly perform statistical significance tests with one line of code:

Binomial test: If your data are like our example here something with a binary outcome which you count among your sample, such as proportion “yes” votes, proportion defect machines, proportion of deaths among patients, etc., then you can quickly calculate the p-value just by entering your sample result and sample size: (Here, 564 is the number of Trump votes in our sample, 1000 is the sample size, p = 0.5 is the null hypothesis we want to test it against (has Trump more than 50% of the votes, i.e. a “one-sided” test))

binom.test(564,1000, p=0.5, alternative="greater")

Which gives us the p-value:

An alternative would be prop.test(), as used in the beginning of the article.

T-Test: One of the most widely used tests. The “t”, as noted above, refers to the t-distribution, a generalized form of the normal distribution. If you have a continuous outcome and two groups – say, income differences by males/females, differences in weight loss between an experimental group that received a drug and a control group, and so on… – you can perform a significance test simply as follows: t.test(outcome ~ group_variable, mydata)
The null hypothesis is always: Both groups have equal outcomes. If that were true, then the probability of your data (or more extreme results) is given by the p-value in the output. If it’s smaller than 0.05, your results are significant, etc.
Correlation test: If you want to know whether two continuous variables are correlated, against the null hypothesis that there is no correlation, use cor.test(mydata$var1, mydata$var2)
Anova, regression: Any anova or (multivariate) regression model also gives you p-values for all coefficients (null hypothesis: they are 0 in the total population) if you run the summary() on the objects with the results; e.g., summary(lm(y~x, mydata)), or summary(aov(y~x, mydata)).

Examples: When to perform a significance test

The punchline of this article is still to come, namely the Georgia results. Let’s first look at a few other examples where it makes sense to calculate a significance test (and where it does not).

In general, you run a significance test if you have a random sample from a total population and you want to make an inference to the total population (which is unkown to you) (hence the term “inferential statistics”). In some cases, you don’t really have a sample from a population, but there is some other source of random variation in the data-generating process which is why a significance test also makes sense.

Examples:

In economics, social sciences, market research, etc., you are often inferring from samples of people to the general population (i.e., voter surveys, household panels, customer feedback, etc.). See our example in this post. These are the prototypical applications of significance tests.
In medicine, psychology, or biostatistics, you often have groups of people for experiments such as randomized control trials for a new drug, and again, due to the random assignment of participants into experimental/control groups, and due to the fact that you often want to make an inferential statement about the total population, statistical significance tests are reasonable. The null hypothesis would be that a treatment/drug/etc. does not work at all, and outcomes (e.g., number of deaths) in all groups are equal.
In engineering as well as in other sciences, you also often have samples and want to infer to a total population. For instance, say you work for a manufacturing company and have recently aquired a new supplier who delivers motors. You put 50 motors from the old supplier and 50 from the new supplier on a test bench and measure various indicators such as maximum engine speed or power consumption. You then compare the two groups and perform a significance test against the null hypothesis that both suppliers’ machines are equally well performing.
An example where you do not have a sample from a population but significance testing still makes sense: You want to test whether males had a lower chance of survival on the Titanic as compared to females (type Titanic into your R console to get the data).
This is not a random sample, so perhaps you are wondering whether you still can perform a significance test of male vs. female survival rates as in our example with Trump vs. Biden vote shares.
And indeed it makes sense here as well: If both males and females had the same chances for surviving (null hypothesis), then, all other things being equal, by random chance a few more males or a few more females could have survived as compared with the expected value. So you can think of the Titanic disaster as an “experiment” that, if repeated many times, would have yielded different results as to who died and who survived. So you see: Where there is a data-generating process that you can reasonably assume has some degree of randomness, such that repeating the experiment would result in different results, it can make sense to check with a significance test if your results could be due to random chance or whether the results are unlikely given the null hypothesis.
Another example: You measure the yearly average temperature of the surface water of some lake over time, and you want to make a significance test against the null hypothesis that water temperature did not increase as compared with 50 years ago. Your measurements are not random samples of a population, but it still would make sense to perform a significance test, given that there is random variation in weather conditions so that in some years it might be a bit warmer than in other years for no systematic reasons, i.e. not indicating a trend (e.g., climate change), and thus you can perform a significance test whether current temperature levels lie within the margin of variation which could be expected due to random variation.

What statistical significance does NOT mean

There are many types of applications, on the other hand, where testing for statistical significance does not make much sense. Let’s look at common misconceptions first about what the test actually tells you:

Our significance test does not tell us that our sample result of 56.4% Trump votes is accurate. Or that Trump leads by a “big margin”. It only tells us that Trump’s vote share is very likely to be greater than Biden’s, i.e. the difference is likely not zero. Trump indeed appears to be in the lead, but by how much we don’t know from the significance test. Whether the p-value is 0.03, or 0.002, or 10^-23 does not by itself imply that the effect is rather weak or rather strong. You have to interpret this from other parameters (effect size).
The significance test does not prove that our sample is representative, nor that our measurement is valid and reliable. You have to check that independently of statistical significance. If you have a biased sample of, say, Twitter users, or a survey question that is prone to social desirability bias, your statistical significance tests will be useless because you can’t properly infer to the true effect in the population.
A significant correlation or regression coefficient does not mean your relationship is causal. There can be many issues with your research design, such as confounding variables, that need to be sorted and the significance test is not telling you how well your research design is. For explanations about causality, confounding variables, etc., please refer to this blog post.

Here are a few examples were testing for statistical significance hardly makes any sense:

You have data from a population, not a sample. For instance, a survey among all of your company’s 20 area managers. Your 20 datasets are not a sample for you to infer to a total population, they are your population. A significance test wouldn’t make sense, unless you would you want to imply that there is some random process by which the managers don’t state their actual true opinions.
In some cases, testing against the null hypothesis is not very useful. Say, you are conducting a study on vaccine efficacy. From the previous literature, it is well known that the vaccine is effective, but while initial studies suggested the effectiveness is around 90%, more recent studies found the value to be more around 70%. You want to contribute to the literature by giving a more precise estimate, with a larger and more thorough study. Let’s say your estimate of the effectiveness is 73%, what would a null hypothesis test be of use? You can be very, very sure that the effectiveness is in fact not zero? Would that settle the discussion?

The latter point is often brought forward by proponents of Bayesian statistics against p-values and significance testing. They argue that, in many cases, we have substantial prior knowledge that suggests a null hypothesis is not very realistic. A null-hypothesis significance test (NHST), however, as you know, gives us the probability for our data under the assumption that the null hypothesis is true. If you know a priori that the null hypothesis is most likely not true, then the test is of little help to you.

However, in many cases, you simply don’t know how large the effect under study will likely be, and the information provided by the null-hypothesis significance test is thus valuable: Did Biden really receive more votes than Trump in a certain state? Does the new supplier deliver parts that are most certainly of worse quality as compared with the old supplier? Do my customers prefer design A over design B in my product? Has a certain government policy had any effect on the unemployment rate? Is taking a certain drug really associated with an increased risk of developing cancer? In all these cases, knowing that one candidate/supplier/product/policy/treatment group is most likely performing better than the other would be an important information and thus testing against a null hypothesis certainly makes sense.

Finally, a problem to avoid when testing for statistical significance is what is known as multiple comparisons: You want to discern real effects from random noise with a certain level of confidence (e.g., p < 0.05), but often your analysis involves performing more than one test; e.g., comparing men and women, different income groups, as well as Blacks, Whites, Hispanics, Asians etc., and thereby the chances are high that one or more of all tested effects will turn out significant at p < .05 as a result of random chance even if all differences across all groups were in fact zero. [The probability of wrongly rejecting the null hypothesis of one test is p < .05, but with ten tests the probability that at least one of them is significant for random reasons increases to (max) 1 – .95^10 = 40.1%, which is not that unlikely].

A survey in Georgia

If you ever wondered why for days after the 2020 US presidential election, no one knew which candidate won, here is one important reason.

Recall that with the true Alaska results, our simulated survey showed Trump in the lead with a statistically significant result. What does that mean again? If in reality, Trump and Biden had equal share of votes, it would be very very unlikely for us to find our result in our sample. So we could be very confident in concluding, after just one exit poll among as few as 1,000 voters, that Trump most certainly won Alaska. [In fact, simulating 1,000 surveys, Trump was in the lead in every single one of them.]

Now let’s do the same thing with the Georgia election results:

set.seed(2020) 

survey_g1 <- sample(georgia,1000,replace=F)

table(survey_g1) %>% as.data.frame %>%
  ggplot(aes(x=survey_g1, y = Freq, label = Freq)) +
  geom_col(fill=c("blue", "red")) + 
  geom_label(label = paste(proportions(table(survey_g1))*100,"%")) +
  theme_minimal() + 
  ggtitle("Results of our first survey in Georgia (N = 1,000)")

prop.test(table(survey_g1))

We have drawn a random sample of N = 1,000 voters and these are the results:

Figure 7: Random sample of 1,000 voters in Georgia

Significance test from the result in Georgia

Our survey sees Biden in the lead with 50.7% of the votes, but the difference between Trump and Biden is not statistically significant. The test gives us a p-value of 0.681 – which means: If Biden and Trump had equal share of votes, the probability of drawing a random sample like ours and finding that one of the candidate leads with 50.7% (or even more votes) would be 68.1%. There is thus a high risk that a result like this can be produced by random chance. We therefore can’t reject the null hypothesis.

Reasons for why a result is not significant

If we get an insignificant effect, there can be two reasons:

The true effect is indeed zero (although in our case we would assume that both candidates will not have the exact same number of votes, so one of them almost certainly won).
Our test power is too low.

To understand what the latter means, considering the following two error types that a test can make:

A fire alarm rings although there is no fire. This is called a Type-I error (sometimes also referred to as alpha error), the equivalent of you saying: My result is statistically significant, although in reality there is no effect. The probability of a Type-I error is given to you by the p-value – recall its definition: the probability of finding your result even though the null hypothesis is true.
A fire alarm does not ring although there is in fact a fire. This is called a Type-II error (or beta error), which is what we have here: Biden did in fact win Georgia, but our survey result says Biden’s lead is not statistically significant. Given the distribution of the data in the population, the risk of getting a Type-II error is depending on the sample size (in addition to toher sources of variation such as measurement error). If you have a large sample, the risk of getting a Type-II error is lower. We also say that this test has a high power, or high specificity.

Our sample size is thus insufficient for Georgia, whereas the same number of voters were sufficient to determine the winner in Alaska, simply because the results were much more close in Georgia.

In sum, these are the main factors determining whether your results will be statistically significant: The size of the effect in the population, its variance, and your sample size.

How large of a sample do we need in Georgia?

We increase the sample size and ask 100,000 voters, instead of only 1,000, at the Georgia exit polls:

set.seed(2020)
survey_g2 <- sample(georgia,100000,replace=F)

table(survey_g2) %>% as.data.frame %>%
  ggplot(aes(x=survey_g2, y = Freq, label = Freq)) +
  geom_col(fill=c("blue", "red")) + 
  geom_label(label = paste(proportions(table(survey_g2))*100,"%")) +
  theme_minimal() + 
  ggtitle("Results of our second survey in Georgia (N = 100,000)")

prop.test(table(survey_g2))

Which gives us:

Figure 8: Survey in Georgia with N = 100,000

Significance test for the survey in Georgia with N = 100,000

We are now much closer to the actual election result (Biden = 50.12% of votes) with our sample estimate of 50.07%. However, again, our result is not statistically significant. So even with this sample which is 100 times larger, our result lies within the margin of error and we can’t be sure that Biden really won.

We have unlimited funds and so we decide to ask half of all voters, i.e. 3 Million people:

survey_g4 <- sample(georgia,3000000,replace=F)
table(survey_g4) %>% as.data.frame %>%
  ggplot(aes(x=survey_g4, y = Freq, label = Freq)) +
  geom_col(fill=c("blue", "red")) + 
  geom_label(label = paste(round(proportions(table(survey_g4))*100,4),"%")) +
  theme_minimal() + 
  ggtitle("Results of our third survey in Georgia (N = 3,000,000)")

prop.test(table(survey_g4))

Figure 9: Survey results in Georgia with huge sample size (N = 3,000,000)

Significance test for Georgia survey with N = 3,000,000

Finally, our result is statistically significant and we can be quite sure that Biden actually won Georgia! But we had to ask half of the population to get this result, so our “exit poll” at this point is not much less of an effort as opposed to just counting all the votes.

But this is of course why immediately after the elections, exit polls in several states where the race was very close could not deliver conclusive results as to which candidate won. The Alaska and Georgia examples thus nicely show that:

A small effect in a population (as the Biden lead in Georgia) might be difficult to detect with a sample, requiring a large sample size.
A large effect (as the Trump lead in Alaska) can be detected with smaller sample sizes.

The end

If you’re still with me, then thank you for your time and I hope this article could help you understand what statistical significance does and does not mean. If you have any suggestions about what is missing or what might be explained in a different way, let me know in the comments!

To leave a comment for the author, please follow the link and comment on their blog: For-loops and piep kicks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.