N is for N (Sample Size) Estimation: Power Analysis in R
[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
To estimate sample, you need a few pieces of information, one of which is the effect size you’re hoping to detect. Later this month, I’ll show how to conduct a meta-analysis in R, which is really useful to do on a small scale to generate this effect size and help inform a power analysis. I’ll talk about that in those future posts, and I’m hoping to write an article for a methods journal going into detail about how to do a mini meta-analysis, using the one I did for my dissertation as a working example.
The other pieces of information you need, besides effect size, are:
- The statistical analysis you plan to use.
- Power – 1-Beta; convention is 0.8, or an 80% chance that, if there is a significant difference, you’ll detect it, but 0.9 is sometimes used.
- Significance level – what alpha you’ll be using for your significance tests; 0.05 is the convention.
- One-Sample Proportion: comparing a proportion to a population value
- Two-Sample Proportion: comparing proportions from two samples with either equal or unequal n’s
- One-Sample T-Test: comparing a sample mean to a population mean
- Paired-Sample T-Test: comparing means at two time points in one sample or from two matched samples
- Two-Sample T-Test: comparing means from two samples with either equal or unequal n’s
- Chi-Square Test
- One-Way ANOVA: comparing means from three or more samples
- General Linear Model Test: including regressions and Factorial ANOVAs
- Correlation
First, let’s do a very simple power analysis. In last year’s Blogging A to Z, I used an ongoing example of a (fictional) study on the impact of caffeine on test performance, and I demonstrated how to conduct a t-test using some data I generated. Descriptive statistics for the two groups showed:
Experimental group: M = 83.2, SD = 6.21
Control group: M = 79.3, SD = 6.40
I can compute an effect size – for two means, the effect size you’d use is Cohen’s d.
caffeine<-read.delim("caffeine_study.txt", header=TRUE) caffeine$group<-factor(caffeine$group, labels=c("Control","Experimental")) library(effsize) ## Warning: package 'effsize' was built under R version 3.4.4 cohen.d(caffeine$score, caffeine$group, pooled=TRUE, conf.level=0.95) ## ## Cohen's d ## ## d estimate: 0.62398 (medium) ## 95 percent confidence interval: ## inf sup ## 0.09471117 1.15324889
My effect size is about 0.624. Let's say you want to replicate my study, but rather than just picking a random number of people to have in each group, you want to determine how many you need to detect an effect of that size. For that, we'll use the pwr package.
library(pwr) ## Warning: package 'pwr' was built under R version 3.4.4 pwr.t.test(d=0.624, sig.level=0.05, power=0.8, type = "two.sample", alternative = "two.sided") ## ## Two-sample t test power calculation ## ## n = 41.29776 ## d = 0.624 ## sig.level = 0.05 ## power = 0.8 ## alternative = two.sided ## ## NOTE: n is number in *each* group
To detect an effect of that size, you want about 41 people per group, if you're conducting a two-sided test. If you instead want one-sided, you'd specify whether you expect your comparison group to be "less" or "greater".
pwr.t.test(d=0.624, sig.level=0.05, power=0.8, type = "two.sample", alternative = "greater") ## ## Two-sample t test power calculation ## ## n = 32.45438 ## d = 0.624 ## sig.level = 0.05 ## power = 0.8 ## alternative = greater ## ## NOTE: n is number in *each* group
So if you predict ahead of time that the experimental group will have a higher score, and only want to look for a significant difference in that direction, you need about 32 people per group. This tells me that the fake study I designed (and just randomly picked sample sizes for) was slightly underpowered, since I only had 30 people in each group. (Not to mention these are fake data I set up to have a very large effect. Effects as large as d = 0.624 aren't all that common in the wild.)
What if you wanted to do something a bit fancier, like estimate necessary sample size for a range of potential effect sizes? This next example is based on a program I recently wrote for a problem at work. (Values and some details have been changed.) A state board requires school programs to have a pass rate on a key exam of 80%. If a program misses that pass rate one year, they go on probation. If they miss it for 3 consecutive years, they are suspended and can no longer be considered a state board-approved school. The board has called up a statistician (hi) to see if a handful of school programs have a pass rate significantly lower than this 80% cutoff. The problem is, some of these school programs are small, and as a result, their pass rates may vary a lot from year to year, just by luck of the draw.
In addition to sharing a report of the study results, the statistician wants to include a power analysis, to show how many cases are necessary before she can reliably say a pass rate is too low. She wants to show necessary sample for a range of school program pass rates (say between 50% and 79%), and for two levels of power, 0.8 (convention) and 0.9. Here's how we could go about conducting that power analysis.
First, we need to generate our effect sizes. This is a one-sample proportion, where we're comparing a range of school pass rates (50-79%) to a population value (80%). In the pwr package, the effect size for this proportion is called h. You can access it with the pwr function, ES.h.
p1s <- seq(.5,.79,.01) h <- ES.h(p1 = p1s, p2 = 0.80) nh <- length(h)
Next, we'll create a list of our two power levels>
p <- c(.8,.9) np <- length(p)
You'll notice I also requested the length of these lists. That becomes important now when I tell R to combine these two to start generating my target sample sizes, using a for loop.
sizes <- array(numeric(nh*np), dim=c(nh,np)) for (i in 1:np){ for (j in 1:nh){ pow_an <- pwr.p.test(n = NULL, h = h[j], sig.level = .05, power = p[i], alternative = "less") sizes[j,i] <- ceiling(pow_an$n) } }
Now I have data in two columns, target sample sizes for each value of h as my rows, with power = 0.8 in one column and power = 0.9 in the other. I'll rename my columns, so I can easily see which is which, and I'll add in the school pass rates I used, so that I can plot my results with that information on the x-axis.
samp <- data.frame(cbind(p1s,sizes)) colnames(samp)<-c("Pass_Rate","Power.8", "Power.9") summary(samp) ## Pass_Rate Power.8 Power.9 ## Min. :0.5000 Min. : 15.0 Min. : 21.00 ## 1st Qu.:0.5725 1st Qu.: 25.5 1st Qu.: 34.75 ## Median :0.6450 Median : 51.0 Median : 71.00 ## Mean :0.6450 Mean : 554.2 Mean : 767.70 ## 3rd Qu.:0.7175 3rd Qu.: 167.2 3rd Qu.: 231.00 ## Max. :0.7900 Max. :10075.0 Max. :13956.00
For power of 0.8, necessary sample size ranges from 15 to over 10,000. For power of 0.9, the range is 21 to almost 14,000. Now I can start plotting my results from my samp data frame. If I only wanted to show results for one power level (0.8):
library(ggplot2) passrates<-ggplot(samp, aes(x=p1s, y = Power.8)) + geom_point() + geom_line() passrates + scale_y_continuous(breaks=seq(0,11000,1000)) + scale_x_continuous(labels = scales::percent) + labs(title = "Comparing Pass Rates Below 80%", x = "Pass Rate", y = "N Needed to Detect Significant Difference") + theme(plot.title=element_text(hjust=0.5))
If I wanted to plot both:
bothpowers<-ggplot(samp, aes(p1s)) + geom_line(aes(y=Power.8, colour="0.8")) + geom_line(aes(y=Power.9, colour="0.9")) + scale_y_continuous(breaks=seq(0,14000,1000)) + scale_x_continuous(labels = scales::percent) + labs(title = "Comparing Pass Rates Below 80%", x = "Pass Rate", y = "N Needed to Detect Significant Difference", colour="Power") + theme(plot.title=element_text(hjust=0.5)) bothpowers