Stratified Sampling in R: A Practical Guide with Base R and dplyr
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
Stratified sampling is a technique used to ensure that different subgroups (strata) within a population are represented in a sample. This method is particularly useful when certain strata are underrepresented in a simple random sample. In this post, we’ll explore how to perform stratified sampling in R using both base R and the dplyr
package. We’ll walk through examples and explain the code, so you can try these techniques on your own data.
What is Stratified Sampling?
In stratified sampling, the population is divided into different strata based on a specific characteristic (e.g., age, gender, income level). A random sample is then taken from each stratum. This method ensures that the sample represents the population accurately, especially when the strata are significantly different in size or characteristics.
Stratified Sampling with Base R
Let’s start with an example using base R. Suppose we have a dataset with information about individuals, including their gender and income. We want to sample a specific number of individuals from each gender group.
Here’s how we can do it:
# Sample data set.seed(123) # For reproducibility data <- data.frame( ID = 1:100, Gender = sample(c("Male", "Female"), 100, replace = TRUE), Income = rnorm(100, mean = 50000, sd = 10000) ) # View the first few rows of the data head(data)
ID Gender Income 1 1 Male 52533.19 2 2 Male 49714.53 3 3 Male 49571.30 4 4 Female 63686.02 5 5 Male 47742.29 6 6 Female 65164.71
In this dataset, we have a column for Gender
and another for Income
. Let’s say we want to sample 10 males and 10 females.
# Stratified sampling function stratified_sample <- function(data, strat_column, size_per_stratum) { strata <- unique(data[[strat_column]]) sampled_data <- do.call(rbind, lapply(strata, function(stratum) { subset_data <- data[data[[strat_column]] == stratum, ] subset_data[sample(nrow(subset_data), size_per_stratum), ] })) return(sampled_data) } # Perform stratified sampling sampled_data <- stratified_sample(data, "Gender", 10) # View the sampled data table(sampled_data$Gender)
Female Male 10 10
head(sampled_data)
ID Gender Income 45 45 Male 63606.52 69 69 Male 41502.96 83 83 Male 50412.33 29 29 Male 51813.03 49 49 Male 47643.00 100 100 Male 37129.70
In this example:
- We first create a function
stratified_sample
that takes the data, the column to stratify by, and the number of samples per stratum. - The function identifies unique strata, then samples the specified number of rows from each stratum.
- The result is a combined dataset with samples from each group.
Stratified Sampling with dplyr
Using sample_n
The dplyr
package makes data manipulation straightforward and efficient. Here’s how to do stratified sampling using dplyr
:
library(dplyr) # Stratified sampling with sample_n() sampled_data_n <- data %>% group_by(Gender) %>% sample_n(10) # View the sampled data sampled_data_n %>% count(Gender)
# A tibble: 2 × 2 # Groups: Gender [2] Gender n <chr> <int> 1 Female 10 2 Male 10
head(sampled_data_n)
# A tibble: 6 × 3 # Groups: Gender [1] ID Gender Income <int> <chr> <dbl> 1 81 Female 64446. 2 6 Female 65165. 3 8 Female 55846. 4 22 Female 26908. 5 98 Female 56879. 6 11 Female 53796.
In this approach:
- We use
group_by()
to group the data by theGender
column. sample_n()
is used to take 10 samples from each group.count()
helps us verify the number of samples from each group.
Using sample_frac()
for Proportional Sampling
If you want to sample a proportion of each stratum, you can use the sample_frac()
function. For example, if you want to sample 20% of each gender group:
# Stratified sampling with sample_frac() sampled_data_frac <- data %>% group_by(Gender) %>% sample_frac(0.2) # View the sampled data sampled_data_frac %>% count(Gender)
# A tibble: 2 × 2 # Groups: Gender [2] Gender n <chr> <int> 1 Female 9 2 Male 11
head(sampled_data_frac)
# A tibble: 6 × 3 # Groups: Gender [1] ID Gender Income <int> <chr> <dbl> 1 71 Female 51176. 2 92 Female 47378. 3 13 Female 46668. 4 48 Female 65326. 5 42 Female 55484. 6 76 Female 43481.
In this example:
sample_frac()
is used to take 20% of the rows from each group.- This is useful when you want the sample size to be proportional to the size of each stratum.
Conclusion
Stratified sampling is a powerful technique to ensure representation from all subgroups in your sample. Whether you’re using base R or dplyr
, the process is straightforward and allows you to draw balanced samples from your data.
Feel free to try these methods on your data! Experimenting with different sizes and strata can help you understand how stratified sampling affects your analyses. Don’t hesitate to dive into the code and see how you can adapt it to your needs.
Happy coding!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.