Bootstrapping clustered data

R on Abhijit Dasgupta

4 years ago

[This article was first published on R on Abhijit Dasgupta, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When evaluating the sampling variability of different statistics, I’ll often use the bootstrap procedure to resample my data, compute the statistic on each sample, and look at the distribution of the statistic over several bootstrap samples.

In principle, the bootstrap is straightforward to do. However, if you have correlated data (like repeated measures or longitudinal data or circular data), the unit of sampling no longer is the particular data point but the second-level unit within which the data are correlated; otherwise you break the correlation structure of the data by doing a naive bootstrap and distort the resultant distributions. This procedure is often called the cluster bootstrap.

Let’s fix ideas using a data analysis I’m currently doing. We’re looking at a particular measurement taken around a spinal joint every 5 degrees. These measures are correlated within person, since the measurements share the common spine. So to bootstrap our dataset, we have to bootstrap the people and not the individual measurements. A few rows of the data are below:

ID	Angle	Measure
16	-90	1
16	-85	1
16	-80	1
16	-75	1
16	-70	1
16	-65	1

The Measure variable varies from 0 to 1. The Angle variable varies between -90 and 90 by 5 degree increments.

Doing this computation is not difficult, but it becomes really straightforward using the rsample package developed by the RStudio crew, specifically Max Kuhn and Hadley Wickham. I was recently in a workshop Max taught in DC, where he introduced me to the rsample package, which, conveniently, has a bootstraps function. Now, this function has an option strata that can do stratified sampling. However, that is not the right tool, since that would sample from the individual measurements, just separately sampling by stratum. What we do need is to sample by the individuals.

Note that the bootstraps function samples rows from a data.frame or tibble object. In our situation, we need to sample groups of rows corresponding to each unique ID. However, we can utilize list-columns in tibbles to transform groups of rows into, effectively, single rows.

D <- d %>% nest(-ID)
head(D)
## # A tibble: 6 x 2
##      ID data             
##   <int> <list>           
## 1    16 <tibble [37 × 2]>
## 2    22 <tibble [37 × 2]>
## 3    38 <tibble [37 × 2]>
## 4    44 <tibble [37 × 2]>
## 5    30 <tibble [37 × 2]>
## 6    41 <tibble [37 × 2]>

Now, we can use bootstraps on this new, compact tibble to sample by ID

library(rsample)
set.seed(154234)

bs <- bootstraps(D, times = 10)
bs
## # Bootstrap sampling 
## # A tibble: 10 x 2
##    splits       id         
##    <list>       <chr>      
##  1 <S3: rsplit> Bootstrap01
##  2 <S3: rsplit> Bootstrap02
##  3 <S3: rsplit> Bootstrap03
##  4 <S3: rsplit> Bootstrap04
##  5 <S3: rsplit> Bootstrap05
##  6 <S3: rsplit> Bootstrap06
##  7 <S3: rsplit> Bootstrap07
##  8 <S3: rsplit> Bootstrap08
##  9 <S3: rsplit> Bootstrap09
## 10 <S3: rsplit> Bootstrap10

You can read up about the rsplit object and how data is stored in this object here. Let’s look at one of these bootstrap samples:

as.tibble(bs$splits[[1]]) %>% arrange(ID) %>% head()
## # A tibble: 6 x 2
##      ID data             
##   <int> <list>           
## 1     2 <tibble [37 × 2]>
## 2     7 <tibble [37 × 2]>
## 3     8 <tibble [37 × 2]>
## 4     9 <tibble [37 × 2]>
## 5     9 <tibble [37 × 2]>
## 6     9 <tibble [37 × 2]>

Notice that some ID’s are sampled multiple times, while others, not at all, which is the nature of bootstrap sampling.

If we want to assess the bootstrap distribution of the average Measure for each Angle, we can just unnest this tibble, and then assess the averages by Angle. This would give one bootstrap sample.

as.tibble(bs$splits[[1]]) %>% unnest() %>% 
  group_by(Angle) %>% summarize(AvgMeasure = mean(Measure))
## # A tibble: 37 x 2
##    Angle AvgMeasure
##    <int>      <dbl>
##  1   -90      0.596
##  2   -85      0.557
##  3   -80      0.539
##  4   -75      0.532
##  5   -70      0.595
##  6   -65      0.530
##  7   -60      0.495
##  8   -55      0.480
##  9   -50      0.439
## 10   -45      0.383
## # ... with 27 more rows

We can now use purrr functions to get the bootstrap distribution over multiple bootstrap samples, and plot the sampled summaries against Angle.

library(purrr)
library(ggplot2)
bs <- bootstraps(D, times = 100)
bs_AvgMeasure <- map(bs$splits, ~as.tibble(.) %>% unnest %>% group_by(Angle) %>% 
                       summarize(AvgMeasure = mean(Measure))) %>% 
  bind_rows(.id = 'boots')
ggplot(bs_AvgMeasure, aes(Angle, AvgMeasure, group = boots))+
  geom_line(alpha= 0.3)+
  theme_bw()

To leave a comment for the author, please follow the link and comment on their blog: R on Abhijit Dasgupta.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.