Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Data on successes and failures can be summarized and analyzed as counted proportions via the binomial distribution or as long format 0/1 binary data. I most often see summarized data when there are multiple trials done within a study unit; for example, when tallying up the number of dead trees out of the total number of trees in a plot.
If these within-plot trials are all independent, analyzing data in a binary format instead of summarized binomial counts doesn’t change the statistical results. If trials are not independent, though, neither approach works correctly and we would see overdispersion/underdispersion in a binomial model. The confusing piece in this is that binary data by definition can’t be overdispersed and so the lack of fit from non-independence can’t be diagnosed with standard overdispersion checks when working with binary data.
In a future post I’ll talk more about simulating data to explore binomial overdispersion and how lack of fit can be diagnosed in binomial vs binary datasets. Today, however, my goal is show how to take binomial count data and expand it into binary data.
In the past I’ve done the data expansion with rowwise()
and do()
from package dplyr, but these days I’m using purrr::pmap_dfr()
. I’ll demonstrate the pmap_dfr()
approach as well as a nest()
/unnest()
approach using functions from tidyr.
Table of Contents
Load R packages
I’m using purrr for looping through rows with pmap_dfr()
. I also load dplyr and tidyr for a nest()
/unnest()
approach.
library(purrr) # 0.3.2 library(tidyr) # 1.0.0 library(dplyr) # 0.8.3
The dataset
I created a dataset with a total of 8 plots, 4 plots in each of two groups. The data has been summarized up to the plot level. The number of trials (total
) per plot varied. The number of successes observed is in num_dead
.
dat = structure(list(plot = structure(1:8, .Label = c("plot1", "plot2", "plot3", "plot4", "plot5", "plot6", "plot7", "plot8"), class = "factor"), group = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("g1", "g2"), class = "factor"), num_dead = c(4L, 6L, 6L, 5L, 1L, 4L, 3L, 2L), total = c(5L, 7L, 9L, 7L, 8L, 10L, 10L, 7L)), class = "data.frame", row.names = c(NA, -8L)) dat # plot group num_dead total # 1 plot1 g1 4 5 # 2 plot2 g1 6 7 # 3 plot3 g1 6 9 # 4 plot4 g1 5 7 # 5 plot5 g2 1 8 # 6 plot6 g2 4 10 # 7 plot7 g2 3 10 # 8 plot8 g2 2 7
Expanding binomial to binary with pmap_dfr()
To make the binomial data into binary data, I need to make a vector with a \(1\) for every “success” listed in num_dead
and a \(0\) for every “failure” (the total number of trials minus the number of successes). Since I want to do a rowwise operation I’ll use one of the pmap
functions. I want my output to be a data.frame so I use pmap_dfr()
.
I use an anonymous function within pmap_dfr()
for creating the output I want from each row. I purposely make the names of the function arguments match the column names. You can either match on position or on names in pmap
functions, and I tend to go for name matching. You can use formula coding with the tilde in pmap
variants, but I find the code more difficult to understand when I have more than three or so columns.
Within the function I make a column for the response variable, repeating \(1\) num_dead
times and \(0\) total - num_dead
times for each row of the original data. I’m taking advantage of recycling in data.frame()
to keep the plot
and group
columns in the output, as well.
binary_dat = pmap_dfr(dat, function(group, plot, num_dead, total) { data.frame(plot = plot, group = group, dead = c( rep(1, num_dead), rep(0, total - num_dead) ) ) } )
Here are the first 6 rows of this new dataset. You can see for the first plot, plot1
, there are five rows of \(1\) and one row of \(0\).
head(binary_dat) # plot group dead # 1 plot1 g1 1 # 2 plot1 g1 1 # 3 plot1 g1 1 # 4 plot1 g1 1 # 5 plot1 g1 0 # 6 plot2 g1 1
This matches the information in the first row of the original dataset.
dat[1, ] # plot group num_dead total # 1 plot1 g1 4 5
Aside: pmap functions with more columns
My anonymous function in pmap_dfr()
works fine in its current form as long as every column is included as a function argument. If I had extra columns that I didn’t want to remove and wasn’t using in the function, however, I would get an error.
To bypass this problem you can add dots, ...
, to the anonymous function to refer to all other columns not being used.
function(group, plot, num_dead, total, ...)
Comparing analysis results
While I definitely learned that binomial data can be analyzed in binary format and returns identical results in a GLM class, for some reason I often have to re-convince myself this is true. ???? This is clear when I do an analysis with each dataset and compare results.
Binomial model
Here’s results from comparing the two groups for the binomial model.
fit = glm( cbind(num_dead, total - num_dead) ~ group, data = dat, family = binomial) summary(fit)$coefficients # Estimate Std. Error z value Pr(>|z|) # (Intercept) 1.098612 0.4364358 2.517237 0.0118279240 # groupg2 -2.014903 0.5748706 -3.504968 0.0004566621
Binary model
The binary model gives identical results for estimates and statistical tests.
fit_binary = glm( dead ~ group, data = binary_dat, family = binomial) summary(fit_binary)$coefficients # Estimate Std. Error z value Pr(>|z|) # (Intercept) 1.098612 0.4364354 2.517239 0.0118278514 # groupg2 -2.014903 0.5748701 -3.504971 0.0004566575
Expanding binomial to binary via nesting
Doing the expansion with nesting plus purrr::map()
inside mutate()
is another option, although this seems less straightforward to me for this particular case. It does keep the other variables in the dataset, though, without having to manually include them in the output data.frame like I did above.
binary_dat2 = dat %>% nest(data = c(num_dead, total) ) %>% mutate(dead = map(data, ~c( rep(1, .x$num_dead), rep(0, .x$total - .x$num_dead) ) ) ) %>% select(-data) %>% unnest(dead) head(binary_dat2) # # A tibble: 6 x 3 # plot group dead # <fct> <fct> <dbl> # 1 plot1 g1 1 # 2 plot1 g1 1 # 3 plot1 g1 1 # 4 plot1 g1 1 # 5 plot1 g1 0 # 6 plot2 g1 1
Just the code, please
Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code from here.
library(purrr) # 0.3.2 library(tidyr) # 1.0.0 library(dplyr) # 0.8.3 dat = structure(list(plot = structure(1:8, .Label = c("plot1", "plot2", "plot3", "plot4", "plot5", "plot6", "plot7", "plot8"), class = "factor"), group = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("g1", "g2"), class = "factor"), num_dead = c(4L, 6L, 6L, 5L, 1L, 4L, 3L, 2L), total = c(5L, 7L, 9L, 7L, 8L, 10L, 10L, 7L)), class = "data.frame", row.names = c(NA, -8L)) dat binary_dat = pmap_dfr(dat, function(group, plot, num_dead, total) { data.frame(plot = plot, group = group, dead = c( rep(1, num_dead), rep(0, total - num_dead) ) ) } ) head(binary_dat) dat[1, ] function(group, plot, num_dead, total, ...) fit = glm( cbind(num_dead, total - num_dead) ~ group, data = dat, family = binomial) summary(fit)$coefficients fit_binary = glm( dead ~ group, data = binary_dat, family = binomial) summary(fit_binary)$coefficients binary_dat2 = dat %>% nest(data = c(num_dead, total) ) %>% mutate(dead = map(data, ~c( rep(1, .x$num_dead), rep(0, .x$total - .x$num_dead) ) ) ) %>% select(-data) %>% unnest(dead) head(binary_dat2)
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.