[R] Compute stats on grouped data using dplyr
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
library(knitr) opts_chunk$set(echo=T, warning=T, fig.width=7, height=7) library(dplyr)
dplyr is a great package to process data
in a more smooth way: the output of one function can be inject to another
function using the operator %>%
easily. Today, I would like to share my
experience on how to process grouped data and return multiple columns/rows
data.
Data
We will generate some fake data for this test.
g1<-rep(c("a","b","c"), each=10) g2<-rep(c("A","B"), each=5, length.out=30) x<-rnorm(30,0,10) dat<-data.frame(g1,g2,x) kable(head(dat))
g1 | g2 | x |
---|---|---|
a | A | 0.9489863 |
a | A | -0.5877112 |
a | A | 11.6968621 |
a | A | 7.0672042 |
a | A | 12.6570049 |
a | B | 8.1088123 |
As you can see, this dataset includes 3 columns:
g1: a category variable with 3 levels, “a”, “b”, and “c”
g2: another variable with 2 levels, “A”,“B”, which are evenly balanced within each value of g1
x: the value column, which will be statistically tested
Tests
We will run a statistical test for the value x
between “A” and “B” (column g2
)
within each category of g1
. For this, we will use dplyer’s group_by()
function
to divide data, and then run tests on subsets and each subset returns a data.frame.
This data.frame can be multi-row or single-row, and see how dplyr handles the return
results.
First, let’s use the summarize()
function to collapse the results.
# this function return a single-row data.frame my_test<-function(v, g) { res<-t.test(v ~ g) df<-data.frame(mean1=res$estimate[1], mean2=res$estimate[2], P=res$p.value) return(df) } dat %>% group_by(g1) %>% summarize(with(.data, my_test(x, g2) )) %>% kable()
g1 | mean1 | mean2 | P |
---|---|---|---|
a | 6.356469 | 3.681912 | 0.5029446 |
b | -1.226555 | -4.937334 | 0.4937346 |
c | 6.793084 | 2.280872 | 0.3598557 |
As you can see, the functions do well, and return a new data.frame with group name and the result from each group.
Let’s also try a function returning multiple rows.
# this function returns multiple-row data.frame my_quantiles<-function(v) { probs<-seq(0,1,0.25) qt<-quantile(x, probs = probs) data.frame(quant=qt, prob=probs) } dat %>% group_by(g1) %>% summarize(with(.data, my_quantiles(x) )) %>% kable() ## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in ## dplyr 1.1.0. ## ℹ Please use `reframe()` instead. ## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()` ## always returns an ungrouped data frame and adjust accordingly. ## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was ## generated. ## `summarise()` has grouped output by 'g1'. You can override using the `.groups` ## argument.
g1 | quant | prob |
---|---|---|
a | -13.250311 | 0.00 |
a | -2.206370 | 0.25 |
a | 1.124772 | 0.50 |
a | 9.544496 | 0.75 |
a | 16.158428 | 1.00 |
b | -13.250311 | 0.00 |
b | -2.206370 | 0.25 |
b | 1.124772 | 0.50 |
b | 9.544496 | 0.75 |
b | 16.158428 | 1.00 |
c | -13.250311 | 0.00 |
c | -2.206370 | 0.25 |
c | 1.124772 | 0.50 |
c | 9.544496 | 0.75 |
c | 16.158428 | 1.00 |
The results remain good but with deprecating warning, which asks me to replace
summarize()
with reframe()
, so let’s try this new function.
dat %>% group_by(g1) %>% reframe(with(.data, my_quantiles(x) )) %>% kable()
g1 | quant | prob |
---|---|---|
a | -13.250311 | 0.00 |
a | -2.206370 | 0.25 |
a | 1.124772 | 0.50 |
a | 9.544496 | 0.75 |
a | 16.158428 | 1.00 |
b | -13.250311 | 0.00 |
b | -2.206370 | 0.25 |
b | 1.124772 | 0.50 |
b | 9.544496 | 0.75 |
b | 16.158428 | 1.00 |
c | -13.250311 | 0.00 |
c | -2.206370 | 0.25 |
c | 1.124772 | 0.50 |
c | 9.544496 | 0.75 |
c | 16.158428 | 1.00 |
Great, everything looks good.
Conclusions
We can combine the function group_by()
, summarize()
and a function returning a data.frame
to easily analyze data by groups via dplyr
. When the returned data.frame is
multi-row, the function summarize()
should be replaced with reframe()
.
References
reframe(): https://dplyr.tidyverse.org/reference/reframe.html
mutate() with multi-row results: https://stackoverflow.com/questions/73398676/dplyrmutate-when-custom-function-return-a-vector
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.