RObservations #37: Demistifying the tapply() function and comparing it to the “tidy” approach.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
Many seasoned base R users use the tapply()
function to help them in many contexts and talk about how powerful it is. However, many new R users have either have never seen tapply()
or they have and are unsure how it works. The documentation is not very helpful in explaining it either:
[tapply() applies] a function to each cell of a ragged array, that is to each (non-empty) group of values given by a unique combination of the levels of certain factors.
While I saw other programmers use this function, I found myself unsure how of how it worked or knew when I would need to use it. In this blog I attempt to change that and explain the cryptic description by showing some applications with my commentary and how it compares to using the “tidy” approach with tidyverse
.
My inspiration for writing this blog was from seeing Dr. Norm Matloff’s blog where he mentions the use of tapply()
and his thoughts on the tidyverse. For a more thorough treatment on his critique of the tidyverse and “tidy” methods, check out his formal essay here.
Now to get into the understanding and using tapply()
.
What is a ragged array?
After digging around on Google, I found a good of ragged arrays (also referred to as jagged arrays) on GeeksforGeeks:
A [r]agged array is an array of arrays such that member arrays can be of different sizes […].
The Stan user guide however offers a better definition:
Ragged arrays are arrays that are not rectangular, but have different sized entries. This kind of structure crops up when there are different numbers of observations per entry.A general approach to dealing with ragged structure is to move to a full database-like data structure […].
To understand this further, we the str()
function on the warpbreaks
dataset (the dataset used in the tapply()
documentation):
str(warpbreaks) ## 'data.frame': 54 obs. of 3 variables: ## $ breaks : num 26 30 54 25 70 52 51 26 67 18 ... ## $ wool : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ... ## $ tension: Factor w/ 3 levels "L","M","H": 1 1 1 1 1 1 1 1 1 2 ...
While the dataset is a dataframe. It can also be understood as understood as a ragged array. The dataset is a record of (1) the number warp breaks per loom and has record (the number of observations) of (2) the wool type (2 types) and (3) level of tension (three levels) applied. With the two factor variables it is possible to aggregate the number of breaks. We can do this by importing tidyverse
and using the group_by()
and summarize()
functions, or we can The tapply()
function.
Using tapply() vs tidyverse
Using one dimension
Suppose we are interested in determining the total number of breaks for each tension level, ignoring wool type. To do this with tapply()
we would write:
tapply(X=warpbreaks$breaks, INDEX=warpbreaks$wool, FUN = sum) ## A B ## 838 682
To do the same thing using tidyverse
we would need to do the following:
library(tidyverse) warpbreaks %>% group_by(wool) %>% summarize(total_breaks=sum(breaks)) %>% ungroup() %>% pivot_wider(names_from=wool, values_from=total_breaks) ## # A tibble: 1 x 2 ## A B ## <dbl> <dbl> ## 1 838 682
Using two dimensions
Suppose we are interested in determining the total number of breaks for each tension level, ignoring wool type. To do this with tapply()
we would write:
tapply(X=warpbreaks$breaks, INDEX=warpbreaks[,-1], FUN = sum) ## tension ## wool L M H ## A 401 216 221 ## B 254 259 169
To do this equivalently with tidyverse we would write:
warpbreaks %>% group_by(wool,tension) %>% summarize(total_breaks=sum(breaks)) %>% ungroup() %>% pivot_wider(names_from=tension, values_from=total_breaks) ## # A tibble: 2 x 4 ## wool L M H ## <fct> <dbl> <dbl> <dbl> ## 1 A 401 216 221 ## 2 B 254 259 169
If we benchmark these two approaches, we see that tapply()
is the clear winner.
library(rbenchmark) benchmark( 'tidyverse'= {warpbreaks %>% group_by(wool,tension) %>% summarize(total_breaks=sum(breaks)) %>% ungroup() %>% pivot_wider(names_from=tension, values_from=total_breaks)}, 'tapply()'=tapply(X=warpbreaks$breaks, INDEX=warpbreaks[,-1], FUN = sum), replications = 1000 ) ## test replications elapsed relative user.self sys.self user.child sys.child ## 2 tapply() 1000 0.08 1.000 0.08 0.00 NA NA ## 1 tidyverse 1000 14.87 185.875 14.69 0.05 NA NA benchmark( 'tidyverse'= {warpbreaks %>% group_by(wool) %>% summarize(total_breaks=sum(breaks)) %>% ungroup() %>% pivot_wider(names_from=wool, values_from=total_breaks)}, 'tapply()'=tapply(X=warpbreaks$breaks, INDEX=warpbreaks$wool, FUN = sum), replications = 1000 ) ## test replications elapsed relative user.self sys.self user.child sys.child ## 2 tapply() 1000 0.05 1 0.04 0.00 NA NA ## 1 tidyverse 1000 9.25 185 9.22 0.01 NA NA
Conclusion
From my experience outlined above, its clear that tapply()
makes for a more efficient method of aggregating data than using the “tidy” approach.
From my experience I have found “tidy” methods offer for an experience where code can be written in the same manner a solution to a given problem is thought out. While it may not be ideal for computational speed or teaching, I have experienced in my own work that it has allowed for problems to be approached in a systematic manner which allows for code to be written as the solution is developed. But, after having worked through a few examples of using tapply()
, I will definitely have it as one of my go-to functions and will experiment further using it.
What do you think? Let me know in the comments!
Thank you for reading!
Want to see more of my content?
Be sure to subscribe and never miss an update!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.