The advantages of using count() to get N-way frequency tables as data frames in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
I recently introduced how to use the count() function in the “plyr” package in R to produce 1-way frequency tables in R. Several commenters provided alternative ways of doing so, and they are all appreciated. Today, I want to extend that tutorial by demonstrating how count() can be used to produce N-way frequency tables in the list format – this will magnify the superiority of this function over other functions like table() and xtabs().
2-Way Frequencies: The Cross-Tabulated Format vs. The List-Format
To get a 2-way frequency table (i.e. a frequency table of the counts of a data set as divided by 2 categorical variables), you can display it in a cross-tabulated format or in a list format.
In R, the xtabs() function is good for cross-tabulation. Let’s use the “mtcars” data set again; recall that it is a built-in data set in Base R.
> y = xtabs(~ cyl + gear, mtcars) > y gear cyl 3 4 5 4 1 8 2 6 2 4 1 8 12 0 2
This is a nice way to visualize the counts in each of the 9 different categories as divided by the variables “gear” and “cyl”. You can use the row and column indices of this object to extract a particular value. For example, to extract the element in the third row and first column,
> y[3,1] [1] 12
Alternatively, you can use the count() function in the “plyr” package to get the same frequencies in a list format.
> x = count(mtcars, c('cyl', 'gear')) > x cyl gear freq 1 4 3 1 2 4 4 8 3 4 5 2 4 6 3 2 5 6 4 4 6 6 5 1 7 8 3 12 8 8 5 2
Notice that this object is a data frame. The column names derive naturally from its origin.
> class(x) [1] "data.frame" > names(x) [1] "cyl" "gear" "freq"
You can access any particular element by 2 methods
- Use the row and/or column indices.
- Use particular values of “cyl” and “gear”.
For example, to find the number of cars with cyl = 8 and gear = 3, you can do
> x[7, ]$freq [1] 12 > subset(x, cyl == 8 & gear == 3)$freq [1] 12
I like the second method, because I don’t have to look at the values of the output table to find which row contains that particular combination of “cyl” and “gear”. This is a key advantage of the list format over the cross-tabulation format.
N-way frequencies: N > 2
Another key advantage of the list format over the cross-tabulation format is in obtaining frequency tables for 3 or more factors.
Cross-tabulations for N-way frequencies are difficult to visualize when N > 2. If N = 3, the best that you can do is using multiple tables, one for each value of the third factor. For example,
> w = xtabs(~ cyl + gear + vs, mtcars) > w , , vs = 0 gear cyl 3 4 5 4 0 0 1 6 0 2 1 8 12 0 2 , , vs = 1 gear cyl 3 4 5 4 1 8 1 6 2 2 0 8 0 0 0
Moreover, it is now even more cumbersome to access the value of a particular combination of these 3 factors.
In contrast, the list format works in the same way, making it equally easy to visualize for any value of N in an N-way frequency table.
> t = count(mtcars, c('cyl', 'gear', 'vs')) > t cyl gear vs freq 1 4 3 1 1 2 4 4 1 8 3 4 5 0 1 4 4 5 1 1 5 6 3 1 2 6 6 4 0 2 7 6 4 1 2 8 6 5 0 1 9 8 3 0 12 10 8 5 0 2
Filed under: Applied Statistics, Categorical Data Analysis, Data Analysis, Descriptive Statistics, R programming, Statistics, Tutorials Tagged: count, cross-tabulation, data analysis, frequency table, R, R programming, statistics, table(), xtabs()
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.