Handling Categorical Data in R – Part 2
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This is part 2 of a series on “Handling Categorical Data in R where we are learning to read, store, summarize, visualize & manipulate categorical data..”
In part 1 of this series, we understood what categorical data is, how R stores it using factor and got an overview of the case study we will be working on throughout this series.
In this article, we will learn to summarize categorical data. In the process, we will do a deep dive on working with tables in R and explore a diverse set of packages.
Table of Contents
Resources
You can download all the data sets, R scripts, practice questions and their solutions from our GitHub repository.
Introduction
Categorical data cannot be summarized in the same way as numeric data. It does not make sense to look at range, standard deviation etc. since data consists of a few distinct values only. So how do we summarize such data? We can look at
- count/frequency
- proportion
- cumulative frequency
- cross table
- contingency table etc.
In this section, we will explore the above ways of summarizing categorical data. We will also spend some time learning about tables as you will be using them extensively while working with categorical data. R has many packages for tabulating data and we list and explore all of them in the R scripts shared in the GitHub repository.
Number of Categories
From our case study, we want to know the number of devices used to browse the website, the name of the devices and the proportion of traffic they drive to our website. Let us read the case study data set before we analyze the website traffic.
# read data data <- readRDS('analytics.rds')
Let us begin with the number of devices. To view the number of groups/categories in a categorical variable, use nlevels()
.
nlevels(data$device) ## [1] 3
There are 3 categories of devices used by the visitors to browse the website. This can also be used for data
sanitization i.e. as an analyst you know that there are only 3 valid categories of device into which any visitor can be classified into. If you see more than 3 categories, you might want to check if there are any issues in data collection or processing. Now that we know there are 3 categories of devices, let us check if they are valid. The levels()
function will return the labels of the groups.
Category Names
Knowing the number of levels is useful but not sufficient. levels()
is one of the most useful functions when it comes to dealing with categorical data.
levels(data$device) ## [1] "Desktop" "Mobile" "Tablet"
Other functions that you can use include unique()
and fct_unique()
. Both these functions will return the unique names/labels along with the levels while levels()
returns the labels of the levels.
unique(data$device) ## [1] Desktop Mobile Tablet ## Levels: Desktop Mobile Tablet fct_unique(data$device) ## [1] Desktop Mobile Tablet ## Levels: Desktop Mobile Tablet
Names & Counts
So we have checked the number of devices and their names. Let us now examine their distribution
i.e. count/frequency. table()
and summary()
will display the levels and their counts while fct_count()
will return a tibble with 2 columns (level & count). It is extremely useful for further data processing or
visualization (using ggplot2).
table(data$device) ## ## Desktop Mobile Tablet ## 177282 63482 3634 fct_count(data$device) ## # A tibble: 3 x 2 ## f n ## <fct> <int> ## 1 Desktop 177282 ## 2 Mobile 63482 ## 3 Tablet 3634 summary(data$device) ## Desktop Mobile Tablet ## 177282 63482 3634
Tables
In the previous section, we used the table()
function to tabulate categorical data. We will
recreate the tabulation for device and store it in a new variable tab.
tab <- table(data$device) tab ## ## Desktop Mobile Tablet ## 177282 63482 3634
What does this function return? It is not a vector
, list
, data.frame
or matrix
. Let us use the class()
function to check the class of the object returned by table()
. It returns an object of the class table. This is a new type of object. Let us spend some time understanding tables as they are useful for organizing and summarizing categorical data. table is also the most used object when it comes to dealing with categorical data.
The table()
function returns the counts of the categories but let us say we want to view the proportion or
percentage instead of counts i.e. the proportion or percentage of traffic driven to our website by the different devices. The proportions()
or prop.table()
function comes in handy in such cases. It takes a table object as input (tab in our case).
prop.table(tab) ## ## Desktop Mobile Tablet ## 0.72538237 0.25974844 0.01486919 proportions(tab) ## ## Desktop Mobile Tablet ## 0.72538237 0.25974844 0.01486919
To get the percentages, multiply the output by 100. Use the round()
function to round the decimal places
according to your requirements.
proportions(tab) * 100 ## ## Desktop Mobile Tablet ## 72.538237 25.974844 1.486919 round(proportions(tab) * 100, 2) ## ## Desktop Mobile Tablet ## 72.54 25.97 1.49
So far, we have used table()
to tabulate a single categorical variable. It can be used for a lot
more than just tabulating data. We can examine the relationship between two categorical variables as well
as create multidimensional tables. Let us look at the relationship between gender and device in our case study. Does gender affect the type of device used? To answer this, we will create a two way or cross table. In the table()
function, we can specify multiple variables by separating them with a comma.
tab2 <- table(data$gender, data$device) tab2 ## ## Desktop Mobile Tablet ## female 32803 7268 494 ## male 46418 14503 696 ## <NA> 98061 41711 2444
Keep in mind that the order of the variables matter. Rows represent the first variable while column represents the second.
table(data$device, data$gender) ## ## female male <NA> ## Desktop 32803 46418 98061 ## Mobile 7268 14503 41711 ## Tablet 494 696 2444
The proportions()
function works with two way tables as well.
proportions(tab2) ## ## Desktop Mobile Tablet ## female 0.134219593 0.029738378 0.002021293 ## male 0.189927904 0.059341729 0.002847814 ## <NA> 0.401234871 0.170668336 0.010000082 proportions(tab2) * 100 ## ## Desktop Mobile Tablet ## female 13.4219593 2.9738378 0.2021293 ## male 18.9927904 5.9341729 0.2847814 ## <NA> 40.1234871 17.0668336 1.0000082
We would like to introduce another function at this point of time, margin.table()
. What does this function do? It computes the marginal frequencies i.e. the sum of the rows or columns. It takes a table object as input. The margin
argument allows us to specify whether we want the sum of rows or columns. 1
indicates rows and 2
indicates columns.
margin.table(tab2, 1) # sum of rows ## ## female male <NA> ## 40565 61617 142216 margin.table(tab2, 2) # sum of columns ## ## Desktop Mobile Tablet ## 177282 63482 3634
If the margin argument is NULL (which it is by default), the function returns the sum of all cells of the table.
margin.table(tab2) ## [1] 244398
table()
does not display row or column labels. It does display the group labels though. Let us revisit the output from tab2
. You can observe that while it includes the group labels, the row and column labels are missing. The output from the dimnames()
function shows the group labels of the variables but the row & column labels are absent.
dimnames(tab2) ## [[1]] ## [1] "female" "male" NA ## ## [[2]] ## [1] "Desktop" "Mobile" "Tablet" names(tab2) ## NULL names(dimnames(tab2)) ## [1] "" ""
The output from names(dimnames(tab2))
is also empty. Let us add the variable names as the row & column
labels to tab2
.
names(dimnames(tab2)) <- c("Gender", "Device") tab2 ## Device ## Gender Desktop Mobile Tablet ## female 32803 7268 494 ## male 46418 14503 696 ## <NA> 98061 41711 2444
Now look at the output from tab2
and you can observe the difference. The same is also visible when we run
dimnames(tab2)
.
dimnames(tab2) ## $Gender ## [1] "female" "male" NA ## ## $Device ## [1] "Desktop" "Mobile" "Tablet"
To add margin totals to the table, use addmargins()
. Like proportions()
and margin.table()
, it also takes a table
object as the input.
addmargins(tab2) ## Device ## Gender Desktop Mobile Tablet Sum ## female 32803 7268 494 40565 ## male 46418 14503 696 61617 ## <NA> 98061 41711 2444 142216 ## Sum 177282 63482 3634 244398
rowSums()
returns the row total while colSums()
returns the column total. They are similar to margin.table()
.
rowSums(tab2) ## female male <NA> ## 40565 61617 142216 colSums(tab2) ## Desktop Mobile Tablet ## 177282 63482 3634
xtabs()
is another way of creating multidimensional tables in R. In comparison to table()
, it
- uses
formula
notation for input - the data argument ensures variable names are referenced instead of using $ i.e.
data$variable
- displays row & column labels by default
tabx <- xtabs(~gender+device, data = data) tabx ## device ## gender Desktop Mobile Tablet ## female 32803 7268 494 ## male 46418 14503 696 ## <NA> 98061 41711 2444
The following functions work with xtabs()
as well
proportions()
margin.table()
addmargins()
proportions(tabx) ## device ## gender Desktop Mobile Tablet ## female 0.134219593 0.029738378 0.002021293 ## male 0.189927904 0.059341729 0.002847814 ## <NA> 0.401234871 0.170668336 0.010000082 margin.table(tabx, 1) ## gender ## female male <NA> ## 40565 61617 142216 margin.table(tabx, 2) ## device ## Desktop Mobile Tablet ## 177282 63482 3634 addmargins(tabx) ## device ## gender Desktop Mobile Tablet Sum ## female 32803 7268 494 40565 ## male 46418 14503 696 61617 ## <NA> 98061 41711 2444 142216 ## Sum 177282 63482 3634 244398
So far, we have been working with one or two dimensional tables. Both the table()
and xtabs()
functions are capable of creating multidimensional tables. Keep in mind that multidimensional tables are complex and it becomes increasingly difficult to understand or interpret them.
tab3 <- xtabs(~gender+device+channel, data = data) tab3 ## , , channel = (Other) ## ## device ## gender Desktop Mobile Tablet ## female 786 258 0 ## male 1063 507 19 ## <NA> 2173 1186 81 ## ## , , channel = Affiliates ## ## device ## gender Desktop Mobile Tablet ## female 1314 60 0 ## male 1714 169 0 ## <NA> 3518 548 65 ## ## , , channel = Direct ## ## device ## gender Desktop Mobile Tablet ## female 4785 977 59 ## male 7010 2381 95 ## <NA> 15824 8292 430 ## ## , , channel = Display ## ## device ## gender Desktop Mobile Tablet ## female 123 753 104 ## male 210 491 73 ## <NA> 554 911 156 ## ## , , channel = Organic Search ## ## device ## gender Desktop Mobile Tablet ## female 17109 4480 282 ## male 25016 9563 448 ## <NA> 54071 27223 1476 ## ## , , channel = Paid Search ## ## device ## gender Desktop Mobile Tablet ## female 645 230 22 ## male 887 478 26 ## <NA> 1274 782 51 ## ## , , channel = Referral ## ## device ## gender Desktop Mobile Tablet ## female 7387 74 0 ## male 9251 185 0 ## <NA> 18052 615 51 ## ## , , channel = Social ## ## device ## gender Desktop Mobile Tablet ## female 654 436 27 ## male 1267 729 35 ## <NA> 2595 2154 134
ftable stands for flat tables and is useful for printing attractive tables. It makes it easy to read and interpret multidimensional tables. In the next example, we will use ftable()
to print the tables we have created in the previous examples and compare the outputs.
ftable(tabx) ## device Desktop Mobile Tablet ## gender ## female 32803 7268 494 ## male 46418 14503 696 ## NA 98061 41711 2444 ftable(tab2) ## Device Desktop Mobile Tablet ## Gender ## female 32803 7268 494 ## male 46418 14503 696 ## NA 98061 41711 2444 ftable(tab3) ## channel (Other) Affiliates Direct Display Organic Search Paid Search Referral Social ## gender device ## female Desktop 786 1314 4785 123 17109 645 7387 654 ## Mobile 258 60 977 753 4480 230 74 436 ## Tablet 0 0 59 104 282 22 0 27 ## male Desktop 1063 1714 7010 210 25016 887 9251 1267 ## Mobile 507 169 2381 491 9563 478 185 729 ## Tablet 19 0 95 73 448 26 0 35 ## NA Desktop 2173 3518 15824 554 54071 1274 18052 2595 ## Mobile 1186 548 8292 911 27223 782 615 2154 ## Tablet 81 65 430 156 1476 51 51 134
By default, missing values (NAs) are excluded from tables. Let us modify the gender data from our case study a bit and see how the table()
function deals with missing values. We won’t explicitly specify NA
as a level while recreating the gender data.
gen <- as.factor(as.character(data$gender)) table(gen) ## gen ## female male ## 40565 61617
As you can see, table()
excludes missing values while tabulating the data. In order to ensure that missing
values are also counted, we can use the useNA
argument. It can take two values:
- ifany
- always
In the first case, it will show NA
as a level and the count only if there are missing values in the data. In the second case, it will always show NA
as a level irrespective of whether there are missing values in the data or not.
table(gen, useNA = "ifany") ## gen ## female male <NA> ## 40565 61617 142216 table(data$device, useNA = "always") ## ## Desktop Mobile Tablet <NA> ## 177282 63482 3634 0
In this final section on tables, we will learn how to select/access the different parts of a table
. We will use [
operator to select rows and columns of a table (it is similar to selecting data from a
data.frame
). Below are a few examples:
- select first row
tab2[1, ] ## Desktop Mobile Tablet ## 32803 7268 494
- select first column
tab2[, 1] ## female male <NA> ## 32803 46418 98061
- select first two rows
tab2[1:2, ] ## Device ## Gender Desktop Mobile Tablet ## female 32803 7268 494 ## male 46418 14503 696
- select first two columns
tab2[, 1:2] ## Device ## Gender Desktop Mobile ## female 32803 7268 ## male 46418 14503 ## <NA> 98061 41711
- select nth row
tab2[2, ] ## Desktop Mobile Tablet ## 46418 14503 696
- select nth column
tab2[, 2] ## female male <NA> ## 7268 14503 41711
- select row by group label
tab2["female", ] ## Desktop Mobile Tablet ## 32803 7268 494
- select column by group label
tab2[, "Mobile"] ## female male <NA> ## 7268 14503 41711
Before we end this section, let us learn how to test if an object is of class table using is.table()
.
is.table(tab2) ## [1] TRUE
Next, we will look at different R packages for two way/contingency tables.
Contingency Table
For cross tables with output similar to SAS or SPSS, use any of the below:
gmodels::CrossTable(data$device, data$gender) ## ## ## Cell Contents ## |-------------------------| ## | N | ## | Chi-square contribution | ## | N / Row Total | ## | N / Col Total | ## | N / Table Total | ## |-------------------------| ## ## ## Total Observations in Table: 102182 ## ## ## | data$gender ## data$device | female | male | Row Total | ## -------------|-----------|-----------|-----------| ## Desktop | 32803 | 46418 | 79221 | ## | 58.228 | 38.334 | | ## | 0.414 | 0.586 | 0.775 | ## | 0.809 | 0.753 | | ## | 0.321 | 0.454 | | ## -------------|-----------|-----------|-----------| ## Mobile | 7268 | 14503 | 21771 | ## | 218.694 | 143.975 | | ## | 0.334 | 0.666 | 0.213 | ## | 0.179 | 0.235 | | ## | 0.071 | 0.142 | | ## -------------|-----------|-----------|-----------| ## Tablet | 494 | 696 | 1190 | ## | 0.986 | 0.649 | | ## | 0.415 | 0.585 | 0.012 | ## | 0.012 | 0.011 | | ## | 0.005 | 0.007 | | ## -------------|-----------|-----------|-----------| ## Column Total | 40565 | 61617 | 102182 | ## | 0.397 | 0.603 | | ## -------------|-----------|-----------|-----------| ## ## descriptr::ds_cross_table(data, device, gender) ## Cell Contents ## |---------------| ## | Frequency | ## | Percent | ## | Row Pct | ## | Col Pct | ## |---------------| ## ## Total Observations: 244398 ## ## ---------------------------------------------------------------------------- ## | | gender | ## ---------------------------------------------------------------------------- ## | device | female | male | NA | Row Total | ## ---------------------------------------------------------------------------- ## | Desktop | 32803 | 46418 | 98061 | 177282 | ## | | 0.134 | 0.19 | 0.401 | | ## | | 0.19 | 0.26 | 0.55 | 0.73 | ## | | 0.81 | 0.75 | 0.69 | | ## ---------------------------------------------------------------------------- ## | Mobile | 7268 | 14503 | 41711 | 63482 | ## | | 0.03 | 0.059 | 0.171 | | ## | | 0.11 | 0.23 | 0.66 | 0.26 | ## | | 0.18 | 0.24 | 0.29 | | ## ---------------------------------------------------------------------------- ## | Tablet | 494 | 696 | 2444 | 3634 | ## | | 0.002 | 0.003 | 0.01 | | ## | | 0.14 | 0.19 | 0.67 | 0.01 | ## | | 0.01 | 0.01 | 0.02 | | ## ---------------------------------------------------------------------------- ## | Column Total | 40565 | 61617 | 142216 | 244398 | ## | | 0.166 | 0.252 | 0.582 | | ## ----------------------------------------------------------------------------
We list and explore different R packages for summarizing categorical data in our GitHub repository.
Key Functions
Your Turn…
Display the number of levels in
browser
channel
landing_page
exit_page
Display the categories in
os
channel
browser
gender
user_type
Display the count/frequency of
channel
user_type
Examine the distribution of the following and summarize your observations:
channel
byuser_type
device
bypurchase_flag
channel
bydevice
channel
bypurchase_flag
user_type
bypurchase_flag
*As the reader of this blog, you are our most important critic and commentator. We value your opinion and want to know what we are doing right, what we could do better, what areas you would like to see us publish in, and any other words of wisdom you are willing to pass our way.
We welcome your comments. You can email to let us know what you did or did not like about our blog as well as what we can do to make our post better.*
Email: [email protected]
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.