Collapsing Categories or Values
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
I have received a few queries recently that can be categorized as “How do I collapse a list of categories or values into a shorter list of category or values?” For example, one user wanted to collapse species of fish into their respective families. Another user wanted to collapse years into decades. Data munging such as this is common in fisheries. Thus, I provide a quick demonstration here of one way to accomplish these tasks using tools from the tidyverse.
This post requires the dplyr
, magrittr
, and plyr
packages. Note, however, that plyr
is not loaded below because I am only going to use one specific function from plyr
(i.e., mapvalues()
) and I have found that plyr
and dplyr
don’t always “play well” together.[^1]
Because I am creating random example data below, I set the random number seed to make the results reproducible.
Create A Sample of Data
The following creates a very simple sample of 250 individuals on which the species (as a short abbreviation) and year of capture were recorded.
Example 1 – Recode and Collapse Categories
The mutate()
function may be used to add a new variable to a data.frame. The mapvalues()
function (from plyr
) may be use to efficiently recode character (or factor) values in a vector. Because mapvalues()
operates on a vector, it must be used within mutate()
to add a new variable with the recoded values to a data.frame. When used within mutate()
, the first argument to mapvalues()
is the vector that contains the original data to be recoded. A vector of categories for these original data are then given in from=
and a vector of new categories for these data are given in to=
.
I find it most simple to first create vectors of categories for from=
and to=
and then use them in mapvalues()
. For example, the use of levels()
below extracts (and saves into short
) the vector of species abbreviations found in the species
variable of the example data.
“New categories” that correspond to each of the original categories may then be entered into a vector. For example, the long
vector below contains the long-form names for each species (in the same order as the abbreviations in short
) and family
contains the corresponding family names.
“Column bind” these vectors together to make sure that the categories are correctly matched across the vectors.
The combined use of mutate()
and mapvalues()
below demonstrates how these vectors may be used to change the original abbreviated names to long-form names or family names. In addition, the last use of mapvalues()
shows how to change the long-form names to family names. This last example is, of course, repetitive, but it is used here to demonstrate how mutate()
allows a variable that was “just created” to be immediately used.
Note in the code above that the use of plyr::
in front of mapvalues()
allows the user to use the mapvalues()
function from plyr
without loading the entire plyr
package.[^2] As noted previously, this idiom is used here to avoid potential conflicts between plyr
and dplyr
.
Note that this use of mapvalues()
and mutate()
is described in Section 2.2.7 of my book Introductory Fisheries Analyses with R.
Example 2 – Collapse Values into Categories
The case_when()
function (from dplyr
) may be used to efficiently collapse discrete values into categories.[^3] This function also operates on vectors and, thus, must be used with mutate()
to add a variable to a data.frame. The arguments to case_when()
are a series of two-sided formulae where the left-side is a conditioning statement based on the original data and the right-side is the value that should appear in the new variable when that condition is TRUE
. For example, the first line in case_when()
below asks “if the year variable is in the values from 1980 to 1989 then the new category should be ‘1980s’.”[^4] For example, the code below creates a new variable called decade
that identifies the decade that corresponds to the year-of-capture variable.
The lines in case_when()
operate sequentially (like a series of “if” statements) such that the above operation can be more succinctly coded as below. Also note in this example the resulting variable is numeric rather than categorical (simply as an example).
Footnotes
[^1] This may not be a concern with recent versions of plyr
and dplyr
. However, I have been bitten by enough problems when I have both of these packages loaded that I prefer to use the cautionary approach that I take in this post.
[^2] The FSA
package imports mapvalues
from plyr
and then exports it. Thus, if you have loaded the FSA
package then you will not need to use the plyr::
idiom.
[^3] You may also want to consider cut()
for this purpose or, for collapsing continuous data into categories, lencat()
from the FSA
package.
[^4] The colon operator creates a sequence of all integers between the two numbers separated by the colon. The %in%
is used on conditional statements to determine if a value is contained with a vector, returning TRUE
if it is and FALSE
if it is not.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.