Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Do you ever have to recode many values at once? It’s a frequent chore when preparing data. For example, suppose we had to replace state abbreviations with the full names:
abbs <- c("AL", "AK", "AZ", "AZ", "WI", "WS")
You could write several ifelse()
statements.
ifelse(abbs == "AL", "Alabama", ifelse(abbs == "AK", "Alaska", ifelse(abbs == "AZ", "Arizona",
Actually, never mind! That gets out of hand very quickly.
case_when()
is nice, especially when the replacement rules are more complex
than 1-to-1 matching.
dplyr::case_when( # Syntax: logical test ~ value to use when test is TRUE abbs == "AL" ~ "Alabama", abbs == "AK" ~ "Alaska", abbs == "AZ" ~ "Arizona", abbs == "WI" ~ "Wisconsin", # a fallback/default value TRUE ~ "No match" ) #> [1] "Alabama" "Alaska" "Arizona" "Arizona" "Wisconsin" "No match"
We could also use one of my very favorite R tricks:
Character subsetting.
We create a named vector where the names are the data we have and the values are
the data we want. I use the mnemonic old_value = new_value
. In this case, we
make a lookup table like so:
lookup <- c( # Syntax: name = value "AL" = "Alabama", "AK" = "Alaska", "AZ" = "Arizona", "WI" = "Wisconsin")
For example, subsetting with the string "AL"
will retrieve the value with the
name "AL"
.
lookup["AL"] #> AL #> "Alabama"
With a vector of names, we can look up the values all at once.
lookup[abbs] #> AL AK AZ AZ WI <NA> #> "Alabama" "Alaska" "Arizona" "Arizona" "Wisconsin" NA
If the names and the replacement values are stored in vectors, we can construct
the lookup table programmatically using setNames()
. In our case, the datasets
package provides vectors with state names and state abbreviations.
full_lookup <- setNames(datasets::state.name, datasets::state.abb) head(full_lookup) #> AL AK AZ AR CA #> "Alabama" "Alaska" "Arizona" "Arkansas" "California" #> CO #> "Colorado" full_lookup[abbs] #> AL AK AZ AZ WI <NA> #> "Alabama" "Alaska" "Arizona" "Arizona" "Wisconsin" NA
One complication is that the character subsetting yields NA
when the
lookup table doesn’t have a matching name. That’s what’s happening above with
the illegal abbreviation "WS"
. We can fix this by replacing the NA
values with some default value.
matches <- full_lookup[abbs] matches[is.na(matches)] <- "No match" matches #> AL AK AZ AZ WI <NA> #> "Alabama" "Alaska" "Arizona" "Arizona" "Wisconsin" "No match"
Finally, to clean away any traces of the matching process, we can unname()
the
results.
unname(matches) #> [1] "Alabama" "Alaska" "Arizona" "Arizona" "Wisconsin" "No match"
Many-to-one lookup tables
By the way, the lookup tables can be many-to-one. That is, different names can retrieve the same value. For example, we can handle this example that has synonymous names and differences in capitalization with many-to-one matching.
lookup <- c( "python" = "Python", "r" = "R", "node" = "Javascript", "js" = "Javascript", "javascript" = "Javascript") languages <- c("JS", "js", "Node", "R", "Python", "r", "JAvascript") # Use tolower() to normalize the language names so # e.g., "R" and "r" can both match R lookup[tolower(languages)] #> js js node r python #> "Javascript" "Javascript" "Javascript" "R" "Python" #> r javascript #> "R" "Javascript"
Character by character string replacement
I’m motivated to write about character subsetting today because I used it in a Stack Overflow answer. Here is my paraphrasing of the problem.
Let’s say I have a long character string, and I’d like to use
stringr::str_replace_all
to replace certain letters with others. According to the documentation,str_replace_all
can take a named vector and replaces the name with the value. That works fine for 1 replacement, but for multiple, it seems to do the replacements iteratively, so that one replacement can replace another one.library(tidyverse) text_string = "developer" # This works fine text_string %>% str_replace_all(c(e ="X")) #> [1] "dXvXlopXr" # But this is not what I want text_string %>% str_replace_all(c(e ="p", p = "e")) #> [1] "develoeer" # Desired result would be "dpvploepr"