Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Do you ever have to recode many values at once? It’s a frequent chore when preparing data. For example, suppose we had to replace state abbreviations with the full names:
abbs <- c("AL", "AK", "AZ", "AZ", "WI", "WS")
You could write several ifelse() statements.
ifelse(abbs == "AL", "Alabama",
ifelse(abbs == "AK", "Alaska",
ifelse(abbs == "AZ", "Arizona",
Actually, never mind! That gets out of hand very quickly.
case_when() is nice, especially when the replacement rules are more complex
than 1-to-1 matching.
dplyr::case_when( # Syntax: logical test ~ value to use when test is TRUE abbs == "AL" ~ "Alabama", abbs == "AK" ~ "Alaska", abbs == "AZ" ~ "Arizona", abbs == "WI" ~ "Wisconsin", # a fallback/default value TRUE ~ "No match" ) #> [1] "Alabama" "Alaska" "Arizona" "Arizona" "Wisconsin" "No match"
We could also use one of my very favorite R tricks:
Character subsetting.
We create a named vector where the names are the data we have and the values are
the data we want. I use the mnemonic old_value = new_value. In this case, we
make a lookup table like so:
lookup <- c( # Syntax: name = value "AL" = "Alabama", "AK" = "Alaska", "AZ" = "Arizona", "WI" = "Wisconsin")
For example, subsetting with the string "AL" will retrieve the value with the
name "AL".
lookup["AL"] #> AL #> "Alabama"
With a vector of names, we can look up the values all at once.
lookup[abbs] #> AL AK AZ AZ WI <NA> #> "Alabama" "Alaska" "Arizona" "Arizona" "Wisconsin" NA
If the names and the replacement values are stored in vectors, we can construct
the lookup table programmatically using setNames(). In our case, the datasets
package provides vectors with state names and state abbreviations.
full_lookup <- setNames(datasets::state.name, datasets::state.abb) head(full_lookup) #> AL AK AZ AR CA #> "Alabama" "Alaska" "Arizona" "Arkansas" "California" #> CO #> "Colorado" full_lookup[abbs] #> AL AK AZ AZ WI <NA> #> "Alabama" "Alaska" "Arizona" "Arizona" "Wisconsin" NA
One complication is that the character subsetting yields NA when the
lookup table doesn’t have a matching name. That’s what’s happening above with
the illegal abbreviation "WS". We can fix this by replacing the NA
values with some default value.
matches <- full_lookup[abbs] matches[is.na(matches)] <- "No match" matches #> AL AK AZ AZ WI <NA> #> "Alabama" "Alaska" "Arizona" "Arizona" "Wisconsin" "No match"
Finally, to clean away any traces of the matching process, we can unname() the
results.
unname(matches) #> [1] "Alabama" "Alaska" "Arizona" "Arizona" "Wisconsin" "No match"
Many-to-one lookup tables
By the way, the lookup tables can be many-to-one. That is, different names can retrieve the same value. For example, we can handle this example that has synonymous names and differences in capitalization with many-to-one matching.
lookup <- c(
"python" = "Python", "r" = "R", "node" = "Javascript",
"js" = "Javascript", "javascript" = "Javascript")
languages <- c("JS", "js", "Node", "R", "Python", "r", "JAvascript")
# Use tolower() to normalize the language names so
# e.g., "R" and "r" can both match R
lookup[tolower(languages)]
#> js js node r python
#> "Javascript" "Javascript" "Javascript" "R" "Python"
#> r javascript
#> "R" "Javascript"
Character by character string replacement
I’m motivated to write about character subsetting today because I used it in a Stack Overflow answer. Here is my paraphrasing of the problem.
Let’s say I have a long character string, and I’d like to use
stringr::str_replace_allto replace certain letters with others. According to the documentation,str_replace_allcan take a named vector and replaces the name with the value. That works fine for 1 replacement, but for multiple, it seems to do the replacements iteratively, so that one replacement can replace another one.library(tidyverse) text_string = "developer" # This works fine text_string %>% str_replace_all(c(e ="X")) #> [1] "dXvXlopXr" # But this is not what I want text_string %>% str_replace_all(c(e ="p", p = "e")) #> [1] "develoeer" # Desired result would be "dpvploepr"
