forcats::fct_match

Jonathan Carroll

3 years ago

[This article was first published on rstats on Irregularly Scheduled Programming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This journey started almost exactly a year ago, but it’s finally been sufficiently worked through and merged! Yay, I’ve officially contributed to the tidyverse (minor as it may be).

I’m at least as useful as Zoidberg

It began with a tweet, recalling a surprise I encountered that day during some routine data processing

Source of today’s mild heart-attack: I have categories W, X_Y, and Z in some data. Intending to keep only the second two:

data %>% filter(g %in% c(“X Y”, “Z”)

Did you spot that I used a space instead of an underscore? I sure as heck didn’t, and filtered excessively to just Z.
— Jonathan Carroll (@carroll_jono) March 6, 2018

For those of you not so comfortable with pipes and dplyr, I was trying to subset a data.frame ‘data’ (with a column g having values "W", "X_Y" and "Z") to only those rows for which the column g had the value "X_Y" or "Z" (not the actual values, of course, but that’s the idea). Without dplyr this might simply be

data[data$g %in% c("X Y", "Z"), ]

To make that more concrete, let’s actually show it in action

data <- data.frame(a = 1:5, g = c("X_Y", "W", "Z", "Z", "W"))
data
##   a   g
## 1 1 X_Y
## 2 2   W
## 3 3   Z
## 4 4   Z
## 5 5   W
data %>% 
   filter(g %in% c("X Y", "Z"))
##   a g
## 1 3 Z
## 2 4 Z

filter isn’t at fault here – the same issue would arise with [ – I have mis-specified the values I wish to match, so I am returned only the matching values. %in% is also performing its job – it returns a logical vector; the result of comparing the values in the column g to the vector c("X Y", "Z"). Both of these functions are behaving as they should, but the logic of what I was trying to achieve (subset to only these values) was lost.

Now, in some instances, that is exactly the behaviour you want – subset this vector to any of these values… where those values may not be present in the vector to begin with

data %>% 
   filter(values %in% all_known_values)

The problem, for me, is that there isn’t a way to say “all of these should be there”. The lack of matching happens silently. If you make a typo, you don’t get that level, and you aren’t told that it’s been skipped

simpsons_characters %>% 
   filter(first_name %in% c("Homer", "Marge", "Bert", "Lisa", "Maggie")

Technically this is a double-post because I also want to sidenote this with something I am amazed I have not known about yet (I was approximately today years old when I learned about this)… I’ve used regexmatching for a while, and have been surprised at how well I’ve been able to make it work occasionally. I’m familiar with counting patterns ((A){2} to match two occurrences of A) and ranges of counts ((A){2,4} Sto match between two and four occurrences of A) but I was not aware that you can specify a number of mistakes that can be included to still make a match…;

grep("Bart", c("Bart", "Bort", "Brat"), value = TRUE)
## [1] "Bart"
grep("(Bart){~1}", c("Bart", "Bort", "Brat"), value = TRUE)
## [1] "Bart" "Bort"

(“Are you matching to me?”… “No, my regex also matches to ‘Bort’”)

Use (pattern){~n}to allow up to nsubstitutions in the pattern matching. Refer here and here.

Back to the original problem – filterand %in%are doing their jobs, but we aren’t getting the result we want because we made a typo, and we aren’t told that we’ve done so.

Enter a new PR to forcats (originally to dplyr, but forcats does make more sense) which implements fct_match(f, lvls). This checks that all of the values in lvls are actually present in f before returning the logical vector of which entries they correspond to. With this, the pattern becomes (after loading the development version of forcats from github)

data %>% 
   filter(fct_match(g, c("X Y", "Z")))
## Error: Levels not present in factor: "X Y"

Yay! We’re notified that we’ve made an error. "X Y" isn’t actually in our column g. If we don’t make the error, we get the result we actually wanted in the first place. We can now use this successfully

data %>% 
   filter(fct_match(g, c("X_Y", "Z")))
##   a   g
## 1 1 X_Y
## 2 3   Z
## 3 4   Z

It took a while for the PR to be addressed (the tidyverse crew have plenty of backlog, no doubt) but after some minor requested changes and a very neat cleanup by Hadley himself, it’s been merged.

My original version had a few bells and whistles that the current implementation has put aside. The first was inverting the matching with fct_exclude to make it easier to negate the matching without having to create a new anonymous function, i.e. ~!fct_match(.x). I find this particularly useful since a pipe expects a call/named function, not a lambda/anonymous function, which is actually quite painful to construct

data %>%
   pull(g) %>%
   (function(x) !fct_match(x, c("X_Y", "Z")))
## [1] FALSE  TRUE FALSE FALSE  TRUE

whereas if we defined

fct_exclude <- function(f, lvls, ...) !fct_match(f, lvls, ...)

we can use

data %>%
   pull(g) %>%
   fct_exclude(c("X_Y", "Z"))
## [1] FALSE  TRUE FALSE FALSE  TRUE

The other was specifying whether or not to include missing levels when considering if lvls is a valid value in f since unique(f) and levels(f) can return different answers.

The cleanup really made me think about how much ‘fluff’ some of my code can have. Sure, it’s nice to encapsulate some logic in a small additional function, but sometimes you can actually replace all of that with a one-liner and not need all that. If you’re ever in the mood to see how compact internal code can really be, check out the source of forcats.

Hopefully this pattern of filter(fct_match(f, lvls)) is useful to others. It’s certainly going to save me overlooking some typos.

< details>

< summary> devtools::session_info()

## ─ Session info ──────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.5.2 (2018-12-20)
##  os       Pop!_OS 19.04               
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_AU:en                    
##  collate  en_AU.UTF-8                 
##  ctype    en_AU.UTF-8                 
##  tz       Australia/Adelaide          
##  date     2019-08-13                  
## 
## ─ Packages ──────────────────────────────────────────────────────────────
##  package     * version date       lib source                           
##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.5.2)                   
##  backports     1.1.4   2019-04-10 [1] CRAN (R 3.5.2)                   
##  blogdown      0.14.1  2019-08-11 [1] Github (rstudio/blogdown@be4e91c)
##  bookdown      0.12    2019-07-11 [1] CRAN (R 3.5.2)                   
##  callr         3.3.1   2019-07-18 [1] CRAN (R 3.5.2)                   
##  cli           1.1.0   2019-03-19 [1] CRAN (R 3.5.2)                   
##  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.5.1)                   
##  desc          1.2.0   2018-05-01 [1] CRAN (R 3.5.1)                   
##  devtools      2.1.0   2019-07-06 [1] CRAN (R 3.5.2)                   
##  digest        0.6.20  2019-07-04 [1] CRAN (R 3.5.2)                   
##  dplyr       * 0.8.3   2019-07-04 [1] CRAN (R 3.5.2)                   
##  evaluate      0.14    2019-05-28 [1] CRAN (R 3.5.2)                   
##  forcats     * 0.4.0   2019-02-17 [1] CRAN (R 3.5.1)                   
##  fs            1.3.1   2019-05-06 [1] CRAN (R 3.5.2)                   
##  glue          1.3.1   2019-03-12 [1] CRAN (R 3.5.2)                   
##  htmltools     0.3.6   2017-04-28 [1] CRAN (R 3.5.1)                   
##  knitr         1.24    2019-08-08 [1] CRAN (R 3.5.2)                   
##  magrittr      1.5     2014-11-22 [1] CRAN (R 3.5.1)                   
##  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.5.1)                   
##  pillar        1.4.2   2019-06-29 [1] CRAN (R 3.5.2)                   
##  pkgbuild      1.0.4   2019-08-05 [1] CRAN (R 3.5.2)                   
##  pkgconfig     2.0.2   2018-08-16 [1] CRAN (R 3.5.1)                   
##  pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.5.1)                   
##  prettyunits   1.0.2   2015-07-13 [1] CRAN (R 3.5.1)                   
##  processx      3.4.1   2019-07-18 [1] CRAN (R 3.5.2)                   
##  ps            1.3.0   2018-12-21 [1] CRAN (R 3.5.1)                   
##  purrr         0.3.2   2019-03-15 [1] CRAN (R 3.5.2)                   
##  R6            2.4.0   2019-02-14 [1] CRAN (R 3.5.1)                   
##  Rcpp          1.0.2   2019-07-25 [1] CRAN (R 3.5.2)                   
##  remotes       2.1.0   2019-06-24 [1] CRAN (R 3.5.2)                   
##  rlang         0.4.0   2019-06-25 [1] CRAN (R 3.5.2)                   
##  rmarkdown     1.14    2019-07-12 [1] CRAN (R 3.5.2)                   
##  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.5.1)                   
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.5.1)                   
##  stringi       1.4.3   2019-03-12 [1] CRAN (R 3.5.2)                   
##  stringr       1.4.0   2019-02-10 [1] CRAN (R 3.5.1)                   
##  testthat      2.2.1   2019-07-25 [1] CRAN (R 3.5.2)                   
##  tibble        2.1.3   2019-06-06 [1] CRAN (R 3.5.2)                   
##  tidyselect    0.2.5   2018-10-11 [1] CRAN (R 3.5.1)                   
##  usethis       1.5.1   2019-07-04 [1] CRAN (R 3.5.2)                   
##  withr         2.1.2   2018-03-15 [1] CRAN (R 3.5.1)                   
##  xfun          0.8     2019-06-25 [1] CRAN (R 3.5.2)                   
##  yaml          2.2.0   2018-07-25 [1] CRAN (R 3.5.1)                   
## 
## [1] /home/jono/R/x86_64-pc-linux-gnu-library/3.5
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library

To leave a comment for the author, please follow the link and comment on their blog: rstats on Irregularly Scheduled Programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.