Locating parts of a string with `stringr`
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I was wondering the realms of StackOver Flow answering some questions when I encoutered a question that looked to extract some parts of a string based on a regex. I thought I knew how to do this with the package stringr
using, for example, str_sub
but I found it a bit difficult to map how str_locate
complements str_sub
.
str_locate
and str_locate_all
give back the locations of your regex inside the desired string as a matrix
or a list
respectively. However, that didn’t look very intuitive to pass on to str_sub
which (I thought) only accepted numeric vectors with the indices of the parts of the strings that you want to extract. However, to my surprise, str_sub
accepts not only numeric vectors but also a matrix with two columns, precisely the result of str_locate
.
Let’s create a set of random strings from which we want to extract the word special*word
, where *
represents a random number.
library(stringr) test_string <- replicate( 100, paste0( sample(c(letters, LETTERS, paste0("special", sample(1:10, 1),"word")), 15), collapse = "") ) head(test_string) ## [1] "pZTQHcDVObnaCFS" "qBxfbIHjauyEmgspecial10word" ## [3] "TKgbmQAEFoJHOVh" "VoBdUAuzfPrmCGX" ## [5] "dGgJOspecial5wordiFpbvXzUD" "WOfLjNospecial4wordEeGkyTA"
Using str_locate
returns a matrix with the positions of all matches for every string. This is what’s called vectorised functions in R.
location_matrix <- str_locate(test_string, pattern = "special[0-9]word") head(location_matrix) ## start end ## [1,] NA NA ## [2,] NA NA ## [3,] NA NA ## [4,] NA NA ## [5,] 6 17 ## [6,] 8 19
For this example this wouldn’t work, but I was also interested in checking how the result of str_locate_all
would fit in this workflow. str_locate_all
is the same as str_locate
but since it can find more than one match per string, it returns a list with the same slots as there are strings in test_string
with a matrix per slot showing the indices of the matches. Since many of the strings in test_string
might not have special*word
, we need to fill out those matches with NA
:
location_list <- str_locate_all(test_string, pattern = "special[0-9]word") %>% lapply(function(.x) if (all(is.na(.x))) matrix(c(NA, NA), ncol = 2) else .x) %>% {do.call(rbind, .)} head(location_list) ## start end ## [1,] NA NA ## [2,] NA NA ## [3,] NA NA ## [4,] NA NA ## [5,] 6 17 ## [6,] 8 19
Now that we have everything ready, str_sub
can give our desires results using both numeric vectors as well as the entire matrix:
# Using numeric vectors from str_locate str_sub(test_string, location_matrix[, 1], location_matrix[, 2]) ## [1] NA NA NA NA "special5word" ## [6] "special4word" NA NA "special5word" NA ## [11] NA NA NA NA NA ## [16] NA NA NA NA NA ## [21] NA NA NA "special5word" "special6word" ## [26] NA NA NA NA NA ## [31] "special4word" NA NA NA NA ## [36] NA NA NA "special7word" NA ## [41] NA NA NA NA NA ## [46] NA NA NA NA NA ## [51] NA NA NA NA NA ## [56] NA NA NA NA NA ## [61] NA NA "special4word" NA NA ## [66] NA NA NA NA NA ## [71] NA NA NA "special7word" "special9word" ## [76] NA NA NA NA NA ## [81] "special4word" NA NA "special5word" NA ## [86] NA NA NA "special9word" "special9word" ## [91] NA NA NA NA NA ## [96] "special6word" NA NA "special3word" "special1word" # Using numeric vectors from str_locate_all str_sub(test_string, location_list[, 1], location_list[, 2]) ## [1] NA NA NA NA "special5word" ## [6] "special4word" NA NA "special5word" NA ## [11] NA NA NA NA NA ## [16] NA NA NA NA NA ## [21] NA NA NA "special5word" "special6word" ## [26] NA NA NA NA NA ## [31] "special4word" NA NA NA NA ## [36] NA NA NA "special7word" NA ## [41] NA NA NA NA NA ## [46] NA NA NA NA NA ## [51] NA NA NA NA NA ## [56] NA NA NA NA NA ## [61] NA NA "special4word" NA NA ## [66] NA NA NA NA NA ## [71] NA NA NA "special7word" "special9word" ## [76] NA NA NA NA NA ## [81] "special4word" NA NA "special5word" NA ## [86] NA NA NA "special9word" "special9word" ## [91] NA NA NA NA NA ## [96] "special6word" NA NA "special3word" "special1word" # Using the entire matrix str_sub(test_string, location_matrix) ## [1] NA NA NA NA "special5word" ## [6] "special4word" NA NA "special5word" NA ## [11] NA NA NA NA NA ## [16] NA NA NA NA NA ## [21] NA NA NA "special5word" "special6word" ## [26] NA NA NA NA NA ## [31] "special4word" NA NA NA NA ## [36] NA NA NA "special7word" NA ## [41] NA NA NA NA NA ## [46] NA NA NA NA NA ## [51] NA NA NA NA NA ## [56] NA NA NA NA NA ## [61] NA NA "special4word" NA NA ## [66] NA NA NA NA NA ## [71] NA NA NA "special7word" "special9word" ## [76] NA NA NA NA NA ## [81] "special4word" NA NA "special5word" NA ## [86] NA NA NA "special9word" "special9word" ## [91] NA NA NA NA NA ## [96] "special6word" NA NA "special3word" "special1word"
A much easier approach to doing the above (which is cumbersome and verbose) is to use str_extract
:
str_extract(test_string, "special[0-9]word") ## [1] NA NA NA NA "special5word" ## [6] "special4word" NA NA "special5word" NA ## [11] NA NA NA NA NA ## [16] NA NA NA NA NA ## [21] NA NA NA "special5word" "special6word" ## [26] NA NA NA NA NA ## [31] "special4word" NA NA NA NA ## [36] NA NA NA "special7word" NA ## [41] NA NA NA NA NA ## [46] NA NA NA NA NA ## [51] NA NA NA NA NA ## [56] NA NA NA NA NA ## [61] NA NA "special4word" NA NA ## [66] NA NA NA NA NA ## [71] NA NA NA "special7word" "special9word" ## [76] NA NA NA NA NA ## [81] "special4word" NA NA "special5word" NA ## [86] NA NA NA "special9word" "special9word" ## [91] NA NA NA NA NA ## [96] "special6word" NA NA "special3word" "special1word"
However, the whole objecive behind this exercise was to clearly map out how to connect str_locate
to str_sub
and it’s much clearer if you can pass the entire matrix. However, converting str_locate_all
is still a bit tricky.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.