Faster, easier, and more reliable character string processing with stringi 0.3-1

Marek Gągolewski

7 years ago

[This article was first published on Rexamine » Blog/R-bloggers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A new release of the stringi package is available on CRAN (please wait a few days for Windows and OS X binary builds).

# install.packages("stringi") or update.packages()
library("stringi")

stringi is an R package providing (but definitely not limiting to) equivalents of nearly all the character string processing functions known from base R. While developing the package we had high performance and portability of its facilities in our minds.

We implemented each string processing function from scratch. The internationalization and globalization support, as well as many string processing facilities (like regex searching) is guaranteed by the well-known IBM’s ICU4C library.

Here is a very general list of the most important features available in the current version of stringi:

string searching:
- with ICU (Java-like) regular expressions,
- ICU USearch-based locale-aware string searching (quite slow, but working properly e.g. for non-Unicode normalized strings),
- very fast, locale-independent byte-wise pattern matching;
joining and duplicating strings;
extracting and replacing substrings;
string trimming, padding, and text wrapping (e.g. with Knuth’s dynamic word wrap algorithm);
text transliteration;
text collation (comparing, sorting);
text boundary analysis (e.g. for extracting individual words);
random string generation;
Unicode normalization;
character encoding conversion and detection;

and many more.

Here’s a list of changes in version 0.3-1:

(IMPORTANT CHANGE) #87: %>% overlapped with the pipe operator from the magrittr package; now each operator like %>% has been renamed %s>%.
(IMPORTANT CHANGE) #108: Now the BreakIterator (for text boundary analysis) may be better controlled via stri_opts_brkiter() (see options type and locale which aim to replace now-removed boundary and locale parameters to stri_locate_boundaries, stri_split_boundaries, stri_trans_totitle, stri_extract_words, stri_locate_words).

For example:

test <- "Theu00a0above-mentioned    features are very useful. Warm thanks to their developers. 123 456 789"
stri_split_boundaries(test, stri_opts_brkiter(type="word", skip_word_none=TRUE, skip_word_number=TRUE)) # cf. stri_extract_words
## [[1]]
##  [1] "The"        "above"      "mentioned"  "features"   "are"       
##  [6] "very"       "useful"     "Warm"       "thanks"     "to"        
## [11] "their"      "developers"
stri_split_boundaries(test, stri_opts_brkiter(type="sentence")) # extract sentences
## [[1]]
## [1] "The above-mentioned    features are very useful. "
## [2] "Warm thanks to their developers. "                
## [3] "123 456 789"
stri_split_boundaries(test, stri_opts_brkiter(type="character")) # extract characters
## [[1]]
##  [1] "T" "h" "e" " " "a" "b" "o" "v" "e" "-" "m" "e" "n" "t" "i" "o" "n"
## [18] "e" "d" " " " " " " " " "f" "e" "a" "t" "u" "r" "e" "s" " " "a" "r"
## [35] "e" " " "v" "e" "r" "y" " " "u" "s" "e" "f" "u" "l" "." " " "W" "a"
## [52] "r" "m" " " "t" "h" "a" "n" "k" "s" " " "t" "o" " " "t" "h" "e" "i"
## [69] "r" " " "d" "e" "v" "e" "l" "o" "p" "e" "r" "s" "." " " "1" "2" "3"
## [86] " " "4" "5" "6" " " "7" "8" "9"

By the way, the last call also works correctly for strings not in the Unicode Normalization Form C:

stri_split_boundaries(stri_trans_nfkd("zażółć gęślą jaźń"), stri_opts_brkiter(type="character"))
## [[1]]
##  [1] "z" "a" "ż"  "ó"  "ł" "ć"  " " "g" "ę"  "ś"  "l" "ą"  " " "j" "a" "ź"  "ń"

(NEW FUNCTIONS) #109: stri_count_boundaries and stri_count_words count the number of text boundaries in a string.

stri_count_words("Have a nice day!")
## [1] 4

(NEW FUNCTIONS) #41: stri_startswith_* and stri_endswith_* determine whether a string starts or ends with a given pattern.

stri_startswith_fixed(c("a1o", "a2g", "b3a", "a4e", "c5a"), "a")
## [1]  TRUE  TRUE FALSE  TRUE FALSE

(NEW FEATURE) #102: stri_replace_all_* gained a vectorize_all parameter, which defaults to TRUE for backward compatibility.

stri_replace_all_fixed("The quick brown fox jumped over the lazy dog.",
     c("quick", "brown", "fox"), c("slow",  "black", "bear"), vectorize_all=FALSE)
## [1] "The slow black bear jumped over the lazy dog."
# Compare the results:
stri_replace_all_fixed("The quicker brown fox jumped over the lazy dog.",
     c("quick", "brown", "fox"), c("slow",  "black", "bear"), vectorize_all=FALSE)
## [1] "The slower black bear jumped over the lazy dog."
stri_replace_all_regex("The quicker brown fox jumped over the lazy dog.",
     "\b"%s+%c("quick", "brown", "fox")%s+%"\b", c("slow",  "black", "bear"), vectorize_all=FALSE)
## [1] "The quicker black bear jumped over the lazy dog."

(NEW FUNCTIONS) #91: stri_subset_*, a convenient and more efficient substitute for str[stri_detect_*(str, ...)], added.

stri_subset_regex(c("john@office.company.com", "steve1932@g00gl3.eu", "no email here"),
   "^[A-Za-z0-9._%+-]+@([A-Za-z0-9-]+\.)+[A-Za-z]{2,4}$")
## [1] "john@office.company.com" "steve1932@g00gl3.eu"

(NEW FEATURE) #100: stri_split_fixed, stri_split_charclass, stri_split_regex, stri_split_coll gained a tokens_only parameter, which defaults to FALSE for backward compatibility.

stri_split_fixed(c("ab_c", "d_ef_g", "h", ""), "_", n_max=1, tokens_only=TRUE, omit_empty=TRUE)
## [[1]]
## [1] "ab"
## 
## [[2]]
## [1] "d"
## 
## [[3]]
## [1] "h"
## 
## [[4]]
## character(0)
stri_split_fixed(c("ab_c", "d_ef_g", "h", ""), "_", n_max=2, tokens_only=TRUE, omit_empty=TRUE)
## [[1]]
## [1] "ab" "c" 
## 
## [[2]]
## [1] "d"  "ef"
## 
## [[3]]
## [1] "h"
## 
## [[4]]
## character(0)
stri_split_fixed(c("ab_c", "d_ef_g", "h", ""), "_", n_max=3, tokens_only=TRUE, omit_empty=TRUE)
## [[1]]
## [1] "ab" "c" 
## 
## [[2]]
## [1] "d"  "ef" "g" 
## 
## [[3]]
## [1] "h"
## 
## [[4]]
## character(0)

(NEW FUNCTION) #105: stri_list2matrix converts lists of atomic vectors to character matrices, useful in connection with stri_split and stri_extract.

stri_list2matrix(stri_split_fixed(c("ab_c", "d_ef_g", "h", ""), "_", n_max=3, tokens_only=TRUE, omit_empty=TRUE))
##      [,1] [,2] [,3] [,4]
## [1,] "ab" "d"  "h"  NA  
## [2,] "c"  "ef" NA   NA  
## [3,] NA   "g"  NA   NA

(NEW FEATURE) #107: stri_split_* now allow setting an omit_empty=NA argument.

stri_split_fixed("a_b_c__d", "_", omit_empty=FALSE)
## [[1]]
## [1] "a" "b" "c" ""  "d"
stri_split_fixed("a_b_c__d", "_", omit_empty=TRUE)
## [[1]]
## [1] "a" "b" "c" "d"
stri_split_fixed("a_b_c__d", "_", omit_empty=NA)
## [[1]]
## [1] "a" "b" "c" NA  "d"

(NEW FEATURE) #106: stri_split and stri_extract_all gained a simplify argument (if TRUE, then stri_list2matrix(..., byrow=TRUE) is called on the resulting list.

stri_split_fixed(c("ab,c", "d,ef,g", ",h", ""), ",", omit_empty=TRUE, simplify=TRUE)
##      [,1] [,2] [,3]
## [1,] "ab" "c"  NA  
## [2,] "d"  "ef" "g" 
## [3,] "h"  NA   NA  
## [4,] NA   NA   NA
stri_split_fixed(c("ab,c", "d,ef,g", ",h", ""), ",", omit_empty=FALSE, simplify=TRUE)
##      [,1] [,2] [,3]
## [1,] "ab" "c"  NA  
## [2,] "d"  "ef" "g" 
## [3,] ""   "h"  NA  
## [4,] ""   NA   NA
stri_split_fixed(c("ab,c", "d,ef,g", ",h", ""), ",", omit_empty=NA, simplify=TRUE)
##      [,1] [,2] [,3]
## [1,] "ab" "c"  NA  
## [2,] "d"  "ef" "g" 
## [3,] NA   "h"  NA  
## [4,] NA   NA   NA

(NEW FUNCTION) #77: stri_rand_lipsum generates (pseudo)random dummy lorem ipsum text.

cat(sapply(
   stri_wrap(stri_rand_lipsum(3), 80, simplify=FALSE),
   stri_flatten, collapse="n"), sep="nn")
## Lorem ipsum dolor sit amet, eu turpis pellentesque est, lectus, vestibulum.
## Iaculis et nam ad eu morbi, ultrices enim pellentesque est fusce. Etiam
## ipsum varius, maecenas dapibus. Netus molestie non adipiscing netus,
## aptent sed malesuada, placerat suscipit. A, sed eu luctus imperdiet odio
## tempor. In velit ut vel feugiat felis eros risus. Sed sapien, facilisis
## ullamcorper, senectus efficitur sit id sociis sed purus. Ipsum, a, blandit
## faucibus. In vivamus, duis et sed sollicitudin maximus. Sodales magnis
## ac senectus facilisis, dolor faucibus a. Cursus in cum, cubilia egestas
## ut platea turpis. Maximus sit vel cursus nec in vel, eu, lacinia in ut.
## 
## Libero maximus potenti penatibus amet nisl non ut. Commodo nullam rhoncus,
## bibendum quisque sem aliquam sed, quam enim et, sed. Lacinia netus inceptos
## sapien nostra tincidunt facilisis montes nascetur non pharetra convallis
## id. Netus diam nulla montes nec tincidunt facilisis eros porttitor nisl urna
## cubilia. Aliquet egestas mus nisl, nisi vehicula, ac mauris rutrum, felis
## aenean tristique magna. Ante maecenas phasellus id class. Finibus iaculis purus
## volutpat posuere phasellus magna class blandit augue morbi torquent. Taciti
## ullamcorper venenatis at nulla eget auctor ante neque metus sed metus. Dolor,
## platea sit sed pellentesque ipsum. Dapibus sed nisi vestibulum ex integer.
## 
## Duis iaculis sapien habitasse, facilisi habitasse leo nam. Egestas,
## libero tempor purus in. Aliquam himenaeos conubia egestas cum vestibulum
## nec. Sociosqu mauris cum mus non lobortis eu et dapibus vel integer.
## Blandit quis inceptos cursus vel pellentesque lectus amet egestas.
## Pharetra ac eros nisi. Finibus nec, ac congue in molestie sed.
## Tincidunt faucibus a interdum facilisis, sed nulla, tortor, felis,
## sociis. Sem porttitor himenaeos pharetra nec eu torquent elementum.

(NEW FEATURE) #98: stri_trans_totitle gained a opts_brkiter parameter; it indicates which ICU BreakIterator should be used when performing case mapping.

stri_trans_totitle("GOOD-OLD cOOkiE mOnSTeR IS watCHinG You. Here HE comes!",
    stri_opts_brkiter(type="word")) # default boundary
## [1] "Good-Old Cookie Monster Is Watching You. Here He Comes!"
stri_trans_totitle("GOOD-OLD cOOkiE mOnSTeR IS watCHinG You. Here HE comes!",
    stri_opts_brkiter(type="sentence"))
## [1] "Good-old cookie monster is watching you. Here he comes!"

(NEW FEATURE) stri_wrap gained a new parameter: normalize.
(BUGFIX) #86: stri_*_fixed, stri_*_coll, and stri_*_regex could give incorrect results if one of search strings were of length 0.
(BUGFIX) #99: stri_replace_all did not use the replacement arg.
(BUGFIX) #94: R CMD check should no longer fail if icudt download failed.
(BUGFIX) #112: Some of the objects were not PROTECTed from being garbage collected, which might have caused spontaneous SEGFAULTS.
(BUGFIX) Some collator’s options were not passed correctly to ICU services.
(BUGFIX) Memory leaks causes as detected by valgrind --tool=memcheck --leak-check=full have been removed.
(DOCUMENTATION) Significant extensions/clean ups in the stringi manual.

Check out yourself. In particular, take a glimpse at stringi-search-regex, stringi-search-charclass and, more generally, at stringi-search.

Enjoy! Any comments and suggestions are welcome.

To leave a comment for the author, please follow the link and comment on their blog: Rexamine » Blog/R-bloggers.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.