Faster, easier, and more reliable character string processing with stringi 0.3-1
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A new release of the stringi
package is available on CRAN (please wait a few days for Windows and OS X binary builds).
# install.packages("stringi") or update.packages() library("stringi")
stringi
is an R package providing (but definitely not limiting to) equivalents of nearly all the character string processing functions known from base R. While developing the package we had high performance and portability of its facilities in our minds.
We implemented each string processing function from scratch. The internationalization and globalization support, as well as many string processing facilities (like regex searching) is guaranteed by the well-known IBM’s ICU4C library.
Here is a very general list of the most important features available in the current version of stringi
:
- string searching:
- with ICU (Java-like) regular expressions,
- ICU USearch-based locale-aware string searching (quite slow, but working properly e.g. for non-Unicode normalized strings),
- very fast, locale-independent byte-wise pattern matching;
- joining and duplicating strings;
- extracting and replacing substrings;
- string trimming, padding, and text wrapping (e.g. with Knuth’s dynamic word wrap algorithm);
- text transliteration;
- text collation (comparing, sorting);
- text boundary analysis (e.g. for extracting individual words);
- random string generation;
- Unicode normalization;
- character encoding conversion and detection;
and many more.
Here’s a list of changes in version 0.3-1:
-
(IMPORTANT CHANGE) #87:
%>%
overlapped with the pipe operator from themagrittr
package; now each operator like%>%
has been renamed%s>%
. -
(IMPORTANT CHANGE) #108: Now the BreakIterator (for text boundary analysis) may be better controlled via
stri_opts_brkiter()
(see optionstype
andlocale
which aim to replace now-removedboundary
andlocale
parameters tostri_locate_boundaries
,stri_split_boundaries
,stri_trans_totitle
,stri_extract_words
,stri_locate_words
).For example:
test <- "Theu00a0above-mentioned features are very useful. Warm thanks to their developers. 123 456 789" stri_split_boundaries(test, stri_opts_brkiter(type="word", skip_word_none=TRUE, skip_word_number=TRUE)) # cf. stri_extract_words ## [[1]] ## [1] "The" "above" "mentioned" "features" "are" ## [6] "very" "useful" "Warm" "thanks" "to" ## [11] "their" "developers" stri_split_boundaries(test, stri_opts_brkiter(type="sentence")) # extract sentences ## [[1]] ## [1] "The above-mentioned features are very useful. " ## [2] "Warm thanks to their developers. " ## [3] "123 456 789" stri_split_boundaries(test, stri_opts_brkiter(type="character")) # extract characters ## [[1]] ## [1] "T" "h" "e" " " "a" "b" "o" "v" "e" "-" "m" "e" "n" "t" "i" "o" "n" ## [18] "e" "d" " " " " " " " " "f" "e" "a" "t" "u" "r" "e" "s" " " "a" "r" ## [35] "e" " " "v" "e" "r" "y" " " "u" "s" "e" "f" "u" "l" "." " " "W" "a" ## [52] "r" "m" " " "t" "h" "a" "n" "k" "s" " " "t" "o" " " "t" "h" "e" "i" ## [69] "r" " " "d" "e" "v" "e" "l" "o" "p" "e" "r" "s" "." " " "1" "2" "3" ## [86] " " "4" "5" "6" " " "7" "8" "9"
By the way, the last call also works correctly for strings not in the Unicode Normalization Form C:
stri_split_boundaries(stri_trans_nfkd("zażółć gęślą jaźń"), stri_opts_brkiter(type="character")) ## [[1]] ## [1] "z" "a" "ż" "ó" "ł" "ć" " " "g" "ę" "ś" "l" "ą" " " "j" "a" "ź" "ń"
- (NEW FUNCTIONS) #109:
stri_count_boundaries
andstri_count_words
count the number of text boundaries in a string.
stri_count_words("Have a nice day!") ## [1] 4
- (NEW FUNCTIONS) #41:
stri_startswith_*
andstri_endswith_*
determine whether a string starts or ends with a given pattern.
stri_startswith_fixed(c("a1o", "a2g", "b3a", "a4e", "c5a"), "a") ## [1] TRUE TRUE FALSE TRUE FALSE
- (NEW FEATURE) #102:
stri_replace_all_*
gained avectorize_all
parameter, which defaults to TRUE for backward compatibility.
stri_replace_all_fixed("The quick brown fox jumped over the lazy dog.", c("quick", "brown", "fox"), c("slow", "black", "bear"), vectorize_all=FALSE) ## [1] "The slow black bear jumped over the lazy dog." # Compare the results: stri_replace_all_fixed("The quicker brown fox jumped over the lazy dog.", c("quick", "brown", "fox"), c("slow", "black", "bear"), vectorize_all=FALSE) ## [1] "The slower black bear jumped over the lazy dog." stri_replace_all_regex("The quicker brown fox jumped over the lazy dog.", "\b"%s+%c("quick", "brown", "fox")%s+%"\b", c("slow", "black", "bear"), vectorize_all=FALSE) ## [1] "The quicker black bear jumped over the lazy dog."
- (NEW FUNCTIONS) #91:
stri_subset_*
, a convenient and more efficient substitute forstr[stri_detect_*(str, ...)]
, added.
stri_subset_regex(c("[email protected]", "[email protected]", "no email here"), "^[A-Za-z0-9._%+-]+@([A-Za-z0-9-]+\.)+[A-Za-z]{2,4}$") ## [1] "[email protected]" "[email protected]"
- (NEW FEATURE) #100:
stri_split_fixed
,stri_split_charclass
,stri_split_regex
,stri_split_coll
gained atokens_only
parameter, which defaults toFALSE
for backward compatibility.
stri_split_fixed(c("ab_c", "d_ef_g", "h", ""), "_", n_max=1, tokens_only=TRUE, omit_empty=TRUE) ## [[1]] ## [1] "ab" ## ## [[2]] ## [1] "d" ## ## [[3]] ## [1] "h" ## ## [[4]] ## character(0) stri_split_fixed(c("ab_c", "d_ef_g", "h", ""), "_", n_max=2, tokens_only=TRUE, omit_empty=TRUE) ## [[1]] ## [1] "ab" "c" ## ## [[2]] ## [1] "d" "ef" ## ## [[3]] ## [1] "h" ## ## [[4]] ## character(0) stri_split_fixed(c("ab_c", "d_ef_g", "h", ""), "_", n_max=3, tokens_only=TRUE, omit_empty=TRUE) ## [[1]] ## [1] "ab" "c" ## ## [[2]] ## [1] "d" "ef" "g" ## ## [[3]] ## [1] "h" ## ## [[4]] ## character(0)
- (NEW FUNCTION) #105:
stri_list2matrix
converts lists of atomic vectors to character matrices, useful in connection withstri_split
andstri_extract
.
stri_list2matrix(stri_split_fixed(c("ab_c", "d_ef_g", "h", ""), "_", n_max=3, tokens_only=TRUE, omit_empty=TRUE)) ## [,1] [,2] [,3] [,4] ## [1,] "ab" "d" "h" NA ## [2,] "c" "ef" NA NA ## [3,] NA "g" NA NA
- (NEW FEATURE) #107:
stri_split_*
now allow setting anomit_empty=NA
argument.
stri_split_fixed("a_b_c__d", "_", omit_empty=FALSE) ## [[1]] ## [1] "a" "b" "c" "" "d" stri_split_fixed("a_b_c__d", "_", omit_empty=TRUE) ## [[1]] ## [1] "a" "b" "c" "d" stri_split_fixed("a_b_c__d", "_", omit_empty=NA) ## [[1]] ## [1] "a" "b" "c" NA "d"
- (NEW FEATURE) #106:
stri_split
andstri_extract_all
gained asimplify
argument (ifTRUE
, thenstri_list2matrix(..., byrow=TRUE)
is called on the resulting list.
stri_split_fixed(c("ab,c", "d,ef,g", ",h", ""), ",", omit_empty=TRUE, simplify=TRUE) ## [,1] [,2] [,3] ## [1,] "ab" "c" NA ## [2,] "d" "ef" "g" ## [3,] "h" NA NA ## [4,] NA NA NA stri_split_fixed(c("ab,c", "d,ef,g", ",h", ""), ",", omit_empty=FALSE, simplify=TRUE) ## [,1] [,2] [,3] ## [1,] "ab" "c" NA ## [2,] "d" "ef" "g" ## [3,] "" "h" NA ## [4,] "" NA NA stri_split_fixed(c("ab,c", "d,ef,g", ",h", ""), ",", omit_empty=NA, simplify=TRUE) ## [,1] [,2] [,3] ## [1,] "ab" "c" NA ## [2,] "d" "ef" "g" ## [3,] NA "h" NA ## [4,] NA NA NA
- (NEW FUNCTION) #77:
stri_rand_lipsum
generates (pseudo)random dummy lorem ipsum text.
cat(sapply( stri_wrap(stri_rand_lipsum(3), 80, simplify=FALSE), stri_flatten, collapse="n"), sep="nn") ## Lorem ipsum dolor sit amet, eu turpis pellentesque est, lectus, vestibulum. ## Iaculis et nam ad eu morbi, ultrices enim pellentesque est fusce. Etiam ## ipsum varius, maecenas dapibus. Netus molestie non adipiscing netus, ## aptent sed malesuada, placerat suscipit. A, sed eu luctus imperdiet odio ## tempor. In velit ut vel feugiat felis eros risus. Sed sapien, facilisis ## ullamcorper, senectus efficitur sit id sociis sed purus. Ipsum, a, blandit ## faucibus. In vivamus, duis et sed sollicitudin maximus. Sodales magnis ## ac senectus facilisis, dolor faucibus a. Cursus in cum, cubilia egestas ## ut platea turpis. Maximus sit vel cursus nec in vel, eu, lacinia in ut. ## ## Libero maximus potenti penatibus amet nisl non ut. Commodo nullam rhoncus, ## bibendum quisque sem aliquam sed, quam enim et, sed. Lacinia netus inceptos ## sapien nostra tincidunt facilisis montes nascetur non pharetra convallis ## id. Netus diam nulla montes nec tincidunt facilisis eros porttitor nisl urna ## cubilia. Aliquet egestas mus nisl, nisi vehicula, ac mauris rutrum, felis ## aenean tristique magna. Ante maecenas phasellus id class. Finibus iaculis purus ## volutpat posuere phasellus magna class blandit augue morbi torquent. Taciti ## ullamcorper venenatis at nulla eget auctor ante neque metus sed metus. Dolor, ## platea sit sed pellentesque ipsum. Dapibus sed nisi vestibulum ex integer. ## ## Duis iaculis sapien habitasse, facilisi habitasse leo nam. Egestas, ## libero tempor purus in. Aliquam himenaeos conubia egestas cum vestibulum ## nec. Sociosqu mauris cum mus non lobortis eu et dapibus vel integer. ## Blandit quis inceptos cursus vel pellentesque lectus amet egestas. ## Pharetra ac eros nisi. Finibus nec, ac congue in molestie sed. ## Tincidunt faucibus a interdum facilisis, sed nulla, tortor, felis, ## sociis. Sem porttitor himenaeos pharetra nec eu torquent elementum.
- (NEW FEATURE) #98:
stri_trans_totitle
gained aopts_brkiter
parameter; it indicates which ICU BreakIterator should be used when performing case mapping.
stri_trans_totitle("GOOD-OLD cOOkiE mOnSTeR IS watCHinG You. Here HE comes!", stri_opts_brkiter(type="word")) # default boundary ## [1] "Good-Old Cookie Monster Is Watching You. Here He Comes!" stri_trans_totitle("GOOD-OLD cOOkiE mOnSTeR IS watCHinG You. Here HE comes!", stri_opts_brkiter(type="sentence")) ## [1] "Good-old cookie monster is watching you. Here he comes!"
-
(NEW FEATURE)
stri_wrap
gained a new parameter:normalize
. -
(BUGFIX) #86:
stri_*_fixed
,stri_*_coll
, andstri_*_regex
could give incorrect results if one of search strings were of length 0. -
(BUGFIX) #99:
stri_replace_all
did not use thereplacement
arg. -
(BUGFIX) #94:
R CMD check
should no longer fail ificudt
download failed. -
(BUGFIX) #112: Some of the objects were not PROTECTed from being garbage collected, which might have caused spontaneous SEGFAULTS.
-
(BUGFIX) Some collator’s options were not passed correctly to ICU services.
-
(BUGFIX) Memory leaks causes as detected by
valgrind --tool=memcheck --leak-check=full
have been removed. -
(DOCUMENTATION) Significant extensions/clean ups in the
stringi
manual.Check out yourself. In particular, take a glimpse at
stringi-search-regex
,stringi-search-charclass
and, more generally, atstringi-search
.
Enjoy! Any comments and suggestions are welcome.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.