Pull the (character) strings with stringi 0.5-2
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A reliable string processing toolkit is a must-have for any data scientist.
A new release of the stringi
package is available on CRAN (please wait a few days for Windows and OS X binary builds). As for now, about 850 CRAN packages depend (either directly or recursively) on stringi
. And quite recently, the package got listed among the top downloaded R extensions.
# install.packages("stringi") or update.packages() library("stringi") stri_info(TRUE) ## [1] "stringi_0.5.2; en_US.UTF-8; ICU4C 55.1; Unicode 7.0" apkg <- available.packages(contriburl="http://cran.rstudio.com/src/contrib") length(tools::dependsOnPkgs('stringi', installed=apkg, recursive=TRUE)) ## [1] 845
Refer to the INSTALL
file for more details if you compile stringi from sources (Linux users mostly).
Here’s a list of changes in version 0.5-2. There are many major (like date&time processing) and minor new features, enhancements, as well as bugfixes. In the current release we also focused on bringing stringr
package’s users even better string processing experience, as since the 1.0.0 release it is now powered by stringi
.
-
[BACKWARD INCOMPATIBILITY] The second argument to
stri_pad_*()
has been renamedwidth
. -
[GENERAL] #69:
stringi
is now bundled with ICU4C 55.1. - [NEW FUNCTIONS] #137: date-time formatting/parsing (note that this is draft API and it may change in future
stringi
releases; any comments are welcome):stri_timezone_list()
– lists all known time zone identifiers
sample(stri_timezone_list(), 10) ## [1] "Etc/GMT+12" "Antarctica/Macquarie" ## [3] "Atlantic/Faroe" "Antarctica/Troll" ## [5] "America/Fort_Wayne" "PLT" ## [7] "America/Goose_Bay" "America/Argentina/Catamarca" ## [9] "Africa/Juba" "Africa/Bissau"
stri_timezone_set()
,stri_timezone_get()
– manage current default time zonestri_timezone_info()
– basic information on a given time zone
str(stri_timezone_info('Europe/Warsaw')) ## List of 6 ## $ ID : chr "Europe/Warsaw" ## $ Name : chr "Central European Standard Time" ## $ Name.Daylight : chr "Central European Summer Time" ## $ Name.Windows : chr "Central European Standard Time" ## $ RawOffset : num 1 ## $ UsesDaylightTime: logi TRUE stri_timezone_info('Europe/Warsaw', locale='de_DE')$Name ## [1] "Mitteleuropäische Normalzeit"
stri_datetime_symbols()
– localizable date-time formatting data
stri_datetime_symbols() ## $Month ## [1] "January" "February" "March" "April" "May" ## [6] "June" "July" "August" "September" "October" ## [11] "November" "December" ## ## $Weekday ## [1] "Sunday" "Monday" "Tuesday" "Wednesday" "Thursday" "Friday" ## [7] "Saturday" ## ## $Quarter ## [1] "1st quarter" "2nd quarter" "3rd quarter" "4th quarter" ## ## $AmPm ## [1] "AM" "PM" ## ## $Era ## [1] "Before Christ" "Anno Domini" stri_datetime_symbols("th_TH_TRADITIONAL")$Month ## [1] "มกราคม" "กุมภาพันธ์" "มีนาคม" "เมษายน" "พฤษภาคม" "มิถุนายน" "กรกฎาคม" ## [8] "สิงหาคม" "กันยายน" "ตุลาคม" "พฤศจิกายน" "ธันวาคม" stri_datetime_symbols("he_IL@calendar=hebrew")$Month ## [1] "תשרי" "חשון" "כסלו" "טבת" "שבט" "אדר א׳" "אדר" ## [8] "ניסן" "אייר" "סיון" "תמוז" "אב" "אלול" "אדר ב׳"
stri_datetime_now()
– return current date-timestri_datetime_fstr()
– convert astrptime
-like format string to an ICU date/time format stringstri_datetime_format()
– convert date/time to string
stri_datetime_format(stri_datetime_now(), "datetime_relative_medium") ## [1] "today, 6:21:45 PM"
stri_datetime_parse()
– convert string to date/time object
stri_datetime_parse(c("2015-02-28", "2015-02-29"), "yyyy-MM-dd") ## [1] "2015-02-28 18:21:45 CET" NA stri_datetime_parse(c("2015-02-28", "2015-02-29"), stri_datetime_fstr("%Y-%m-%d")) ## [1] "2015-02-28 18:21:45 CET" NA stri_datetime_parse(c("2015-02-28", "2015-02-29"), "yyyy-MM-dd", lenient=TRUE) ## [1] "2015-02-28 18:21:45 CET" "2015-03-01 18:21:45 CET" stri_datetime_parse("19 lipca 2015", "date_long", locale="pl_PL") ## [1] "2015-07-19 18:21:45 CEST"
stri_datetime_create()
– construct date-time objects from numeric representations
stri_datetime_create(2015, 12, 31, 23, 59, 59.999) ## [1] "2015-12-31 23:59:59 CET" stri_datetime_create(5775, 8, 1, locale="@calendar=hebrew") # 1 Nisan 5775 -> 2015-03-21 ## [1] "2015-03-21 12:00:00 CET" stri_datetime_create(2015, 02, 29) ## [1] NA stri_datetime_create(2015, 02, 29, lenient=TRUE) ## [1] "2015-03-01 12:00:00 CET"
stri_datetime_fields()
– get values for date-time fields
stri_datetime_fields(stri_datetime_now()) ## Year Month Day Hour Minute Second Millisecond WeekOfYear WeekOfMonth ## 1 2015 6 23 18 21 45 52 26 4 ## DayOfYear DayOfWeek Hour12 AmPm Era ## 1 174 3 6 2 2 stri_datetime_fields(stri_datetime_now(), locale="@calendar=hebrew") ## Year Month Day Hour Minute Second Millisecond WeekOfYear WeekOfMonth ## 1 5775 11 6 18 21 45 56 40 2 ## DayOfYear DayOfWeek Hour12 AmPm Era ## 1 272 3 6 2 1 stri_datetime_symbols(locale="@calendar=hebrew")$Month[ stri_datetime_fields(stri_datetime_now(), locale="@calendar=hebrew")$Month ] ## [1] "Tamuz"
stri_datetime_add()
– add specific number of date-time units to a date-time object
x <- stri_datetime_create(2015, 12, 31, 23, 59, 59.999) stri_datetime_add(x, units="months") <- 2 print(x) ## [1] "2016-02-29 23:59:59 CET" stri_datetime_add(x, -2, units="months") ## [1] "2015-12-29 23:59:59 CET"
-
[NEW FUNCTIONS]
stri_extract_*_boundaries()
extract text between text boundaries. -
[NEW FUNCTION] #46:
stri_trans_char()
is astringi
-flavouredchartr()
equivalent.
stri_trans_char("id.123", ".", "_") ## [1] "id_123" stri_trans_char("babaab", "ab", "01") ## [1] "101001"
- [NEW FUNCTION] #8:
stri_width()
approximates the width of a string in a more Unicodish fashion thannchar(..., "width")
stri_width(LETTERS[1:5]) ## [1] 1 1 1 1 1 nchar(stri_trans_nfkd("u0105"), "width") # provides incorrect information ## [1] 0 stri_width(stri_trans_nfkd("u0105")) # A and ogonek (width = 1) ## [1] 1 stri_width( # Full-width equivalents of ASCII characters: stri_enc_fromutf32(as.list(c(0x3000, 0xFF01:0xFF5E))) ) ## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ## [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ## [71] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
- [NEW FEATURE] #149:
stri_pad()
andstri_wrap()
now by default bases on code point widths instead of the number of code points. Moreover, the default behavior ofstri_wrap()
is now such that it does not get rid of non-breaking, zero width, etc. spaces
x <- stri_flatten(c( stri_dup(LETTERS, 2), stri_enc_fromutf32(as.list(0xFF21:0xFF3a)) ), collapse=' ') # Note that your web browser may have problems with properly aligning # this (try it in RStudio) cat(stri_wrap(x, 11), sep='n') ## AA BB CC DD ## EE FF GG HH ## II JJ KK LL ## MM NN OO PP ## QQ RR SS TT ## UU VV WW XX ## YY ZZ A B ## C D E F ## G H I J ## K L M N ## O P Q R ## S T U V ## W X Y Z
-
[NEW FEATURE] #133:
stri_wrap()
silently allows forwidth <= 0
(for compatibility withstrwrap()
). -
[NEW FEATURE] #139:
stri_wrap()
gained a new argument:whitespace_only
. -
[GENERAL] #144: Performance improvements in handling ASCII strings (these affect
stri_sub()
,stri_locate()
and other string index-based operations) -
[GENERAL] #143: Searching for short fixed patterns (
stri_*_fixed()
) now relies on the currentlibC
’s implementation ofstrchr()
andstrstr()
. This is very fast e.g. onglibc
utilizing theSSE2/3/4
instruction set.
x <- stri_rand_strings(100, 10000, "[actg]") microbenchmark::microbenchmark( stri_detect_fixed(x, "acgtgaa"), grepl("actggact", x), grepl("actggact", x, perl=TRUE), grepl("actggact", x, fixed=TRUE) ) ## Unit: microseconds ## expr min lq mean ## stri_detect_fixed(x, "acgtgaa") 349.153 354.181 381.2391 ## grepl("actggact", x) 14017.923 14181.416 14457.3996 ## grepl("actggact", x, perl = TRUE) 8280.282 8367.426 8516.0124 ## grepl("actggact", x, fixed = TRUE) 3599.200 3637.373 3726.6020 ## median uq max neval cld ## 362.7515 391.0655 681.267 100 a ## 14292.2815 14594.4970 15736.535 100 d ## 8463.4490 8570.0080 9564.503 100 c ## 3686.6690 3753.4060 4402.397 100 b
-
[GENERAL] #141: a local copy of
icudt*.zip
may be used on package install; see theINSTALL
file for more information. -
[GENERAL] #165: the
./configure
option--disable-icu-bundle
forces the use of system ICU when building the package. -
[BUGFIX] locale specifiers are now normalized in a more intelligent way: e.g.
@calendar=gregorian
expands toDEFAULT_LOCALE@calendar=gregorian
. -
[BUGFIX] #134:
stri_extract_all_words()
did not acceptsimplify=NA
. -
[BUGFIX] #132: incorrect behavior in
stri_locate_regex()
for matches of zero lengths. -
[BUGFIX] stringr/#73:
stri_wrap()
returnedCHARSXP
instead ofSTRSXP
on empty string input withsimplify=FALSE
argument. -
[BUGFIX] #164: libicu-dev usage used to fail on Ubuntu.
-
[BUGFIX] #135: C++11 is now used by default (see the
INSTALL
file, however) to build stringi from sources. This is because ICU4C uses thelong long
type which is not part of the C++98 standard. -
[BUGFIX] #154: Dates and other objects with a custom class attribute were not coerced to the character type correctly.
-
[BUGFIX] #168: Build now fails if
icudt
is not available. -
[BUGFIX] Force ICU
u_init()
call on stringi dynlib load. -
[BUGFIX] #157: many overfull hboxes in the package PDF manual has been corrected.
Enjoy! Any comments and suggestions are welcome.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.