Properly “internationalized” regular expressions in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
We should pay special attention to writing a truly portable code that works in the same fashion under different locales and character encodings. Currently, R
has two Regex engines, ERE (via TRE
) and PRE (via PCRE
). What is surprising, they ought to give different results on different operating systems and native character encodings used!
UPDATE@2013-07-10: check out our stringi package to get rid of such problems forever!
PCRE
often outperforms ERE and has a more powerful syntax. Moreover, it was built into R
with Unicode support. As UTF-8 may represent almost all printable characters used around the world, a good idea is to always use PRE on normalized character vectors, i.e. converted from native encoding to UTF-8 via enc2utf8()
and then, after regexing, back with enc2native()
.
Here’s an example on matching some character classes in three different locales. The string where matches were sought consisted of all ASCII characters (codes 1–127) and Polish letters (ą, ę, ł, ś, ż, and so on).
Locale | |||
---|---|---|---|
Pattern | pl_PL.UTF-8 ( GNU/Linux ) |
pl_PL.iso-8859-2 ( GNU/Linux ) |
Polish_Poland.1250 ( Windows ) |
ERE-Native | |||
[[:alpha:]] |
AB...Zab...z ĄĆĘŁŃÓŚŹŻąćęłńóśźż |
AB...Zab...zĆĘŃÓćęńó |
|
[[:digit:]] |
0123456789 |
0123456789ął |
|
[[:lower:]] |
ab...ząćęłńóśźż |
ab...zćęńó |
|
[[:upper:]] |
AB...ZĄĆĘŁŃÓŚŹŻ |
AB...ZĆĘŃÓ |
|
[[:punct:]] |
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ |
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ ĄŁŚŹŻąłśźż |
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ ĄŁŻąłż |
[A-Z] |
AB...Z |
||
[a-z] |
ab...z |
||
PCRE-Native | |||
[[:alpha:]] |
AB...Zab...z |
AB...Zab...z ĄĆĘŁŃÓŚŹŻąćęłńóśźż |
|
[[:digit:]] |
0123456789 |
||
[[:lower:]] |
ab...z |
ab...ząćęłńóśźż |
|
[[:upper:]] |
AB...Z |
AB...ZĄĆĘŁŃÓŚŹŻ |
|
[[:punct:]] |
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ |
||
[A-Z] |
AB...Z |
||
[a-z] |
ab...z |
||
ERE-UTF-8 normalized | |||
[[:alpha:]] |
AB...Zab...z ĄĆĘŁŃÓŚŹŻąćęłńóśźż |
AB...Zab...zÓó |
|
[[:digit:]] |
0123456789 |
||
[[:lower:]] |
ab...ząćęłńóśźż |
ab...zó |
|
[[:upper:]] |
AB...ZĄĆĘŁŃÓŚŹŻ |
AB...ZÓ |
|
[[:punct:]] |
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ |
||
[A-Z] |
AB...Z |
||
[a-z] |
ab...z |
||
PCRE-UTF-8 normalized | |||
\p{L} |
AB...Zab...zĄĆĘŁŃÓŚŹŻąćęłńóśźż |
||
\p{N} |
0123456789 |
||
\p{Ll} |
ab...ząćęłńóśźż |
||
\p{Lu} |
AB...ZĄĆĘŁŃÓŚŹŻ |
||
\p{P} |
!"#%&'()*,-./:;?@[\]_{} |
||
\p{S} |
$+<=>^`|~ |
||
\p{S}|\p{P} |
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ |
||
[A-Z] |
AB...Z |
||
[a-z] |
ab...z |
We see that PCRE
after a “normalization” with enc2utf8()
gives correct results in all the locales.
An example:
gregexpr(enc2utf8(pattern), enc2utf8(text), perl=TRUE)
With the stringr
package you may use e.g.:
str_extract_all(enc2utf8(text), perl(enc2utf8(pattern)))
Note that regexec()
(and str_match_all()
from stringr
) currently doesn’t support PRE. However, you may use gregexpr()
instead.
str_match_all.perl <- function(s, p) { require("stringr") m <- gregexpr(enc2utf8(p), enc2utf8(s), perl=TRUE) # PCRE-NORMALIZED # note that normalization is needed only for regex-matching out <- vector("list", length(s)) # vectorized over s for (j in seq_along(s)) { nmatch <- length(m[[j]]) ncapt <- length(attr(m[[j]], "capture.names")) if (length(m) == 1 && m[[j]] == -1) next out[[j]] <- matrix(str_sub(s[[j]], m[[j]], m[[j]]+attr(m[[j]], "match.length")-1), nrow=nmatch, ncol=ncapt+1) if (ncapt > 0) { cs <- as.integer(attr(m[[j]], "capture.start")) cl <- as.integer(attr(m[[j]], "capture.length")) out[[j]][,-1] <- str_sub(s[j], cs, cs+cl-1) if (any(str_length(attr(m[[j]], "capture.names")) > 0)) colnames(out[[j]]) <- c("", attr(m[[j]], "capture.names")) } } out } # test: print(str_match_all.perl(c( "rocznik99='nie', skrzynia='auto', osób='5', foo=bar", "kolor='ŻÓŁTY', silnik='3.0l TURBO++', skrzynia='manual'" ), "(?<attr>\\p{L}+)='(?<val>[^']*)'")) # two named groups print(str_match_all.perl(c( "kolor='czerwony', świece='nówka', skrzynia='manual'", "rocznik99='skądże', skrzynia='auto', osób='5', foo=bar" ), "\\p{L}+='[^']*'")) # no groups print(str_match_all.perl("123", "\\p{L}+")) # no matches at all
UPDATE@2013-07-10: our stringi package works the same in each platform’s locale and encoding. See the stri_locate_all_regex
and stri_match_all_regex
functions.
–
Marek Gągolewski
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.