RegEx: Named Capture in R
[This article was first published on Odd Hypothesis, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I consider myself a decent RegEx user. References to famous quotes about RegEx aside, I find it intuitive, like its speed and that it makes my code simple (more so than the alternative anyhow). Thus, I use RegEx where I can in the growing grab bag of languages I consider myself proficient in:Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
- *nix command line / shell scripts
- Javascript
- PHP
- Matlab
- Python
- R
To get a sense of R’s named capture inadequacy, here’s a simple scenario …
The Problem:
You are given a list of files with names like:- chA_0001
- chA_0002
- chA_0003
- chB_0001
- chB_0002
- chB_0003
The regular expression with named capture to do this is quite simple:
ch(?[A-Z])\_(?[0-9]{4})
which, given the list of file names, should return some structure with a property:value pairs of the sort:
- ch : A, A, A, B, B, B
- id : 0001, 0002, 0003, 0001, 0002, 0003
The Solutions:
Here’s some Matlab code that basically does this in one line:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
src = {'chA_0001', 'chA_0002', 'chA_0003', 'chB_0001', 'chB_0002', 'chB_0003'}; | |
pat = 'ch(?<ch>[A-Z])\_(?<id>[0-9]{4})'; | |
rex = regexp(src, pat, 'names') | |
rex{1} | |
rex{1}.id |
which would result in the following console output:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
>> rex = regexp(src, pat, 'names') | |
rex = | |
[1x1 struct] [1x1 struct] [1x1 struct] [1x1 struct] [1x1 struct] [1x1 struct] | |
>> rex{1} | |
ans = | |
ch: 'A' | |
id: '0001' | |
>> rex{1}.id | |
ans = | |
0001 |
Now here’s the equivalent R code:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# regular expressions with named capture in R | |
src = c('chA_0001', 'chA_0002', 'chA_0003', 'chB_0001', 'chB_0002', 'chB_0003') | |
pat = 'ch(?<ch>[A-Z])\\_(?<id>[0-9]{4})' | |
re.capture = function(pattern, string, ...) { | |
rex = list(src=string, | |
result=regexpr(pattern, string, perl=TRUE, ...), | |
names=list()) | |
for (.name in attr(rex$result, 'capture.name')) { | |
rex$names[[.name]] = substr(rex$src, | |
attr(rex$result, 'capture.start')[,.name], | |
attr(rex$result, 'capture.start')[,.name] | |
+ attr(rex$result, 'capture.length')[,.name] | |
- 1) | |
} | |
return(rex) | |
} | |
print(re.capture(pat, src)) |
There is a lot of work here! To help explain what’s going on, here’s the corresponding console output:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
$src | |
[1] "chA_0001" "chA_0002" "chA_0003" "chB_0001" "chB_0002" "chB_0003" | |
$result | |
[1] 1 1 1 1 1 1 | |
attr(,"match.length") | |
[1] 8 8 8 8 8 8 | |
attr(,"useBytes") | |
[1] TRUE | |
attr(,"capture.start") | |
ch id | |
[1,] 3 5 | |
[2,] 3 5 | |
[3,] 3 5 | |
[4,] 3 5 | |
[5,] 3 5 | |
[6,] 3 5 | |
attr(,"capture.length") | |
ch id | |
[1,] 1 4 | |
[2,] 1 4 | |
[3,] 1 4 | |
[4,] 1 4 | |
[5,] 1 4 | |
[6,] 1 4 | |
attr(,"capture.names") | |
[1] "ch" "id" | |
$names | |
$names$ch | |
[1] "A" "A" "A" "B" "B" "B" | |
$names$id | |
[1] "0001" "0002" "0003" "0001" "0002" "0003" |
Here’s what’s happening:
- regexpr(…, perl=T) is used to create a regular expression result with named capture which is placed in the
$result
item of the output list.$result [1] 1 1 1 1 1 1 attr(,"match.length") [1] 8 8 8 8 8 8 attr(,"useBytes") [1] TRUE attr(,"capture.start") ch id [1,] 3 5 [2,] 3 5 [3,] 3 5 [4,] 3 5 [5,] 3 5 [6,] 3 5 attr(,"capture.length") ch id [1,] 1 4 [2,] 1 4 [3,] 1 4 [4,] 1 4 [5,] 1 4 [6,] 1 4 attr(,"capture.names") [1] "ch" "id"
This result is pretty unusable since all of the important captured information is buried in attribute settings. - To do anything with the output from
regexpr()
, the result from #1 has to have its attributes probed usingattr()
(via a for loop) to get:- captured group names
- start locations within the strings of the captured groups
- length of the captured groups (oddly/depressingly, end positions are not returned)
substr()
to extract the actual match strings from the input list:rex$names[[.name]] = substr(rex$src, attr(rex$result, 'capture.start')[,.name], attr(rex$result, 'capture.start')[,.name] + attr(rex$result, 'capture.length')[,.name] - 1)
- The above steps are encapsulated into a much easier to use function
re.capture()
that allows for one-line-ish extraction:> src [1] "chA_0001" "chA_0002" "chA_0003" "chB_0001" "chB_0002" "chB_0003" > pat [1] "ch(?[A-Z])\\_(?[0-9]{4})" > re.capture(pat, src)$names$ch [1] "A" "A" "A" "B" "B" "B" > re.capture(pat, src)$names$id [1] "0001" "0002" "0003" "0001" "0002" "0003"
Summary
All told, it takes three functions and a for loop to get a user friendly named capture result! While I was able to make a one-liner function out of the ordeal, it’s a shame that someone on the R development team couldn’t build this into the return values forregexpr()
and gregexpr()
. Granted, I’m not the first to wish for something better. Perhaps this is something to look forward to in R 2.16?
To leave a comment for the author, please follow the link and comment on their blog: Odd Hypothesis.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.