Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
One of the functions I use the most is strsplit
. It is quite useful if you want to separate a string by a specific character. Even if you have some complex rule for the split, most of the time you can solve this with a regular expression. However, recently I came across a problem I could not get my head around. I wanted to split the string but also keep the delimiter.
basic regular expressions
Let's start at the beginning. If you do not know what regular expressions are, I will give you a short introduction. With regular expressions you can describe patterns in a string and then use them in functions like grep
, gsub
or strsplit
.
As the R (3.4.1) help file for regex
states:
A regular expression is a pattern that describes a set of strings. Two types of regular expressions are used in R, extended regular expressions (the default) and Perl-like regular expressions used by
perl = TRUE
. There is a alsofixed = TRUE
which can be considered to use a literal regular expression.
If you are looking for a specific pattern in a string – let's say "3D"
– you can just use those characters:
x <- c("3D", "4D", "3a") grep("3D", x) [1] 1
If you instead want all numbers followed by a upper case letter you should use regular expressions:
x <- c("3D", "4D", "3a") grep("[0-9][A-Z]", x) [1] 1 2
Since regular expressions can get quite complicated really fast, I will stop here and refer you to a cheat sheet for more infos. In the cheat sheet you can also find the part that gave me the trouble: lookarounds
lookarounds
Back to my problem. I had a string like c("3D/MON&SUN")
and wanted to separate it by /
and &
.
x <- c("3D/MON&SUN") strsplit(x, "[/&]", perl = TRUE) [[1]] [1] "3D" "MON" "SUN"
Since I still needed the delimiter as it contained useful information, I used the lookaround regular expressions. First up is the lookbehind which works just fine:
strsplit(x, "(?<=[/&])", perl = TRUE) [[1]] [1] "3D/" "MON&" "SUN"
However, when i used the lookahead, it did not work as I expected
strsplit(x, "(?=[/&])", perl = TRUE) [[1]] [1] "3D" "/" "MON" "&" "SUN"
In my search for a solution, I finally found this post on Stackoverflow, which explained the strange behaviour of strsplit
. Well, after reading the post and the help file – it is not strange anymore. It is just what the algorithm said it would do – the very same way it is stated in help file of strsplit
:
repeat { if the string is empty break. if there is a match add the string to the left of the match to the output. remove the match and all to the left of it. else add the string to the output. break. }
Since the look arounds have zero length, they mess up the removing part within the algorithm. Luckily, the post also gave a solution which contains some regular expression magic:
strsplit(x = x, "(?<=.)(?=[/&])",perl = TRUE) [[1]] [1] "3D" "/MON" "&SUN"
So my problem is solved, but I would have to remember this regular expression … uurrghhh!
a new function: strsplit 2.0
If I have the chance to write a function which eases my work – I will do it! So I wrote my own strsplit
with a new argument type = c("remove", "before", "after")
. Basically, I just used the regular expression mentioned above and put it into an if-condition.
To sum it all up: Regular expressions are a powerful tool and you should try to learn and understand how they work!
strsplit <- function(x, split, type = "remove", perl = FALSE, ...) { if (type == "remove") { # use base::strsplit out <- base::strsplit(x = x, split = split, perl = perl, ...) } else if (type == "before") { # split before the delimiter and keep it out <- base::strsplit(x = x, split = paste0("(?<=.)(?=", split, ")"), perl = TRUE, ...) } else if (type == "after") { # split after the delimiter and keep it out <- base::strsplit(x = x, split = paste0("(?<=", split, ")"), perl = TRUE, ...) } else { # wrong type input stop("type must be remove, after or before!") } return(out) }
Über den Autor
Jakob Gepp
Der Beitrag strsplit – but keeping the delimiter erschien zuerst auf STATWORX.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.