Site icon R-bloggers

strsplit – but keeping the delimiter

[This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One of the functions I use the most is strsplit. It is quite useful if you want to separate a string by a specific character. Even if you have some complex rule for the split, most of the time you can solve this with a regular expression. However, recently I came across a problem I could not get my head around. I wanted to split the string but also keep the delimiter.

basic regular expressions

Let's start at the beginning. If you do not know what regular expressions are, I will give you a short introduction. With regular expressions you can describe patterns in a string and then use them in functions like grep, gsub or strsplit.

As the R (3.4.1) help file for regex states:

A regular expression is a pattern that describes a set of strings. Two types of regular expressions are used in R, extended regular expressions (the default) and Perl-like regular expressions used by perl = TRUE. There is a also fixed = TRUE which can be considered to use a literal regular expression.

If you are looking for a specific pattern in a string – let's say "3D" – you can just use those characters:

x <- c("3D", "4D", "3a")
grep("3D", x)

[1] 1

If you instead want all numbers followed by a upper case letter you should use regular expressions:

x <- c("3D", "4D", "3a")
grep("[0-9][A-Z]", x)

[1] 1 2

Since regular expressions can get quite complicated really fast, I will stop here and refer you to a cheat sheet for more infos. In the cheat sheet you can also find the part that gave me the trouble: lookarounds

lookarounds

Back to my problem. I had a string like c("3D/MON&SUN") and wanted to separate it by / and &.

x <- c("3D/MON&SUN")
strsplit(x, "[/&]", perl = TRUE)

[[1]]
[1] "3D"  "MON" "SUN"

Since I still needed the delimiter as it contained useful information, I used the lookaround regular expressions. First up is the lookbehind which works just fine:

strsplit(x, "(?<=[/&])", perl = TRUE)

[[1]]
[1] "3D/"  "MON&" "SUN"

However, when i used the lookahead, it did not work as I expected

strsplit(x, "(?=[/&])", perl = TRUE)

[[1]]
[1] "3D"  "/"   "MON" "&"   "SUN"

In my search for a solution, I finally found this post on Stackoverflow, which explained the strange behaviour of strsplit. Well, after reading the post and the help file – it is not strange anymore. It is just what the algorithm said it would do – the very same way it is stated in help file of strsplit:

repeat {
    if the string is empty
        break.
    if there is a match
        add the string to the left of the match to the output.
        remove the match and all to the left of it.
    else
        add the string to the output.
        break.
}

Since the look arounds have zero length, they mess up the removing part within the algorithm. Luckily, the post also gave a solution which contains some regular expression magic:

strsplit(x = x, "(?<=.)(?=[/&])",perl = TRUE)

[[1]]
[1] "3D"   "/MON" "&SUN"

So my problem is solved, but I would have to remember this regular expression … uurrghhh!

a new function: strsplit 2.0

If I have the chance to write a function which eases my work – I will do it! So I wrote my own strsplit with a new argument type = c("remove", "before", "after"). Basically, I just used the regular expression mentioned above and put it into an if-condition.
To sum it all up: Regular expressions are a powerful tool and you should try to learn and understand how they work!

strsplit <- function(x,
                     split,
                     type = "remove",
                     perl = FALSE,
                     ...) {
  if (type == "remove") {
    # use base::strsplit
    out <- base::strsplit(x = x, split = split, perl = perl, ...)
  } else if (type == "before") {
    # split before the delimiter and keep it
    out <- base::strsplit(x = x,
                          split = paste0("(?<=.)(?=", split, ")"),
                          perl = TRUE,
                          ...)
  } else if (type == "after") {
    # split after the delimiter and keep it
    out <- base::strsplit(x = x,
                          split = paste0("(?<=", split, ")"),
                          perl = TRUE,
                          ...)
  } else {
    # wrong type input
    stop("type must be remove, after or before!")
  }
  return(out)
}
Über den Autor

Jakob Gepp

Jakob ist im Statistik Team und interessiert sich im Moment stark für Hadoop und Big Data. In seiner Freizeit bastelt er gerne an alten Elektrogeräten und spielt Hockey.

Der Beitrag strsplit – but keeping the delimiter erschien zuerst auf STATWORX.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.