Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
R has a lot of string functions, many of them can be found with ls("package:base", pattern="str")
. Additionally, there are add-on packages such as stringr
, gsubfn
and brew
that enhance R string processing capabilities. As a statistical language and environment, R has an edge compared to other programming languages when it comes to text mining algorithms or natural language processing. There is even a taskview for this on CRAN.
I am currently playing with markdown files in R, which eventually will result in a new version of mdtools
, and collected or created some string functions I like to present in this blogpost. The source code of the functions is at the end of the post, first I show how to use these functions.
Head and tail for strings
The idea for the first two functions I had earlier, and I had to learn that providing a S3 method for head
and tail
is not an good idea. But strhead
and strtail
did prove as handy. Here are some usage examples:
> strhead("egghead", 3) [1] "egg" > strhead("beagle", -1) # negative index [1] "beagl" > strtail(c("bowl", "snowboard"), 3) # vector-able in the first argument [1] "owl" "ard"
These functions are only syntactic sugar, hopefully easy to memorize because of their similarity to existing R functions. For packages, they are probably not worth introducing an extra dependency. I thought about defining an replacement function like substr
does, but I did not try it because head
and tail
do not have replacement functions.
Bare minimum template
With sprintf
, format
and pretty
, there are powerful functions for formatting strings. However, sometimes I miss the named template syntax as in Python or in Makefiles. So I implemented this in R. Here are some usage examples:
> strsubst( + "$(WHAT) is $(HEIGHT) meters high.", + list( + WHAT="Berlin's teletower", + HEIGHT=348 + ) + ) [1] "Berlin's teletower is 348 meters high." > d <- strptime("2012-03-18", "%Y-%m-%d") > strsubst(c( + "Be careful with dates.", + "$(NO_CONV) shows a list.", + "$(CONV) is more helpful."), + list( + NO_CONV=d, + CONV= as.character(d) + ) + ) [1] "Be careful with dates." [2] "list(sec = 0, min = 0, hour = 0, mday = 18, mon = 2, year = 112, wday = 0, yday = 77, isdst = 0) shows a list." [3] "2012-03-18 is more helpful."
The first argument can be string or a vector of strings such as the output of readLines
. The second argument can be any indexable object (i.e. with working [
operator) such as lists. Environments are not indexable hence won’t work.
Parse raw text
Frequently, I need to extract parts from raw text data. For instance, few weeks ago I had to parse a SPSS script (some variable labels were hard-coded theree and not in the .sav file). The script contained lines VARIABLE LABELS some_var "<some_label>".
I was interested in some_var
and <some_label>
. The examples from the R documentation on regexpr
gave me the direction and led me to the strparse
function that is applied as follows:
> lines <- c( + 'VARIABLE LABELS weight "weight".', + 'VARIABLE LABELS altq "Year of birth".', + 'VARIABLE LABELS hhg "Household size".', + 'missing values all (-1).', + 'EXECUTE.' + ) > pat <- 'VARIABLE LABELS (?<name>[^\\s]+) \\"(?<lbl>.*)\\".$' > matches <- grepl(pat, lines, perl=TRUE) > strparse(pat, lines[matches]) name lbl [1,] "weight" "weight" [2,] "altq" "Year of birth" [3,] "hhg" "Household size"
The function returns a vector if one line was parsed and a matrix otherwise. It supports named groups.
Recoding with regular expressions
Sometimes I need to recode a vector of strings in a way that I find all mathces for a particular regular expression and replace these matches with one string. The I match all remaining strings with a second regular expression and replace the hits with a second replacement. And so on. I wrote the strrecode
function to support this operation. The function can be seen as an generalisation of the gsub function. It is the only function without test code. Here is a made-up example analysing process information from the task manager:
> dat <- data.frame( + wtitle=c(paste(c("Inbox", "Starred", "All"), "- Google Mail"), paste("file", 1:4, "- Notepad++")), + pid=sample.int(9999,7), + exe=c(rep("chrome.exe",3), rep("notepad++.exe", 4)) + ) > dat <- transform( + dat, + usage=strrecode(c("Google Mail$|Microsoft Outlook$", " - Notepad\\+\\+$|Microsoft Word$"), c("Mail", "Text"), dat$wtitle) + ) > dat wtitle pid exe usage 1 Inbox - Google Mail 6810 chrome.exe Mail 2 Starred - Google Mail 2488 chrome.exe Mail 3 All - Google Mail 4086 chrome.exe Mail 4 file 1 - Notepad++ 2946 notepad++.exe Text 5 file 2 - Notepad++ 112 notepad++.exe Text 6 file 3 - Notepad++ 1176 notepad++.exe Text 7 file 4 - Notepad++ 8881 notepad++.exe Text
Interested in the source code of these helper functions? Read on.
Read more »R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.