Site icon R-bloggers

Canned Regular Expressions: qdapRegex 0.1.2 on CRAN

[This article was first published on TRinker's R Blog » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We’re pleased to announce first CRAN release of qdapRegex! You can read about qdapRegex or skip right to the examples.

qdapRegex is a collection of regular expression tools associated with the qdap package that may be useful outside of the context of discourse analysis. The package uses a dictionary system to uniformly perform extraction, removal, and replacement.  Tools include removal/extraction/replacement of abbreviations, dates, dollar amounts, email addresses, hash tags, numbers, percentages, person tags, phone numbers, times, and zip codes.

The qdapRegex package does not aim to compete with string manipulation packages such as stringror stringi but is meant to provide access to canned, common regular expression patterns that can be used within qdapRegex, with R‘s own regular expression functions, or add on string manipulation packages such as stringr and stringi.

You can download it from CRAN or from GitHub.

 

 

Examples

Let’s see qdapRegex  in action. As you can see functions starting with rm_ generally remove the canned regular expression that they are naming and with extract = TRUE can be extracted. A replacement argument also allows for optional replacements.

URLs

library(qdapRegex)
x <- "I like www.talkstats.com and http://stackoverflow.com"

## Removal
rm_url(x)

## Extraction
rm_url(x, extract=TRUE)

## Replacement
rm_url(x, replacement = '<a href="\1" target="_blank">\1</a>')

## Removal
## [1] "I like and"
## > 
## Extraction
## [[1]]
## [1] "www.talkstats.com"        "http://stackoverflow.com"
## 
## Replacement
## [1] "I like <a href="" target="_blank"></a> and <a href="http://stackoverflow.com" target="_blank">http://stackoverflow.com</a>"

Twitter Hash Tags

x <- c("@hadley I like #rstats for #ggplot2 work.",
    "Difference between #magrittr and #pipeR, both implement pipeline operators for #rstats:
        http://renkun.me/r/2014/07/26/difference-between-magrittr-and-pipeR.html @timelyportfolio",
    "Slides from great talk: @ramnath_vaidya: Interactive slides from Interactive Visualization
        presentation #user2014. http://ramnathv.github.io/user2014-rcharts/#1"
)

rm_hash(x)
rm_hash(x, extract=TRUE)

## > rm_hash(x)
## [1] "@hadley I like for work."                                                                                                                                  
## [2] "Difference between and , both implement pipeline operators for : http://renkun.me/r/2014/07/26/difference-between-magrittr-and-pipeR.html @timelyportfolio"
## [3] "Slides from great talk: @ramnath_vaidya: Interactive slides from Interactive Visualization presentation . http://ramnathv.github.io/user2014-rcharts/#1" 

  
## > rm_hash(x, extract=TRUE)
## [[1]]
## [1] "#rstats"  "#ggplot2"
## 
## [[2]]
## [1] "#magrittr" "#pipeR"    "#rstats"  
## 
## [[3]]
## [1] "#user2014"

Emoticons

x <- c("are :-)) it 8-D he XD on =-D they :D of :-) is :> for :o) that :-/",
  "as :-D I xD with :^) a =D to =) the 8D and :3 in =3 you 8) his B^D was")
rm_emoticon(x)
rm_emoticon(x, extract=TRUE)

## > rm_emoticon(x)
## [1] "are it he on they of is for that"     
## [2] "as I with a to the and in you his was"


## > rm_emoticon(x, extract=TRUE)
## [[1]]
## [1] ":-))" "8-D"  "XD"   "=-D"  ":D"   ":-)"  ":>"   ":o)"  ":-/" 
## 
## [[2]]
##  [1] ":-D" "xD"  ":^)" "=D"  "=)"  "8D"  ":3"  "=3"  "8)"  "B^D"

Academic, APA 6 Style, Citations

x <- c("Hello World (V. Raptor, 1986) bye",
    "Narcissism is not dead (Rinker, 2014)",
    "The R Core Team (2014) has many members.",
    paste("Bunn (2005) said, "As for elegance, R is refined, tasteful, and",
        "beautiful. When I grow up, I want to marry R.""),
    "It is wrong to blame ANY tool for our own shortcomings (Baer, 2005).",
    "Wickham's (in press) Tidy Data should be out soon.",
    "Rinker's (n.d.) dissertation not so much.",
    "I always consult xkcd comics for guidance (Foo, 2012; Bar, 2014).",
    "Uwe Ligges (2007) says, "RAM is cheap and thinking hurts""
)

rm_citation(x)
rm_citation(x, extract=TRUE)

## > rm_citation(x)
## [1] "Hello World () bye"                                                                                  
## [2] "Narcissism is not dead ()"                                                                           
## [3] "has many members."                                                                                   
## [4] "said, "As for elegance, R is refined, tasteful, and beautiful. When I grow up, I want to marry R.""
## [5] "It is wrong to blame ANY tool for our own shortcomings ()."                                          
## [6] "Tidy Data should be out soon."                                                                       
## [7] "dissertation not so much."                                                                           
## [8] "I always consult xkcd comics for guidance (; )."                                                     
## [9] "says, "RAM is cheap and thinking hurts""     

                                                      
## > rm_citation(x, extract=TRUE)
## [[1]]
## [1] "V. Raptor, 1986"
## 
## [[2]]
## [1] "Rinker, 2014"
## 
## [[3]]
## [1] "The R Core Team (2014)"
## 
## [[4]]
## [1] "Bunn (2005)"
## 
## [[5]]
## [1] "Baer, 2005"
## 
## [[6]]
## [1] "Wickham's (in press)"
## 
## [[7]]
## [1] "Rinker's (n.d.)"
## 
## [[8]]
## [1] "Foo, 2012" "Bar, 2014"
## 
## [[9]]
## [1] "Uwe Ligges (2007)"

Combining Regular Expressions

A user may wish to combine regular expressions. For example one may want to extract all URLs and Twitter Short URLs. The verb pastex (paste + regex) pastes together regular expressions. It also will search the regex dictionaries for named regular expressions prefixed with a @. So…

pastex("@rm_twitter_url", "@rm_url")

yields…

## [1] "(https?://t\.co[^ ]*)|(t\.co[^ ]*)|(http[^ ]*)|(ftp[^ ]*)|(www\.[^ ]*)"

If we combine this ability with qdapRegex‘s function generator, rm_, we can make our own function that removes both standard URLs and Twitter Short URLs.

rm_twitter_n_url <- rm_(pattern=pastex("@rm_twitter_url", "@rm_url"))

Let’s use it…

rm_twitter_n_url <- rm_(pattern=pastex("@rm_twitter_url", "@rm_url"))

x <- c("download file from http://example.com",
         "this is the link to my website http://example.com",
         "go to http://example.com from more info.",
         "Another url ftp://www.example.com",
         "And https://www.example.net",
         "twitter type: t.co/N1kq0F26tG",
         "still another one https://t.co/N1kq0F26tG :-)")

rm_twitter_n_url(x)
rm_twitter_n_url(x, extract=TRUE)
## > rm_twitter_n_url(x)
## [1] "download file from"             "this is the link to my website"
## [3] "go to from more info."          "Another url"                   
## [5] "And"                            "twitter type:"                 
## [7] "still another one :-)"     

    
## > rm_twitter_n_url(x, extract=TRUE)
## [[1]]
## [1] "http://example.com"
## 
## [[2]]
## [1] "http://example.com"
## 
## [[3]]
## [1] "http://example.com"
## 
## [[4]]
## [1] "ftp://www.example.com"
## 
## [[5]]
## [1] "https://www.example.net"
## 
## [[6]]
## [1] "t.co/N1kq0F26tG"
## 
## [[7]]
## [1] "https://t.co/N1kq0F26tG"

*Note that there is a binary operator version of pastex, %|% that may be more useful to some folks.

"@rm_twitter_url" %|% "@rm_url"

yields…

## > "@rm_twitter_url" %|% "@rm_url"
## [1] "(https?://t\.co[^ ]*)|(t\.co[^ ]*)|(http[^ ]*)|(ftp[^ ]*)|(www\.[^ ]*)"

Educational

Regular expressions can be extremely powerful but were difficult for me to grasp at first.

The qdapRegex package serves a dual purpose of being both functional and educational. While the canned regular expressions are useful in and of themselves they also serve as a platform for understanding regular expressions in the context of meaningful, purposeful usage. In the same way I learned guitar while trying to mimic Eric Clapton, not by learning scales and theory, some folks may enjoy an approach of learning regular expressions in a more pragmatic, experiential interaction. Users are encouraged to look at the regular expressions being used (?regex_usa and ?regex_supplement are the default regular expression dictionaries used by qdapRegex) and unpack how they work. I have found slow repeated exposures to information in a purposeful context results in acquired knowledge.

The following regular expressions sites were very helpful to my own regular expression education:

  1. Regular-Expression.info
  2. Rex Egg
  3. Regular Expressions as used in R

Being able to discuss and ask questions is also important to learning…in this case regular expressions. I have found the following forums extremely helpful to learning about regular expressions:

  1. Talk Stats + Posting Guidelines
  2. stackoverflow + Posting Guidelines

Acknowledgements

Thank you to the folks that have developed stringi (maintainer: Marek Gagolewski). The stringi package provides fast, consistently used regular expression manipulation tools. qdapRegex uses the stringi package as a back-end in most functions prefixed with rm_XXX.

We would also like to thank the many folks at http://www.talkstats.com and http://www.stackoverflow.com that freely give of their time to answer questions around many topics, including regular expressions.


To leave a comment for the author, please follow the link and comment on their blog: TRinker's R Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.