Keeping rows containing particular strings in R

Todd Connelly

7 years ago

[This article was first published on Bearded Analytics » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I was recently presented with the need to filter out certain rows in my dataset based upon them containing the desired strings. I needed to retain any row that had a “utm_source” and “utm_medium” and “utm_campaign”. Each row in my dataset was a single string. The idea is to parse the strings of interest. My approach was to use grep and check each string for each condition that I needed it to satisfy. I consulted with my co-blogger to see if he had a more intelligent way of approaching this problem. He tackled it with a regular expression using a look-ahead. You can see my ‘checker’ function below and Jeremy’s function ‘checker2’. Both seem to perform the required task correctly. So now it is simply a matter of performance.

#Sample Data
querystrings <- c("skuId=34567-02-S&qty=1&continueShoppingUrl=http://www.beardedanalytics.com/?utm_source=ER&utm_medium=email&utm_content=Main&utm_campaign=ER101914G_greenlogoupper&cm_lm=foo@person.invalid&codes-processed=true&qtyAvailableWithCartContents=True&basketcode=mybasket1&OrderEventCreateDateTimeLocal=2014-10-2011:06:04.937", 
"skuId=6950K-02-S&qty=1&continueShoppingUrl=http://www.beardedanalytics.com/&utm_medium=email&utm_content=Main&utm_campaign=ER101914G_greenlogoupper&cm_lm=foo2@person.invalid&codes-processed=true&qtyAvailableWithCartContents=True&basketcode=mybasket2&OrderEventCreateDateTimeLocal=2014-10-2011:06:04.937"
)

mydf <- as.data.frame(querystrings)


# This Should return TRUE when all conditions have been satisfied
checker <-function(foo){
  grepl(pattern="utm_source", x=foo) &
    grepl(pattern="utm_medium", x=foo)&
    grepl(pattern="utm_campaign", x=foo)
 
}

checker2 <- function(foo){
    grepl(pattern="^(?=.*utm_source)(?=.*utm_medium)(?=.*utm_campaign).*$",
                      x=foo, perl=TRUE)
   }
# This is the loop that was run to repeatedly test each function with a much larger dataset
# Yes, I know this is not an efficient way to do this but it is easy to read.
ttime=c()
for( i in 1:100){
  tt <- system.time(  tresult <-mydf[checker(mydf[, 1]), ]  )
  ttime =rbind(ttime,tt[3])
}


jtime=c()
for( i in 1:100){
  jt <- system.time(  jresult<-mydf[checker2(mydf[, 1]), ] ) 
  jtime =rbind(jtime,jt[3])
}

mean(ttime)
mean(jtime)

I am not able to share the full dataset that I was using, due to privacy concerns. The dataset that I tested both functions against had 26,746 rows. The ‘checker’ function which I wrote took on average 0.0801 seconds and Jeremy’s approach took 0.1488 seconds. I decided to stick with my checker function, but that was not because of speed. I would have happily accepted the increased computation time for mine if the times had been reversed. The reason for this is that I find mine easier to read. This means that there is a chance that I could come back to this code in 6 months and have a clue about what it is suppose to be doing. Regular Expressions can sometimes be quite hard to come back to and say, ” oh yeah, I wanted to check if all the characters that occupy prime digits in my string are vowels!”. I think that my simplistic grep statement will be easier to change if that becomes needed in the future and so I will stick with the ‘checker’ approach. Do you have a better way to approach this using R? If so, make sure to post a comment.

To leave a comment for the author, please follow the link and comment on their blog: Bearded Analytics » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.