Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I was recently presented with the need to filter out certain rows in my dataset based upon them containing the desired strings. I needed to retain any row that had a “utm_source” and “utm_medium” and “utm_campaign”. Each row in my dataset was a single string. The idea is to parse the strings of interest. My approach was to use grep and check each string for each condition that I needed it to satisfy. I consulted with my co-blogger to see if he had a more intelligent way of approaching this problem. He tackled it with a regular expression using a look-ahead. You can see my ‘checker’ function below and Jeremy’s function ‘checker2’. Both seem to perform the required task correctly. So now it is simply a matter of performance.
#Sample Data querystrings <- c("skuId=34567-02-S&qty=1&continueShoppingUrl=http://www.beardedanalytics.com/?utm_source=ER&utm_medium=email&utm_content=Main&utm_campaign=ER101914G_greenlogoupper&cm_lm=foo@person.invalid&codes-processed=true&qtyAvailableWithCartContents=True&basketcode=mybasket1&OrderEventCreateDateTimeLocal=2014-10-2011:06:04.937", "skuId=6950K-02-S&qty=1&continueShoppingUrl=http://www.beardedanalytics.com/&utm_medium=email&utm_content=Main&utm_campaign=ER101914G_greenlogoupper&cm_lm=foo2@person.invalid&codes-processed=true&qtyAvailableWithCartContents=True&basketcode=mybasket2&OrderEventCreateDateTimeLocal=2014-10-2011:06:04.937" ) mydf <- as.data.frame(querystrings) # This Should return TRUE when all conditions have been satisfied checker <-function(foo){ grepl(pattern="utm_source", x=foo) & grepl(pattern="utm_medium", x=foo)& grepl(pattern="utm_campaign", x=foo) } checker2 <- function(foo){ grepl(pattern="^(?=.*utm_source)(?=.*utm_medium)(?=.*utm_campaign).*$", x=foo, perl=TRUE) } # This is the loop that was run to repeatedly test each function with a much larger dataset # Yes, I know this is not an efficient way to do this but it is easy to read. ttime=c() for( i in 1:100){ tt <- system.time( tresult <-mydf[checker(mydf[, 1]), ] ) ttime =rbind(ttime,tt[3]) } jtime=c() for( i in 1:100){ jt <- system.time( jresult<-mydf[checker2(mydf[, 1]), ] ) jtime =rbind(jtime,jt[3]) } mean(ttime) mean(jtime)
I am not able to share the full dataset that I was using, due to privacy concerns. The dataset that I tested both functions against had 26,746 rows. The ‘checker’ function which I wrote took on average 0.0801 seconds and Jeremy’s approach took 0.1488 seconds. I decided to stick with my checker function, but that was not because of speed. I would have happily accepted the increased computation time for mine if the times had been reversed. The reason for this is that I find mine easier to read. This means that there is a chance that I could come back to this code in 6 months and have a clue about what it is suppose to be doing. Regular Expressions can sometimes be quite hard to come back to and say, ” oh yeah, I wanted to check if all the characters that occupy prime digits in my string are vowels!”. I think that my simplistic grep statement will be easier to change if that becomes needed in the future and so I will stick with the ‘checker’ approach. Do you have a better way to approach this using R? If so, make sure to post a comment.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.