R: Stem (Pre-Processed) Text Blocks
[This article was first published on Consistently Infrequent » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Objective
I recently needed to stem every word in a block of text i.e. reduce each word to a root form.
Problem
The stemmer I was using would only stem the last word in each block of text e.g.
require(SnowballC) wordStem('walk walks walked walking walker walkers', language = 'en') # [1] 'walk walks walked walking walker walk';
Solution
I wrote a function which splits a block of text into individual words, stems each word, and then recombines the words together into a block of text
stem_text<- function(text, language = "porter", mc.cores = 1) { # stem each word in a block of text stem_string <- function(str, language) { str <- strsplit(x = str, split = "\s") str <- wordStem(unlist(str), language = language) str <- paste(str, collapse = " ") return(str) } # stem each text block in turn x <- mclapply(X = text, FUN = stem_string, language, mc.cores = mc.cores) # return stemed text blocks return(unlist(x)) }
This works under the assumptions that the text only contains text and whitespace (i.e. it has been appropriately pre-processed).
# Blocks of text sentences <- c('walk walks walked walking walker walkers', 'Never ignore coincidence unless of course you are busy In which case always ignore coincidence') # Stem blocks of text stem_text(sentences, language = 'en', mc.cores = 2) # [1] 'walk walk walk walk walker walker'; # [2] 'Never ignor coincid unless of cours you are busi In which case alway ignor coincid'
To leave a comment for the author, please follow the link and comment on their blog: Consistently Infrequent » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.