R: Stem (Pre-Processed) Text Blocks
[This article was first published on Consistently Infrequent » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Objective
I recently needed to stem every word in a block of text i.e. reduce each word to a root form.
Problem
The stemmer I was using would only stem the last word in each block of text e.g.
require(SnowballC)
wordStem('walk walks walked walking walker walkers', language = 'en')
# [1] 'walk walks walked walking walker walk';
Solution
I wrote a function which splits a block of text into individual words, stems each word, and then recombines the words together into a block of text
stem_text<- function(text, language = "porter", mc.cores = 1) {
# stem each word in a block of text
stem_string <- function(str, language) {
str <- strsplit(x = str, split = "\s")
str <- wordStem(unlist(str), language = language)
str <- paste(str, collapse = " ")
return(str)
}
# stem each text block in turn
x <- mclapply(X = text, FUN = stem_string, language, mc.cores = mc.cores)
# return stemed text blocks
return(unlist(x))
}
This works under the assumptions that the text only contains text and whitespace (i.e. it has been appropriately pre-processed).
# Blocks of text
sentences <- c('walk walks walked walking walker walkers',
'Never ignore coincidence unless of course you are busy In which case always ignore coincidence')
# Stem blocks of text
stem_text(sentences, language = 'en', mc.cores = 2)
# [1] 'walk walk walk walk walker walker';
# [2] 'Never ignor coincid unless of cours you are busi In which case alway ignor coincid'
To leave a comment for the author, please follow the link and comment on their blog: Consistently Infrequent » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.