Site icon R-bloggers

R: Stem (Pre-Processed) Text Blocks

[This article was first published on Consistently Infrequent » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Objective

I recently needed to stem every word in a block of text i.e. reduce each word to a root form.

Problem

The stemmer I was using would only stem the last word in each block of text e.g.

require(SnowballC)

wordStem('walk walks walked walking walker walkers', language = 'en')
# [1] 'walk walks walked walking walker walk';

Solution

I wrote a function which splits a block of text into individual words, stems each word, and then recombines the words together into a block of text

stem_text<- function(text, language = "porter", mc.cores = 1) {
  # stem each word in a block of text
  stem_string <- function(str, language) {
    str <- strsplit(x = str, split = "\s")
    str <- wordStem(unlist(str), language = language)
    str <- paste(str, collapse = " ")
    return(str)
  }
  
  # stem each text block in turn
  x <- mclapply(X = text, FUN = stem_string, language, mc.cores = mc.cores)
  
  # return stemed text blocks
  return(unlist(x))
}

This works under the assumptions that the text only contains text and whitespace (i.e. it has been appropriately pre-processed).

# Blocks of text
sentences <- c('walk walks walked walking walker walkers',
               'Never ignore coincidence unless of course you are busy In which case always ignore coincidence')

# Stem blocks of text
stem_text(sentences, language = 'en', mc.cores = 2)

# [1] 'walk walk walk walk walker walker';                                                
# [2] 'Never ignor coincid unless of cours you are busi In which case alway ignor coincid'

To leave a comment for the author, please follow the link and comment on their blog: Consistently Infrequent » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.