Stemming and Spell Checking in R

Jeroen Ooms

6 years ago

[This article was first published on OpenCPU, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last week we introduced the new hunspell R package. This week a new version was released which adds support for additional languages and text analysis features.

Additional languages

By default hunspell uses the US English dictionary en_US but the new version allows for checking and analyzing in other languages as well. The ?hunspell help page has detailed instructions on how to install additional dictionaries.

> library(hunspell)
> hunspell_info("ru_RU")
$dict
[1] "/Users/jeroen/workspace/hunspell/tests/testdict/ru_RU.dic"

$encoding
[1] "UTF-8"

$wordchars
[1] NA

> hunspell("чёртова карова", dict = "ru_RU")[[1]]
[1] "карова"

It turned out this feature was much more difficult to implement than I expected. Much of the Hunspell library dates from before UTF-8 became popular and therefore many dictionaries use local 8 bit character encodings such as ISO-8859-1 for English and KOI8-R for Russian. To spell check in these languages, the character encoding of the document text has to match that of the dictionary. However R only supports latin and UTF-8 so we need to convert strings in C with iconv, which opens up a new can of worms. Anyway it should all work now.

@opencpu hunspell_stem could be very useful in interpretation issues of e.g. #wordclouds.
— Jelle Geertsma (@rdatasculptor) March 14, 2016

Text analysis and wordclouds

In last weeks post we showed how to parse and spell check a latex file:

# Check an entire latex document
library(hunspell)
setwd(tempdir())
download.file("http://arxiv.org/e-print/1406.4806v1", "1406.4806v1.tar.gz",  mode = "wb")
untar("1406.4806v1.tar.gz")
text <- readLines("content.tex", warn = FALSE)
bad_words <- hunspell(text, format = "latex")
sort(unique(unlist(bad_words)))

The new version also exposes the parser directly, so you can easily extract words and derive the stems to summarize some text, for example to display in a wordcloud.

# Summarize text by stems (e.g. for wordcloud)
allwords <- hunspell_parse(text, format = "latex")
stems <- unlist(hunspell_stem(unlist(allwords)))
words <- head(sort(table(stems), decreasing = TRUE), 200)

To leave a comment for the author, please follow the link and comment on their blog: OpenCPU.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.