Hunspell: Spell Checker and Text Parser for R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Hunspell is the spell checker library used in LibreOffice, OpenOffice, Mozilla Firefox, Google Chrome, Mac OS X, InDesign, and a few more. Base R has some spell checking functionality via the aspell
function which wraps the aspell or hunspell command line program on supported systems. The new hunspell R package on the other hand directly links to the hunspell c++ library and works on all platforms without installing additional dependencies.
Basic tools
The hunspell_check
function takes a vector of words and checks each individual word for correctness.
library(hunspell)
words <- c("beer", "wiskey", "wine")
hunspell_check(words)
## [1] TRUE FALSE TRUE
The hunspell_find
function takes a character vector with text (in plain, latex or man format) and returns a list with incorrect words for each line.
bad_words <- hunspell_find("spell checkers are not neccessairy for langauge ninja's")
print(bad_words)
## [1] "neccessairy" "langauge" "ninja's"
Finally hunspell_suggest
is used to suggest correct alternatives for each (incorrect) input word.
hunspell_suggest(bad_words[[1]])
## [[1]]
## [1] "necessary" "necessarily" "necessaries" "recessionary" "accessory" "incarcerate"
##
## [[2]]
## [1] "language" "Langeland" "Lagrange" "Lange" "gaugeable" "linkage" "Langland"
##
## [[3]]
## [1] "ninjas" "Janina's" "Nina's" "ninja" "Janine's" "meninx" "nark's"
Parsing text
The first challenge in spell-checking is extracting individual words from formatted text. The hunspell_find
function supports three parsers via the format
parameter: plain text, latex and man. For example to check the OpenCPU paper for spelling errors we use the latex source code:
download.file("http://arxiv.org/e-print/1406.4806v1", "1406.4806v1.tar.gz", mode = "wb")
untar("1406.4806v1.tar.gz")
text <- readLines("content.tex", warn = FALSE)
words <- hunspell_find(text, format = "latex")
sort(unique(unlist(words)))
Base R also has a few filters to extract words from R, Sweave or Rd code, see RdTextFilter
, SweaveTeXFilter
in tools. For example to check your R package manual for typos (assuming you are in the pkg source dir)
for(list.files("man", full.names = TRUE) in man_files){
cat("nFile", file, ":n ")
txt <- RdTextFilter(file, keepSpacing = FALSE)
cat(sQuote(sort(unique(unlist(hunspell_find(txt))))), sep =", ")
}
Morphological analysis
A cool feature in hunspell is the morphological analysis. The hunspell_analyze
function will show you how a word breaks down into a valid stem plus affix. Hunspell uses a special dictionary format that defines which stems and affixes are valid in a given language.
For example suppose we take a few variations of the word love. To get the possible stems+affix for each word:
hunspell_analyze(c("love", "loving", "lovingly", "loved", "lover", "lovely", "love"))
## [1] " st:love"
## [1] " st:loving" " st:love fl:G"
## [1] " st:lovingly"
## [1] " st:loved" " st:love fl:D"
## [1] " st:lover" " st:love fl:R"
## [1] " st:lovely" " st:love fl:Y"
## [1] " st:love"
Alternatively the hunspell_stem
returns only the stem. Not sure how you would use this but it’s certainly cool.
Thanks!
Thanks to Daniel Falbel for suggesting this package on the rOpenSci forums!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.