Spelling 1.0: quick and effective spell checking in R

rOpenSci - open tools for open science

5 years ago

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The new rOpenSci spelling package provides utilities for spell checking common document formats including latex, markdown, manual pages, and DESCRIPTION files. It also includes tools especially for package authors to automate spell checking of R documentation and vignettes.

Spell Checking Packages

The main purpose of this package is to quickly find spelling errors in R packages. The spell_check_package() function extracts all text from your package manual pages and vignettes, compares it against a language (e.g. en_US or en_GB), and lists potential errors in a nice tidy format:

> spelling::spell_check_package("~/workspace/writexl")
  WORD       FOUND IN
booleans   write_xlsx.Rd:21
xlsx       write_xlsx.Rd:6,18
           title:1
           description:1

Results may contain false positives, i.e. names or technical jargon which does not appear in the English dictionary. Therefore you can create a WORDLIST file, which serves as a package-specific dictionary of allowed words:

> spelling::update_wordlist("~/workspace/writexl")
The following words will be added to the wordlist:
 - booleans
 - xlsx
Are you sure you want to update the wordlist?
1: Yes
2: No

Words added to this file are ignored in the spell check, making it easier to catch actual spelling errors:

> spell_check_package("~/workspace/writexl")
No spelling errors found.

The package also includes a cool function spell_check_setup() which adds a unit test to your package that automatically runs the spell check.

> spelling::spell_check_setup("~/workspace/writexl")
No changes required to /Users/jeroen/workspace/writexl/inst/WORDLIST
Updated /Users/jeroen/workspace/writexl/tests/spelling.R

By default this unit test will never actually fail; it merely displays potential spelling errors at the end of a R CMD check. But you can configure it to fail if you’d like, which can be useful to automatically highlight spelling errors on e.g. Travis CI.

Under the Hood

The spelling package builds on hunspell which has a fully customizable spell checking engine. Most of the code in the spelling package is dedicated to parsing and extracting text from documents before feeding it to the spell checker. For example, when spell checking an rmarkdown file, we first extract words from headers and paragraphs (but not urls or R syntax).

# Spell check this post
> spelling::spell_check_files("~/workspace/roweb/_posts/2017-09-07-spelling-release.md", lang = 'en_US')
  WORD         FOUND IN
blog         2017-09-07-spelling-release.md:7
commonmark   2017-09-07-spelling-release.md:88
hunspell     2017-09-07-spelling-release.md:69
Jeroen       2017-09-07-spelling-release.md:7
knitr        2017-09-07-spelling-release.md:88
Ooms         2017-09-07-spelling-release.md:7
rmarkdown    2017-09-07-spelling-release.md:88
rOpenSci     2017-09-07-spelling-release.md:18
urls         2017-09-07-spelling-release.md:88
wordlist     2017-09-07-spelling-release.md:49
WORDLIST     2017-09-07-spelling-release.md:34

To accomplish this, we use knitr to drop code chunks, and subsequently parse markdown using commonmark and xml2, which gives us the text nodes and approximate line numbers in the source document.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.