stringdist 0.8: now with soundex
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
An update to the stringdist package was released earlier this month. Thanks to a contribution of Jan van der Laan the package now includes a method to compute soundex codes as defined here. Briefly, soundex encoding aims to translate words that sound similar (when pronounced in English) to the same code.
Soundex codes can be computed with the new phonetic
function, for example:
library(stringdist) > phonetic(c('Euler','Gauss','Hilbert','Knuth','Lloyd','Lukasiewicz','Wachs')) [1] "E460" "G200" "H416" "K530" "L300" "L222" "W200"
Two strings are considered equal when they have the same soundex code, we have a two-valued distance function.
> stringdist('Claire','Clare',method='soundex') [1] 0 stringdist('Harry','Joe',method='soundex') [1] 1
Since soundex is really only defined on the printable ASCII character set, a warning is given when non-ascii or non-printable ascii characters are encountered.
> phonetic("Jörgen") [1] "J?62" Warning message: In phonetic("Jörgen") : soundex encountered 1 non-printable ASCII or non-ASCII characters. Results may be unreliable, see ?printable_ascii
The also new function printable_ascii
can help you to detect such characters.
> printable_ascii(c("jörgen","jurgen")) [1] FALSE TRUE
To get rid of such characters in a sensible way there are a few options. First of all, you may want to try R’s built-in iconv
interface to translate accented characters to ascii.
> iconv("jörgen",to="ASCII//TRANSLIT") [1] "jorgen"
However, behaviour of iconv
may be system-dependent, see the iconv
documentation for a thorough discussion. Another option is to install the stringi package.
library(stringi) > stri_trans_general("jörgen","Latin-ASCII") [1] "jorgen"
This package should yield the same result, regardless of the OS you’re working on.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.