Approximate string matching in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I have released a new version of the stringdist package.
Besides a some new string distance algorithms it now contains two convenient matching functions:
-
amatch
: Equivalent to R’smatch
function but allowing for approximate matching. -
ain
: Similar to R’s%in%
operator
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | # here's an example of amatch > x <- c('foo', 'bar') > amatch('fu',x,maxDist=2) [1] 1 # if we decrease the maximum allowd distance, we get > amatch('fu',x,maxDist=1) [1] NA # just like with 'match' you can control the output of no-matches: > amatch('fu',x,maxDist=1,nomatch=0) [1] 0 # to see if 'fu' matches approximately with any element of x: ain('fu',x) [1] FALSE # however, if we allow for larger distances ain('fu',x,maxDist=2) [1] TRUE |
Check the helpfile of for other options, like how to choose the string distance algorithm.
Note previously stringdist
and stringdistmatrix
returned -1
if a distance was undefined or exceeding a predefined maximum. Now,
these functions return Inf
in such cases, making it easier to do comparisons. It may break your code if you explicitly test output for this.
With the latest release also arrive the latest bugs, so please drop me a line if you happen to stumble upon one.
The next release will probably not include any user-facing changes, but I’m planning to improve performance by smarter memory allocation and better maxDist
handling for some of the string distance algorithms.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.