Approximate string matching in R

mark

9 years ago

[This article was first published on Mark van der Loo, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I have released a new version of the stringdist package.

Besides a some new string distance algorithms it now contains two convenient matching functions:

amatch: Equivalent to R’s match function but allowing for approximate matching.
ain: Similar to R’s %in% operator

^?Download download.txt

# here's an example of amatch
> x <- c('foo', 'bar')
> amatch('fu',x,maxDist=2)
[1] 1
 
# if we decrease the maximum allowd distance, we get 
> amatch('fu',x,maxDist=1)
[1] NA
 
# just like with 'match' you can control the output of no-matches:
> amatch('fu',x,maxDist=1,nomatch=0)
[1] 0
 
# to see if 'fu' matches approximately with any element of x:
ain('fu',x)
[1] FALSE
 
# however, if we allow for larger distances
ain('fu',x,maxDist=2)
[1] TRUE

Check the helpfile of for other options, like how to choose the string distance algorithm.

Note previously stringdist and stringdistmatrix returned -1 if a distance was undefined or exceeding a predefined maximum. Now,
these functions return Inf in such cases, making it easier to do comparisons. It may break your code if you explicitly test output for this.

With the latest release also arrive the latest bugs, so please drop me a line if you happen to stumble upon one.

The next release will probably not include any user-facing changes, but I’m planning to improve performance by smarter memory allocation and better maxDist handling for some of the string distance algorithms.

To leave a comment for the author, please follow the link and comment on their blog: Mark van der Loo.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.