Fuzzy String Matching – a survival skill to tackle unstructured information
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
“The amount of information available in the internet grows every day” thank you captain Obvious! by now even my grandma is aware of that!. Actually, the internet has increasingly become the first address for data people to find good and up-to-date data. But this is not and never has been an easy task.
Even if the Semantic Web was pushed very hard in the academic environments and in spite of all efforts driven by the community of internet visionaries, like Sir Tim Berners-Lee, the vast majority of the existing sites don’t speak RDF, don’t expose their data with microformats and keep giving a hard time to the people trying to programmatically consume their data.
I must admit though, that when I got my hands on angular.js, I was surprise by how MVC javascript frameworks are now trying to “semantify” the HTML via directives… really cool thing with a promising future, but still lacking of standardization.
And of course you see an ever increasing number of web platforms opening up APIs… but we both know, that there is so much interesting data you cannot just query from an API… what do you do? you end up scraping, parsing logs, applying regex, etc.
Let’s assume you got access to the information (good for me, that I’m putting this problem aside )… So we have managed to retrieve semi-structured data from different internet sources. What do you do next?
Connecting the dots without knowing exactly what a dot is
Well, we know, that data is in many cases useful only if it can be combined with other data. But in your retrieved data sets, there’s nothing like a matching key, so you don’t know how to connect sources.
The only thing you have in the two different data sets you are trying to match is item names… they actually look quite similar and a human could do the matching… but there are some nasty differences.
For example, you have a product name called “Apple iPad Air 16 GB 4G silber” on one source and “iPad Air 16GB 4G LTE” on the other hand… You know it is the same product.. but how can you do the matching?
There are several ways of tacking this problem:
- The manual approach
The most obvious one is just manually scanning the fields in both data sources than are supposed to contain the matching keys and creating a mapping table (for example, 2 columns in a csv). It works well when the data sources are relatively small, but good luck with larger ones!. - The “I let others do” approach
You’ve certainly heard of crowd sourcing and services like Amazon Mechanical Turk, where you can task a remote team of people to do manually do the matching for you. It might not be very reliable, but for certain jobs it’s a really good option. - The “regex” hell If you have a closer look at your data, you might define regular expressions to extract parts of the potential matching keys (e.g.: gsub(‘.*?([0-9]+GB).*’,’\1′, ‘Apple iPhone 16GB black’) to extract the number of memory GB in the name of a device and trying to match by several fields, not just by one). But there are so many special cases to consider, that you might well end up in a “regex” hell.
Obviously, you’d love to have a better method, or at least a more automatic one to accomplish this task. Let me say it upfront: there’s no automatic easy-to-implement 100% reliable approach to that: even those who after reading it started thinking of machine learning approaches can’t ever guarantee a 100% reliability because of the nature of the problem (overfitting vs. accuracy). But enough bad news! Let’s talk now about the art of the possible:
The Fuzzy String Matching approach
Fuzzy String Matching is basically rephrasing the YES/NO “Are string A and string B the same?” as “How similar are string A and string B?”… And to compute the degree of similarity (called “distance”), the research community has been consistently suggesting new methods over the last decades. Maybe the first and most popular one was Levenshtein, which is by the way the one that R natively implements in the utils package (adist)
Mark Van der Loo released a package called stringdist with additional popular fuzzy string matching methods, which we are going to use in our example below.
These fuzzy string matching methods don’t know anything about your data, but you might do. For example, you see that in a source the matching keys are kept much shorter than in the other one, where further features are included as part of the key. In this case, you want to have an approximate distance between the shorter key and portions of similar number of words of the larger key to decide whether there’s a match. This “semantics” usually need to be implemented on top but might well rely on the previously mentioned stringdist methods.
Let’s have a look at the three variants in R. Basically the process is done in three steps:
- Reading the data from both sources
- Computing the distance matrix between all elements
- Pairing the elements with the minimum distance
The first methods based on the native approximate distance method looks like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | # Method 1: using the native R adist source1.devices<-read.csv('[path_to_your_source1.csv]') source2.devices<-read.csv('[path_to_your_source2.csv]') # To make sure we are dealing with charts source1.devices$name<-as.character(source1.devices$name) source2.devices$name<-as.character(source2.devices$name) # It creates a matrix with the Standard Levenshtein distance between the name fields of both sources dist.name<-adist(source1.devices$name,source2.devices$name, partial = TRUE, ignore.case = TRUE) # We now take the pairs with the minimum distance min.name<-apply(dist.name, 1, min) match.s1.s2<-NULL for(i in 1:nrow(dist.name)) { s2.i<-match(min.name[i],dist.name[i,]) s1.i<-i match.s1.s2<-rbind(data.frame(s2.i=s2.i,s1.i=s1.i,s2name=source2.devices[s2.i,]$name, s1name=source1.devices[s1.i,]$name, adist=min.name[i]),match.s1.s2) } # and we then can have a look at the results View(match.s1.s2) |
Now let’s make use of all meaningful implementations of string distance metrics in the stringdist package:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | # Method 2: applying different string matching methods #osa Optimal string aligment, (restricted Damerau-Levenshtein distance). #lv Levenshtein distance (as in R’s native adist). #dl Full Damerau-Levenshtein distance. #hamming Hamming distance (a and b must have same nr of characters). #lcs Longest common substring distance. #qgram q-gram distance. #cosine cosine distance between q-gram profiles #jaccard Jaccard distance between q-gram profiles #jw Jaro, or Jaro-Winker distance. #install.packages('stringdist') library(stringdist) distance.methods<-c('osa','lv','dl','hamming','lcs','qgram','cosine','jaccard','jw') dist.methods<-list() for(m in 1:length(distance.methods)) { dist.name.enh<-matrix(NA, ncol = length(source2.devices$name),nrow = length(source1.devices$name)) for(i in 1:length(source2.devices$name)) { for(j in 1:length(source1.devices$name)) { dist.name.enh[j,i]<-stringdist(tolower(source2.devices[i,]$name),tolower(source1.devices[j,]$name),method = distance.methods[m]) #adist.enhance(source2.devices[i,]$name,source1.devices[j,]$name) } } dist.methods[[distance.methods[m]]]<-dist.name.enh } match.s1.s2.enh<-NULL for(m in 1:length(dist.methods)) { dist.matrix<-as.matrix(dist.methods[[distance.methods[m]]]) min.name.enh<-apply(dist.matrix, 1, base::min) for(i in 1:nrow(dist.matrix)) { s2.i<-match(min.name.enh[i],dist.matrix[i,]) s1.i<-i match.s1.s2.enh<-rbind(data.frame(s2.i=s2.i,s1.i=s1.i,s2name=source2.devices[s2.i,]$name, s1name=source1.devices[s1.i,]$name, adist=min.name.enh[i],method=distance.methods[m]),match.s1.s2.enh) } } # Let's have a look at the results library(reshape2) matched.names.matrix<-dcast(match.s1.s2.enh,s2.i+s1.i+s2name+s1name~method, value.var = "adist") View(matched.names.matrix) |
And lastly, let’s have a look at what an own implementation exploiting the known semantics about the data would look like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | # Now my own method applying some more knowledge about the data # First a small but really helpfull function trim <- function (x) gsub("^\s+|\s+$", "", x) # Then we implement our own distance function # taking the shortest string applying Levenshtein distance sliding over the largest one to take the minimum adist.custom <- function (str1, str2, sliding = TRUE) { s.str1<-strsplit(trim(str1), split=' ') s.str2<-strsplit(trim(str2), split=' ') s.str2<-trim(unlist(s.str2)) s.str1<-trim(unlist(s.str1)) if (length(s.str2)>=length(s.str1)) { short.str<- s.str1 long.str<-s.str2 } else { short.str <- s.str2 long.str<-s.str1 } # sliding return.dist<-0 if (sliding == TRUE) { min<-99999 s1<-trim(paste(short.str,collapse = ' ')) for (k in 1:(length(long.str)-length(short.str))) { s2<-trim(paste(long.str[k:(length(short.str)+(k-1))],collapse = ' ')) ads<-adist(s1,s2,partial = TRUE, ignore.case = TRUE) min <- ifelse(ads<min,ads,min) } return.dist<-min } else { #string start matching s1<-trim(paste(short.str,collapse = ' ')) s2<-trim(paste(long.str[1:length(short.str)],collapse = ' ')) return.dist<-adist(s1,s2,partial = TRUE, ignore.case = TRUE) } return (return.dist) } dist.name.custom<-matrix(NA, ncol = length(source2.devices$name),nrow = length(source1.devices$name)) for(i in 1:length(source2.devices$name)) { for(j in 1:length(source1.devices$name)) { dist.name.custom[j,i]<-adist.custom(tolower(source2.devices[i,]$name),tolower(source1.devices[j,]$name)) } } min.name.custom<-apply(dist.name.custom, 1, min) match.s1.s2<-NULL for(i in 1:nrow(dist.name.custom)) { s2.i<-match(min.name.custom[i],dist.name.custom[i,]) s1.i<-i match.s1.s2<-rbind(data.frame(s2.i=s2.i,s1.i=s1.i,s2name=source2.devices[s2.i,]$name, s1name=source1.devices[s1.i,]$name, adist=min.name.custom[i]),match.s1.s2) } # let's have a look at the results View(match.s1.s2) |
I run the code with two different lists of mobile devices names that you can find here: list1, list2.
Even if depending on the method you get a correct matching rate over 80%, you really need to supervise the outcome… but it’s still mich quicker than the other non-fuzzy methods, don’t you agree?
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.