Merging Data Sets Based on Partially Matched Data Elements
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A tweet from @coneee yesterday about merging two datasets using columns of data that don’t quite match got me wondering about a possible R recipe for handling partial matching. The data in question related to country names in a datafile that needed fusing with country names in a listing of ISO country codes.
The original data set had the form:
RANK,ECONOMY,PERCENTAGE OF INDIVIDUALS USING THE INTERNET 2011 1,Iceland,95 2,Norway,94 3,Netherlands,92.3 4,Sweden,91 5,Luxembourg,90.9 6,Denmark,90 7,Finland,89.4 8,Qatar,86.2 9,New Zealand,86 10,Switzerland,85.2 11,Liechtenstein,85 12,S. Korea,83.8 13,Germany,83 14,Canada,83 15,United Kingdom,82 16,Antigua & Barbuda,82
I’m not sure what country code listing was being used, but it probably looked something like this list of ISO Country Codes by Country:
ANDORRA;AD ANGOLA;AO ANGUILLA;AI ANTARCTICA;AQ ANTIGUA AND BARBUDA;AG ARGENTINA;AR ARMENIA;AM ARUBA;AW AUSTRALIA;AU AUSTRIA;AT AZERBAIJAN;AZ BAHAMAS;BS BAHRAIN;BH
It’s easy enough to reduce all the country names to lowercase characters so that we can try to match them exactly without worrying about any differences in capitalisation, but how do we match country names that don’t match exactly – ANTIGUA AND BARBUDA and Antigua & Barbuda, for example, or Central African Rep. and Central African Republic?
One trick is to use one of the well known partial string matching algorithms, such as the Levenshtein Distance. Here’s a recipe I hacked together that first tries to find an exact match on country names by attempting to merge the two country lists directly, and then tries to partially match any remaining unmatched names in the original list. A signature function is constructed to help out the partial matching attempt that reduces all words in the country name to lower case, sorts them alphabetically, and then concatenates them with no adjoining spaces.
(The signature idea was cribbed from the fingerprint that is available in Google Refine and that I employed in Merging Datasets with Common Columns in Google Refine.)
#Data loaded in from downloaded files as: ##PercentageUsingTheNet ##ccode ##Here's where the algorithm starts... ##I'm going to generate a signature from country names to reduce some of the minor differences between strings ##In this case, convert all characters to lower case, sort the words alphabetically, and then concatenate them with no spaces. ##So for example, United Kingdom would become kingdomunited ##We might also remove stopwords such as 'the' and 'of'. signature=function(x){ sig=paste(sort(unlist(strsplit(tolower(x)," "))),collapse='') return(sig) } partialMatch=function(x,y,levDist=0.1){ xx=data.frame(sig=sapply(x, signature),row.names=NULL) yy=data.frame(sig=sapply(y, signature),row.names=NULL) xx$raw=x yy$raw=y xx=subset(xx,subset=(sig!='')) xy=merge(xx,yy,by='sig',all=T) matched=subset(xy,subset=(!(is.na(raw.x)) & !(is.na(raw.y)))) matched$pass="Duplicate" todo=subset(xy,subset=(is.na(raw.y)),select=c(sig,raw.x)) colnames(todo)=c('sig','raw') todo$partials= as.character(sapply(todo$sig, agrep, yy$sig,max.distance = levDist,value=T)) todo=merge(todo,yy,by.x='partials',by.y='sig') partial.matched=subset(todo,subset=(!(is.na(raw.x)) & !(is.na(raw.y))),select=c("sig","raw.x","raw.y")) partial.matched$pass="Partial" matched=rbind(matched,partial.matched) un.matched=subset(todo,subset=(is.na(raw.x)),select=c("sig","raw.x","raw.y")) if (nrow(un.matched)>0){ un.matched$pass="Unmatched" matched=rbind(matched,un.matched) } matched=subset(matched,select=c("raw.x","raw.y","pass")) return(matched) } #A rogue character in @coneee's data file borked things for me, so let's do a character code conversion first PercentageUsingTheNet$ECONOMY=iconv(PercentageUsingTheNet$ECONOMY) matches=partialMatch(PercentageUsingTheNet$ECONOMY,ccode$Country.Name)
As ever, this code was arrived at by tinkering and searching on Stack Overflow (using search terms along the lines of “partial match R” and “R levenshtein”). If you can improve on it, please do so and paste a link to the improved code, or a code fragment itself, in the comments below:-)
When we run the script and look at the contents of the matches dataframe, this is an example of what we get:
This data frame can then be merged with the originally loaded data to give us the required country code annotations:
a=PercentageUsingTheNet b=ccode #Merge the original data set with the ISO country code country name keys aa=merge(a,matches,by.x='ECONOMY',by.y='raw.x',all.x=T) #Merge in the ISO country codes aa=merge(aa,b,by.x='raw.y',by.y='Country.Name',all.x=T) aa=subset(aa,select=c('ECONOMY','RANK','PERCENTAGE.OF.INDIVIDUALS.USING.THE.INTERNET.2011','ISO.3166.1.alpha.2.code'))
Something like this, for example:
Unfortunately, not every country in the original data set is matched:
In particular, here are the unmatched items and what they presumably should have been matched with:
Lao P.D.R. - LAO PEOPLE'S DEMOCRATIC REPUBLIC Syria - SYRIAN ARAB REPUBLIC S. Korea - KOREA, REPUBLIC OF Bolivia - BOLIVIA, PLURINATIONAL STATE OF Russia - RUSSIAN FEDERATION Guinea Bissau - GUINEA-BISSAU St. Vincent & Grenadines - SAINT VINCENT AND THE GRENADINES S. Sudan - SOUTH SUDAN Eq. Guinea - EQUATORIAL GUINEA Congo (Dem. Rep.) - CONGO, THE DEMOCRATIC REPUBLIC OF THE Congo (Kinshasa) - CONGO Slovak Republic - SLOVAKIA Iran - IRAN, ISLAMIC REPUBLIC OF TFYR Macedonia - MACEDONIA, THE FORMER YUGOSLAV REPUBLIC OF
We could try to increase the Levenshtein Distance within which a partial match is suggested, but then we run the risk of introducing false positives.
Removing stopwords as part of the signature function may help, for example in the case of St. Vincent & Grenadines. (Also, I’m not sure what went wrong with Guinea Bissau?) We could presumably also make something of the abbreviations (S., Dem., Rep., Eq., P. D. R.), using these as NGrams or stubs in a regular expression? So for example, convert Lao P. D. R. to a regular expression invoking searches for a three word phrase in which consecutive words start with P, D, R. (Of course, this would require matching across the full country name rather than the signature.)
The Google Refine documentation page Clustering In Depth (Methods and theory behind the clustering functionality in Google Refine) also identifies several other fuzzy/partial matching algorithms that could be tried here. In which case, I guess I need to rewrite the above function so that it can accommodate different matching algorithms? IF you manage to implement any of these other fuzzy matching approaches, please post a link in the comments.
PS If you use any other approaches for partial matching of strings, please feel free to share how you do it in the comments below…:-)
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.