Fuzzy joining tables using string distance methods
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I recently had a problem where I had two datasets containing data which I needed to join together. The two datasets had a nice 1:1 mapping between them, but unfortunately there was not a nice coded identifier to join the two datasets. There was just a name field, and annoyingly there were subtle differences between the two.
For demonstration purposes, I’m going to show a similar problem. Imagine that we have one dataset that contains data about ICSs, and another about STPs. (For those not familiar with English NHS geographies, STPs were 42 geographic areas covering England, which in July 2022 became ICSs). This has a 1:1 mapping, but some of the names changed slightly when ICSs came into effect.
If you want to follow along, download the list of STPs and ICBs from the ONS Geoportal site.
stps <- readxl::read_excel("STP_APR_2021_EN_NC.xlsx") |> select(code = STP21CDH, description = STP21NM) |> arrange(code) stps # A tibble: 42 × 2 code description <chr> <chr> 1 QE1 Healthier Lancashire and South Cumbria 2 QF7 South Yorkshire and Bassetlaw 3 QGH Herefordshire and Worcestershire 4 QH8 Mid and South Essex 5 QHG Bedfordshire, Luton and Milton Keynes 6 QHL Birmingham and Solihull 7 QHM Cumbria and North East 8 QJ2 Joined Up Care Derbyshire 9 QJG Suffolk and North East Essex 10 QJK Devon # … with 32 more rows icbs <- readxl::read_excel("ICB_JUL_2022_EN_NC.xlsx") |> select(code = ICB22CDH, description = ICB22NM) |> arrange(code) icbs # A tibble: 42 × 2 code description <chr> <chr> 1 QE1 NHS Lancashire and South Cumbria Integrated Care Board 2 QF7 NHS South Yorkshire Integrated Care Board 3 QGH NHS Herefordshire and Worcestershire Integrated Care Board 4 QH8 NHS Mid and South Essex Integrated Care Board 5 QHG NHS Bedfordshire, Luton and Milton Keynes Integrated Care Board 6 QHL NHS Birmingham and Solihull Integrated Care Board 7 QHM NHS North East and North Cumbria Integrated Care Board 8 QJ2 NHS Derby and Derbyshire Integrated Care Board 9 QJG NHS Suffolk and North East Essex Integrated Care Board 10 QJK NHS Devon Integrated Care Board # … with 32 more rows
Obviously, here we have the “E54* ONS codes which we could join on, a luxury I did not have. I’ve left these in to test the matching does work later.
First of all, how many rows are we able to match joining on the name?
icbs |> inner_join(stps, by = "description") # A tibble: 0 × 3 # … with 3 variables: code.x <chr>, description <chr>, code.y <chr>
None! Looking at the icbs
dataset, we can see rows start with “NHS” and end with “Integrated Care Board”, which doesn’t happen in stps
. Perhaps, by just stripping these we get a perfect match?
icbs |> select(description) |> mutate(across(description, str_remove_all, "^NHS | Integrated Care Board$")) |> inner_join(stps |> select(description), by = "description") # A tibble: 20 × 1 description <chr> 1 Herefordshire and Worcestershire 2 Mid and South Essex 3 Bedfordshire, Luton and Milton Keynes 4 Birmingham and Solihull 5 Suffolk and North East Essex 6 Devon 7 Lincolnshire 8 Leicester, Leicestershire and Rutland 9 Kent and Medway 10 Hertfordshire and West Essex 11 Bath and North East Somerset, Swindon and Wiltshire 12 Northamptonshire 13 Gloucestershire 14 Somerset 15 Buckinghamshire, Oxfordshire and Berkshire West 16 Cambridgeshire and Peterborough 17 Bristol, North Somerset and South Gloucestershire 18 Dorset 19 Coventry and Warwickshire 20 Cheshire and Merseyside
Roughly half… not good enough!
String distance methods to the rescue?
Many of us will have had to compare strings at some point, perhaps using LIKE
in Sql, or Regular Expressions (regexs) in R. But there are a class of algorithms that can calculate the “distance” or “similarity” between two strings.
Consider the two words “grey” and “gray”. How similar are these two words? The Hamming Distance algorithm compares two strings of equal length, and returns a number indicating how many positions are different in the string. So for our two words above, we get a distance of 1.
A generally more useful method though is the Damerau-Levenshtein distance. This calculates the number of operations to make the first string equal the second string.
Operations in this context are single-character insertions, deletions or substitutions, or transposition of two adjacent characters.
Alternatively, we could consider the set of unique words used in two strings. We can count the intersection of words (words common to both strings) and divide by the count of all the words used to give us a value between 0 and 1. A value of 0 would indicate that the two strings are completely different, and a value of 1 would indicate that the two strings are very similar. This method is called the Jaccard Similarity.
This is a very useful method for the problem I faced, as I expect the names in both datasets to be free of spelling mistakes.
Using the Jaccard Similartiy method
First, we can use the stringsimmatrix()
function from the {stringdist}
package to calculate the Jaccard Similarity matrix, comparing the names from the first table to the names from the second table.
dist_matrix <- stringdist::stringsimmatrix( icbs$description, stps$description, "jaccard" )
However, simply calculating the string distance matrix doesn’t give us a solution to the problem. In the table below, you can see that in column y, some rows appear more than once, and eyeballing the match it’s clear it hasn’t found the correct pair.
# we can find the index of the maximum tibble( x = icbs$description |> str_remove_all("^NHS | Integrated Care Board$"), y = stps$description[apply(dist_matrix, 1, which.max)] ) |> group_by(y) |> arrange(y) |> filter(n() > 1) # A tibble: 20 × 2 # Groups: y [5] x y <chr> <chr> 1 Gloucestershire Bristol, North Somerset an… 2 Bristol, North Somerset and South Gloucestershire Bristol, North Somerset an… 3 Shropshire, Telford and Wrekin Hampshire and the Isle of … 4 Hampshire and Isle of Wight Hampshire and the Isle of … 5 Lancashire and South Cumbria Healthier Lancashire and S… 6 Leicester, Leicestershire and Rutland Healthier Lancashire and S… 7 Lincolnshire North London Partners in H… 8 North East London North London Partners in H… 9 North Central London North London Partners in H… 10 Birmingham and Solihull Nottingham and Nottinghams… 11 Derby and Derbyshire Nottingham and Nottinghams… 12 Devon Nottingham and Nottinghams… 13 Sussex Nottingham and Nottinghams… 14 Humber and North Yorkshire Nottingham and Nottinghams… 15 Northamptonshire Nottingham and Nottinghams… 16 Somerset Nottingham and Nottinghams… 17 Nottingham and Nottinghamshire Nottingham and Nottinghams… 18 Dorset Nottingham and Nottinghams… 19 Surrey Heartlands Nottingham and Nottinghams… 20 Cheshire and Merseyside Nottingham and Nottinghams…
Graph theory saves the day
There is a quick solution to this though using a Bipartite Graph. A Birpartite Graph is a type of network where we have vertices of two types, and edges only exist between nodes of the different types.
We can use the {igraph}
package to construct and manipulate graphs. First, let’s construct a table where we have names from the first table as nodes of one type, and the names from the second table as nodes of the other type.
# the column `name` is special in a named graph. it will uniquely identify each vertex. vertices <- dplyr::bind_rows( .id = "type", icbs = icbs |> mutate(name = paste0("icb_", code)), stps = stps |> mutate(name = paste0("stp_", code)) ) |> dplyr::relocate(name, .before = dplyr::everything()) |> # the "type" column needs to be a logical vector, so we use TRUE for the first type, and FALSE for the second dplyr::mutate(dplyr::across("type", ~.x == "icbs")) vertices # A tibble: 84 × 4 name type code description <chr> <lgl> <chr> <chr> 1 icb_QE1 TRUE QE1 NHS Lancashire and South Cumbria Integrated Care Board 2 icb_QF7 TRUE QF7 NHS South Yorkshire Integrated Care Board 3 icb_QGH TRUE QGH NHS Herefordshire and Worcestershire Integrated Care Boa… 4 icb_QH8 TRUE QH8 NHS Mid and South Essex Integrated Care Board 5 icb_QHG TRUE QHG NHS Bedfordshire, Luton and Milton Keynes Integrated Car… 6 icb_QHL TRUE QHL NHS Birmingham and Solihull Integrated Care Board 7 icb_QHM TRUE QHM NHS North East and North Cumbria Integrated Care Board 8 icb_QJ2 TRUE QJ2 NHS Derby and Derbyshire Integrated Care Board 9 icb_QJG TRUE QJG NHS Suffolk and North East Essex Integrated Care Board 10 icb_QJK TRUE QJK NHS Devon Integrated Care Board # … with 74 more rows
Then create weighted edges between each pair of names, using the distance matrix we calculated above.
edges <- dist_matrix |> # this will convert our matrix into a list of lists purrr::array_branch(1) |> # the lists will be in the same order as our original data # so we can use purrr to change into a dataframe purrr::set_names(icbs$code) |> purrr::map_dfr( .id = "to", \(.x) tibble::tibble(from = icbs$code, weight = .x) ) |> mutate( across(to, ~paste0("icb_", .x)), across(from, ~paste0("stp_", .x)) ) edges # A tibble: 1,764 × 3 to from weight <chr> <chr> <dbl> 1 icb_QE1 stp_QE1 0.792 2 icb_QE1 stp_QF7 0.519 3 icb_QE1 stp_QGH 0.52 4 icb_QE1 stp_QH8 0.462 5 icb_QE1 stp_QHG 0.483 6 icb_QE1 stp_QHL 0.542 7 icb_QE1 stp_QHM 0.625 8 icb_QE1 stp_QJ2 0.429 9 icb_QE1 stp_QJG 0.464 10 icb_QE1 stp_QJK 0.12 # … with 1,754 more rows
This tibble gives us the string similarities between each pair of names from our two tables.
Now that we have our edges and vertices, we can construct a graph, and find the maximum bipartite matching. This works without much effort as we constructed our vertices with a type
logical column, and we constructed our edges with a weight
numeric column. {igraph}
handles the rest for us.
g <- graph_from_data_frame(edges, TRUE, vertices) m <- maximum.bipartite.matching(g)$matching |> enframe("icb_code", "stp_code") |> # the results gives us results from icb22cdh->icb22cd and vice versa # keep just the icb22cdh->icb22cd results filter(icb_code |> str_starts("icb_")) |> mutate(across(c(icb_code, stp_code), str_remove_all, "^.{3}_")) m |> filter(icb_code == stp_code) |> rename(code = icb_code) |> select(-stp_code) |> inner_join( icbs |> rename(icb_name = description), by = "code" ) |> inner_join( stps |> rename(stp_name = description), by = "code" ) |> mutate(across(everything(), str_trunc, 30)) |> print(n = 42) # A tibble: 42 × 3 code icb_name stp_name <chr> <chr> <chr> 1 QE1 NHS Lancashire and South Cu... Healthier Lancashire and So... 2 QF7 NHS South Yorkshire Integra... South Yorkshire and Bassetlaw 3 QGH NHS Herefordshire and Worce... Herefordshire and Worcester... 4 QH8 NHS Mid and South Essex Int... Mid and South Essex 5 QHG NHS Bedfordshire, Luton and... Bedfordshire, Luton and Mil... 6 QHL NHS Birmingham and Solihull... Birmingham and Solihull 7 QHM NHS North East and North Cu... Cumbria and North East 8 QJ2 NHS Derby and Derbyshire In... Joined Up Care Derbyshire 9 QJG NHS Suffolk and North East ... Suffolk and North East Essex 10 QJK NHS Devon Integrated Care B... Devon 11 QJM NHS Lincolnshire Integrated... Lincolnshire 12 QK1 NHS Leicester, Leicestershi... Leicester, Leicestershire a... 13 QKK NHS South East London Integ... Our Healthier South East Lo... 14 QKS NHS Kent and Medway Integra... Kent and Medway 15 QM7 NHS Hertfordshire and West ... Hertfordshire and West Essex 16 QMF NHS North East London Integ... East London Health and Care... 17 QMJ NHS North Central London In... North London Partners in He... 18 QMM NHS Norfolk and Waveney Int... Norfolk and Waveney Health ... 19 QNC NHS Staffordshire and Stoke... Staffordshire and Stoke on ... 20 QNQ NHS Frimley Integrated Care... Frimley Health and Care ICS 21 QNX NHS Sussex Integrated Care ... Sussex Health and Care Part... 22 QOC NHS Shropshire, Telford and... Shropshire and Telford and ... 23 QOP NHS Greater Manchester Inte... Greater Manchester Health a... 24 QOQ NHS Humber and North Yorksh... Humber, Coast and Vale 25 QOX NHS Bath and North East Som... Bath and North East Somerse... 26 QPM NHS Northamptonshire Integr... Northamptonshire 27 QR1 NHS Gloucestershire Integra... Gloucestershire 28 QRL NHS Hampshire and Isle of W... Hampshire and the Isle of W... 29 QRV NHS North West London Integ... North West London Health an... 30 QSL NHS Somerset Integrated Car... Somerset 31 QT1 NHS Nottingham and Nottingh... Nottingham and Nottinghamsh... 32 QT6 NHS Cornwall and the Isles ... Cornwall and the Isles of S... 33 QU9 NHS Buckinghamshire, Oxford... Buckinghamshire, Oxfordshir... 34 QUA NHS Black Country Integrate... The Black Country and West ... 35 QUE NHS Cambridgeshire and Pete... Cambridgeshire and Peterbor... 36 QUY NHS Bristol, North Somerset... Bristol, North Somerset and... 37 QVV NHS Dorset Integrated Care ... Dorset 38 QWE NHS South West London Integ... South West London Health an... 39 QWO NHS West Yorkshire Integrat... West Yorkshire and Harrogat... 40 QWU NHS Coventry and Warwickshi... Coventry and Warwickshire 41 QXU NHS Surrey Heartlands Integ... Surrey Heartlands Health an... 42 QYG NHS Cheshire and Merseyside... Cheshire and Merseyside
This gives us a perfect match!
How does this work?
The way this matching algorithm works is it starts off and finds the edge with the greatest possible matching score, and pairs those two nodes together. It then removes those nodes (and edges to/from those nodes) from the graph, and repeats until all nodes are paired, or no edges remain.
This prevents the issue we initially saw, because a node can only be paired to one other node.
This algorithm works when we have a good set of weights to the edges. In fact, if you try running the string similarity function with some of the different algorithms that are available, such as the Levenshtein Distance, most give us bipartite matchings that aren’t correct.
Final thoughts
Hopefully this has been interesting to you and introduced some new and interesting techniques to play with. Both string-distance algorithms and graph theory are very powerful tools that crop up again and again in computer science, so are worth diving into if you are curious!
There is an obvious question of, are there easier approaches to this problem? In this case, we only had 42 options, which is probably quick enough to solve by hand in Excel by starting with two sorted columns and manually lining up the correct rows.
However, if you had a similar problem with more options then the manual approach would quickly becoming tiring. It is worth noting that you should not blindly trust the results; In my original problem I scanned the results and confirmed that I got the results I was after. In this example we had the codes which allowed us to confirm the correct results
I also came into this problem expecting there to be a perfect 1:1 mapping between both sets. If it isn’t guaranteed that constraint holds in your problem then you may need to treat the results more cautiously.
This post is also available as a quarto document.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.