Site icon R-bloggers

Fuzzy joining tables using string distance methods

[This article was first published on R tips – NHS-R Community, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Image from Unsplash

I recently had a problem where I had two datasets containing data which I needed to join together. The two datasets had a nice 1:1 mapping between them, but unfortunately there was not a nice coded identifier to join the two datasets. There was just a name field, and annoyingly there were subtle differences between the two.

For demonstration purposes, I’m going to show a similar problem. Imagine that we have one dataset that contains data about ICSs, and another about STPs. (For those not familiar with English NHS geographies, STPs were 42 geographic areas covering England, which in July 2022 became ICSs). This has a 1:1 mapping, but some of the names changed slightly when ICSs came into effect.

If you want to follow along, download the list of STPs and ICBs from the ONS Geoportal site.

stps <- readxl::read_excel("STP_APR_2021_EN_NC.xlsx") |>
  select(code = STP21CDH, description = STP21NM) |>
  arrange(code)

stps
# A tibble: 42 × 2
   code  description                           
   <chr> <chr>                                 
 1 QE1   Healthier Lancashire and South Cumbria
 2 QF7   South Yorkshire and Bassetlaw         
 3 QGH   Herefordshire and Worcestershire      
 4 QH8   Mid and South Essex                   
 5 QHG   Bedfordshire, Luton and Milton Keynes 
 6 QHL   Birmingham and Solihull               
 7 QHM   Cumbria and North East                
 8 QJ2   Joined Up Care Derbyshire             
 9 QJG   Suffolk and North East Essex          
10 QJK   Devon                                 
# … with 32 more rows
icbs <- readxl::read_excel("ICB_JUL_2022_EN_NC.xlsx") |>
  select(code = ICB22CDH, description = ICB22NM) |>
  arrange(code)

icbs
# A tibble: 42 × 2
   code  description                                                    
   <chr> <chr>                                                          
 1 QE1   NHS Lancashire and South Cumbria Integrated Care Board         
 2 QF7   NHS South Yorkshire Integrated Care Board                      
 3 QGH   NHS Herefordshire and Worcestershire Integrated Care Board     
 4 QH8   NHS Mid and South Essex Integrated Care Board                  
 5 QHG   NHS Bedfordshire, Luton and Milton Keynes Integrated Care Board
 6 QHL   NHS Birmingham and Solihull Integrated Care Board              
 7 QHM   NHS North East and North Cumbria Integrated Care Board         
 8 QJ2   NHS Derby and Derbyshire Integrated Care Board                 
 9 QJG   NHS Suffolk and North East Essex Integrated Care Board         
10 QJK   NHS Devon Integrated Care Board                                
# … with 32 more rows

Obviously, here we have the “E54* ONS codes which we could join on, a luxury I did not have. I’ve left these in to test the matching does work later.

First of all, how many rows are we able to match joining on the name?

icbs |>
  inner_join(stps, by = "description")
# A tibble: 0 × 3
# … with 3 variables: code.x <chr>, description <chr>, code.y <chr>

None! Looking at the icbs dataset, we can see rows start with “NHS” and end with “Integrated Care Board”, which doesn’t happen in stps. Perhaps, by just stripping these we get a perfect match?

icbs |>
  select(description) |>
  mutate(across(description, str_remove_all, "^NHS | Integrated Care Board$")) |>
  inner_join(stps |> select(description), by = "description")
# A tibble: 20 × 1
   description                                        
   <chr>                                              
 1 Herefordshire and Worcestershire                   
 2 Mid and South Essex                                
 3 Bedfordshire, Luton and Milton Keynes              
 4 Birmingham and Solihull                            
 5 Suffolk and North East Essex                       
 6 Devon                                              
 7 Lincolnshire                                       
 8 Leicester, Leicestershire and Rutland              
 9 Kent and Medway                                    
10 Hertfordshire and West Essex                       
11 Bath and North East Somerset, Swindon and Wiltshire
12 Northamptonshire                                   
13 Gloucestershire                                    
14 Somerset                                           
15 Buckinghamshire, Oxfordshire and Berkshire West    
16 Cambridgeshire and Peterborough                    
17 Bristol, North Somerset and South Gloucestershire  
18 Dorset                                             
19 Coventry and Warwickshire                          
20 Cheshire and Merseyside                            

Roughly half… not good enough!

String distance methods to the rescue?

Many of us will have had to compare strings at some point, perhaps using LIKE in Sql, or Regular Expressions (regexs) in R. But there are a class of algorithms that can calculate the “distance” or “similarity” between two strings.

Consider the two words “grey” and “gray”. How similar are these two words? The Hamming Distance algorithm compares two strings of equal length, and returns a number indicating how many positions are different in the string. So for our two words above, we get a distance of 1.

A generally more useful method though is the Damerau-Levenshtein distance. This calculates the number of operations to make the first string equal the second string.

Operations in this context are single-character insertions, deletions or substitutions, or transposition of two adjacent characters.

Alternatively, we could consider the set of unique words used in two strings. We can count the intersection of words (words common to both strings) and divide by the count of all the words used to give us a value between 0 and 1. A value of 0 would indicate that the two strings are completely different, and a value of 1 would indicate that the two strings are very similar. This method is called the Jaccard Similarity.

This is a very useful method for the problem I faced, as I expect the names in both datasets to be free of spelling mistakes.

Using the Jaccard Similartiy method

First, we can use the stringsimmatrix() function from the {stringdist} package to calculate the Jaccard Similarity matrix, comparing the names from the first table to the names from the second table.

dist_matrix <- stringdist::stringsimmatrix(
  icbs$description,
  stps$description,
  "jaccard"
)

However, simply calculating the string distance matrix doesn’t give us a solution to the problem. In the table below, you can see that in column y, some rows appear more than once, and eyeballing the match it’s clear it hasn’t found the correct pair.

# we can find the index of the maximum 
tibble(
   x = icbs$description |> str_remove_all("^NHS | Integrated Care Board$"),
  y = stps$description[apply(dist_matrix, 1, which.max)]
) |>
  group_by(y) |>
  arrange(y) |>
  filter(n() > 1)
# A tibble: 20 × 2
# Groups:   y [5]
   x                                                 y                          
   <chr>                                             <chr>                      
 1 Gloucestershire                                   Bristol, North Somerset an…
 2 Bristol, North Somerset and South Gloucestershire Bristol, North Somerset an…
 3 Shropshire, Telford and Wrekin                    Hampshire and the Isle of …
 4 Hampshire and Isle of Wight                       Hampshire and the Isle of …
 5 Lancashire and South Cumbria                      Healthier Lancashire and S…
 6 Leicester, Leicestershire and Rutland             Healthier Lancashire and S…
 7 Lincolnshire                                      North London Partners in H…
 8 North East London                                 North London Partners in H…
 9 North Central London                              North London Partners in H…
10 Birmingham and Solihull                           Nottingham and Nottinghams…
11 Derby and Derbyshire                              Nottingham and Nottinghams…
12 Devon                                             Nottingham and Nottinghams…
13 Sussex                                            Nottingham and Nottinghams…
14 Humber and North Yorkshire                        Nottingham and Nottinghams…
15 Northamptonshire                                  Nottingham and Nottinghams…
16 Somerset                                          Nottingham and Nottinghams…
17 Nottingham and Nottinghamshire                    Nottingham and Nottinghams…
18 Dorset                                            Nottingham and Nottinghams…
19 Surrey Heartlands                                 Nottingham and Nottinghams…
20 Cheshire and Merseyside                           Nottingham and Nottinghams…

Graph theory saves the day

There is a quick solution to this though using a Bipartite Graph. A Birpartite Graph is a type of network where we have vertices of two types, and edges only exist between nodes of the different types.

We can use the {igraph} package to construct and manipulate graphs. First, let’s construct a table where we have names from the first table as nodes of one type, and the names from the second table as nodes of the other type.

# the column `name` is special in a named graph. it will uniquely identify each vertex.
vertices <- dplyr::bind_rows(
  .id = "type",
  icbs = icbs |> mutate(name = paste0("icb_", code)),
  stps = stps |> mutate(name = paste0("stp_", code))
) |>
  dplyr::relocate(name, .before = dplyr::everything()) |>
  # the "type" column needs to be a logical vector, so we use TRUE for the first type, and FALSE for the second
  dplyr::mutate(dplyr::across("type", ~.x == "icbs"))

vertices
# A tibble: 84 × 4
   name    type  code  description                                              
   <chr>   <lgl> <chr> <chr>                                                    
 1 icb_QE1 TRUE  QE1   NHS Lancashire and South Cumbria Integrated Care Board   
 2 icb_QF7 TRUE  QF7   NHS South Yorkshire Integrated Care Board                
 3 icb_QGH TRUE  QGH   NHS Herefordshire and Worcestershire Integrated Care Boa…
 4 icb_QH8 TRUE  QH8   NHS Mid and South Essex Integrated Care Board            
 5 icb_QHG TRUE  QHG   NHS Bedfordshire, Luton and Milton Keynes Integrated Car…
 6 icb_QHL TRUE  QHL   NHS Birmingham and Solihull Integrated Care Board        
 7 icb_QHM TRUE  QHM   NHS North East and North Cumbria Integrated Care Board   
 8 icb_QJ2 TRUE  QJ2   NHS Derby and Derbyshire Integrated Care Board           
 9 icb_QJG TRUE  QJG   NHS Suffolk and North East Essex Integrated Care Board   
10 icb_QJK TRUE  QJK   NHS Devon Integrated Care Board                          
# … with 74 more rows

Then create weighted edges between each pair of names, using the distance matrix we calculated above.

edges <- dist_matrix |>
  # this will convert our matrix into a list of lists
  purrr::array_branch(1) |>
  # the lists will be in the same order as our original data
  # so we can use purrr to change into a dataframe
  purrr::set_names(icbs$code) |>
  purrr::map_dfr(
    .id = "to",
    \(.x) tibble::tibble(from = icbs$code, weight = .x)
  ) |>
  mutate(
    across(to, ~paste0("icb_", .x)),
    across(from, ~paste0("stp_", .x))
  )

edges
# A tibble: 1,764 × 3
   to      from    weight
   <chr>   <chr>    <dbl>
 1 icb_QE1 stp_QE1  0.792
 2 icb_QE1 stp_QF7  0.519
 3 icb_QE1 stp_QGH  0.52 
 4 icb_QE1 stp_QH8  0.462
 5 icb_QE1 stp_QHG  0.483
 6 icb_QE1 stp_QHL  0.542
 7 icb_QE1 stp_QHM  0.625
 8 icb_QE1 stp_QJ2  0.429
 9 icb_QE1 stp_QJG  0.464
10 icb_QE1 stp_QJK  0.12 
# … with 1,754 more rows

This tibble gives us the string similarities between each pair of names from our two tables.

Now that we have our edges and vertices, we can construct a graph, and find the maximum bipartite matching. This works without much effort as we constructed our vertices with a type logical column, and we constructed our edges with a weight numeric column. {igraph} handles the rest for us.

g <- graph_from_data_frame(edges, TRUE, vertices)

m <- maximum.bipartite.matching(g)$matching |>
  enframe("icb_code", "stp_code") |>
  # the results gives us results from icb22cdh->icb22cd and vice versa
  # keep just the icb22cdh->icb22cd results
  filter(icb_code |> str_starts("icb_")) |>
  mutate(across(c(icb_code, stp_code), str_remove_all, "^.{3}_"))

m |>
  filter(icb_code == stp_code) |>
  rename(code = icb_code) |>
  select(-stp_code) |>
  inner_join(
    icbs |> rename(icb_name = description),
    by = "code"
  ) |>
  inner_join(
    stps |> rename(stp_name = description),
    by = "code"
  ) |>
  mutate(across(everything(), str_trunc, 30)) |>
  print(n = 42)
# A tibble: 42 × 3
   code  icb_name                       stp_name                      
   <chr> <chr>                          <chr>                         
 1 QE1   NHS Lancashire and South Cu... Healthier Lancashire and So...
 2 QF7   NHS South Yorkshire Integra... South Yorkshire and Bassetlaw 
 3 QGH   NHS Herefordshire and Worce... Herefordshire and Worcester...
 4 QH8   NHS Mid and South Essex Int... Mid and South Essex           
 5 QHG   NHS Bedfordshire, Luton and... Bedfordshire, Luton and Mil...
 6 QHL   NHS Birmingham and Solihull... Birmingham and Solihull       
 7 QHM   NHS North East and North Cu... Cumbria and North East        
 8 QJ2   NHS Derby and Derbyshire In... Joined Up Care Derbyshire     
 9 QJG   NHS Suffolk and North East ... Suffolk and North East Essex  
10 QJK   NHS Devon Integrated Care B... Devon                         
11 QJM   NHS Lincolnshire Integrated... Lincolnshire                  
12 QK1   NHS Leicester, Leicestershi... Leicester, Leicestershire a...
13 QKK   NHS South East London Integ... Our Healthier South East Lo...
14 QKS   NHS Kent and Medway Integra... Kent and Medway               
15 QM7   NHS Hertfordshire and West ... Hertfordshire and West Essex  
16 QMF   NHS North East London Integ... East London Health and Care...
17 QMJ   NHS North Central London In... North London Partners in He...
18 QMM   NHS Norfolk and Waveney Int... Norfolk and Waveney Health ...
19 QNC   NHS Staffordshire and Stoke... Staffordshire and Stoke on ...
20 QNQ   NHS Frimley Integrated Care... Frimley Health and Care ICS   
21 QNX   NHS Sussex Integrated Care ... Sussex Health and Care Part...
22 QOC   NHS Shropshire, Telford and... Shropshire and Telford and ...
23 QOP   NHS Greater Manchester Inte... Greater Manchester Health a...
24 QOQ   NHS Humber and North Yorksh... Humber, Coast and Vale        
25 QOX   NHS Bath and North East Som... Bath and North East Somerse...
26 QPM   NHS Northamptonshire Integr... Northamptonshire              
27 QR1   NHS Gloucestershire Integra... Gloucestershire               
28 QRL   NHS Hampshire and Isle of W... Hampshire and the Isle of W...
29 QRV   NHS North West London Integ... North West London Health an...
30 QSL   NHS Somerset Integrated Car... Somerset                      
31 QT1   NHS Nottingham and Nottingh... Nottingham and Nottinghamsh...
32 QT6   NHS Cornwall and the Isles ... Cornwall and the Isles of S...
33 QU9   NHS Buckinghamshire, Oxford... Buckinghamshire, Oxfordshir...
34 QUA   NHS Black Country Integrate... The Black Country and West ...
35 QUE   NHS Cambridgeshire and Pete... Cambridgeshire and Peterbor...
36 QUY   NHS Bristol, North Somerset... Bristol, North Somerset and...
37 QVV   NHS Dorset Integrated Care ... Dorset                        
38 QWE   NHS South West London Integ... South West London Health an...
39 QWO   NHS West Yorkshire Integrat... West Yorkshire and Harrogat...
40 QWU   NHS Coventry and Warwickshi... Coventry and Warwickshire     
41 QXU   NHS Surrey Heartlands Integ... Surrey Heartlands Health an...
42 QYG   NHS Cheshire and Merseyside... Cheshire and Merseyside       

This gives us a perfect match!

How does this work?

The way this matching algorithm works is it starts off and finds the edge with the greatest possible matching score, and pairs those two nodes together. It then removes those nodes (and edges to/from those nodes) from the graph, and repeats until all nodes are paired, or no edges remain.

This prevents the issue we initially saw, because a node can only be paired to one other node.

This algorithm works when we have a good set of weights to the edges. In fact, if you try running the string similarity function with some of the different algorithms that are available, such as the Levenshtein Distance, most give us bipartite matchings that aren’t correct.

Final thoughts

Hopefully this has been interesting to you and introduced some new and interesting techniques to play with. Both string-distance algorithms and graph theory are very powerful tools that crop up again and again in computer science, so are worth diving into if you are curious!

There is an obvious question of, are there easier approaches to this problem? In this case, we only had 42 options, which is probably quick enough to solve by hand in Excel by starting with two sorted columns and manually lining up the correct rows.

However, if you had a similar problem with more options then the manual approach would quickly becoming tiring. It is worth noting that you should not blindly trust the results; In my original problem I scanned the results and confirmed that I got the results I was after. In this example we had the codes which allowed us to confirm the correct results

I also came into this problem expecting there to be a perfect 1:1 mapping between both sets. If it isn’t guaranteed that constraint holds in your problem then you may need to treat the results more cautiously.

This post is also available as a quarto document.

To leave a comment for the author, please follow the link and comment on their blog: R tips – NHS-R Community.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.