Fuzzy Matching with Texas High School Academic Competition Results and SAT/ACT Scores

r on Tony ElHabr

4 years ago

[This article was first published on r on Tony ElHabr, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

As a follow-up to a previous post about correlations between Texas high school academic UIL competition scores and SAT/ACT scores, I wanted explore some of the “alternatives” to joining the two data sets—which come from different sources. In that post, I simply perform a an inner_join() using the school and city names as keys. While this decision ensures that the data integrity is “high”, there are potentially many un-matched schools that could have been included in the analysis with some sound “fuzzy matching”. For my exploration here, I’ll leverage David Robinson’s excellent {fuzzyjoin} package.

My dissatisfaction with the results that I came up with various fuzzy matching attempts ultimately led me to “settle” upon

Fuzzy Matching

In its raw form,the UIL data—scraped from https< nolink>://www.hpscience.net/ (which was the only site that I found to have the academic competition scores)—is much “more unclean” than the data from the SAT and ACT data from the TEA website. In particular, many of the city, schoo, and individual participant names are inconsistent across different webpages. ¹ Even after eliminating much of the “self-inconsistency” of the UIL data and creating a suitable “clean” data set to use as the basis for my series of posts exploring the UIL data exclusively, there are more than a few differences in the names of schools between the UIL and TEA data.

To aid in my experimentation with “fuzzy joining”, I use a function from my {tetidy} package—join_fuzzily(), which is essentially a wrapper for the stringdist_*_join() functions provided by the {fuzzyjoin} package. With this function, I was able to evaluate different values for max_dist and different join columns to help me make a judgement regarding the quality of joins. I primarily considered counts of joined and unjoined rows computed with the summarise_join_stats() function—also from my {tetidy} package—to make a decision on how I should join the UIL and SAT/ACT school data. ²

What follows is the results of some of my experimentation. Of course, let’s first have a look at the data sets that we’re working with. (See my other posts to see how I created these data sets.)

schools_tea

“` {.r} schools_uil “`

test	year	school	district	county	city	math	reading	writing	english	science	total
ACT	2011	A C JONES	BEEVILLE ISD	BEE	CORPUS CHRISTI	19	18	NA	17	19	18
ACT	2011	A J MOORE ACAD	WACO ISD	MCLENNAN	WACO	19	18	NA	16	18	18
ACT	2011	A M CONS	COLLEGE STATION ISD	BRAZOS	HUNTSVILLE	26	24	NA	23	24	24
ACT	2011	A MACEO SMITH HIGH SCHOOL	DALLAS ISD	DALLAS	RICHARDSON	16	14	NA	13	15	14
ACT	2011	ABBOTT SCHOOL	ABBOTT ISD	HILL	WACO	20	20	NA	19	21	20
ACT	2011	ABERNATHY	ABERNATHY ISD	HALE	LUBBOCK	22	20	NA	19	21	21
ACT	2011	ABILENE	ABILENE ISD	TAYLOR	ABILENE	21	21	NA	20	21	21
ACT	2011	ACADEMY	ACADEMY ISD	BELL	WACO	24	23	NA	21	24	23
ACT	2011	ACADEMY HIGH SCHOOL	HAYS CISD	HAYS	AUSTIN	NA	NA	NA	NA	NA	NA
ACT	2011	ACADEMY OF CAREERS AND TECHNOLOGIE	ACADEMY OF CAREERS AND TECHNOLOGIE	BEXAR	SAN ANTONIO	15	14	NA	12	14	14
ACT	2011	ACADEMY OF CREATIVE ED	NORTH EAST ISD	BEXAR	SAN ANTONIO	NA	NA	NA	NA	NA	NA
ACT	2011	ADRIAN SCHOOL	ADRIAN ISD	OLDHAM	AMARILLO	19	18	NA	20	19	19
ACT	2011	ADVANTAGE ACADEMY	ADVANTAGE ACADEMY	DALLAS	RICHARDSON	18	20	NA	19	16	18
ACT	2011	AGUA DULCE	AGUA DULCE ISD	NUECES	CORPUS CHRISTI	21	19	NA	18	20	19
ACT	2011	AIM CENTER	VIDOR ISD	ORANGE	BEAUMONT	NA	NA	NA	NA	NA	NA
ACT	2011	AKINS	AUSTIN ISD	TRAVIS	AUSTIN	19	17	NA	16	17	17
ACT	2011	ALAMO HEIGHTS	ALAMO HEIGHTS ISD	BEXAR	SAN ANTONIO	25	24	NA	24	24	24
ACT	2011	ALBA-GOLDEN	ALBA-GOLDEN ISD	WOOD	KILGORE	20	19	NA	18	20	19
ACT	2011	ALBANY JR-SR	ALBANY ISD	SHACKELFORD	ABILENE	24	22	NA	21	22	22
ACT	2011	ALDINE	ALDINE ISD	HARRIS	HOUSTON	19	17	NA	16	18	18
¹ # of total rows: 15,073

school	city	complvl_num	score	year	conf	complvl	comp	advanced	n_state	n_bycomp	prnk	n_defeat	w
HASKELL	HASKELL	13	616	2011	1	District	Calculator Applications	1	0	8	1.00	7	TRUE
POOLVILLE	POOLVILLE	13	609	2011	1	District	Calculator Applications	1	0	8	0.86	6	FALSE
LINDSAY	LINDSAY	17	553	2011	1	District	Calculator Applications	1	0	7	1.00	6	TRUE
PLAINS	PLAINS	3	537	2011	1	District	Calculator Applications	1	0	10	1.00	9	TRUE
SAN ISIDRO	SAN ISIDRO	32	534	2011	1	District	Calculator Applications	1	0	4	1.00	3	TRUE
CANADIAN	CANADIAN	7	527	2011	1	District	Calculator Applications	1	0	7	1.00	6	TRUE
GARDEN CITY	GARDEN CITY	10	518	2011	1	District	Calculator Applications	1	0	8	1.00	7	TRUE
WATER VALLEY	WATER VALLEY	10	478	2011	1	District	Calculator Applications	0	0	8	0.86	6	FALSE
GRUVER	GRUVER	7	464	2011	1	District	Calculator Applications	0	0	7	0.83	5	FALSE
YANTIS	YANTIS	19	451	2011	1	District	Calculator Applications	1	0	10	1.00	9	TRUE
SHINER	SHINER	27	450	2011	1	District	Calculator Applications	1	0	9	1.00	8	TRUE
WEST TEXAS	STINNETT	7	443	2011	1	District	Calculator Applications	0	0	7	0.67	4	FALSE
HONEY GROVE	HONEY GROVE	17	440	2011	1	District	Calculator Applications	1	0	7	0.83	5	FALSE
LATEXO	LATEXO	23	439	2011	1	District	Calculator Applications	1	0	10	1.00	9	TRUE
MUENSTER	MUENSTER	17	436	2011	1	District	Calculator Applications	0	0	7	0.67	4	FALSE
VAN HORN	VAN HORN	1	436	2011	1	District	Calculator Applications	1	0	7	1.00	6	TRUE
SLOCUM	ELKHART	23	415	2011	1	District	Calculator Applications	0	0	10	0.89	8	FALSE
ERA	ERA	17	415	2011	1	District	Calculator Applications	0	0	7	0.50	3	FALSE
GOLDTHWAITE	GOLDTHWAITE	15	413	2011	1	District	Calculator Applications	1	0	7	1.00	6	TRUE
NEWCASTLE	NEWCASTLE	12	408	2011	1	District	Calculator Applications	1	0	10	1.00	9	TRUE
¹ # of total rows: 27,359

For my first attempt at fuzzy matching, I tried joining using only school as a key column and setting max_dist = 1. (Note that I do a “full” join because my tetidy::summarise_join_stats() function works best with this kind of input.)

library("tidyverse")
schools_uil_distinct <-
  schools_uil %>%
  distinct(school, city)

summ_schools_joinfuzzy_1 <-
  schools_tea %>%
  tetidy::join_fuzzily(
    schools_uil_distinct,
    mode = "full",
    max_dist = 1,
    cols_join = c("school"),
    suffix_x = "_tea",
    suffix_y = "_uil"
  ) %>%
  tetidy::summarise_join_stats(school_uil, school_tea) %>% 
  select(-x, -y) %>% 
  gather(metric, value)
summ_schools_joinfuzzy_1

metric	value
n_x	12,367
n_y	17,542
n_joined	12,119
n_x_unjoined	5,423
n_y_unjoined	248
x_in_y_pct	98
y_in_x_pct	69

This kind of join represents a very “optimistic” or “over-zealous” implementation that likely results in matches that are not “true”. Nonetheless, it’s hard to really conclude anything about the quality of the join without comparing it to the results of another attempt.

Next, I tried the same join, only setting max_dist = 0 this time. Note that this is really like “exact” string matching, meaning that there is not really any fuzzy matching going on. (Nonetheless, it was worthwhile to evaluate the counts of matches and mis-matches with a full join.)

summ_schools_joinfuzzy_2 <-
  schools_tea %>%
  tetidy::join_fuzzily(
    schools_uil_distinct,
    mode = "full",
    max_dist = 0,
    cols_join = c("school"),
    suffix_x = "_tea",
    suffix_y = "_uil"
  ) %>%
  tetidy::summarise_join_stats(school_uil, school_tea) %>% 
  select(-x, -y) %>% 
  gather(metric, value)
summ_schools_joinfuzzy_2

metric	value
n_x	11,666
n_y	16,898
n_joined	11,399
n_x_unjoined	5,499
n_y_unjoined	267
x_in_y_pct	98
y_in_x_pct	67

As we should expect, this results in a lower number of joins, but my feeling is that the join is still too naive as a result of using only one key column for joining—school.

What about changing the key columns to school and city and setting max_dist = 1? (Note that I unite() the two columns only for my tetidy::summarise_join_stats() function, which can only work with a single “key” column. If only joining the data, the two join columns could be kept separate.)

summ_schools_joinfuzzy_3 <-
  schools_tea %>%
  unite(school_city, school, city, remove = FALSE) %>%
  tetidy::join_fuzzily(
    schools_uil_distinct %>%
      unite(school_city, school, city, remove = FALSE),
    mode = "full",
    max_dist = 1,
    cols_join = c("school_city"),
    suffix_x = "_tea",
    suffix_y = "_uil"
  ) %>%
  tetidy::summarise_join_stats(school_city_uil, school_city_tea) %>% 
  select(-x, -y) %>% 
  gather(metric, value)
summ_schools_joinfuzzy_3

metric	value
n_x	2,842
n_y	15,076
n_joined	1,697
n_x_unjoined	13,379
n_y_unjoined	1,145
x_in_y_pct	60
y_in_x_pct	11

This is probably the best option of the many alternatives that I tested—including many not shown here that try different columns for joining and increasing the value of max_dist. In fact, the number of rows that would be returned with an inner join and the same settings (i.e. max_dist = 1 and the key columns)—as indicated by the value n_joined—is only slightly greater than that which you get when you combine the two data sets with an inner_join() on school and city, as I ended up doing. ³

Conclusion

In some ways, I think this is not too unlike the classic “bias vs. variance” trade-off with machine learning. To what extent do you try to minimize error with your training data—and potentially overfit your model? Or do you try to make a simpler model that generalizes better to a data set outside the training data—potentially creating a model that is “underfit” because it is too simplified? In this case, the question is about to what extent do you try to maximize the number of observations joined—perhaps creating some false matches that compromise the data and the results? And, from the other persepective, if you don’t include all of the data that is only not included due to poor matching, how can you guarante that your data “representative” are that your results are legitimate? It’s a tricky trade-off, as with many things in the realm of data analysis. And, as with most things, the best solution depends heavily on the context. In this case, I think erring on the side of caution with a “conservative” joining policy is a logical choice because there are lots of rows that are matched, and one can only assume that there is no underlying “trend” or “bias” in those schools that are mismatched. (i.e. The un-matched schools are just as ikely to be superior/inferior academically relative to all other schools as those that are matched.)

Something that I did not consider here but is certainly worthy of exploration is neural network text classificaton. With this approach, I could quantify the probability that one string is “equivalent” to another, and choose matches with the highest probabilities for any given string. I have a feeling that this kind of approach would be more successful, although it does not seem as easy to implement. (I may come back and visit this idea in a future post.)

In fact, the rigor involved in cleaning the UIL data obliged be to completely hide it from the reader in my write-ups on the topic. ^{^}
Check out my post on the “DRY” principle and its application to R packages. Creating packages for actions that you perform across projects is a real time-saver! ^{^}
Actually, I also joined on year in the previous post (in order to make sure that school scores across years were “aligned” properly. ^{^}

To leave a comment for the author, please follow the link and comment on their blog: r on Tony ElHabr.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.