Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
As a follow-up to a previous
post about
correlations between Texas high school academic UIL competition scores
and SAT/ACT scores, I wanted explore some of the “alternatives” to
joining the two data sets—which come from different sources. In that
post, I simply perform a an inner_join()
using the school
and city
names as keys. While this decision ensures that the data integrity is
“high”, there are potentially many un-matched schools that could have
been included in the analysis with some sound “fuzzy
matching”.
For my exploration here, I’ll leverage David
Robinson’s excellent {fuzzyjoin}
package.
My dissatisfaction with the results that I came up with various fuzzy matching attempts ultimately led me to “settle” upon
Fuzzy Matching
In its raw form,the UIL data—scraped from https< nolink>://www.hpscience.net/ (which was the only site that I found to have the academic competition scores)—is much “more unclean” than the data from the SAT and ACT data from the TEA website. In particular, many of the city, schoo, and individual participant names are inconsistent across different webpages. 1 Even after eliminating much of the “self-inconsistency” of the UIL data and creating a suitable “clean” data set to use as the basis for my series of posts exploring the UIL data exclusively, there are more than a few differences in the names of schools between the UIL and TEA data.
To aid in my experimentation with “fuzzy joining”, I use a function from
my {tetidy}
package—join_fuzzily()
, which is essentially a wrapper
for the stringdist_*_join()
functions provided by the {fuzzyjoin}
package. With this function, I was able to evaluate different values for
max_dist
and different join columns to help me make a judgement
regarding the quality of joins. I primarily considered counts of joined
and unjoined rows computed with the summarise_join_stats()
function—also from my {tetidy}
package—to make a decision on how I
should join the UIL and SAT/ACT school data. 2
What follows is the results of some of my experimentation. Of course, let’s first have a look at the data sets that we’re working with. (See my other posts to see how I created these data sets.)
schools_tea
test | year | school | district | county | city | math | reading | writing | english | science | total | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ACT | 2011 | A C JONES | BEEVILLE ISD | BEE | CORPUS CHRISTI | 19 | 18 | NA | 17 | 19 | 18 | ||
ACT | 2011 | A J MOORE ACAD | WACO ISD | MCLENNAN | WACO | 19 | 18 | NA | 16 | 18 | 18 | ||
ACT | 2011 | A M CONS | COLLEGE STATION ISD | BRAZOS | HUNTSVILLE | 26 | 24 | NA | 23 | 24 | 24 | ||
ACT | 2011 | A MACEO SMITH HIGH SCHOOL | DALLAS ISD | DALLAS | RICHARDSON | 16 | 14 | NA | 13 | 15 | 14 | ||
ACT | 2011 | ABBOTT SCHOOL | ABBOTT ISD | HILL | WACO | 20 | 20 | NA | 19 | 21 | 20 | ||
ACT | 2011 | ABERNATHY | ABERNATHY ISD | HALE | LUBBOCK | 22 | 20 | NA | 19 | 21 | 21 | ||
ACT | 2011 | ABILENE | ABILENE ISD | TAYLOR | ABILENE | 21 | 21 | NA | 20 | 21 | 21 | ||
ACT | 2011 | ACADEMY | ACADEMY ISD | BELL | WACO | 24 | 23 | NA | 21 | 24 | 23 | ||
ACT | 2011 | ACADEMY HIGH SCHOOL | HAYS CISD | HAYS | AUSTIN | NA | NA | NA | NA | NA | NA | ||
ACT | 2011 | ACADEMY OF CAREERS AND TECHNOLOGIE | ACADEMY OF CAREERS AND TECHNOLOGIE | BEXAR | SAN ANTONIO | 15 | 14 | NA | 12 | 14 | 14 | ||
ACT | 2011 | ACADEMY OF CREATIVE ED | NORTH EAST ISD | BEXAR | SAN ANTONIO | NA | NA | NA | NA | NA | NA | ||
ACT | 2011 | ADRIAN SCHOOL | ADRIAN ISD | OLDHAM | AMARILLO | 19 | 18 | NA | 20 | 19 | 19 | ||
ACT | 2011 | ADVANTAGE ACADEMY | ADVANTAGE ACADEMY | DALLAS | RICHARDSON | 18 | 20 | NA | 19 | 16 | 18 | ||
ACT | 2011 | AGUA DULCE | AGUA DULCE ISD | NUECES | CORPUS CHRISTI | 21 | 19 | NA | 18 | 20 | 19 | ||
ACT | 2011 | AIM CENTER | VIDOR ISD | ORANGE | BEAUMONT | NA | NA | NA | NA | NA | NA | ||
ACT | 2011 | AKINS | AUSTIN ISD | TRAVIS | AUSTIN | 19 | 17 | NA | 16 | 17 | 17 | ||
ACT | 2011 | ALAMO HEIGHTS | ALAMO HEIGHTS ISD | BEXAR | SAN ANTONIO | 25 | 24 | NA | 24 | 24 | 24 | ||
ACT | 2011 | ALBA-GOLDEN | ALBA-GOLDEN ISD | WOOD | KILGORE | 20 | 19 | NA | 18 | 20 | 19 | ||
ACT | 2011 | ALBANY JR-SR | ALBANY ISD | SHACKELFORD | ABILENE | 24 | 22 | NA | 21 | 22 | 22 | ||
ACT | 2011 | ALDINE | ALDINE ISD | HARRIS | HOUSTON | 19 | 17 | NA | 16 | 18 | 18 | ||
1 # of total rows: 15,073 |
school | city | complvl_num | score | year | conf | complvl | comp | advanced | n_state | n_bycomp | prnk | n_defeat | w |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
HASKELL | HASKELL | 13 | 616 | 2011 | 1 | District | Calculator Applications | 1 | 0 | 8 | 1.00 | 7 | TRUE |
POOLVILLE | POOLVILLE | 13 | 609 | 2011 | 1 | District | Calculator Applications | 1 | 0 | 8 | 0.86 | 6 | FALSE |
LINDSAY | LINDSAY | 17 | 553 | 2011 | 1 | District | Calculator Applications | 1 | 0 | 7 | 1.00 | 6 | TRUE |
PLAINS | PLAINS | 3 | 537 | 2011 | 1 | District | Calculator Applications | 1 | 0 | 10 | 1.00 | 9 | TRUE |
SAN ISIDRO | SAN ISIDRO | 32 | 534 | 2011 | 1 | District | Calculator Applications | 1 | 0 | 4 | 1.00 | 3 | TRUE |
CANADIAN | CANADIAN | 7 | 527 | 2011 | 1 | District | Calculator Applications | 1 | 0 | 7 | 1.00 | 6 | TRUE |
GARDEN CITY | GARDEN CITY | 10 | 518 | 2011 | 1 | District | Calculator Applications | 1 | 0 | 8 | 1.00 | 7 | TRUE |
WATER VALLEY | WATER VALLEY | 10 | 478 | 2011 | 1 | District | Calculator Applications | 0 | 0 | 8 | 0.86 | 6 | FALSE |
GRUVER | GRUVER | 7 | 464 | 2011 | 1 | District | Calculator Applications | 0 | 0 | 7 | 0.83 | 5 | FALSE |
YANTIS | YANTIS | 19 | 451 | 2011 | 1 | District | Calculator Applications | 1 | 0 | 10 | 1.00 | 9 | TRUE |
SHINER | SHINER | 27 | 450 | 2011 | 1 | District | Calculator Applications | 1 | 0 | 9 | 1.00 | 8 | TRUE |
WEST TEXAS | STINNETT | 7 | 443 | 2011 | 1 | District | Calculator Applications | 0 | 0 | 7 | 0.67 | 4 | FALSE |
HONEY GROVE | HONEY GROVE | 17 | 440 | 2011 | 1 | District | Calculator Applications | 1 | 0 | 7 | 0.83 | 5 | FALSE |
LATEXO | LATEXO | 23 | 439 | 2011 | 1 | District | Calculator Applications | 1 | 0 | 10 | 1.00 | 9 | TRUE |
MUENSTER | MUENSTER | 17 | 436 | 2011 | 1 | District | Calculator Applications | 0 | 0 | 7 | 0.67 | 4 | FALSE |
VAN HORN | VAN HORN | 1 | 436 | 2011 | 1 | District | Calculator Applications | 1 | 0 | 7 | 1.00 | 6 | TRUE |
SLOCUM | ELKHART | 23 | 415 | 2011 | 1 | District | Calculator Applications | 0 | 0 | 10 | 0.89 | 8 | FALSE |
ERA | ERA | 17 | 415 | 2011 | 1 | District | Calculator Applications | 0 | 0 | 7 | 0.50 | 3 | FALSE |
GOLDTHWAITE | GOLDTHWAITE | 15 | 413 | 2011 | 1 | District | Calculator Applications | 1 | 0 | 7 | 1.00 | 6 | TRUE |
NEWCASTLE | NEWCASTLE | 12 | 408 | 2011 | 1 | District | Calculator Applications | 1 | 0 | 10 | 1.00 | 9 | TRUE |
1 # of total rows: 27,359 |
For my first attempt at fuzzy matching, I tried joining using only
school
as a key column and setting max_dist = 1
. (Note that I do a
“full” join because my tetidy::summarise_join_stats()
function works
best with this kind of input.)
library("tidyverse") schools_uil_distinct <- schools_uil %>% distinct(school, city) summ_schools_joinfuzzy_1 <- schools_tea %>% tetidy::join_fuzzily( schools_uil_distinct, mode = "full", max_dist = 1, cols_join = c("school"), suffix_x = "_tea", suffix_y = "_uil" ) %>% tetidy::summarise_join_stats(school_uil, school_tea) %>% select(-x, -y) %>% gather(metric, value) summ_schools_joinfuzzy_1
metric | value |
---|---|
n_x | 12,367 |
n_y | 17,542 |
n_joined | 12,119 |
n_x_unjoined | 5,423 |
n_y_unjoined | 248 |
x_in_y_pct | 98 |
y_in_x_pct | 69 |
This kind of join represents a very “optimistic” or “over-zealous” implementation that likely results in matches that are not “true”. Nonetheless, it’s hard to really conclude anything about the quality of the join without comparing it to the results of another attempt.
Next, I tried the same join, only setting max_dist = 0
this time. Note
that this is really like “exact” string matching, meaning that there is
not really any fuzzy matching going on. (Nonetheless, it was worthwhile
to evaluate the counts of matches and mis-matches with a full join.)
summ_schools_joinfuzzy_2 <- schools_tea %>% tetidy::join_fuzzily( schools_uil_distinct, mode = "full", max_dist = 0, cols_join = c("school"), suffix_x = "_tea", suffix_y = "_uil" ) %>% tetidy::summarise_join_stats(school_uil, school_tea) %>% select(-x, -y) %>% gather(metric, value) summ_schools_joinfuzzy_2
metric | value |
---|---|
n_x | 11,666 |
n_y | 16,898 |
n_joined | 11,399 |
n_x_unjoined | 5,499 |
n_y_unjoined | 267 |
x_in_y_pct | 98 |
y_in_x_pct | 67 |
As we should expect, this results in a lower number of joins, but my
feeling is that the join is still too naive as a result of using only
one key column for joining—school
.
What about changing the key columns to school
and city
and setting
max_dist = 1
? (Note that I unite()
the two columns only for my
tetidy::summarise_join_stats()
function, which can only work with a
single “key” column. If only joining the data, the two join columns
could be kept separate.)
summ_schools_joinfuzzy_3 <- schools_tea %>% unite(school_city, school, city, remove = FALSE) %>% tetidy::join_fuzzily( schools_uil_distinct %>% unite(school_city, school, city, remove = FALSE), mode = "full", max_dist = 1, cols_join = c("school_city"), suffix_x = "_tea", suffix_y = "_uil" ) %>% tetidy::summarise_join_stats(school_city_uil, school_city_tea) %>% select(-x, -y) %>% gather(metric, value) summ_schools_joinfuzzy_3
metric | value |
---|---|
n_x | 2,842 |
n_y | 15,076 |
n_joined | 1,697 |
n_x_unjoined | 13,379 |
n_y_unjoined | 1,145 |
x_in_y_pct | 60 |
y_in_x_pct | 11 |
This is probably the best option of the many alternatives that I
tested—including many not shown here that try different columns for
joining and increasing the value of max_dist
. In fact, the number of
rows that would be returned with an inner join and the same settings
(i.e. max_dist = 1
and the key columns)—as indicated by the value
n_joined
—is only slightly greater than that which you get when you
combine the two data sets with an inner_join()
on school
and city
,
as I ended up doing. 3
Conclusion
In some ways, I think this is not too unlike the classic “bias vs. variance” trade-off with machine learning. To what extent do you try to minimize error with your training data—and potentially overfit your model? Or do you try to make a simpler model that generalizes better to a data set outside the training data—potentially creating a model that is “underfit” because it is too simplified? In this case, the question is about to what extent do you try to maximize the number of observations joined—perhaps creating some false matches that compromise the data and the results? And, from the other persepective, if you don’t include all of the data that is only not included due to poor matching, how can you guarante that your data “representative” are that your results are legitimate? It’s a tricky trade-off, as with many things in the realm of data analysis. And, as with most things, the best solution depends heavily on the context. In this case, I think erring on the side of caution with a “conservative” joining policy is a logical choice because there are lots of rows that are matched, and one can only assume that there is no underlying “trend” or “bias” in those schools that are mismatched. (i.e. The un-matched schools are just as ikely to be superior/inferior academically relative to all other schools as those that are matched.)
Something that I did not consider here but is certainly worthy of exploration is neural network text classificaton. With this approach, I could quantify the probability that one string is “equivalent” to another, and choose matches with the highest probabilities for any given string. I have a feeling that this kind of approach would be more successful, although it does not seem as easy to implement. (I may come back and visit this idea in a future post.)
- In fact, the rigor involved in cleaning the UIL data obliged be to completely hide it from the reader in my write-ups on the topic. ^
- Check out my post on the “DRY” principle and its application to
R
packages. Creating packages for actions that you perform across projects is a real time-saver! ^ - Actually, I also joined on
year
in the previous post (in order to make sure that school scores across years were “aligned” properly. ^
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.