Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
I wanted to do a follow-up on my series of posts about Texas high school University Interscholastic League (UIL) academic competitions to more closely evaluate the relationship between the school performance in those competitions with school-wide SAT) and ACT scores. For those who may not be familiar with these tests, these are the two most popular standardized tests used for college admission in the United States.
In my introduction to that series, I stated the following: School-wide … scores on state- and national-standardized tests (e.g. the SAT) certainly are the most common measure of academic strength, but I think rankings by academic competitions may be more indicative.
Essentially, I was implying that the academic UIL scores may not correspond well, or at all, with standardized test scores. However, I did not attempt to prove this hypothesis, which is what I set out to do here. While I’m at it, I’ll show the code and provide some commentary to explain my process.
Data Collection
While I already have collected and cleaned the UIL data that I’ll need
by virtue of my work for my series of posts analyzing the UIL
competitions, I did
not retrieve data for standardized test scores. To my delight, the
Texas Education Agency’s website publishes
Texas high school SAT and ACT scores for the years 2011 through 2015.
The task of scraping from this source is a perfect use-case for the
super-handy {xml2}
and {rvest}
packages, as well the well-known
awesome {stringr}
and {purrr}
packages in the {tidyverse}
.
library("tidyverse") library("rlang") library("teplot") # Personal package. urls_tea <- "https://tea.texas.gov/acctres/sat_act_index.html" %>% xml2::read_html() %>% rvest::html_nodes(xpath = "//tr //td //a") %>% rvest::html_attr("href") %>% str_subset("^\\/acctres\\/[:alpha:]{3}_[Cc]ampus_[Dd]ata") urls_tea create_path_tea <- function(url_suffix, dir = "data-raw", ext = "csv") { if(!dir.exists(dir)) { dir.create(dir) } url_suffix %>% str_remove_all("acctres|\\/") %>% paste0(".", ext) %>% file.path(dir, .) } # NOTE(s): # + `urls_tea_dl` is actually the same as `url_tea` because `purrr::walk()` returns its first argument. # + `mode = "wb"` is important! Otherwise, the downloaded files have empty lines every other line # (due to the way that CR and LFs are handled. urls_tea_dl <- urls_tea %>% walk( ~download.file( url = paste0("https://tea.texas.gov/", .x), destfile = create_path_tea(url_suffix = .x), mode = "wb" ) )
Data Cleaning
Next, I bind the data from all of the downloaded files together and do some cleaning. I put these actions in function(s) because I plan on re-using them in future posts where I explore this data set in other ways.
One relatively significant choice that I make here is to only include
the data for the school-wide level (via the "All Students"
filtering
criteria), although data for different demographics within each school
is provided. The other data set that I am evaluating—the academic UIL
data— does not have demographci-specific information, so I want to treat
the two set as “equally” as possible.
Additionally, in order to better understand the resulting data set, the
reader should be made aware of some of the details of the tests. The SAT
has math
, reading
, and writing
sections, each having minimum and
maximum scores of 200 and 800, meaning that the total
can range from
600 to 2400. The ACT has math
, reading
, english
, and science
sections, each having a minimum and maximum score of 1 and 36, combined
for a single compos
score also ranging from 1 to 36. To eliminate
duplicate columns representing the same underlying “thing”. I don’t
distinguish the math
and reading
section scores for each test in
separate columns, I rename the ACT’s compos
score to total
,
following the convention used for the SAT’s cumulative score. The other
sections—writing
for the SAT and english
and science
for the ACT—
are not really analogous to sections in the other test, so they are
filled with NA
s appropriately.
Finally, for the interEsted reader, there are some details regarding the code implementation that I document in comments (both for explaining actions for myself and for the reader).
import_tea_data <- function(path, rgx_grp) { res <- path %>% readr::read_csv() %>% rename_all(funs(tolower)) if(!is.null(rgx_grp)) { res <- res %>% filter(group %>% str_detect(rgx_grp)) } res <- res %>% select( matches( "^group$|name$|math|reading|writing|total|english|science|compos" ) ) res } import_tea_data_cleanly <- function(urls, rgx_grp, ...) { res <- urls %>% create_path_tea(...) %>% tibble(path = .) %>% mutate( test = stringr::str_extract(path, "([Ss][Aa][Tt])|([Aa][Cc][Tt])") %>% toupper(), year = stringr::str_extract(path, "[:digit:]+") %>% as.integer() ) %>% mutate(contents = purrr::map(path, ~import_tea_data(.x, rgx_grp = rgx_grp))) %>% unnest() %>% # NOTE: No longer need this columns(s) any more. select(-path) %>% mutate_at(vars(total), funs(ifelse(test == "ACT", compos, .))) %>% # NOTE: No longer need this column(s) any more. select(-compos) %>% # NOTE: Rearranging score columns in a more logical fashion. select(-total, everything(), total) %>% # NOTE: Renaming "important" columns. rename(school = campname, district = distname, county = cntyname, city = regnname) %>% mutate_if(is.character, funs(str_replace_all(., "=|\"", ""))) %>% mutate_at(vars(school, district, county, city), funs(toupper)) %>% # NOTE: Some county names are truncated and end with COUN or COUNT. # (The max seems to be 18 characters). # Fortunately, ther are no county names with COUN in their names, so the following # regular expression is sufficient. mutate_at(vars(county), funs(str_remove_all(., "\\s+COUN.*$"))) %>% # NOTE: Remove all HS/H S at the end of school names, as well as ampersands. # This seems to improve join percentages with other data sets. mutate_at(vars(school), funs(str_remove_all(., "([H]\\s*[S]$)|(\\s+\\&)") %>% str_trim())) %>% # NOTE: This is (try to) to resolve duplicates in raw data. # group_by_at(vars(matches("test|year|school|district|county|city"))) %>% # summarise_all(funs(max(., na.rm = TRUE))) %>% # ungroup() %>% arrange(test, year, school) res } schools_tea <- urls_tea %>% import_tea_data_cleanly(rgx_grp = "All Students") schools_tea
test | year | school | district | county | city | math | reading | writing | english | science | total |
---|---|---|---|---|---|---|---|---|---|---|---|
ACT | 2011 | A C JONES | BEEVILLE ISD | BEE | CORPUS CHRISTI | 19 | 18 | NA | 17 | 19 | 18 |
ACT | 2011 | A J MOORE ACAD | WACO ISD | MCLENNAN | WACO | 19 | 18 | NA | 16 | 18 | 18 |
ACT | 2011 | A M CONS | COLLEGE STATION ISD | BRAZOS | HUNTSVILLE | 26 | 24 | NA | 23 | 24 | 24 |
ACT | 2011 | A MACEO SMITH HIGH SCHOOL | DALLAS ISD | DALLAS | RICHARDSON | 16 | 14 | NA | 13 | 15 | 14 |
ACT | 2011 | ABBOTT SCHOOL | ABBOTT ISD | HILL | WACO | 20 | 20 | NA | 19 | 21 | 20 |
ACT | 2011 | ABERNATHY | ABERNATHY ISD | HALE | LUBBOCK | 22 | 20 | NA | 19 | 21 | 21 |
ACT | 2011 | ABILENE | ABILENE ISD | TAYLOR | ABILENE | 21 | 21 | NA | 20 | 21 | 21 |
ACT | 2011 | ACADEMY | ACADEMY ISD | BELL | WACO | 24 | 23 | NA | 21 | 24 | 23 |
ACT | 2011 | ACADEMY HIGH SCHOOL | HAYS CISD | HAYS | AUSTIN | NA | NA | NA | NA | NA | NA |
ACT | 2011 | ACADEMY OF CAREERS AND TECHNOLOGIE | ACADEMY OF CAREERS AND TECHNOLOGIE | BEXAR | SAN ANTONIO | 15 | 14 | NA | 12 | 14 | 14 |
ACT | 2011 | ACADEMY OF CREATIVE ED | NORTH EAST ISD | BEXAR | SAN ANTONIO | NA | NA | NA | NA | NA | NA |
ACT | 2011 | ADRIAN SCHOOL | ADRIAN ISD | OLDHAM | AMARILLO | 19 | 18 | NA | 20 | 19 | 19 |
ACT | 2011 | ADVANTAGE ACADEMY | ADVANTAGE ACADEMY | DALLAS | RICHARDSON | 18 | 20 | NA | 19 | 16 | 18 |
ACT | 2011 | AGUA DULCE | AGUA DULCE ISD | NUECES | CORPUS CHRISTI | 21 | 19 | NA | 18 | 20 | 19 |
ACT | 2011 | AIM CENTER | VIDOR ISD | ORANGE | BEAUMONT | NA | NA | NA | NA | NA | NA |
ACT | 2011 | AKINS | AUSTIN ISD | TRAVIS | AUSTIN | 19 | 17 | NA | 16 | 17 | 17 |
ACT | 2011 | ALAMO HEIGHTS | ALAMO HEIGHTS ISD | BEXAR | SAN ANTONIO | 25 | 24 | NA | 24 | 24 | 24 |
ACT | 2011 | ALBA-GOLDEN | ALBA-GOLDEN ISD | WOOD | KILGORE | 20 | 19 | NA | 18 | 20 | 19 |
ACT | 2011 | ALBANY JR-SR | ALBANY ISD | SHACKELFORD | ABILENE | 24 | 22 | NA | 21 | 22 | 22 |
ACT | 2011 | ALDINE | ALDINE ISD | HARRIS | HOUSTON | 19 | 17 | NA | 16 | 18 | 18 |
1 # of total rows: 15,073 |
EDA: Year-to-Year Correlations
First, before evaluating the primary concern at hand—the relationship
between the academic UIL scores and the SAT/ACT scores (available in the
schools_tea
data created above)—I want to verify that there is some
non-trivial relationship among the scores for a given school on a given
test across years. (I would be surprised if this were not shown to be
true.)
schools_tea_cors_byyear <- schools_tea %>% distinct(test, year, school, .keep_all = TRUE) %>% filter(!is.na(total)) %>% unite(test_school, test, school) %>% widyr::pairwise_cor( feature = test_school, item = year, value = total ) %>% rename(year1 = item1, year2 = item2, cor = correlation) schools_tea_cors_byyear %>% filter(year1 <= year2)
year1 | year2 | cor |
---|---|---|
2011 | 2012 | 0.80 |
2011 | 2013 | 0.76 |
2012 | 2013 | 0.86 |
2011 | 2014 | 0.69 |
2012 | 2014 | 0.78 |
2013 | 2014 | 0.83 |
2011 | 2015 | 0.64 |
2012 | 2015 | 0.74 |
2013 | 2015 | 0.78 |
2014 | 2015 | 0.86 |
school | city | complvl_num | score | year | conf | complvl | comp | advanced | n_state | n_bycomp | prnk | n_defeat | w |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
HASKELL | HASKELL | 13 | 616 | 2011 | 1 | District | Calculator Applications | 1 | 0 | 8 | 1.00 | 7 | TRUE |
POOLVILLE | POOLVILLE | 13 | 609 | 2011 | 1 | District | Calculator Applications | 1 | 0 | 8 | 0.86 | 6 | FALSE |
LINDSAY | LINDSAY | 17 | 553 | 2011 | 1 | District | Calculator Applications | 1 | 0 | 7 | 1.00 | 6 | TRUE |
PLAINS | PLAINS | 3 | 537 | 2011 | 1 | District | Calculator Applications | 1 | 0 | 10 | 1.00 | 9 | TRUE |
SAN ISIDRO | SAN ISIDRO | 32 | 534 | 2011 | 1 | District | Calculator Applications | 1 | 0 | 4 | 1.00 | 3 | TRUE |
CANADIAN | CANADIAN | 7 | 527 | 2011 | 1 | District | Calculator Applications | 1 | 0 | 7 | 1.00 | 6 | TRUE |
GARDEN CITY | GARDEN CITY | 10 | 518 | 2011 | 1 | District | Calculator Applications | 1 | 0 | 8 | 1.00 | 7 | TRUE |
WATER VALLEY | WATER VALLEY | 10 | 478 | 2011 | 1 | District | Calculator Applications | 0 | 0 | 8 | 0.86 | 6 | FALSE |
GRUVER | GRUVER | 7 | 464 | 2011 | 1 | District | Calculator Applications | 0 | 0 | 7 | 0.83 | 5 | FALSE |
YANTIS | YANTIS | 19 | 451 | 2011 | 1 | District | Calculator Applications | 1 | 0 | 10 | 1.00 | 9 | TRUE |
SHINER | SHINER | 27 | 450 | 2011 | 1 | District | Calculator Applications | 1 | 0 | 9 | 1.00 | 8 | TRUE |
WEST TEXAS | STINNETT | 7 | 443 | 2011 | 1 | District | Calculator Applications | 0 | 0 | 7 | 0.67 | 4 | FALSE |
HONEY GROVE | HONEY GROVE | 17 | 440 | 2011 | 1 | District | Calculator Applications | 1 | 0 | 7 | 0.83 | 5 | FALSE |
LATEXO | LATEXO | 23 | 439 | 2011 | 1 | District | Calculator Applications | 1 | 0 | 10 | 1.00 | 9 | TRUE |
MUENSTER | MUENSTER | 17 | 436 | 2011 | 1 | District | Calculator Applications | 0 | 0 | 7 | 0.67 | 4 | FALSE |
VAN HORN | VAN HORN | 1 | 436 | 2011 | 1 | District | Calculator Applications | 1 | 0 | 7 | 1.00 | 6 | TRUE |
SLOCUM | ELKHART | 23 | 415 | 2011 | 1 | District | Calculator Applications | 0 | 0 | 10 | 0.89 | 8 | FALSE |
ERA | ERA | 17 | 415 | 2011 | 1 | District | Calculator Applications | 0 | 0 | 7 | 0.50 | 3 | FALSE |
GOLDTHWAITE | GOLDTHWAITE | 15 | 413 | 2011 | 1 | District | Calculator Applications | 1 | 0 | 7 | 1.00 | 6 | TRUE |
NEWCASTLE | NEWCASTLE | 12 | 408 | 2011 | 1 | District | Calculator Applications | 1 | 0 | 10 | 1.00 | 9 | TRUE |
1 # of total rows: 27,359 |
Now let’s try to evaluate whether or not year-to-year correlations also exist with this data set.
Importantly, some choice about how to quantify performance needs to be
made. As I discussed in my long-form series of posts exploring the UIL
academic data, the
evaluation of performance is somewhat subjective. Should we use number
of times a school advanced to the next level of competition in a given
year? (Note that there are three competition levels—District
,
Region
, and State
.) What about the number the number of other
schools it “defeated” in head-to-head competitions? In that separate
analysis, I made the choice to use the percentile rank (prnk
) of the
school’s placings across all competition levels for a given competition
type (comp
). I believe this measure bests represent a school’s quality
of performance (where a higher value indicates better performance). As I
stated there when explaining my choice to use percent rank for
identifying “dominant” individual“, ”I choose to use percent rank—which
is a always a value between 0 and 1—because it inherently accounts for
the wide range of number of competitors across all competitions. (For
this context, a percent rank of 1 corresponds to the highest score in a
given competition, and, conversely, a value of 0 corresponds to the
lowest score.)”
Aside from this decision regarding performance evaluation in academic
UIL competitions, note that I treat the competition type (comp
) in
schools_uil
as analogous to the test
variable indicating SAT or ACT
score in the schools_tea
data set. For those who have not read through
my UIL analysis, note that scores for five different competition types
was collected—Calculator Applications
, Computer Science
,
Mathematics
, Number Sense
, and Science
.
schools_uil_cors_byyear <- schools_uil %>% select(year, school, city, comp, prnk) %>% group_by(year, school, city, comp) %>% summarise(prnk_sum = sum(prnk, na.rm = TRUE)) %>% ungroup() %>% unite(comp_school, comp, school) %>% widyr::pairwise_cor( feature = comp_school, item = year, value = prnk_sum ) %>% rename(year1 = item1, year2 = item2, cor = correlation) schools_uil_cors_byyear %>% filter(year1 <= year2)
table class=“table” style=“width: auto !important; margin-left: auto; margin-right: auto;“>
year1
year2
cor
We can see that correlations among years do exist, as we would expect. The strength of the correlations decrease for years that are farther apart, which is also what we might expect.
“Final” Correlation Analysis
So, at this point, I have set myself up to do that which I set out to do—evaluate the relationship between the academic UIL competition scores and the national SAT/ACT scores.
In order to put the two sets of data on “equal grounds”, I only evaluate
math scores. In particular, I filter comp
in the UIL data to just the
mathematically-based competitions—Calculator Applications
,
Mathematics
, and Number Sense
—excluding Science
and
Computer Science
. And, for the SAT/ACT data, I select only the math
score, which is available fore both tests, excluding the total
and
reading
scores also available for each and the writing
, english
,
and science
scores available for one or the other. (Perhaps the ACT’s
science
score could be compared to the Science
UIL scores, but I
choose not to do so here.)
schools_uil_math <- schools_uil %>% filter(str_detect(comp, "Calculator|Math|Number")) %>% group_by(year, school, city) %>% summarise(prnk_sum = sum(prnk, na.rm = TRUE)) %>% ungroup() %>% # NOTE: "Renormalize" `prnk_sum`. mutate(math_prnk = percent_rank(prnk_sum)) %>% select(-prnk_sum) schools_uil_math
year | school | city | math_prnk |
---|---|---|---|
2011 | ABBOTT | ABBOTT | 0.82 |
2011 | ABERNATHY | ABERNATHY | 0.59 |
2011 | ABILENE | ABILENE | 0.00 |
2011 | ACADEMY | KINGSVILLE | 0.70 |
2011 | ACADEMY | LITTLE RIVER | 0.81 |
2011 | ACADEMY OF FINE ARTS | FORT WORTH | 0.55 |
2011 | ADAMS | DALLAS | 0.47 |
2011 | ADAMSON | DALLAS | 0.42 |
2011 | ADRIAN | ADRIAN | 0.36 |
2011 | ADVANTAGE ACADEMY | WAXAHACHE | 0.22 |
2011 | AGUA DULCE | AGUA DULCE | 0.57 |
2011 | ALAMO HEIGHTS | SAN ANTONIO | 0.70 |
2011 | ALBA-GOLDEN | ALBA | 0.72 |
2011 | ALBANY | ALBANY | 0.95 |
2011 | ALEDO | ALEDO | 0.89 |
2011 | ALEXANDER | LAREDO | 0.56 |
2011 | ALICE | ALICE | 0.10 |
2011 | ALLEN | ALLEN | 0.85 |
2011 | ALPINE | ALPINE | 0.57 |
2011 | ALTO | ALTO | 0.19 |
1 # of total rows: 5,596 |
schools_tea_math <- schools_tea %>% select(test, year, school, city, math) %>% filter(!is.na(math)) %>% group_by(test) %>% mutate(math_prnk = percent_rank(math)) %>% ungroup() %>% group_by(year, school, city) %>% summarise_at(vars(math_prnk), funs(mean(., na.rm = TRUE))) %>% ungroup() schools_tea_math
year | school | city | math_prnk |
---|---|---|---|
2011 | A C JONES | CORPUS CHRISTI | 0.51 |
2011 | A J MOORE ACAD | WACO | 0.24 |
2011 | A M CONS | HUNTSVILLE | 0.97 |
2011 | A MACEO SMITH HIGH SCHOOL | RICHARDSON | 0.03 |
2011 | A+ ACADEMY | RICHARDSON | 0.14 |
2011 | ABBOTT SCHOOL | WACO | 0.72 |
2011 | ABERNATHY | LUBBOCK | 0.63 |
2011 | ABILENE | ABILENE | 0.60 |
2011 | ACADEMY | WACO | 0.91 |
2011 | ACADEMY HIGH SCHOOL | AUSTIN | 0.32 |
2011 | ACADEMY OF CAREERS AND TECHNOLOGIE | SAN ANTONIO | 0.03 |
2011 | ACADEMY OF CREATIVE ED | SAN ANTONIO | 0.48 |
2011 | ADRIAN SCHOOL | AMARILLO | 0.27 |
2011 | ADVANTAGE ACADEMY | RICHARDSON | 0.25 |
2011 | AGUA DULCE | CORPUS CHRISTI | 0.66 |
2011 | AKINS | AUSTIN | 0.25 |
2011 | ALAMO HEIGHTS | SAN ANTONIO | 0.95 |
2011 | ALBA-GOLDEN | KILGORE | 0.52 |
2011 | ALBANY JR-SR | ABILENE | 0.83 |
2011 | ALDINE | HOUSTON | 0.23 |
1 # of total rows: 7,730 |
schools_join_math <- schools_tea_math %>% rename_at(vars(matches("^math")), funs(paste0("tea_", .))) %>% inner_join(schools_uil_math %>% rename_at(vars(matches("^math")), funs(paste0("uil_", .))), by = c("year", "school", "city")) %>% select(year, school, city, matches("math")) schools_join_math
year | school | city | tea_math_prnk | uil_math_prnk |
---|---|---|---|---|
2011 | ABILENE | ABILENE | 0.60 | 0.00 |
2011 | ALAMO HEIGHTS | SAN ANTONIO | 0.95 | 0.70 |
2011 | AMERICAS | EL PASO | 0.31 | 0.69 |
2011 | ANDERSON | AUSTIN | 0.98 | 0.64 |
2011 | ANDRESS | EL PASO | 0.15 | 0.56 |
2011 | ARLINGTON HEIGHTS | FORT WORTH | 0.63 | 0.49 |
2011 | AUSTIN | AUSTIN | 0.89 | 0.22 |
2011 | AUSTIN | EL PASO | 0.22 | 0.68 |
2011 | AUSTIN | HOUSTON | 0.09 | 0.85 |
2011 | BEL AIR | EL PASO | 0.17 | 0.49 |
2011 | BERKNER | RICHARDSON | 0.80 | 0.50 |
2011 | BOSWELL | FORT WORTH | 0.83 | 0.22 |
2011 | BOWIE | AUSTIN | 0.97 | 0.70 |
2011 | BOWIE | EL PASO | 0.15 | 0.15 |
2011 | BRANDEIS | SAN ANTONIO | 0.79 | 0.39 |
2011 | BREWER | FORT WORTH | 0.48 | 0.19 |
2011 | BURBANK | SAN ANTONIO | 0.22 | 0.76 |
2011 | BURGES | EL PASO | 0.57 | 0.70 |
2011 | CALALLEN | CORPUS CHRISTI | 0.47 | 0.93 |
2011 | CANUTILLO | EL PASO | 0.18 | 0.81 |
1 # of total rows: 699 |
schools_join_math_cors <- schools_join_math %>% select(-year) %>% select_if(is.numeric) %>% corrr::correlate() schools_join_math_cors
rowname | tea_math_prnk | uil_math_prnk |
---|---|---|
tea_math_prnk | NA | 0.36 |
uil_math_prnk | 0.36 | NA |
So, this correlation value—0.36—seems fairly low. At face value, it certainly does not provide any basis to claim that schools that do well in the academic UIL competitions also do well with SAT/ACT tests. However, perhaps if I used a different methodology, the result would be different. Other metrics used to quantify academic UIL performance could be tested in some kind of sensitivity analysis.
EDA: Year-to-Year Correlations, Cont.
While I won’t do any kind of rigorous second evaluation here, I do want
to try to quantify the impact of the “missing” data dropped due to
mismatched school names. If all possible data had been used, would the
final correlation value have increased (or decreased) with more (or
less) data? Although finding direct answer to this question is
impossible, we can evaluate the difference in the year-to-year
correlations of scores from the schools that are joined with the
correlations calculated for all in “unjoined” schools_tea
and
schools_uil
data sets. If we find that there are large discrepancies
(one way or the other), then we may have some reason to believe that the
0.36 number found above is misleading.
To perform this task, I create a couple of intermediary data sets, as well as some functions.
schools_postjoin_math_tidy <- schools_join_math %>% unite(school_city, school, city) %>% gather(metric, value, matches("prnk")) pairwise_cor_f1 <- function(data, which = c("tea", "uil")) { which <- match.arg(which) data %>% filter(metric %>% str_detect(which)) %>% # filter_at(vars(value), all_vars(!is.nan(.))) %>% widyr::pairwise_cor( feature = school_city, item = year, value = value ) %>% rename(year1 = item1, year2 = item2, cor = correlation) %>% mutate(source = which %>% toupper()) } pairwise_cor_f2 <- function(data, which = c("tea", "uil")) { which <- match.arg(which) col <- data %>% names() %>% str_subset("math") data %>% unite(school_city, school, city) %>% rename(value = !!rlang::sym(col)) %>% mutate(source = which %>% toupper()) %>% widyr::pairwise_cor( feature = school_city, item = year, value = value ) %>% rename(year1 = item1, year2 = item2, cor = correlation) %>% mutate(source = which %>% toupper()) } schools_postjoin_math_cors_byyear <- bind_rows( schools_postjoin_math_tidy %>% pairwise_cor_f1("tea"), schools_postjoin_math_tidy %>% pairwise_cor_f1("uil") ) schools_prejoin_math_cors_byyear <- bind_rows( schools_tea_math %>% pairwise_cor_f2("tea"), schools_uil_math %>% pairwise_cor_f2("uil") ) schools_math_cors_byyear_diffs <- schools_postjoin_math_cors_byyear %>% inner_join(schools_prejoin_math_cors_byyear, by = c("year1", "year2", "source"), suffix = c("_join", "_unjoin")) %>% mutate(cor_diff = cor_join - cor_unjoin)
Ok, enough of the data munging—let’s review the results!
schools_math_cors_byyear_diffs_wide <- schools_math_cors_byyear_diffs %>% filter(year1 <= year2) %>% select(-matches("join$")) %>% unite(year_pair, year1, year2) %>% spread(source, cor_diff) schools_math_cors_byyear_diffs_wide
year_pair | TEA | UIL |
---|---|---|
2011_2012 | 0.09 | -0.02 |
2011_2013 | 0.05 | -0.04 |
2011_2014 | 0.11 | -0.06 |
2011_2015 | 0.24 | -0.03 |
2012_2013 | 0.01 | 0.00 |
2012_2014 | 0.06 | 0.04 |
2012_2015 | 0.15 | 0.00 |
2013_2014 | -0.01 | -0.03 |
2013_2015 | 0.08 | 0.00 |
2014_2015 | 0.05 | -0.01 |
Note that the correlations in the joined data are a bit “stronger”—in the sense that they are more positive—among the TEA SAT/ACT data, although not in any kind of magnificent way. Additionally, the differences for the UIL data are trivial. Thus, we might say that the additional data that could have possibly increased (or decreased) the singular correlation value found—0.36—would not have changed much at all.
Conclusion
So, my initial inclination in my analysis of academic UIL competitions) seems correct—there is no significant relationship between Texas high school academic competition scores and standardized test scores (for math, between 2011 and 2015). And, with that question answered, I intend to explore this rich data set in other ways in future blog posts.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.