Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
During a recent negotiation of an informed consent form for use in a clinical trial, the opposing lawyer and I skirmished over the applicability of the Genetic Information Nondiscrimination Act of 2008, commonly known as GINA. Specifically, the opposing lawyer thought that guidance issued by the U.S. Office for Human Research Protections in 2009 was now outdated, in part because enforcement efforts were erratic. The argument was primarily driven by policy, rather than data.
Being a data-driven guy, I wanted to see whether the data supported the argument advanced by the other lawyer. Fortunately, the U.S. Equal Employment Opportunity Commission (EEOC), which is responsible for administering GINA complaints, maintains statistics regarding GINA claims and resolutions. I’m not great at making sense of numbers in a table, so I thought this presented the perfect opportunity to rvest some data!
libraries <- c("tidyverse", "rvest", "magrittr") lapply(libraries, require, character.only = TRUE)
Data Scraping
In standard rvest fashion, we’ll read a url, extract the table containing the GINA enforcement statistics, and then do some data cleaning. Once we read the table and gather all of the annual results into key/pair of year/value, we get the following results:
url <- "https://www.eeoc.gov/eeoc/statistics/enforcement/genetic.cfm" GINA.resolutions <- read_html(url) %>% html_nodes("table") %>% extract2(1) %>% html_table(trim = TRUE, fill = TRUE, header = TRUE) names(GINA.resolutions)[1] <- "metric" # Top left table cell is blank, will throw errors names(GINA.resolutions) <- gsub("FY (.+)", "\\1", names(GINA.resolutions)) # Remove FY from year so we can convert to numeric GINA.resolutions <- GINA.resolutions %>% filter(! metric == "") %>% # Remove percentage rows filter(! metric == "Resolutions By Type") %>% # Remove blank line gather(year, value, 2:9) %>% # short and wide data to tall and skinny mutate( year = as.integer(year), value = gsub("[\\$\\%]", "", value) ) %>% mutate( value = as.numeric(value) ) %>% as.tibble() GINA.resolutions ## # A tibble: 88 x 3 ## metric year value ## <chr> <int> <dbl> ## 1 Receipts 2010 201 ## 2 Resolutions 2010 56 ## 3 Settlements 2010 3 ## 4 Withdrawals w/Benefits 2010 2 ## 5 Administrative Closures 2010 11 ## 6 No Reasonable Cause 2010 38 ## 7 Reasonable Cause 2010 2 ## 8 Successful Conciliations 2010 1 ## 9 Unsuccessful Conciliations 2010 1 ## 10 Merit Resolutions 2010 7 ## # ... with 78 more rows
Claim Numbers over Time
Now that we have the data in a format we can use, we’ll look at the volume of claims and resolutions over time:
GINA.resolutions %>% filter(metric == "Receipts" | metric == "Resolutions") %>% ggplot(aes(year, value, color = metric)) + geom_line() + labs( title = "EEOC Enforcement of GINA Charges", subtitle = "Claims and Resolutions, FY 2010 - FY 2017", caption = paste0("Source: ", url), x = "", y = "" ) + scale_color_brewer("", palette = "Paired")
GINA Claim Resolutions
One of the arguments made by the opposing lawyer is that the Obama administration was pushing GINA enforcement, and that the Trump administration hates the law and won’t enforce it. We can look at the resolution types to test this hypothesis:
GINA.resolutions %>% filter(metric != "Receipts" & metric != "Resolutions") %>% ggplot(aes(year, value)) + geom_line() + facet_wrap(~ metric, scales = "free_y") + labs( title = "EEOC Enforcement of GINA Charges", subtitle = "Resolutions by Type, FY 2010 - FY 2017", caption = paste0("Source: ", url), x = "", y = "" )
The resolution type that jumped most markedly in 2017 was “unsuccessful conciliation.” A conciliation is where the EEOC “attempt[s] to achieve a just resolution of all violations found and to obtain agreement that the respondent will eliminate the unlawful employment practice and provide appropriate affirmative relief.” 29 C.F.R. § 1601.24. It’s unclear why this jump occurred from the summary statistics provided by the EEOC.
Finally, I thought it was useful to plot all the resolution types together to show relative numbers:
GINA.resolutions %>% filter(metric != "Receipts" & metric != "Resolutions" & metric != "Monetary Benefits (Millions)*") %>% ggplot(aes(year, value, color = metric)) + geom_line() + labs( title = "EEOC Enforcement of GINA Charges", subtitle = "Resolutions by Type, FY 2010 - FY 2017", caption = paste0("Source: ", url), x = "", y = "" ) + # scale_y_sqrt() + scale_color_brewer("Resolution Type", palette="Paired")
Conclusion
In all, I didn’t find the opposing lawyer’s argument particularly compelling in light of the data from President Trump’s first year in office. However, the first month of 2017 was President Obama’s last in office, and there was a flurry of activity by many regulatory agencies. It wouldn’t surprise me if EEOC also participated in a high volume of lame-duck activity, and a lot of activity in January 2017 could haved skewed the annual results. Monthly statistics would be nice but didn’t appear to be readily available. The goal with any R project is for it to be repeatable with additional data, so it will be interesting to see what the data from FY2018 shows.
This wasn’t a particularly complicated coding project – in fact, this writeup took me longer to produce than writing the actual code and coming to conclusions about whether GINA is on its last leg or not. Despite that fact, I thought it was a good example of how data science can be used to inform solutions to simple as well as complex problems.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.