Site icon R-bloggers

Displaying increasing U.S. eligible voter diversity with a slopegraph in R

[This article was first published on Posts on R Lover ! a programmer, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
  • I maintain small package CGPfunctions on Github as well as CRAN. I know a few people use it because occasionally I get feedback (usually accompanied by feature requests 🙂 ). I am absolutely sure that it’s in no danger of “going viral” but the discipline of maintaining does me some good at least, and it’s nice to know a few people find it helpful.

    Over the summer I managed to make a few tweaks to various functions and add some new features and functionality. This post documents a few of those as well as touching on an interesting election topic (no worries no political lobbying will be found here).

    The genesis of the data is The Pew Research Center a “reliable source” if you will. The data and the report titled “The Changing Racial and Ethnic Composition of the U.S. Electorate” were published September 23rd. The site has lots of great graphics and analysis but I was drawn to one set of tabular data that cried out for a visual. The table tracks by state changes in the composition of eligible voters from 2000 through 2010 to 2018.

    I wanted to use my newslopegraph function to make a more informative visual of the data, while showcasing some of the recent tweaks and existing functionality.

    Setup and data

    First step is to grab the latest version of the package from GitHub. I’ll push the changes to CRAN in the not too distant future.

    devtools::install_github("ibecav/CGPfunctions",
                             build_vignettes = TRUE,
                             upgrade = "ask",
                             force = TRUE)

    Next load a few libraries we need.

    library(dplyr)
    library(tidyr)
    library(readr)
    library(CGPfunctions)

    I was too lazy to do a lot of web scraping so I simply cut and paste from the web page into a plain text file then eliminated the commas and percent signs and cleaned up the header line. I actually did the work in a text editor but a series of gsub would have worked. It’s a wide table with one row per state plus D.C.. After reading it in with readr::read_tsv I checked the math (2552000 / 3713000 = 69%) so since we are interested in using percents the columns that contain the word “share” are what we want.

    data <- readr::read_tsv(file="vote_eligible.txt")
    ## 
    ## ── Column specification ────────────────────────────────────────────────────────
    ## cols(
    ##   State = col_character(),
    ##   Total_eligible_voter_2018 = col_double(),
    ##   White_eligible_voter_pop_2018 = col_double(),
    ##   White_share_eligible_voters_2018 = col_double(),
    ##   White_eligible_voter_pop_2010 = col_double(),
    ##   White_share_eligible_voters_2010 = col_double(),
    ##   White_eligible_voter_pop_2000 = col_double(),
    ##   White_share_eligible_voters_2000 = col_double(),
    ##   change_00_18 = col_double()
    ## )
    data
    ## # A tibble: 51 x 9
    ##    State Total_eligible_… White_eligible_… White_share_eli… White_eligible_…
    ##    <chr>            <dbl>            <dbl>            <dbl>            <dbl>
    ##  1 Alab…          3713000          2552000               69          2522000
    ##  2 Alas…           535000           351000               66           354000
    ##  3 Ariz…          5042000          3192000               63          2968000
    ##  4 Arka…          2219000          1724000               78          1704000
    ##  5 Cali…         25869000         11750000               45         11950000
    ##  6 Colo…          4147000          3110000               75          2777000
    ##  7 Conn…          2614000          1917000               73          1982000
    ##  8 Dela…           721000           496000               69           476000
    ##  9 Dist…           527000           220000               42           181000
    ## 10 Flor…         15342000          9325000               61          8799000
    ## # … with 41 more rows, and 4 more variables:
    ## #   White_share_eligible_voters_2010 <dbl>,
    ## #   White_eligible_voter_pop_2000 <dbl>,
    ## #   White_share_eligible_voters_2000 <dbl>, change_00_18 <dbl>

    So we’ll select(State, contains("share")) to eliminate what we don’t need and pivot from wide to long with tidyr::pivot_longer for each of the columns that starts with “White_share” we’ll pivot and call the values “percent” (which they are). Since “White share of eligible voters 2018” is more than a little ungainly we’ll use names_pattern = “([0-9]{4})” to extract just the four digit year.

    data <- data %>% 
       select(State, contains("share")) %>% 
       tidyr::pivot_longer(cols = starts_with("White_share"), 
                           names_to = "year", 
                           values_to = "percent",
                           names_pattern = "([0-9]{4})"
                           )
    
    data
    ## # A tibble: 153 x 3
    ##    State    year  percent
    ##    <chr>    <chr>   <dbl>
    ##  1 Alabama  2018       69
    ##  2 Alabama  2010       71
    ##  3 Alabama  2000       73
    ##  4 Alaska   2018       66
    ##  5 Alaska   2010       70
    ##  6 Alaska   2000       73
    ##  7 Arizona  2018       63
    ##  8 Arizona  2010       69
    ##  9 Arizona  2000       75
    ## 10 Arkansas 2018       78
    ## # … with 143 more rows

    Since we’re interested in diversity, not the percent white eligible voters we’ll flip things around.

    data <- data %>% mutate(percent = as.integer(100 - percent))
    
    data
    ## # A tibble: 153 x 3
    ##    State    year  percent
    ##    <chr>    <chr>   <int>
    ##  1 Alabama  2018       31
    ##  2 Alabama  2010       29
    ##  3 Alabama  2000       27
    ##  4 Alaska   2018       34
    ##  5 Alaska   2010       30
    ##  6 Alaska   2000       27
    ##  7 Arizona  2018       37
    ##  8 Arizona  2010       31
    ##  9 Arizona  2000       25
    ## 10 Arkansas 2018       22
    ## # … with 143 more rows

    First plot

    Now that the data is in long format, we can visualize it. We’ll add an informative Title, SubTitle and Caption and let the defaults take over.

    newggslopegraph(dataframe = data, 
                    Times = year, 
                    Measurement = percent, 
                    Grouping = State,
                    Title = "Eligible non-white voters per state 2000- 2018",
                    SubTitle = "Expressed as a percent of total eligible voters",
                    Caption = "Retrieved from https://www.pewresearch.org/2020/09/23/the-changing-racial-and-ethnic-composition-of-the-u-s-electorate/ on Oct 23, 2020")

    Not bad. The function’s defaults work very hard to deconflict labeling, while making sure things are proportional, and properly aligned. The gaps you see between lines are proportional to the data, and the ordering of the left and right axis is correct. New Hampshire surpassed Maine and Vermont in terms of diversity. We can do a lot better though.

    One easy improvement is to make use of the Data.label argument. Normally we simply use the Measurement as a string. But Data.label let’s you make it any character value you like. As a simple example let’s add a percent sign.

    To draw our viewers eyes to big changes in slope (percent difference between) 2000 and 2018 let’s make the lines reflect that. Increases in diversity greater than 10% will get one color from the viridis scale and those less than -5% the other end of the spectrum. States with no strong trend get a shade very close to white. I avoid red and green because of colorblindeness.

    data$datalabel <- paste0(data$percent, "%")
    
    custom_colors <- 
       tidyr::pivot_wider(data,
                       id_cols = State,
                       names_from = year,
                       values_from = percent) %>%
       mutate(difference = `2018` - `2000`) %>%
       mutate(trend = case_when(
          difference >= 10 ~ "#404788ff",
          difference <= -5 ~ "#fde725ff",
          TRUE ~ "snow2"
       )
       ) %>%
       select(State, trend) %>%
       tibble::deframe()
    
    newggslopegraph(dataframe = data, 
                    Times = year, 
                    Measurement = percent, 
                    Grouping = State,
                    Data.label = datalabel,
                    Title = "Eligible non-white voters per state 2000- 2018",
                    SubTitle = "Expressed as a percent of total eligible voters",
                    Caption = "Retrieved from https://www.pewresearch.org/2020/09/23/the-changing-racial-and-ethnic-composition-of-the-u-s-electorate/ on Oct 23, 2020",
                    LineThickness = .5,
                    LineColor = custom_colors
    )

    Just because you can doesn’t mean you should

    We could keep tweaking but quite frankly 51 lines is probably too many let’s back up and focus on a more select number. I’m going to proceed much faster this time. Reread the original tsv file. Add a mutate to find the overall change from 2000 to 2018 call it changes. Filter for just the big change states changes >= 5 | changes <= -10. Then the same old steps to pivot mutate and plot.

    data <- readr::read_tsv(file="vote_eligible.txt")
    data <- data %>% 
       mutate(changes = White_share_eligible_voters_2018 - White_share_eligible_voters_2000) %>%
       filter(changes >= 5 | changes <= -10) %>%
       select(State, contains("share")) %>% 
       tidyr::pivot_longer(cols = starts_with("White_share"), 
                           names_to = "year", 
                           values_to = "percent",
                           names_pattern = "([0-9]{4})"
                           ) %>% 
       mutate(percent = as.integer(100 - percent))
    
    
    data$datalabel <- paste0(data$percent, "%")
    
    custom_colors <- 
       tidyr::pivot_wider(data,
                       id_cols = State,
                       names_from = year,
                       values_from = percent) %>%
       mutate(difference = `2018` - `2000`) %>%
       mutate(trend = case_when(
          difference >= 10 ~ "#404788ff",
          difference <= -5 ~ "#fde725ff",
          TRUE ~ "snow2"
       )
       ) %>%
       select(State, trend) %>%
       tibble::deframe()
    
    newggslopegraph(dataframe = data, 
                    Times = year, 
                    Measurement = percent, 
                    Grouping = State,
                    Data.label = datalabel,
                    Title = "Eligible non-white voters per state 2000- 2018",
                    SubTitle = "Expressed as a percent of total eligible voters",
                    Caption = "Retrieved from https://www.pewresearch.org/2020/09/23/the-changing-racial-and-ethnic-composition-of-the-u-s-electorate/ on Oct 23, 2020",
                    LineThickness = .5,
                    LineColor = custom_colors
    )

    Done

    Hope you enjoyed the post. Comments always welcomed. Especially please let me know if you actually use the tools and find them useful.

    Extra credit for me for not expressing a political view at any point. Let the data speak.

    Chuck

    This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License

    To leave a comment for the author, please follow the link and comment on their blog: Posts on R Lover ! a programmer.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.