mutate(across( c("PopPct", "ECPct"), \(x) scales::label_percent(accuracy = 0.01)(x) )) |> knitr::kable(align = "ccrr") | Electors | States | PopPct| ECPct| |:--------:|:--------------------------:|------:|------:| | 3 | AK, DE, DC, ND, SD, VT, WY | 1.61%| 3.90%| | 4 | HI, ID, ME, MT, NH, RI, WV | 3.04%| 5.20%| | 5 | NE, NM | 1.22%| 1.86%| | 6 | AR, IA, KS, MS, NV, UT | 5.60%| 6.69%| | 7 | CT, OK | 2.29%| 2.60%| | 8 | KY, LA, OR | 3.98%| 4.46%| | 9 | AL, SC | 3.13%| 3.35%| | 10 | CO, MD, MN, MO, WI | 8.93%| 9.29%| | 11 | AZ, IN, MA, TN | 8.49%| 8.18%| | 12 | WA | 2.33%| 2.23%| | 13 | VA | 2.60%| 2.42%| | 14 | NJ | 2.77%| 2.60%| | 15 | MI | 3.00%| 2.79%| | 16 | GA, NC | 6.53%| 5.95%| | 17 | OH | 3.52%| 3.16%| | 19 | IL, PA | 7.62%| 7.06%| | 28 | NY | 5.84%| 5.20%| | 30 | FL | 6.75%| 5.58%| | 40 | TX | 9.11%| 7.43%| | 54 | CA | 11.63%| 10.04%| And we’re done. To reiterate, all I wanted to do here was show how basically the same abstractions—specifically the idea of a dataframe, and the idea of splitting it up into subunits and applying functions those units—are useful when working with data, which is why they’re present in the data-focused toolkits of different programming languages." />

Dr Drang and the Electoral College

[This article was first published on R on kieranhealy.org, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The other week, the Internet’s most beloved creepy snowman wrote a blog post where he showed how to use a little Python to group states by their number of electoral college votes to make a table like this:

Electors States PopPct ECPct
3 AK, DE, DC, ND, SD, VT, WY 1.61% 3.90%
4 HI, ID, ME, MT, NH, RI, WV 3.04% 5.20%
5 NE, NM 1.22% 1.86%
6 AR, IA, KS, MS, NV, UT 5.60% 6.69%
7 CT, OK 2.29% 2.60%
8 KY, LA, OR 3.98% 4.46%
9 AL, SC 3.13% 3.35%
10 CO, MD, MN, MO, WI 8.93% 9.29%
11 AZ, IN, MA, TN 8.49% 8.18%
12 WA 2.33% 2.23%
13 VA 2.60% 2.42%
14 NJ 2.77% 2.60%
15 MI 3.00% 2.79%
16 GA, NC 6.53% 5.95%
17 OH 3.52% 3.16%
19 IL, PA 7.62% 7.06%
28 NY 5.84% 5.20%
30 FL 6.75% 5.58%
40 TX 9.11% 7.43%
54 CA 11.63% 10.04%

In fact, he wrote two versions of the the post. The first one used Python’s Pandas library to get the data in shape, but some of the code Dr Drang wrote was still based around nested loops:

python
1
2
3
4
5
# Print out the summary table
print('State EC   States   Population   Pop Pct   Electors   EC Pct')
for k, v in dfg:
  print(f'    {k:2d}       {v.State.count():2d}    {v.Population.sum():11,d}\
    {v.PopPct.sum():6.2%}     {v.Electors.sum():3d}     {v.ECPct.sum():6.2%}')

Subsequently, after a conversation with ondaiwai on Mastodon, he rewrote the code to take advantage of Pandas’ agg function to abstract that loop away. The new code is shown in his follow-up post. What agg does is manage the group-and-calculate process in a more elegant and readable manner, one that also generalizes more easily to larger numbers of groups and additional kinds of calculation. To be clear, there is nothing wrong with loops as a construction. And, indeed, somewhere in the underlying implementation you will find the loops are still there, beavering away. Rather, the idea is that by introducing a bit of abstraction we end up with a somewhat more expressive and composable way of working with our data.

At the moment I’m teaching a course that introduces some related ideas to students. I teach it in R, so here I’m going to present essentially the same solution in R that Dr Drang wrote in Python. I’ll use the Tidyverse package because that’s what I’m teaching with. My goal here, to forestall any tedious remarks from people less intrinsically reasonable than Dr Drang, is not to show a better or faster way to do things, or to get into some boring Python vs R vs Why Don’t You Just Rewrite It In Rust argument. Rather, the point is that in both cases we use the same abstractions, often with the same names. In R’s terms, we want to move from working with loops to doing things in, loosely speaking, a vectorized manner that’s more amenable to being expressed as the application of a sequence of functions.

The first and most important abstraction that Pandas and R in general share here is a kind of data structure: the dataframe. R was originally written to do data analysis so it comes with that concept built-in. Pandas (and also Polars) add it to Python. A dataframe is just a rectangular table where the columns can be of different types. That is, unlike matrices or arrays or whatever, one column might be made of integers, but the next of character strings, and the next of logical values or date-times, and so on. This is the way tables of data present themselves to us “naturally”, as it were. Having something built-in to your programming language that directly represents data in that way is terrifically useful.

The second kind of abstraction, the one that led Dr Drang to rewrite his original post, is some sort of concept of grouping and then acting on pieces of data in order to get something new out the other side. The general idea is sometimes called a split-apply-combine strategy for analysis. We have a table of data; we split it into pieces, e.g. by one of the categorical variables in the table, or according to some other criterion; we apply a function or operation to each of the pieces, which produces some result for each piece; and then we combine the results back into a new table.

In this case we start with a table of state-level information about how many electors each state has, and what its population is. We want to end up with an “elector-level” table where the rows are numbers of electors and the values (total population, etc) are aggregated from all the states with that many electors. So we split our original state-level table into groups based on the number of electors. Seven states have three electors each. Two states have sixteen electors. And so on. Then we calculate some new quantities within these groupings—e.g. the total population living in “3 Elector” states; the percentage of the overall population that is, etc. Then we combine the results into a new table.

Dr Drang did this in Python. Here’s one way to do it in R. For ease of comparison, I’ll mostly replicate Dr Drang’s formatting and labeling.

First we read in the data, and create the PopPct and ECPct columns:

r
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
library(tidyverse)

# Import the CSV and create PopPCT and ECPct
df <- read_csv("states.csv", show_col_types = FALSE) |>
  mutate(PopPct = Population/sum(Population),
         ECPct = Electors/sum(Electors))

df
#> # A tibble: 51 × 6
#>    State                Abbrev Population Electors  PopPct   ECPct
#>    <chr>                <chr>       <dbl>    <dbl>   <dbl>   <dbl>
#>  1 Alabama              AL        5108468        9 0.0153  0.0167 
#>  2 Alaska               AK         733406        3 0.00219 0.00558
#>  3 Arizona              AZ        7431344       11 0.0222  0.0204 
#>  4 Arkansas             AR        3067732        6 0.00916 0.0112 
#>  5 California           CA       38965193       54 0.116   0.100  
#>  6 Colorado             CO        5877610       10 0.0175  0.0186 
#>  7 Connecticut          CT        3617176        7 0.0108  0.0130 
#>  8 Delaware             DE        1031890        3 0.00308 0.00558
#>  9 District of Columbia DC         678972        3 0.00203 0.00558
#> 10 Florida              FL       22610726       30 0.0675  0.0558 
#> # ℹ 41 more rows

We don’t really need the whole Tidyverse to do what we want here, but we’ll just import it for convenience. We put the data in an object called df. It’s a dataframe. (A “tibble” is the tidverse’s name for a dataframe with some nice printing methods and other features added.) In this way of writing R we use the pipe, |> to move from step to step. Conceptually it’s just like a Unix pipe, which is to say you read it as “and then”. You do a thing, and then pass that on down the pipeline to the next function, which does another thing.

One feature of R in this step is that we can write PopPct = Population/sum(Population) to get the population proportion for each row. That’s because the calculation is vectorized: the Population in the numerator will be each element of the Population vector (i.e. each row of the table). The sum(Population) in the denominator will be the sum of all the elements of that vector. This way of writing things is very handy and, again, is a “natural” way to express many kinds of calculation that arise when we are doing arithmetic on tables.

Next, we’ll do the initial aggregation to group by Electors and calculate the summary measures. We could have done this all in one breath above, just by adding more steps to the pipeline, but I’ve broken it out for ease of reading. This is the “Quick and Dirty Table” from Dr Drang’s post. We start with our df, group by Electors and then, within each group, count up how many states there are, sum how many electors there are, and also sum the remaining proportion columns. The across(where()) bit establishes a test for each column (is it numeric?) and sum is the function that’s applied to each column that meets the test. These calculations are done by the grouping we set in the previous step in the pipeline.

r
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
df |>
  group_by(Electors) |>
  summarize(State_n = n(),
            Electors_n = sum(Electors),
            across(where(is.numeric), sum))

#> # A tibble: 20 × 6
#>    Electors State_n Electors_n Population PopPct  ECPct
#>       <dbl>   <int>      <dbl>      <dbl>  <dbl>  <dbl>
#>  1        3       7         21    5379033 0.0161 0.0390
#>  2        4       7         28   10196485 0.0304 0.0520
#>  3        5       2         10    4092750 0.0122 0.0186
#>  4        6       6         36   18766882 0.0560 0.0669
#>  5        7       2         14    7671000 0.0229 0.0260
#>  6        8       3         24   13333261 0.0398 0.0446
#>  7        9       2         18   10482023 0.0313 0.0335
#>  8       10       5         50   29902889 0.0893 0.0929
#>  9       11       4         44   28421431 0.0849 0.0818
#> 10       12       1         12    7812880 0.0233 0.0223
#> 11       13       1         13    8715698 0.0260 0.0242
#> 12       14       1         14    9290841 0.0277 0.0260
#> 13       15       1         15   10037261 0.0300 0.0279
#> 14       16       2         32   21864718 0.0653 0.0595
#> 15       17       1         17   11785935 0.0352 0.0316
#> 16       19       2         38   25511372 0.0762 0.0706
#> 17       28       1         28   19571216 0.0584 0.0520
#> 18       30       1         30   22610726 0.0675 0.0558
#> 19       40       1         40   30503301 0.0911 0.0743
#> 20       54       1         54   38965193 0.116  0.100

You might have noticed that the Abbrev column disappeared in the summarize() step. This is because we didn’t explicitly use it to create any column of the summary table, so there’s no place for it. This would also be true of any other columns the df might have that we weren’t interested in summarizing just now.

Finally, we want to make the nice Markdown version of the table. It’s much the same as before. The difference is that we’ll use the state abbreviations (this time it’s the full State names that disappear). During the summary step, when the rows are grouped by Electors, the paste0 function collapses them into a single string whose elements are separated by a “, “. We also calculate the grouped proportions as before in PopPct andECPct. Once we’ve summarized the table, we need to format those proportions as percentages rounded to two decimal places. We apply the label_percent() function to those two columns, which does the job for us. We have to write that using the notation for an anonymous function, \(x) function(x) because label_percent() is a function factory: it’s a function that can be given arguments, but whose output is itself a function that accepts an argument (in this case, x, a vector of numbers to format as percentages). This is why it’s written in the form label_percent(accuracy=0.01)(x). Finally, we grab the kable() function from the knitr package to take the cleaned-up dataframe as input and output the actual markdown.

r
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
df |>
  group_by(Electors) |>
  summarize(States = paste0(Abbrev, collapse = ", "), 
            across(c("PopPct", "ECPct"), sum)) |>
  mutate(across(
    c("PopPct", "ECPct"),
    \(x) scales::label_percent(accuracy = 0.01)(x)
  )) |>
  knitr::kable(align = "ccrr")


| Electors |           States           | PopPct|  ECPct|
|:--------:|:--------------------------:|------:|------:|
|    3     | AK, DE, DC, ND, SD, VT, WY |  1.61%|  3.90%|
|    4     | HI, ID, ME, MT, NH, RI, WV |  3.04%|  5.20%|
|    5     |           NE, NM           |  1.22%|  1.86%|
|    6     |   AR, IA, KS, MS, NV, UT   |  5.60%|  6.69%|
|    7     |           CT, OK           |  2.29%|  2.60%|
|    8     |         KY, LA, OR         |  3.98%|  4.46%|
|    9     |           AL, SC           |  3.13%|  3.35%|
|    10    |     CO, MD, MN, MO, WI     |  8.93%|  9.29%|
|    11    |       AZ, IN, MA, TN       |  8.49%|  8.18%|
|    12    |             WA             |  2.33%|  2.23%|
|    13    |             VA             |  2.60%|  2.42%|
|    14    |             NJ             |  2.77%|  2.60%|
|    15    |             MI             |  3.00%|  2.79%|
|    16    |           GA, NC           |  6.53%|  5.95%|
|    17    |             OH             |  3.52%|  3.16%|
|    19    |           IL, PA           |  7.62%|  7.06%|
|    28    |             NY             |  5.84%|  5.20%|
|    30    |             FL             |  6.75%|  5.58%|
|    40    |             TX             |  9.11%|  7.43%|
|    54    |             CA             | 11.63%| 10.04%|

And we’re done. To reiterate, all I wanted to do here was show how basically the same abstractions—specifically the idea of a dataframe, and the idea of splitting it up into subunits and applying functions those units—are useful when working with data, which is why they’re present in the data-focused toolkits of different programming languages.

To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)