Site icon R-bloggers

K-Means Redistricting

[This article was first published on David B. Sparks » r, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

U.S. Congressional districts are today drawn with the aim of maximizing the electoral advantage of the state’s majority party, subject to some constraints, including compactness (which can be measured in numerous ways) and a “one person, one vote” standard. What if, instead of minimizing population variance across districts, we aimed to minimize the mean distance between each resident and their district center?

To do so would be to employ something very much like k-means clustering, and produces some interesting results.

Using the population and latitude and longitude coordinates of the centroid of each (2000) census tract (a block-level reproduction was deemed too computationally intensive for the present purposes), I produced a geospatial k-means clustering for several states. Each tract was represented by its centroid as a point, weighted by population (which required a custom function, as the default kmeans() function in R does not appear to permit weighted points.

Since each run of the k-means algorithm begins with a random set of points, I replicated the function several thousand times, attempting to find a maximum inverse Herfindahl-Hirschman index of district population — the “effective number of districts,” as it were. For North Carolina, as shown below, I was able to find a maximum END of 12.17 for thirteen districts, which is a fairly even distribution of population.

Click to enlarge

Interestingly, there is still substantially wider variation in population than would be permitted under the current system. The least populous district houses fewer than 400,000 individuals, and the most populous, nearly a million. These figures are much more extreme than the extant least- (Wyoming) and most- (Montana) populous districts.

Population by district:

#  Population
1  398492
4  398896
8  423710
10 525860
2  533812
13 537417
3  618040
6  662092
11 676221
12 767249
7  785668
5  786448
9  935408

However, the district boundaries (here hastily drawn by use of chull()) are not characterized by the ragged edges and elongated shapes often seen in the existing plans.

I was interested in what the k-means-based plan would do to district partisanship, and decided to use population density as a rough proxy for local party affiliation. The distribution of population per square mile for each North Carolina census tract is shown below, with a vertical line indicating the median.

I decided to characterize any tract with greater-than-median population density as Democratic, and less-dense tracts as Republican. This resulted in the following proportion of Democrats residing in each district as plotted above:

#  % Dem.
1  0.253
4  0.265
10 0.336
8  0.350
6  0.383
7  0.474
13 0.510
3  0.589
11 0.615
9  0.628
12 0.671
2  0.673
5  0.837

As the table indicates, full turnout under such a plan would result in the election of 6 Republicans and 7 Democrats. Below, I plot “Democratic” tracts in blue and “Republican” tracts in red, scaled according to their population. Urban centers are easily identifiable. Note the difference between this plan and the current actual plan, which draws a single elongated district (the twelfth) parallel to Interstate 85.

Click to enlarge

Below, I replicate the same process for the state of Texas, generating 32 districts. One problem with the k-means algorithm is that larger states, or those with greater variance in population density, tend to generate districts with wide variations in population and inequalities of representation. The Texas plan below depicts a district with fewer than 200,000 residents and one with over 2 million. The Effective Number of Districts (maximum after 100 attempts) is a mere 21.58. Interestingly, the the district “partisanship” split is 22/10 majority Republican/Democrat — not far from the current 20/12 split. In this simulated redistricting, there are 10 districts in which the majority of residents live in higher-than-the-state-median density areas: four each in Houston and Dallas-Fort Worth, one each around San Antonio and Austin.

Click to enlarge

The slideshow below depicts the incremental steps of the weighted k-means algorithm toward convergence around alternate districts for Ohio, beginning with set of random centers, and eventually minimizing collective distances from local centroids.

View this document on Scribd

Finally, I used the same algorithm to investigate what a the continental United States would look like if states were partitioned according to the k-means rule. Clicking on the image below will bring you to an interactive, scalable map of the U.S. with 48 alternate states and inferred partisanship. Instead of initializing with random centers, I started the k-means algorithm with the population centroids of the actual states, and allowed the algorithm to converge to a minimizing partition. Many of these alternative states are more compact but familiar versions of the originals, although this new plan does realize Plunkitt’s Fondest Dream.

Click to enlarge


To leave a comment for the author, please follow the link and comment on their blog: David B. Sparks » r.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.