Advent 8

Brian A. Fannin

8 years ago

[This article was first published on PirateGrunt, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I saved the data from the last post which shows the percentage of Republican voters in each county. In addition to that column, I also have figures from the 2010 census. This will show things like age, ethnicity, urbanization and home ownership. Those census figures show actual population counts, so they’ll need to be altered to relative numbers to be used in any statistical inference. This will necessitate a read through the obscure column names in the data frame. The USCensus package documents this well.

I’ll note two things about the ethnic categories: 1) in pretty much every society on earth, race is a very sensitive, divisive issue with a great deal of history. I’ll add the hopelessly needless caveat that although it may be used in a statistical model, that shouldn’t suggest that ethnicity connotes any constraints around a person’s behavior or ability. 2) Perhaps in conjunction with point 1, the US Census has a very dense set of data collection for race. I’m not going to try to sort through all of the nuance that’s captured in the data, but will simply create one data element to capture the percentage of the population which identifies as white, as described in one of the several categories where it is possible to do so.

Everybody cool? Good, let’s do some math.

dfNC = read.csv("./Data/NC2012andCensus.csv", stringsAsFactors = FALSE)

dfNC$PctUrban = dfNC$P0020002 / dfNC$P0020001
dfNC$PctWhite = dfNC$P0060002 / dfNC$P0060001
dfNC$PctVacantHousing = dfNC$H0030003 / dfNC$H0030001
dfNC$PctRent = dfNC$H0110004 / dfNC$H0110001
dfNC$PctLargeFamily = (dfNC$H0130007 + dfNC$H0130008) / dfNC$H0130001
dfNC$PctOver65 = (dfNC$H0170009 + dfNC$H0170010 + dfNC$H0170011 + dfNC$H0170019 + dfNC$H0170020 + dfNC$H0170021) / dfNC$H0170001
dfNC$PctWithChildren = (dfNC$H0190003 + dfNC$H0190006) / dfNC$H0190001

keepCols= c("NAME10", "PctRed", "PctUrban", "PctWhite", "PctVacantHousing", "PctRent", "PctLargeFamily", "PctOver65", "PctWithChildren")
keepCols %in% colnames(dfNC)

## [1]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

dfTest = dfNC[, colnames(dfNC) %in% keepCols]
colnames(dfTest)[1] = "CountyName"
head(dfTest)

##   CountyName PctUrban  PctWhite PctVacantHousing   PctRent PctLargeFamily
## 1   Franklin        0 0.5382669       0.11021760 0.2762110     0.03615098
## 2   Carteret        0 0.9812092       0.38433735 0.2101313     0.01369863
## 3   Davidson        0 0.7584716       0.09285993 0.3788390     0.02279570
## 4    Sampson        0 0.5320380       0.12406271 0.2814177     0.06575875
## 5   Johnston        0 0.7830243       0.10009878 0.3261308     0.03841932
## 6   Caldwell        0 0.5942054       0.15770609 0.5231678     0.03617021
##   PctOver65 PctWithChildren
## 1 0.2509304       0.3200425
## 2 0.3894325       0.2093933
## 3 0.1380645       0.3849462
## 4 0.2408560       0.3809339
## 5 0.2510062       0.3369923
## 6 0.2897872       0.3229787

Unfortunately, the urbanization column isn’t available for this data. That’s a shame as I would imagine that it’s very predictive. Later, I’ll try to find it elsewhere, or create a proxy variable by computing a population density value.

plot(dfNC$PctWhite, dfNC$PctRed, pch = 19, ylab = "% Red", xlab = "% White")

plot(dfNC$PctVacantHousing, dfNC$PctRed, pch = 19, ylab = "% Red", xlab = "% Vacant")

plot(dfNC$PctRent, dfNC$PctRed, pch = 19, ylab = "% Red", xlab = "% Rent")

plot(dfNC$PctOver65, dfNC$PctRed, pch = 19, ylab = "% Red", xlab = "% Over 65")

plot(dfNC$PctWithChildren, dfNC$PctRed, pch = 19, ylab = "% Red", xlab = "% w/Children")

plot(dfNC$PctLargeFamily, dfNC$PctRed, pch = 19, ylab = "% Red", xlab = "% Large family")

fitAll = lm(PctRed ~ PctWhite + PctVacantHousing + PctRent + PctOver65 + PctWithChildren + PctLargeFamily, data = dfNC)

## Error in eval(expr, envir, enclos): object 'PctRed' not found

summary(fitAll)

## Error in summary(fitAll): object 'fitAll' not found

The plots would suggest that counties with a large population of rentals are less apt to vote Republican. However, both the sign of the relationship and its significance aren’t what we’d expect when we include all variables. I’m going to change the column a bit, so that it’s percentage of owned homes and drop a few of the insignificant variables and try the fit again.

dfNC$PctOwn = 1 - dfNC$PctRent

fitTwo = lm(PctRed ~ PctWhite + PctOver65 + PctWithChildren + PctOwn, data = dfNC)

## Error in eval(expr, envir, enclos): object 'PctRed' not found

summary(fitTwo)

## Error in summary(fitTwo): object 'fitTwo' not found

Ownership continues to show up as insignificant, which is just odd. One final fit with only that variable.

fitOwnership = lm(PctRed ~ PctOwn, data = dfNC)

## Error in eval(expr, envir, enclos): object 'PctRed' not found

summary(fitOwnership)

## Error in summary(fitOwnership): object 'fitOwnership' not found

OK. On its own it’s fine, but it gets lost when mixed with the other variables.

What does all of this mean? It means that- for this set of explanatory variables and construction of data- absent any significant demographic shifts we can probably expect North Carolina to remain red. An influx of non-white residents, or younger residents could alter that. I’ll emphasize that this is a very superficial treatment of complex phenomena. In a later post, I’ll augment the basic census data with other data elements. Further, I’ll try to fetch data for other states to see how the relationships observed here play out elsewhere in the country.

This is also the part where I point out that Nate Silver and Andrew Gelman- two people who are reliably smarter than I am- have written about political forescasting in a way that I can’t hope to replicate. I’ve read their stuff and it’s tremensous. You should do the same.

citation("UScensus2010county")

## 
## To cite UScensus2000 in publications use:
## 
##   Zack W. Almquist (2010). US Census Spatial and Demographic Data
##   in R: The UScensus2000 Suite of Packages. Journal of Statistical
##   Software, 37(6), 1-31. URL http://www.jstatsoft.org/v37/i06/.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{,
##     title = {US Census Spatial and Demographic Data in {R}: The {UScensus2000} Suite of Packages},
##     author = {Zack W. Almquist},
##     journal = {Journal of Statistical Software},
##     year = {2010},
##     volume = {37},
##     number = {6},
##     pages = {1--31},
##     url = {http://www.jstatsoft.org/v37/i06/},
##   }

sessionInfo()

## R version 3.2.3 (2015-12-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.4 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  base     
## 
## loaded via a namespace (and not attached):
## [1] magrittr_1.5  formatR_1.2.1 tools_3.2.3   stringi_1.0-1 knitr_1.12   
## [6] methods_3.2.3 stringr_1.0.0 evaluate_0.8

To leave a comment for the author, please follow the link and comment on their blog: PirateGrunt.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.