[This article was first published on asdfree by anthony damico, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
on election days in the united states, the news media peppers its coverage with quick, dirty exit polls that allow them to make coarse statements like, “x% of demographic group y voted for candidate z.” the american national election studies are the scientific community’s response to those haphazard polls, for those of us who care more about having the number right than having the number right away. available every presidential election since dewey defeated truman and every off-year congressional election since eisenhower’s first term, the anes has released a data set so that professional researchers, political junkies, partisan hacks could seriously figure out who voted for who. and if any of you out there are personally running for office, consider this your best source of information to view the demographics and behavior of split-ticket voters.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
although it might lag behind the published microdata, berkeley’s sda (survey documentation and analysis) online query tool has a few of the anes data files hot and ready for crosstabulation and simple regression. before diving into either sda or the r code, perhaps review the available topics – with weighted proportions over time – posted on the main electionstudies.org website. you won’t be able to access any demographic breakouts there, but it’s the quickest way to view the ross perot anomaly.
choose which microdata file to work with after carefully reading your four study choices. you could review the frequently asked questions as well, but only if you promise me you won’t read anything into spss. most american national election studies generalize to all eligible voters in the united states, confirm the sample universe on the `weights summary` section of your selection. and have fun. have fun. this new github repository contains four scripts:
download and import.R
- slip on some sheep’s clothing and log in to the electionstudies.org as if you were a real person
- download, import, save every data set shown here onto your local computer
- well really, that’s it. what more did you want?
analysis examples.R
- load the 2012 point-in-time survey file into working memory
- create a weight and primary sampling unit variable based on this exchange with the stanford folks
- construct the complex sample survey object
- run enough political analyses to make cnn jealous
replicate table one.R
- conjure up the 2004 time-series and 2006 pilot data sets
- merge ’em together then recode a few variables to match stanford’s published categorizations
- reproduce the official method column in the analysis examples on pdf page twenty-five
replicate table two.R
- pull the 2008 time-series file into working memory
- recode a bunch of variables into a bunch of other variables
- produce two incorrect and one correct logistic regression
- say hip-hip-hooray, but quietly so as not to disturb those around you
click here to view these four scripts
for more detail about the american national election studies (anes), visit:
- the time-saving variable search feature
- the anes bibliography, to browse what others have done with this data
- the wikipedia entry will never let you down
notes:
as you’d expect with any survey dating back to 1948, some of the weighting and confidence interval calculations have changed over time. with five notable exceptions (see table one), the main anes data sets did not start including a sampling weight until 1992 – when it became the norm. to further complicate your life, the more recent data sets include both a pre- and post-election weight. if no weight variable exists, just add a column of all ones and make that your weighting variable – matching what they’ve done in the multi-year cumulative file.
if you only care about specific points-in-time (one of the cross-sectional time series studies), then simply find four variables to construct a taylor-series design: the strata variable, the primary sampling unit (also called the psu or cluster) variable, the pre-election weight, and the post-election weight. as stated at the bottom of this page, if your analysis only involves questions asked during the pre-election portion, use the pre-election weight (the unweighted sample will be larger) – but if you’re looking at any variables collected during the post-election interview, use the post-election weight instead. next, look for the cluster and strata variables. sometimes they’re mushed together into a single variable and will need to be extracted with a simple recode like `stratum = substr( v040103 , 1 , 2 )` and `secu = substr( v040103 , 3 , 3 )` for some of the older studies, these variables are not available – and your standard errors may be misleadingly small.
if you’re analyzing the cumulative file, they’ve prepared a few multi-year columns of all weights. e-mail anes@electionstudies.org and ask for cluster and strata variable advice. there’s also a weighting anomaly back in the 1970 file that’s outlined in the main how-to guide, but in order to understand the three weight options, you actually gotta read the middle paragraph on the 1970 study design page.
confidential to sas, spss, stata, and sudaan users: and saber-toothed tigers probably laughed when they saw the first humans crossing the bering strait. don’t be a saber toothed-tiger. time to transition to r. 😀
To leave a comment for the author, please follow the link and comment on their blog: asdfree by anthony damico.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.