analyze the united states decennial census public use microdata sample (pums) with r and monetdb
[This article was first published on asdfree by anthony damico, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
during his tenure as secretary of state, thomas jefferson oversaw the first american census way back in 1790. some of my countrymen express pride that we’re the oldest democracy, but my heart swells with the knowledge that we’ve got the world’s oldest ongoing census. you’ll find the terms ‘census’ and ‘enumeration’ scattered throughout article one, section two of our constitution. long story short: the united states census bureau has been a pioneer in the field of data collection and dissemination since george washington’s first term. tis oft i wonder how he would have felt betwixt r and monetdb.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
for the past few decades, the bureau has compiled and released public use microdata samples (pums) from each big decennial census. these are simply one- and five-percent samples of the entire united states population. although a microdata file containing five percent of the american population sounds better, the one-percent files might be more valuable for your analysis because fewer fields have to be suppressed or top-coded for respondent privacy (in compliance with title 13 of the u.s. code).
if you’re not sure what kind of census data you want, read the missouri census data center’s description of what’s available. these public use microdata samples are most useful for very specific analyses of tiny areas (your neighborhood). it’d be wise to review the bureau’s at a glance page to decide whether you can just use one of the summary files that they’ve already constructed – why re-invent the table?
my syntax below only loads the one- and five-percent files from 1990 and 2000. earlier releases can be obtained from the national archives or the university of minnesota’s magical ipums or the missouri state library’s informative missouri census data center. here’s some bad news: it looks like sequestration means the 2010 census data release might not include a pums. this isn’t as much of a loss as it might sound: the 2010 census dropped the long form – historically, one-sixth of american households were instructed to answer a lot of questions and the other five-sixths of us just had to answer a few. starting in 2010, everyone just had to answer a few, and the more detailed questions are now asked of roughly one percent of the united states population on an annual basis (as opposed to decennial) with the spanking new american community survey. read this for more detail. kinda awesome. this new github repository contains three scripts:
download and import.R
- create the batch (.bat) file needed to initiate the monet database in the future
- figure out the data structures for the 1990 and 2000 pums for both household and person files
- download, unzip, and import each file for every year and size specified by the user into monetdb
- create and save a merged/person-level design object to make weighted analysis commands a breeze
- create a well-documented block of code to re-initiate the monetdb server in the future
2000 analysis examples.R
- run the well-documented block of code to re-initiate the monetdb server
- load the r data file (.rda) containing the weighted design object for the one-percent and five-percent files
- perform the standard repertoire of analysis examples, this time using sqlsurvey functions – sorry no standard errors
replicate control counts table.R
- run the well-documented block of code to re-initiate the monetdb server
- query the y2k household and merged tables to match the census bureau’s published control counts
click here to view these three scripts
for more detail about the united states decennial census public use microdata sample, visit:
- the us census bureau’s 1990, 2000, and 2010 census homepages.
- the american factfinder homepage, for all your online table creation needs
- the national archives, with identifiable data releases up to 1940. grandpa’s confidentiality be damned!
notes:
analyzing trends between historical decennial censuses (would that be censii?) and the american community survey is legit. not only legit. encouraged. instead of waiting ten years to analyze long-form respondents, now you and i have access to a new data set every year. if you like this new design, thank a re-engineer.
so how might one calculate standard errors and confidence intervals in the pums? there isn’t a good solution. ipums (again, who i love dearly) has waved its wand and created this impressive strata variable for each of the historical pums data sets. in a previous post, i advocated for simply doubling the standard errors but then calculating any critically-important standard errors by hand with the official formula (1990 here and 2000 there). starting with the 2005 american community survey, replicate weights have been added and the survey data world has been at (relative) peace.
confidential to sas, spss, stata, and sudaan users: fred flintstone thinks you are old-fashioned. time to transition to r. 😀
To leave a comment for the author, please follow the link and comment on their blog: asdfree by anthony damico.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.