[This article was first published on asdfree by anthony damico, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
plenty of nationwide surveys collect information at the household-level, only the american housing survey (ahs) focuses on the physical structure rather than the inhabitants. when asked to pick their favorite public-use file, urban planners, realty researchers, even data-driven squatters choose this one. in action since (and with available microdata dating back to) 1973, the united states department of housing and urban development (hud) contracts with our census bureau to collect information about a panel of both nationally- and metropolitan area-representative homes so that scientists (like you) can boldly answer questions about america’s residential housing supply.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
from 1973 until 1996, the survey administrators mushed all of the content from this survey into a single one-record-per-housing-unit consolidated table that they call a “flat file” – simple. beginning in 1997, you have access to much more detailed information. if you feel confused rather than empowered, walk through the various 2011 files with me. background first: the `control` column in the microdata is just the unique identifier for the household. in the 2011 release, it’s appropriate to think of `tnewhouse` and `trepwgt` as the main files – those are the only files that have weights. there are no person-level weights in this microdata. you can make statements like, “the average american housing unit has x bathrooms” but not, “the average american lives in a household with x bathrooms.” you cannot make a statement about average american anythings without sampling weights. catch my drift? alright, here’s my description of each file using this structure:
- tablename (number of records in 2011) [unique `control` numbers in 2011] – structure. notes/description.
files, structures, descriptions of individual tables in the 2011 ahs public use file:
- tnewhouse (186,448) [186,448] – one record per household. household characteristics. the main file.
- trepwgt (186,448) [186,448] – one record per household. weight file. needs to be merged onto the main file.
- towner (60,572) [60,572] – one record per owner of rented unit. not all homes have an outside owner, but the ones that do will merge onto `tnewhouse` by `control` one-to-one
- thomimp (147,329) [50,532] – one record per home improvement. to uniquely identify each home-improvement use `control` plus `ras` some homes have multiple home-improvements, others have none.
- tmortg (56,507) [56,507] – one record per mortgage. not all homes have a mortgage (renters never do), but the ones that do will merge onto `tnewhouse` by `control` one-to-one
- tperson (339,453) [134,918] – one record per person. to uniquely identify each person use `control` plus `pline` some homes have multiple persons, others have none.
- tratiov (8,166) [8,166] – one record per household. verification that the renter pays x amount when their reported income makes it seem implausible.
- trmov (43,968) [39,464] – one record per movement group. to uniquely identify each group of movers use `control` plus `mvg` some homes have multiple movement groups, others have none.
- ttypec (71,672) [71,672] – one record per household available in prior years but not the current year.
don’t say i didn’t warn you that this survey kicks ass. ahh yes and if you are still perplexed by something, pdf page eleven of the census bureau’s documentation outlines what i’ve tried to summarize above in much more detail, using a mix of both capital and lowercase letters. this new github repository contains four scripts:
download all microdata.R
- download, import, save each and every american housing survey file onto your local computer
- when a household file and replicate weights file are both available, merge them. you’ll have to do it eventually, why not automate it from the start?
- store all successfully-imported r data files (.rda) into a big fat sqlite database in case your computer isn’t the newest edition
analysis examples.R
- load a single household-level data file, either into working memory or as a database-backed (ram-free) object
- construct the complex sample survey object post-stratifying according to census bureau specifications
- run example analyses that calculate perfect means, medians, quantiles, totals
merge and recode examples.R
- recode some columns in the person-level table into other columns, inside the sql database
- aggregate some person-level statistics into household-level information just like hud’s file flattener sas program
- merge these aggregated person-level results onto the main household-level file
- re-construct a legitimate replicate-weighted database-backed survey design object, using the new person-level results you just created
- repeat the four previous steps, but all in working memory rather than with a sql database – for more powerful computers only
replication.R
- fire up a sqlite-backed replicate-weighted survey design
- match two separate statistics and standard errors in this census bureau publication
- fire up the same design, sans sqlite-backing
- repeat step two
click here to view these four scripts
for more detail about the american housing survey (ahs), visit:
- the census bureau’s ahs homepage (they conduct the survey)
- the housing and urban development (hud) ahs user page (they administer and pay for the survey)
- the census bureau’s housing topics page (which points to..)
- the american community survey (acs), for a far larger sample but many fewer interview questions. the current population survey (cps), for better economic detail about each household’s inhabitants. the nyc housing and vacancy survey (nychvs) for manhattanites.
notes:
it might not be perfectly clear from the documentation and they’ve yet to publish a core set of longitudinal weights for the various national periods and metropolitan samples, but the american housing survey is drawn from the same panel of housing units every other year. when comparing the 2009 and 2011 unique identifiers (the `control` column), i found 55,065 matches. you’d be smart to contact the (superhumanly responsive) quants who create this survey via their userlist to confirm your panel-based analysis strategy makes sense.
when you think of a housing unit, you might informally refer to it as a place where you would expect people to have their own bathroom and kitchen for their excluuuusive use. the american housing survey includes some assisted living settings, but excludes group quarters like dormitories, hospitals, military barracks, and most nursing homes. for more detailed explanations, take a look at the methodology document and especially appendix b.
confidential to sas, spss, stata, and sudaan users: knock knock. who’s there? r. r who? aren’t you glad you transitioned to r?
To leave a comment for the author, please follow the link and comment on their blog: asdfree by anthony damico.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.