[This article was first published on asdfree by anthony damico, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
the national health interview survey (nhis) is a household survey about health status and utilization. each annual data set can be used to examine the disease burden and access to care that individuals and families are currently experiencing across the country. check out the wikipedia article (ohh hayy i wrote that) for more detail about its current and potential uses. if you’re cooking up a health-related analysis that doesn’t need medical expenditures or monthly health insurance coverage, look at nhis before the medical expenditure panel survey (it’s sample is twice as big). the centers for disease control and prevention (cdc) has been keeping nhis real since 1957, and the scripts below automate the download, importation, and analysis of every file back to 1963.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
what happened in 1997, you ask? scientists cloned dolly the sheep, clinton started his second term, and the national health interview survey underwent its most recent major questionnaire re-design. here’s how all the moving parts work:
- a person-level file (personsx) that merges onto other files using unique household (hhx), family (fmx), and person (fpx) identifiers. [note to data historians: prior to 2004, person number was (px) and unique within each household.] this file includes the complex sample survey variables needed to construct a taylor-series linearization design, and should be used if your analysis doesn’t require variables from the sample adult or sample child files. this survey setup generalizes to the noninstitutional, non-active duty military population.
- a family-level file that merges onto other files using unique household (hhx) and family (fmx) identifiers.
- a household-level file that merges onto other files using the unique household (hhx) identifier.
- a sample adult file that includes questions asked of only one adult within each household (selected at random) – a subset of the main person-level file. hhx, fmx, and fpx identifiers will merge with each of the files above, but since not every adult gets asked these questions, this file contains its own set of weights: wtfa_sa instead of wtfa. you can merge on whatever other variables you need from the three files above, but if your analysis requires any variables from the sample adult questionnaire, you can’t use records in the person-level file that aren’t also in the sample adult file (a big sample size cut). this survey setup generalizes to the noninstitutional, non-active duty military adult population.
- a sample child file that includes questions asked of only one child within each household (if available, and also selected at random) – another subset of the main person-level file. same deal as the sample adult description, except use wtfa_sc instead of wtfa oh yeah and this one generalizes to the child population.
- five imputed income files. if you want income and/or poverty variables incorporated into any part of your analysis, you’ll need these puppies. the replication example below uses these, but if that’s impenetrable, post in the comments describing where you get stuck.
- some injury stuff and other miscellanea that varies by year. if anyone uses this, please share your experience.
if you use anything more than the personsx file alone, you’ll need to merge some tables together. make sure you understand the difference between setting the parameter all = TRUE versus all = FALSE — not everyone in the personsx file has a record in the samadult and samchild files.
this new github repository contains four scripts:
1963-2011 – download all microdata.R
- loop through every year and download every file hosted on the cdc’s nhis ftp site
- import each file into r with SAScii
- save each file as an r data file (.rda)
- download all the documentation into the year-specific directory
2011 personsx – analyze.R
- load the r data file (.rda) created by the download script (above)
- set up a taylor-series linearization survey design outlined on page 6 of this survey document
- perform a smattering of analysis examples
2011 personsx plus samadult with multiple imputation – analyze.R
- load the personsx and samadult r data files (.rda) created by the download script (above)
- merge the personsx and samadult files, highlighting how to conduct analyses that need both
- create tandem survey designs for both personsx-only and merged personsx-samadult files
- perform just a touch of analysis examples
- load and loop through the five imputed income files, tack them onto the personsx-samadult file
- conduct a poverty recode or two
- analyze the multiply-imputed survey design object, just like mom used to analyze
replicate cdc tecdoc – 2000 multiple imputation.R
- download and import the nhis 2000 personsx and imputed income files, using SAScii and this imputed income sas importation script (no longer hosted on the cdc’s nhis ftp site).
- loop through each of the five imputed income files, merging each to the personsx file and performing the same set of variable recodes
- construct a multiply-imputed survey design object
- analyze the multiply-imputed survey design object to generate pdf page 60 of this technical document
click here to view these four scripts
for more detail about the national health interview survey (nhis), visit:
- the centers for disease control and prevention’s national health interview survey homepage
- the national health interview survey according to wikipedia
- the minnesota population data center’s harmonized national health interview survey homepage
notes:
the national health interview survey is the first and only us government survey data set to include any r syntax examples (page 6). an inspiration.
the cdc often includes supplemental survey questions in nhis. check ’em out.
unless specified by the question’s phrasing, most nhis variables should be treated as point-in-time, as opposed to either annualized or ever during the year. this distinction is particularly important for health insurance coverage. think about these three statistics —
- the number of americans who won’t have health insurance at least once during this year
- the number of americans without health insurance right now
- the number of americans who won’t ever have health insurance during this year
confidential to sas, spss, stata, and sudaan users: why are you still rocking out on that cassette tape after we’ve designed the ipod? time to transition to r. 😀
To leave a comment for the author, please follow the link and comment on their blog: asdfree by anthony damico.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.