[This article was first published on asdfree by anthony damico, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
the surveillance epidemiology and end results program is the aggregation of all cancer registry statistics in the united states. created by congressional decree, seer has captured a nationally-representative quarter of american cancer incidence since 1973. when acs, cdc, nci, and naaccr publish their collaborative annual report, they use seer. when the aacr predicts that america will have 18 million cancer survivors by 2022, they use seer too. you can use seer three.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
the national cancer institute blessedly provides a bouquet of free statistical software to import and analyze this microdata. obviously, my code won’t compete with the legions of epidemiological software programmers at the largest of the nih institutes. but plenty of other r users have written packages to work with this stuff, so maybe, just maybe, someone will find value in my automated importation syntax. plus, the seer microdata include a sas import script – which triggers my fight or fight harder reflex. list of things i hate, descending sort order: mosquitoes, cancer, then sas a very distant third. but still.
aside from easing the importation of this data into the r language, i suppose i have contributed one tangible improvement to the seer-analyst community: these download and import scripts will put all eight million records into wickedly-fast monetdb. so long as you can perform your analysis using sql, you can perform your analysis (on all eight million records) in basically one second. haa-cha! i’ve said it before, i’ll say it again: the import takes forrrrrever (leave it overnight). but once it’s loaded, it’ll outrun lightning. this new github repository contains four scripts:
download.R
- after setting your username and password, download and unzip the seer text data file to some working directory
import all tables into rda.R
- grep through the unzipped seer text folders to find individual- and population-level tables
- import each individual-level table into an r data.frame with sascii, then save to disk for fast loading later.
- import each population-level table into an r data.frame with sascii, then save to disk for fast loading later.
import individual-level tables into monetdb.R
- grep through the unzipped seer text folders to find individual-level tables
- initiate a monetdb server on the local disk, then import each individual-level table with read.sascii.monetdb
- stack all of the imported individual-level tables into one, thereby replicating the total case count
- create a well-documented block of code to re-initiate the monetdb server in the future
replicate case counts table.R
- connect to the seer microdata stored in monetdb
- replicate the count statistics shown on the nci’s seer data page with sql
- shut the whole thing down
click here to view these four scripts
for more detail about surveillance epidemiology and end results microdata, visit:
- the seer datasets and software tab and brochure, both good starting points
- seer recodes you’ll need to implement in r if you want to match nci-created (free) software
notes:
seer is publicly-available, you just gotta sign and e-mail in this form, then wait two business days for them to send you the login and password needed for the box that pops up when you click this download link.
confidential to sas, spss, stata, and sudaan users: it’s black tie dinner night at the governor’s mansion and you’re still wearing a t-shirt. ready to change into your tuxedo? time to transition to r. 😀
To leave a comment for the author, please follow the link and comment on their blog: asdfree by anthony damico.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.