Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
There’s a lot of discussion around why code written by self-taught “scientist programmers” rarely follows what a trained computer scientist would consider “best practice”. Here’s a recent post on the topic.
One answer: we begin with exploratory data analysis and never get around to cleaning it up.
An example. For some reason, a researcher (let’s call him “Bob”) becomes interested in a particular dataset in the GEO database. So Bob opens the R console and use the GEOquery package to grab the data:
library(GEOquery) gse <- getGEO("GSE22255")
Bob is interested in the covariates and metadata associated with the experiment, which he can access using pData().
pd <- pData(gse$GSE22255_series_matrix.txt.gz) names(pd) # [1] "title" "geo_accession" # [3] "status" "submission_date" # [5] "last_update_date" "type" # [7] "channel_count" "source_name_ch1" # [9] "organism_ch1" "characteristics_ch1" # [11] "characteristics_ch1.1" "characteristics_ch1.2" # [13] "characteristics_ch1.3" "characteristics_ch1.4" # [15] "characteristics_ch1.5" "characteristics_ch1.6" # [17] "treatment_protocol_ch1" "molecule_ch1" # [19] "extract_protocol_ch1" "label_ch1" # [21] "label_protocol_ch1" "taxid_ch1" # [23] "hyb_protocol" "scan_protocol" # [25] "description" "data_processing" # [27] "platform_id" "contact_name" # [29] "contact_email" "contact_phone" # [31] "contact_fax" "contact_laboratory" # [33] "contact_department" "contact_institute" # [35] "contact_address" "contact_city" # [37] "contact_state" "contact_zip/postal_code" # [39] "contact_country" "supplementary_file" # [41] "data_row_count"
Bob discovers that pd$characteristics_ch1.2 is “age at examination”, but it’s stored as a factor. He’d like to use it as a numeric variable. So he sets about figuring out how to do the conversion.
# first you need factor to character as.character(pd$characteristics_ch1.2[1:5]) # [1] "age-at-examination: 68" "age-at-examination: 68" "age-at-examination: 73" # [4] "age-at-examination: 65" "age-at-examination: 73" # now you want to get rid of everything but the age value # so you wrap it in a gsub() gsub("age-at-examination: ", "", as.character(pd$characteristics_ch1.2[1:5])) # [1] "68" "68" "73" "65" "73" # and the last step is character to numeric as.numeric(gsub("age-at-examination: ", "", as.character(pd$characteristics_ch1.2[1:5]))) # [1] 68 68 73 65 73
Three levels of nested methods. Ugly. However, it works, so Bob moves to the next task (using those ages to do some science).
He thinks: “at some point, I should rewrite that as a function to convert ages.”
But he never does. We stop when it works.
Filed under: programming, R, research diary, statistics
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.