Censoring on one end, “outliers” on the other, what can we do with the middle?
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This post was written by Phil.
A medical company is testing a cancer drug. They get a 16 genetically identical (or nearly identical) rats that all have the same kind of tumor, give 8 of them the drug and leave 8 untreated…or maybe they give them a placebo, I don’t know; is there a placebo effect in rats?. Anyway, after a while the rats are killed and examined. If the tumors in the treated rats are smaller than the tumors in the untreated rats, then all of the rats have their blood tested for dozens of different proteins that are known to be associated with tumor growth or suppression. If there is a “significant” difference in one of the protein levels, then the working assumption is that the drug increases or decreases levels of that protein and that may be the mechanism by which the drug affects cancer. All of the above is done on many different cancer types and possibly several different types of rats. It’s just the initial screening: if things look promising, many more tests and different tests are done, potentially culminating (years later) in human tests.
So the initial task is to determine, from 8 control and 8 treated rats, which proteins look different. There are some complications: (1) the data are left-censored, i.e. below some level a protein is simply reported as “low”; (2) even above the censoring threshold the data are very uncertain (50% or 30% uncertainty for concentrations up to maybe double the censoring threshold); (3) some proteins are reported only in discrete levels (e.g. measurements might be 3.7 or 7.4, but never in between); (4) sometimes instrument problems, chemistry problems, or abnormalities in one or more rats lead to very high measurements of one or more proteins.
For instance:
(“low” means < 0.10) :
Protein A, cases: 0.31, 0.14, low, 0.24, low, low, 0.14, low
Protein A, controls: low, low, low, low, 0.24, low, low, low
Protein B, cases: 160, 122, 99, 145, 377, 133, 123, 140
Protein B, controls: 94, 107, 139, 135, 152, 120, 111, 118
Note the very high value of Protein B in case rat 5. The drug company would not want to flag Protein B as being affected by their drug just because they got that one big number.
Finally, the question: what’s a good algorithm to recognize if the cases tend to have higher levels of a given protein than the controls? A few possibilities that come to mind: (1) generate bootstrap samples from the cases and from the controls, and see how often the medians differ by more than the observed medians do; if it’s a small fraction of the time, then the observed difference is “statistically significant.” (2) Use the Whitney-Mann “U-test”. (3) Discard outliers, then use censored maximum likelihood (or similar) on the rest of the data, thus generating a mean (or geometric mean) and uncertainty for the cases and for the controls.
Which of those is the best approach, and if the answer is “neither” then what do you recommend?
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.