The CDK/Metabolomics/Chemometrics Unconference results
[This article was first published on chem-bla-ics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
As announced earlier, Miguel, Velitchka, Christoph and I held a small CDK/Metabolomics/Chemometrics unconference. We started late, and did not have an evening program, resulting in not overly much results. However, we did do molecular chemometrics.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
We used the R statistics software together with Rajarshi’s rcdk package (an R wrapper around the CDK library) and Ron’s (my PhD supervisor) PLS package (see this paper), to predict retention indices for a number of metabolites.
We ended up with this R script:
library("rJava") library("rcdk") library("pls") mols = load.molecules("data_cdk.sdf") selection = get.desc.names() selection = selection[-which(selection=="org.openscience.cdk.qsar.descriptors.molecular.AminoAcidCountDescriptor")] x = eval.desc(mols, selection, verbose=TRUE) x2 = x[,apply(x, 2, function(a) {all(!is.na(a))})] y = read.table("data_cdk_RI") input = data.frame(x2, y) pls.model = plsr(V1 ~ ., 50, data=input, validation="CV") summary(pls.model) plot(RMSEP(pls.model)) plot(pls.model, ncomp=20) abline(0,1, col="red") plot(pls.model, "loadings", comps=1:2) savehistory("finalHistory.R")The AminoAcidCountDescriptor threw us a NullPointerException and there were a few NAs in the resulting matrix. The CV results were not so good as Velitchka’s best models, but still a good start:
No variable selection; 200 objects, 190 variables.
Questions:
- Can we do this in Bioclipse2 too?
- Can we improve the default CDK descriptor parameters to maximize the column count?
- Rajarshi, what would be involved to write some wrapper code for atomic descriptors for rcdk?
To leave a comment for the author, please follow the link and comment on their blog: chem-bla-ics.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.