Converting a spreadsheet of SMILES: my first OSM contribution
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’ve long admired the work of the Open Source Malaria Project. Unfortunately time and “day job” constraints prevent me from being as involved as I’d like.
So: I was happy to make a small contribution recently in response to this request for help:
Can anyone help @O_S_M to convert this spreadsheet ( malaria.ourexperiment.org/biological_dat…) into chemical structures with data? #openscience #realtimechem—
Alice Williamson (@all_isee) June 24, 2014
Note – this all works fine under Linux; there seem to be some issues with Open Babel library files under OSX.
First step: make that data usable by rescuing it from the spreadsheet 😉 We’ll clean up a column name too.
library(XLConnect) mmv <- readWorksheetFromFile("TP compounds with solid amounts 14_3_14.xlsx", sheet = "Sheet1") colnames(mmv)[5] <- "EC50" head(mmv) COMPOUND_ID Smiles MW QUANTITY.REMAINING..mg. 1 MMV668822 c1[n+](cc2n(c1OCCc1cc(c(cc1)F)F)c(nn2)c1ccc(cc1)OC(F)F)[O-] 434.35 0.0 2 MMV668823 c1nc(c2n(c1OCCc1cc(c(cc1)F)F)c(nn2)c1ccc(cc1)OC(F)F)Cl 452.79 0.0 3 MMV668824 c1ncc2n(c1CCO)c(nn2)c1ccc(cc1)OC(F)F 306.27 29.6 4 MMV668955 C1NCc2n(C1CCO)c(nn2)c1ccc(cc1)OC(F)F 310.30 18.5 5 MMV668956 C1(CN(C1)c1cc(c(cc1)F)F)Oc1cncc2n1c(nn2)c1ccc(cc1)OC(F)F 445.38 124.2 6 MMV668957 c1ncc2n(c1N1CCC(C1)c1ccccc1)c(nn2)c1ccc(cc1)OC(F)F 407.42 68.5 EC50 New.quantity.remaining 1 4.01 0 2 0.16 0 3 10.00 29 4 8.37 18 5 0.43 124 6 2.00 62
What OSM would like: an output file in Chemical Markup Language, containing the Compound ID and properties (MW and EC50).
The ChemmineR package makes conversion of SMILES strings to other formats pretty straightforward; we start by converting to Structure Data Format (SDF):
library(ChemmineR) library(ChemmineOB) mmv.sdf <- smiles2sdf(mmv$Smiles)
That will throw a warning, since all molecules in the SDF object have the same CID; currently, no CID (empty string). We add the CID using the compound ID, then use datablock() to add properties:
cid(mmv.sdf) <- mmv$COMPOUND_ID datablock(mmv.sdf) <- data.frame(MW = mmv$MW, EC50 = mmv$EC50)
Now we can write out to a SDF file. We could also use a loop or an apply function to write individual files per molecule.
write.SDF(mmv.sdf, "mmv-all.sdf", cid = TRUE)
It would be nice to stay in the one R script for conversion to CML too but for now, I just run Open Babel from the command line. Note that the -xp flag is required to include the properties in CML:
babel -xp mmv-all.sdf mmv-all.cml
That’s it; here’s my OSMinformatics Github repository, here’s the output.
Filed under: open science, programming, R, statistics Tagged: cheminformatics, conversion, malaria, osm, smiles
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.