Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In an ideal world, all bioinformatics tools would be made available via the Web as a web service with an API, as well as a standalone package to download for local use. This is rarely the case and sometimes, even where one or the other is available, factors such as cost come into play. So we resort to web scraping; writing code to interact with the code that lies behind a web server so as to submit queries, retrieve and parse results.
Normally, I’d use something like Ruby’s Mechanize library for this purpose. However, where the purpose is to retrieve delimited data for analysis using R, I figured it was time to try and achieve the entire process within R. So here’s how I used the RCurl and XML packages to interact with the WHAT IF server, which provides tools for the analysis of protein structure.
1. A note of caution
The administrators of some servers don’t like robots; particularly those which fire off thousands of queries per second to a server designed to cope only with a small number of requests. So use common sense; check whether the administrators have made their policy public and respect the limits.
2. Know your HTML
Step 1 in automating form submission is understanding how the web form works. So: head to the WHAT IF server, click “Atomic contacts” in the left menu, then right-click “Salt bridges” and open that link in a new tab. Examination of the HTML shows this:
<FORM METHOD="POST" ACTION="/wiw-cgi/GenericCGI.py" ENCTYPE="multipart/form-data"> <INPUT TYPE="hidden" NAME="request" VALUE="shosbr" > <TABLE border=0 cellpadding=4 cellspacing=1 width="100%"> <TR><TD Align=left >Either Choose a pdb-file,</TD> <TD Align=left ><INPUT TYPE="TEXT" NAME="&PDB1" SIZE=8 MAXLENGTH=4 > </TD> </TR> <TR><TD Align=left >Or Choose your own file</TD> <TD Align=left ><INPUT TYPE="file" NAME="&FIL1" > </TD> </TR> </TABLE><P> <INPUT TYPE="submit" NAME="SubmitButton" VALUE="Send" > <INPUT TYPE="reset" NAME="Reset" VALUE="Clear Form" > </FORM>
What this tells you is that the server runs a Python script, named GenericCGI.py, to which you can submit a set of name=value parameters: request=shosbr, &PDB1=NNNN (where NNNN is a PDB identifier) or &FIL1=FILE, where FILE is a PDB file on your local machine.
3. Submission of query
Let’s choose a small protein – 3N2J, a bacterial azurin and submit it to WHAT IF using R:
library(RCurl) salt <- postForm("http://swift.cmbi.ru.nl/wiw-cgi/GenericCGI.py", "&PDB1" = "3N2J", "request" = "shosbr", "submitButton" = "Send")
That takes a little while to run and when finished, the variable salt contains a bunch of HTML which looks like this:
<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 3.2//EN\">\n<HTML>\n\n<!-- This file generated using Python HTMLgen module. -->\n<HEAD>\n <META NAME=\"GENERATOR\" CONTENT=\"HTMLgen 2.0.6\">\n <TITLE>Salt bridges.</TITLE>\n </HEAD>\n<BODY BGCOLOR=\"WHITE\">\n<H1>Salt bridges.</H1>\n\n\n Your job has been passed to WHAT IF and is processed.\n Your request might take a while, please wait.\n \n\n</BODY> </HTML>\n<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 3.2//EN\">\n<HTML>\n\n<!-- This file generated using Python HTMLgen module. -->\n<HEAD>\n <META NAME=\"GENERATOR\" CONTENT=\"HTMLgen 2.0.6\">\n <TITLE>Salt bridges.</TITLE>\n </HEAD>\n<BODY BGCOLOR=\"WHITE\">\n\n<P>\n\nAccess your result.\n\n<FORM METHOD=\"POST\" ACTION=\"/wiw-cgi//GenericCGI.py\">\n<INPUT TYPE=\"hidden\" NAME=\"ID\" VALUE=\"/tmprM7rDq/\" >\n<INPUT TYPE=\"hidden\" NAME=\"refresh\" VALUE=\"shosbr\" >\n<INPUT TYPE=\"submit\" NAME=\"submit\" VALUE=\"Results\" >\n\n</FORM>\n\n <HR> \n If you have detected any error, or have any question or suggestion,\n please send an Email to Gert Vriend.<BR>\n Roland Krause, Jens Erik Nielsen, \n <a href=\"mailto: Vriend@CMBI.ru.nl\">Gert Vriend</a>.<P>\n <HR>\n Last modified Thu Sep 8 12:07:53 2011\n \n\n</BODY> </HTML>\n
4. Retrieval of results
The WHAT IF server is a bit of a tease. Examination of the HTML in variable salt, above, reveals that instead of returning results, we have an intermediate page containing another form, which we have to submit to retrieve the final data. The key form parameters are: ID=/tmprQzd2o/, submit=Results and refresh=shosbr. The parameter ID refers to a temporary location on the server which holds the results and will change with each submitted query.
We can submit this form in the same way:
library(RCurl) salt <- postForm("http://swift.cmbi.ru.nl/wiw-cgi/GenericCGI.py", "ID" = "/tmprQzd2o/", "request" = "shosbr", "submitButton" = "Send")
That returns a bunch more HTML – just looking at the first part:
<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 3.2//EN\">\n<HTML>\n\n<!-- This file generated using Python HTMLgen module. -->\n<HEAD>\n <META NAME=\"GENERATOR\" CONTENT=\"HTMLgen 2.0.6\">\n <TITLE>Salt bridges.</TITLE>\n </HEAD>\n<BODY BGCOLOR=\"white\">\n<H1>Salt bridges.</H1>\n\nThe list of salt bridges:\n<PRE>Date= 2011-09-08 12:07:52\nYou are prompted for the first range\nYou are prompted for the second range\n 1 6 ASP ( 6 ) A OD1 - 1 ALA ( 1 ) A N 6.78\n 2 6 ASP ( 6 ) A OD2 - 1 ALA ( 1 ) A N 5.98\n 3 11 ASP ( 11 ) A OD1 - 35 HIS ( 35 ) A ND1 4.35\n 4 11 ASP ( 11 ) A OD1 - 35 HIS ( 35 ) A NE2 4.23\n ...
Success. We have results and if you squint, you might be able to see that they are wrapped up in PRE tags. Now, how to get them out?
5. Parsing of results
Here, we bring the XML package into play. First, parse the returned HTML and pull out the text between the PRE tags:
library(XML) sb <- htmlTreeParse(salt, useInternalNodes = T) sb1 <- xpathApply(sb, "//pre", xmlValue) class(sb1) # "list"
That gives us a list, containing one element which is a character vector containing the data as preformatted text.
6. Conversion to a data frame
Now for the tricky part. It looks as though each line of the data is space-delimited. So, you might think that we could read it into a data frame by treating the text as a connection. Question is: how consistent is the data output? Do all lines contain the same number of fields, for example? What about fields that themselves contain spaces?
Well, all we can do is give it a try. First, a quick look at the start and end of the data:
head(readLines(textConnection(sb1[[1]]))) # [1] "Date= 2011-09-08 12:07:52" # [2] "You are prompted for the first range" # [3] "You are prompted for the second range" # [4] " 1 6 ASP ( 6 ) A OD1 - 1 ALA ( 1 ) A N 6.78" # [5] " 2 6 ASP ( 6 ) A OD2 - 1 ALA ( 1 ) A N 5.98" # [6] " 3 11 ASP ( 11 ) A OD1 - 35 HIS ( 35 ) A ND1 4.35" tail(readLines(textConnection(sb1[[1]]))) # [1] " 5871506 ASP ( 98 ) L OD2 - 1509 LYS ( 101 ) L NZ 6.02" # [2] " 5881512 GLU ( 104 ) L OE1 - 1432 LYS ( 24 ) L NZ 2.84" # [3] " 5891512 GLU ( 104 ) L OE2 - 1432 LYS ( 24 ) L NZ 4.42" # [4] " 5901514 GLU ( 106 ) L OE2 - 1511 LYS ( 103 ) L NZ 5.43" # [5] " 5911572 LYS ( 128 ) L O'' - 1432 LYS ( 24 ) L NZ 6.43" # [6] ""
OK, looks like we can lose the first 3 lines and perhaps split on whitespace?
sbl <- readLines(textConnection(sb1[[1]])) sbl <- sbl[4:length(sbl)] sbt <- read.table(textConnection(sbl), fill = T) head(sbt) # V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 # 1 1 6 ASP ( 6 ) A OD1 - 1 ALA ( 1 ) A N 6.78 # 2 2 6 ASP ( 6 ) A OD2 - 1 ALA ( 1 ) A N 5.98 # 3 3 11 ASP ( 11 ) A OD1 - 35 HIS ( 35 ) A ND1 4.35 # 4 4 11 ASP ( 11 ) A OD1 - 35 HIS ( 35 ) A NE2 4.23 # 5 5 11 ASP ( 11 ) A OD2 - 35 HIS ( 35 ) A ND1 6.01 # 6 6 11 ASP ( 11 ) A OD2 - 35 HIS ( 35 ) A NE2 6.33 tail(sbt) # V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 # 586 5861506 ASP ( 98 ) L OD2 - 1435 LYS ( 27 ) L NZ 5.09 NA # 587 5871506 ASP ( 98 ) L OD2 - 1509 LYS ( 101 ) L NZ 6.02 NA # 588 5881512 GLU ( 104 ) L OE1 - 1432 LYS ( 24 ) L NZ 2.84 NA # 589 5891512 GLU ( 104 ) L OE2 - 1432 LYS ( 24 ) L NZ 4.42 NA # 590 5901514 GLU ( 106 ) L OE2 - 1511 LYS ( 103 ) L NZ 5.43 NA # 591 5911572 LYS ( 128 ) L O'' - 1432 LYS ( 24 ) L NZ 6.43 NA
That almost worked – except…
7. …we have a glitch
Rows in the data frame sbt which look like this:
5911572 LYS ( 128 ) L O'' - 1432 LYS ( 24 ) L NZ 6.43 NA
should actually look like this:
591 1572 LYS ( 128 ) L O'' - 1432 LYS ( 24 ) L NZ 6.43
The problem lies with the original output from WHAT IF. There needs to be a space between column 1 (contact number) and column 2 (residue number of first interacting molecule). In fact, I noticed this problem for the first time today, in the course of this R investigation.
Well you win some, you lose some. At least I learned a thing or two about RCurl and XML.
Filed under: bioinformatics, programming, R, statistics Tagged: how to, protein structure, what if
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.