Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Background
From time to time I have wanted to bring an online table into an R dataframe. While in principle, the data can be cut and paste into Excel, sometimes the table is very large and sometimes the columns get goofed up in the process. Fortunately, there are a number of R tools for accomplishing this. I am just going to show one approach using the rvest package. The rvest package also makes it possible to interact with forms on webpages to request specific material which can then be scraped. I think you will see the potential if you look here.
In our (simple) case, we will apply this process to Westgard's desirable assay specifications as shown on his website. The goal is to parse out the biological variation tables, get them into a dataframe and the write to csv or xlsx.
Reading in the Data
The first thing to do is to load the rvest
and httr
packages and define an html session with the html_session()
function.
library(rvest) library(httr) wg <- html_session("https://www.westgard.com/biodatabase1.htm", user_agent("LabRtorian"))
Now looking at the webpage, you can see that there are 8 columns in the tables of interest. So, we will define an empty dataframe with 8 columns.
#define empty table to hold all the content biotable = data.frame(matrix(NA,0, 8))
We need to know which part of the document to scrape. This is a little obscure, but following the instructions in this post, we can determine that the xpaths we need are:
/html/body/div[1]/div[3]/div/main/article/div/table[1]
/html/body/div[1]/div[3]/div/main/article/div/table[2]
/html/body/div[1]/div[3]/div/main/article/div/table[3]
…
etc.
There are 8 such tables in the whole webpage. We can define a character vector for these as such:
xpaths <- paste0("/html/body/div[1]/div[3]/div/main/article/div/table[", 1:8, "]")
Now we make a loop to scrape the 8 tables and with each iteration of the loop, append the scraped subtable
to the main dataframe called biotable
using the rbind()
function. We have to use the parameter fill = TRUE
in the html_table()
function because the table does not happen to always a uniform number of columns.
for (j in 1:8){ subtable <- wg %>% read_html() %>% html_nodes(xpath = xpaths[j]) %>% html_table(., fill = TRUE) subtable <- subtable[[1]] biotable <- rbind(biotable,subtable) }
Clean Up
Now that we have the raw data out, we can have a quick look at it:
X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 |
---|---|---|---|---|---|---|---|
Analyte | Number of Papers | Biological Variation | Biological Variation | Desirable specification | Desirable specification | Desirable specification | |
Analyte | Number of Papers | CVI | CVg | I(%) | B(%) | TE(%) | |
S- | 11-Desoxycortisol | 2 | 21.3 | 31.5 | 10.7 | 9.5 | 27.1 |
S- | 17-Hydroxyprogesterone | 2 | 19.6 | 50.4 | 9.8 | 13.5 | 29.7 |
U- | 4-hydroxy-3-methoximandelate (VMA) | 1 | 22.2 | 47.0 | 11.1 | 13.0 | 31.3 |
S- | 5' Nucleotidase | 2 | 23.2 | 19.9 | 11.6 | 7.6 | 26.8 |
U- | 5'-Hydroxyindolacetate, concentration | 1 | 20.3 | 33.2 | 10.2 | 9.7 | 26.5 |
S- | α1-Acid Glycoprotein | 3 | 11.3 | 24.9 | 5.7 | 6.8 | 16.2 |
S- | α1-Antichymotrypsin | 1 | 13.5 | 18.3 | 6.8 | 5.7 | 16.8 |
S- | α1-Antitrypsin | 3 | 5.9 | 16.3 | 3.0 | 4.3 | 9.2 |
We can see that we need define column names and we need to get rid of some rows containing extraneous column header information. There are actually 8 such sets of headers to remove.
table.header <- c("Sample", "Analyte" ,"NumPapers", "CVI", "CVG", "I", "B","TE") names(biotable) <- table.header
Let's now find rows we don't want and remove them.
for.removal <- grep("Analyte", biotable$Analyte) biotable <- biotable[-for.removal,]
You will find that the table has missing data which is written as “- – -”. This should be now replaced by NA
and the column names should be assigned to sequential integers. Also, we will remove all the minus signs after the specimen type. I'm not sure what they add.
biotable[biotable == "---"] <- NA row.names(biotable) <- 1:nrow(biotable) biotable$Sample <- gsub("-", "", biotable$Sample, fixed = TRUE)
Check it Out
Just having another look at the first 10 rows:
Sample | Analyte | NumPapers | CVI | CVG | I | B | TE |
---|---|---|---|---|---|---|---|
S | 11-Desoxycortisol | 2 | 21.3 | 31.5 | 10.7 | 9.5 | 27.1 |
S | 17-Hydroxyprogesterone | 2 | 19.6 | 50.4 | 9.8 | 13.5 | 29.7 |
U | 4-hydroxy-3-methoximandelate (VMA) | 1 | 22.2 | 47.0 | 11.1 | 13.0 | 31.3 |
S | 5' Nucleotidase | 2 | 23.2 | 19.9 | 11.6 | 7.6 | 26.8 |
U | 5'-Hydroxyindolacetate, concentration | 1 | 20.3 | 33.2 | 10.2 | 9.7 | 26.5 |
S | α1-Acid Glycoprotein | 3 | 11.3 | 24.9 | 5.7 | 6.8 | 16.2 |
S | α1-Antichymotrypsin | 1 | 13.5 | 18.3 | 6.8 | 5.7 | 16.8 |
S | α1-Antitrypsin | 3 | 5.9 | 16.3 | 3.0 | 4.3 | 9.2 |
S | α1-Globulins | 2 | 11.4 | 22.6 | 5.7 | 6.3 | 15.7 |
U | α1-Microglobulin, concentration, first morning | 1 | 33.0 | 58.0 | 16.5 | 16.7 | 43.9 |
Now examining the structure:
str(biotable)
## 'data.frame': 370 obs. of 8 variables: ## $ Sample : chr "S" "S" "U" "S" ... ## $ Analyte : chr "11-Desoxycortisol" "17-Hydroxyprogesterone" "4-hydroxy-3-methoximandelate (VMA)" "5' Nucleotidase" ... ## $ NumPapers: chr "2" "2" "1" "2" ... ## $ CVI : chr "21.3" "19.6" "22.2" "23.2" ... ## $ CVG : chr "31.5" "50.4" "47.0" "19.9" ... ## $ I : chr "10.7" "9.8" "11.1" "11.6" ... ## $ B : chr "9.5" "13.5" "13.0" "7.6" ... ## $ TE : chr "27.1" "29.7" "31.3" "26.8" ...
It's kind-of undesirable to have numbers as characters so…
#convert appropriate columns to numeric biotable[,3:8] <- lapply(biotable[3:8], as.numeric)
Write the Data
Using the xlsx
package, you can output the table to an Excel file in the current working directory.
library(xlsx) write.xlsx(biotable, file = "Westgard_Biological_Variation.xlsx", row.names = FALSE)
If you are having trouble getting xlsx
to install, then just write as csv.
write.csv(biotable, file = "Westgard_Biological_Variation.csv", row.names = FALSE)
Conclusion
You can now use the same general approach to parse any table you have web access to, no mater how small or big it is. Here is a complete script in one place:
library(httr) library(rvest) library(xlsx) wg <- html_session("https://www.westgard.com/biodatabase1.htm", user_agent("yournamehere")) xpaths <- paste0("/html/body/div[1]/div[3]/div/main/article/div/table[", 1:8, "]") #define empty dataframe biotable = data.frame(matrix(NA,0, 8)) #loop over the 8 html tables for (j in 1:8){ subtable <- wg %>% read_html() %>% html_nodes(xpath = xpaths[j] ) %>% html_table(., fill = TRUE) subtable <- subtable[[1]] biotable <- rbind(biotable,subtable) } table.header <- c("Sample", "Analyte" ,"NumPapers", "CVI", "CVG", "I", "B","TE") names(biotable) <- table.header #remove extraneous rows for.removal <- grep("Analyte", biotable$Analyte) biotable <- biotable[-for.removal,] #make missing data into NA biotable[ biotable == "---" ] <- NA row.names(biotable) <- 1:nrow(biotable) #convert appropriate columns to numeric biotable[,3:8] <- lapply(biotable[3:8], as.numeric) #get rid of minus signs in column 1 biotable$Sample <- gsub("-", "", biotable$Sample, fixed = TRUE) write.xlsx(biotable, file = "Westgard_Biological_Variation.xlsx", row.names = FALSE) write.csv(biotable, file = "Westgard_Biological_Variation.csv", row.names = FALSE)
Parting Thought on Tables
You prepare a table before me in the presence of my enemies. You anoint my head with oil; my cup overflows.
(Prov 23:5)
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.