Parse an Online Table into an R Dataframe – Westgard’s Biological Variation Database

dtholmes@mail.ubc.ca

5 years ago

[This article was first published on The Lab-R-torian, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Background

From time to time I have wanted to bring an online table into an R dataframe. While in principle, the data can be cut and paste into Excel, sometimes the table is very large and sometimes the columns get goofed up in the process. Fortunately, there are a number of R tools for accomplishing this. I am just going to show one approach using the rvest package. The rvest package also makes it possible to interact with forms on webpages to request specific material which can then be scraped. I think you will see the potential if you look here.

In our (simple) case, we will apply this process to Westgard's desirable assay specifications as shown on his website. The goal is to parse out the biological variation tables, get them into a dataframe and the write to csv or xlsx.

Reading in the Data

The first thing to do is to load the rvest and httr packages and define an html session with the html_session() function.

library(rvest)
library(httr)
wg <- html_session("https://www.westgard.com/biodatabase1.htm", user_agent("LabRtorian"))

Now looking at the webpage, you can see that there are 8 columns in the tables of interest. So, we will define an empty dataframe with 8 columns.

#define empty table to hold all the content
biotable = data.frame(matrix(NA,0, 8))

We need to know which part of the document to scrape. This is a little obscure, but following the instructions in this post, we can determine that the xpaths we need are:

/html/body/div[1]/div[3]/div/main/article/div/table[1]

/html/body/div[1]/div[3]/div/main/article/div/table[2]

/html/body/div[1]/div[3]/div/main/article/div/table[3]

…

etc.

There are 8 such tables in the whole webpage. We can define a character vector for these as such:

xpaths <- paste0("/html/body/div[1]/div[3]/div/main/article/div/table[", 1:8, "]")

Now we make a loop to scrape the 8 tables and with each iteration of the loop, append the scraped subtable to the main dataframe called biotable using the rbind() function. We have to use the parameter fill = TRUE in the html_table() function because the table does not happen to always a uniform number of columns.

for (j in 1:8){                
  subtable <- wg %>%
  read_html() %>%
  html_nodes(xpath =  xpaths[j]) %>%
  html_table(., fill = TRUE) 
  subtable <- subtable[[1]]
  biotable <- rbind(biotable,subtable)
}

Clean Up

Now that we have the raw data out, we can have a quick look at it:

X1	X2	X3	X4	X5	X6	X7	X8
	Analyte	Number of Papers	Biological Variation	Biological Variation	Desirable specification	Desirable specification	Desirable specification
	Analyte	Number of Papers	CVI	CVg	I(%)	B(%)	TE(%)
S-	11-Desoxycortisol	2	21.3	31.5	10.7	9.5	27.1
S-	17-Hydroxyprogesterone	2	19.6	50.4	9.8	13.5	29.7
U-	4-hydroxy-3-methoximandelate (VMA)	1	22.2	47.0	11.1	13.0	31.3
S-	5' Nucleotidase	2	23.2	19.9	11.6	7.6	26.8
U-	5'-Hydroxyindolacetate, concentration	1	20.3	33.2	10.2	9.7	26.5
S-	α1-Acid Glycoprotein	3	11.3	24.9	5.7	6.8	16.2
S-	α1-Antichymotrypsin	1	13.5	18.3	6.8	5.7	16.8
S-	α1-Antitrypsin	3	5.9	16.3	3.0	4.3	9.2

We can see that we need define column names and we need to get rid of some rows containing extraneous column header information. There are actually 8 such sets of headers to remove.

table.header <- c("Sample", "Analyte" ,"NumPapers", "CVI", "CVG", "I", "B","TE")
names(biotable) <- table.header

Let's now find rows we don't want and remove them.

for.removal <- grep("Analyte", biotable$Analyte)
biotable <- biotable[-for.removal,]

You will find that the table has missing data which is written as “- – -”. This should be now replaced by NA and the column names should be assigned to sequential integers. Also, we will remove all the minus signs after the specimen type. I'm not sure what they add.

biotable[biotable == "---"] <- NA
row.names(biotable) <- 1:nrow(biotable)
biotable$Sample <- gsub("-", "", biotable$Sample, fixed = TRUE)

Check it Out

Just having another look at the first 10 rows:

Sample	Analyte	NumPapers	CVI	CVG	I	B	TE
S	11-Desoxycortisol	2	21.3	31.5	10.7	9.5	27.1
S	17-Hydroxyprogesterone	2	19.6	50.4	9.8	13.5	29.7
U	4-hydroxy-3-methoximandelate (VMA)	1	22.2	47.0	11.1	13.0	31.3
S	5' Nucleotidase	2	23.2	19.9	11.6	7.6	26.8
U	5'-Hydroxyindolacetate, concentration	1	20.3	33.2	10.2	9.7	26.5
S	α1-Acid Glycoprotein	3	11.3	24.9	5.7	6.8	16.2
S	α1-Antichymotrypsin	1	13.5	18.3	6.8	5.7	16.8
S	α1-Antitrypsin	3	5.9	16.3	3.0	4.3	9.2
S	α1-Globulins	2	11.4	22.6	5.7	6.3	15.7
U	α1-Microglobulin, concentration, first morning	1	33.0	58.0	16.5	16.7	43.9

Now examining the structure:

str(biotable)

## 'data.frame':    370 obs. of  8 variables:
##  $ Sample   : chr  "S" "S" "U" "S" ...
##  $ Analyte  : chr  "11-Desoxycortisol" "17-Hydroxyprogesterone" "4-hydroxy-3-methoximandelate (VMA)" "5' Nucleotidase" ...
##  $ NumPapers: chr  "2" "2" "1" "2" ...
##  $ CVI      : chr  "21.3" "19.6" "22.2" "23.2" ...
##  $ CVG      : chr  "31.5" "50.4" "47.0" "19.9" ...
##  $ I        : chr  "10.7" "9.8" "11.1" "11.6" ...
##  $ B        : chr  "9.5" "13.5" "13.0" "7.6" ...
##  $ TE       : chr  "27.1" "29.7" "31.3" "26.8" ...

It's kind-of undesirable to have numbers as characters so…

#convert appropriate columns to numeric
biotable[,3:8] <- lapply(biotable[3:8], as.numeric)

Write the Data

Using the xlsx package, you can output the table to an Excel file in the current working directory.

library(xlsx)
write.xlsx(biotable,
            file = "Westgard_Biological_Variation.xlsx",
            row.names = FALSE)

If you are having trouble getting xlsx to install, then just write as csv.

write.csv(biotable,
            file = "Westgard_Biological_Variation.csv",
            row.names = FALSE)

Conclusion

You can now use the same general approach to parse any table you have web access to, no mater how small or big it is. Here is a complete script in one place:

library(httr)
library(rvest)
library(xlsx)

wg <- html_session("https://www.westgard.com/biodatabase1.htm", user_agent("yournamehere"))
xpaths <- paste0("/html/body/div[1]/div[3]/div/main/article/div/table[", 1:8, "]")

#define empty dataframe
biotable = data.frame(matrix(NA,0, 8))

#loop over the 8 html tables
for (j in 1:8){                
  subtable <- wg %>%
  read_html() %>%
  html_nodes(xpath =  xpaths[j] ) %>%
  html_table(., fill = TRUE) 
  subtable <- subtable[[1]]
  biotable <- rbind(biotable,subtable)
}

table.header <- c("Sample", "Analyte" ,"NumPapers", "CVI", "CVG", "I", "B","TE")
names(biotable) <- table.header

#remove extraneous rows
for.removal <- grep("Analyte", biotable$Analyte)
biotable <- biotable[-for.removal,]

#make missing data into NA
biotable[ biotable == "---" ] <- NA
row.names(biotable) <- 1:nrow(biotable)

#convert appropriate columns to numeric
biotable[,3:8] <- lapply(biotable[3:8], as.numeric)

#get rid of minus signs in column 1
biotable$Sample <- gsub("-", "", biotable$Sample, fixed = TRUE)

write.xlsx(biotable,
            file = "Westgard_Biological_Variation.xlsx",
            row.names = FALSE)

write.csv(biotable,
            file = "Westgard_Biological_Variation.csv",
            row.names = FALSE)

Parting Thought on Tables

You prepare a table before me in the presence of my enemies. You anoint my head with oil; my cup overflows.

(Prov 23:5)

To leave a comment for the author, please follow the link and comment on their blog: The Lab-R-torian.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.