Pull Yahoo Finance Key-Statistics Instantaneously Using XML and XPath in R
[This article was first published on All Things R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This two-part blog post I published a day ago required key-stats from Yahoo Finance for all the companies in the control group I created for my research. I wanted all the key-stats pulled, arranged in a data-frame and then present them side-by-side to form my opinions.
Quantmod package has “getQuote” method which should return the desired metrics. The number and names of the metrics can be controlled via “yahooQF” (see the script below.) Unfortunately, this method seems to be broken as the resulting data-frame had large amount of null values for some metrics. Here is the script nonetheless if one wishes to experiment:
#######################################################################
# Script to download key metrics for a set of stock tickers using the quantmod package
#######################################################################
require(quantmod)
require(“plyr”)
what_metrics <- yahooQF(c("Price/Sales",
“P/E Ratio”,
“Price/EPS Estimate Next Year”,
“PEG Ratio”,
“Dividend Yield”,
“Market Capitalization”))
tickers <- c("AAPL", "FB", "GOOG", "HPQ", "IBM", "MSFT", "ORCL", "SAP")
# Not all the metrics are returned by Yahoo.
metrics <- getQuote(paste(tickers, sep="", collapse=";"), what=what_metrics)
#Add tickers as the first column and remove the first column which had date stamps
metrics <- data.frame(Symbol=tickers, metrics[,2:length(metrics)])
#Change colnames
colnames(metrics) <- c("Symbol", "Revenue Multiple", "Earnings Multiple",
“Earnings Multiple (Forward)”, “Price-to-Earnings-Growth”, “Div Yield”, “Market Cap”)
#Persist this to the csv file
write.csv(metrics, “FinancialMetrics.csv”, row.names=FALSE)
#######################################################################
After some digging around and staring at the raw HTML for Yahoo’s KeyStats page for sometime, I decided to use the XML package and the XPath operators to get all the nodes which host key stats (name and values). This turned out to be lot simpler. Let’s walk through this using three easy steps:
1) The CSS class name for the HTML nodes which host the name of the metric such as Market Cap or Enterprise value is “yfnc_tablehead1”. This made it quite easy to grab all the elements from the HTML tree with this class name:
nodes <- getNodeSet(html_text, "/*//td[@class='yfnc_tablehead1']")
2) Now all I needed to do was get the value of this node using xmlValue function to get the name of the metric (Enterprise Value as an example):
measures <- sapply(nodes, xmlValue)
3) Next, to get the value of any metric, I used the getSibling function to get the adjacent node (i.e. sibling) and used xmlValue function to get the value. Here is how it was done:
values <- sapply(nodes, function(x) xmlValue(getSibling(x)))
#######################################################################
##Alternate method to download all key stats using XML and x_path – PREFERRED WAY
#######################################################################
setwd(“C:/Users/i827456/Pictures/Blog/Oct-25”)
require(XML)
require(plyr)
getKeyStats_xpath <- function(symbol) {
yahoo.URL <- "http://finance.yahoo.com/q/ks?s="
html_text <- htmlParse(paste(yahoo.URL, symbol, sep = ""), encoding="UTF-8")
#search for
nodes <- getNodeSet(html_text, "/*//td[@class='yfnc_tablehead1']")
if(length(nodes) > 0 ) {
measures <- sapply(nodes, xmlValue)
#Clean up the column name
measures <- gsub(" *[0-9]*:", "", gsub(" \\(.*?\\)[0-9]*:","", measures))
#Remove dups
dups <- which(duplicated(measures))
#print(dups)
for(i in 1:length(dups))
measures[dups[i]] = paste(measures[dups[i]], i, sep=” “)
#use siblings function to get value
values <- sapply(nodes, function(x) xmlValue(getSibling(x)))
df <- data.frame(t(values))
colnames(df) <- measures
return(df)
} else {
break
}
}
tickers <- c("AAPL")
stats <- ldply(tickers, getKeyStats_xpath)
rownames(stats) <- tickers
write.csv(t(stats), “FinancialStats_updated.csv”,row.names=TRUE)
#######################################################################
Happy Analyzing!
All Things R &
All Things Analytics (http://goo.gl/CBYQI)
To leave a comment for the author, please follow the link and comment on their blog: All Things R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Copyright © 2022 | MH Corporate basic by MH Themes