Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
< summary> Click to see R set-up code
# Libraries if(!require("pacman")) { install.packages("pacman") } pacman::p_load( data.table ) # Set knitr params knitr::opts_chunk$set( comment = NA, fig.width = 12, fig.height = 8, out.width = '100%' )
Introduction
As we discussed in our last post Introducing the Redwall ‘Red Flag’ Explorer with New Constructs Data, we were able to test the response of 125,000 quarterly and annual financial statements to incidence of “red flag” ratios, but some of the most interesting results may have been hidden in de-listed tickers, often not available to the default R {quantmod}
pricing sources (Yahoo Finance and Alpha Vantage). As a result, 15-25% of financial statements in New Constructs the first five years of data, could not be matched to return history, exposing our analysis to “survivor bias”. Redwall’s mission is multifaceted: to conduct research on topic of interest in available financial and government data, but also to explore and share how to best solve problems encountered while using R.
We would love to see market sponsors make weekly or monthly prices available to the open source community, as has become almost expected in so many other areas where open data is used. Even though every closing price was once publicly disclosed in the newspaper, that doesn’t mean that those prices are available for analysis, and it takes a lot of work to collect them. One of the only options suggested online, was to go to the local library and look through the stacks, but this was not feasible for a personal project involving over 900 securities. It turns out that there are several providers of financial statement data for a price. Since many other R users will likely face this same problem, one objective of this post will be to lay out the options for others looking for older and de-listed securities.
Financial Data API’s for De-Listed Securities
Just to make clear, this is not the first article about what to do when Yahoo Finance fails. In our case, we looked at four providers, the first being Sharadar, as above mentioned. Sharadar had a substantial listing of 16,512 tickers, going back to 1986 and including ADRs, Canadian and US companies. We discovered that we could match the large majority of our still-missing tickers, but the cost for the level of access allowing the full history needed would be a full year’s access at their recently reduced price of $299 (through the Quandl API). We also looked at Algoseek, but their data only went back to January 2007, so that wouldn’t solve the problem. Norgate offered access to de-listed securities back to 1990 for $346 under their 6-month “Platinum” plan option. Norgate also offered a free 3-week trial, but that only included 2 years of data, which would involve time to figure out the API and wouldn’t solve the problem, unless we went for the subscription. Lastly, there was EOD Historical Data, which offered $20 per month for world prices going back 30+ years.
Finding Matching Prices
Sharadar’s comprehensive spreadsheet, offering certainty that needed prices would actually be available without having to download through the API (a helpful feature while deciding how to proceed). Given that this is only a personal project, we thought we would try EOD first to see if we could access the data we needed for only $20, though this forced us to invest time to get the API working from R without knowing if what we were looking for would even be there.
EOD gave a code example for accessing the API, but we struggled at first to use it with the suggested default .csv format. The error messages were confusing, and sometimes unclear if the data we needed was not available (404) or if there was a problem with the API preventing the download. Customer support was very responsive, considering they are not charging a lot for it, and after some wheel spinning, helped us to understand that using JSON would work better. By adding “&fmt=json” to the set-up string and parsing with {jsonlite}
as in the code example below worked perfectly, and we were able to collect over 800 of our 930 missing tickers.
As we were doing this, we also discovered other problems in our Yahoo Finance prices, so decided to collect the tickers we previously thought we would use from Yahoo. The All-World package allows up to 100,000 calls per day, which was more than enough, and we would be able to expand beyond the US to 60 overseas exchanges. EOD was felt quite a bit faster than the traditional {quantmod}
, even when we used {BatchGetSymbols}
for collecting a lot of tickers. Unlike with Alpha Vantage, these were adjusted prices. All in all, EOD seemed to be perfect for our use-case.
< summary> Click to see R code sample API call
# Tickers and API token api.token <- "OeAFFmMliFG5orCUuwAKQ8l4WWFQ67YX" tickers <- c("AAPL.US") ticker_list <- list() # Loop to collect prices for ( ticker in tickers ) { # Set up API string ticker.link <- paste( "http://nonsecure.eodhistoricaldata.com/api/eod/", ticker, "?api_token=", api.token, "&period=w&order=d&fmt=json&from=1997-01-01", sep = "" ) # Call to API ticker_prices <- try(jsonlite::fromJSON(ticker.link)) # Wait to avoid overloading API Sys.sleep(2) # Append new ticker data to list ticker_list <- append(ticker_list, list(ticker_prices)) } # Name list by ticker names(ticker_list) <- tickers
ticker_list$AAPL.US[1:10,] date open high low close adjusted_close volume 1 2021-08-23 148.310 150.8600 147.80 148.36 148.3600 166528933 2 2021-08-16 148.535 151.6800 144.50 148.19 148.1900 429022231 3 2021-08-09 146.200 149.4444 145.30 149.10 149.1000 299579344 4 2021-08-02 146.360 148.0450 145.18 146.14 146.1400 284559336 5 2021-07-26 148.270 149.8300 142.54 145.86 145.6418 423324004 6 2021-07-19 143.750 148.7177 141.67 148.56 148.3378 441563672 7 2021-07-12 146.210 150.0000 143.63 146.39 146.1710 504249353 8 2021-07-06 140.070 145.6500 140.07 145.11 144.8929 418559704 9 2021-06-28 133.410 140.0000 133.35 139.96 139.7506 321360121 10 2021-06-21 130.300 134.6400 129.21 133.11 132.9109 354155886
Some thoughts on the collected data
We collected pricing data for an additional 400 tickers than with our original sources, so we are still missing price histories for about 500. Of these, only about 200 had 10 or more filings, so many of these were short-lived listings, and possibly not as relevant for our analysis. We also learned that in cases where the ticker had been used more than once, EOD price history would generally have the most recent, but not for the previous entity. For example, “DELL” went private in 2014 and was re-listed in 2016, so only data for the latter entity was available with EOD. In these cases, we were often able to use the Yahoo data.
In addition, we learned that Yahoo sometimes gave surprising results. In the case if ticker “EDO” below (not to be mistaken with EOD who have provided the data), the reliability of the Yahoo data on the same dates during the early period, is shown to be very bad. Here we are showing for prices when there should be none, very small and unchanging prices in some periods, and then jumping all over the place when the history should have ended.
< details>< summary> Click to see R function used to generate output
get_ticker <- function(ticker) { # Load prices for ticker from EOD and Yahoo path <- "~/Desktop/David/Projects/new_constructs_targets/data/" eod <- fread(cmd = paste0("grep ", ticker, " ", path, "eod_weekly_prices.csv")) setnames(eod, c("ticker", "date", "adjusted.close")) yahoo <- fread(cmd = paste0("grep ", ticker, " ", path, "historical_prices/nc_complete_prices2.csv")) yahoo <- yahoo[, c(1:2, 8)] setnames(yahoo, c("ticker", "date", "adjusted.close")) # Rbind, order by date and dcast for comparison prices <- list(eod, yahoo) names(prices) <- c("eod", "yahoo") prices <- rbindlist(prices, idcol = "source")[order(date)] # Return prices[ , dcast( .SD, date ~ source, fun = identity, fill = NA_real_)] }
get_ticker("EDO") date eod yahoo 1: 1997-12-31 8.688 NA 2: 1998-01-02 NA 0.51 3: 1998-01-05 8.438 0.51 4: 1998-01-12 8.625 0.51 5: 1998-01-20 8.500 0.51 --- 1217: 2021-04-12 NA 1.82 1218: 2021-04-19 NA 1.82 1219: 2021-04-26 NA 1.82 1220: 2021-05-03 NA 1.82 1221: 2021-05-10 NA 1.82
Here is another example of SKP illustrating the potential problems, considering our desire to find accurate return data after a given date. These price sequences would give a lot of cases of zero returns, and others with very positive or negative returns.
get_ticker("SKP") date eod yahoo 1: 1998-01-02 NA 20018 2: 1998-01-05 NA 20018 3: 1998-01-12 NA 20018 4: 1998-01-20 NA 20018 5: 1998-01-26 NA 20018 --- 1216: 2021-04-12 NA 52380 1217: 2021-04-19 NA 52380 1218: 2021-04-26 NA 52380 1219: 2021-05-03 NA 52380 1220: 2021-05-10 NA 52380
For this reason, we favored EOD prices, followed by Alpha Vantage when not available, and finally Yahoo Finance as the last resort, which meant the large majority were coming from EOD as shown in the table below.
path <- "~/Desktop/David/Projects/redwall-analytics/content/post/2021-08-19-when-yahoo-finance-doesn-t-have-de-listed-tickers-needed/" source(paste0(path, "prices_source_table.R")) prices_source_table() source N 1: eod 3680589 2: yahoo 267814 3: av 406068
Now, we find that we are missing about 200 tickers out of the 5,500 we set out to match. Many of these were ending in “Q” (so were already on the pink sheets), are foreign or may have bad tickers. In any case, we have probably covered the bulk of the meaningful companies with our analysis.
[1] "AABA" "AAMRQ" "AANI" "ABCWQ" "ABII" [6] "ABKFQ" "ABLSQ" "ACHI" "ACME_33987" "ACPIQ" [11] "ACTT" "AEMI" "AFR" "AGCCQ" "AHMMQ" [16] "AKRXQ" "ALTV" "AMOA" "ANCCQ" "ANLD" [21] "ANRZQ" "ANSVQ" "ANVGQ" "APAC" "APXSQ" [26] "ARDI" "ARMP_1725" "ASCL" "ASYTQ" "ATNY" [31] "ATPAQ" "AWA" "BBAO" "BCKDY" "BFCF" [36] "BHEL.BO" "BKUNQ" "BLGM" "BOW" "BRLC" [41] "BSBN" "CAMD" "CBCG" "CBSS" "CCOWQ" [46] "CDWC" "CEMJQ" "CHMT" "CHZS_8434" "CLK" [51] "CNVX" "COHM" "CPGVY" "CRGIY" "CRNM" [56] "CSKEQ" "CTCI" "CTGI" "CTRA" "CUNO" [61] "CZZ" "DCGNQ" "DDIC" "DEU.F" "DHI.KS" [66] "DIVX" "DOLE" "DRTE" "EBHIQ" "EGLSQ" [71] "EKDKQ" "EMRG" "ENMC" "EPEXQ" "ERPLQ" [76] "EZEM" "FBNIQ" "FCE.A" "FCHDQ" "FCSE" [81] "FFKY" "FILE" "FNBP" "FSNM" "FWMHQ" [86] "GCORE" "GDPMQ" "GDYS" "GISX" "GNKOQ" [91] "GVHR" "HECL.F" "HEVV" "HOFF" "HRVE" [96] "HSTN" "HYDP" "IBCPD" "IION" "IMNR" [101] "INFS" "ISOOE" "ITWO" "JHTXQ" "JRCO" [106] "JWLR" "KNGGY" "KVPHQ" "LAWE" "LCAV" [111] "LEHMQ" "LFGRQ" "LGFTY" "LINEQ" "LNX" [116] "LPHIQ" "LPS" "LQI" "MCEL" "MDRIQ_5781" [121] "MECAQ" "MHRCQ" "MILLQ" "MKTS" "MODT" [126] "MSNW" "MSSN" "MTLQQ" "MUSE" "MYG" [131] "NCOC" "NHR" "NMGA" "NNDS" "NRVHQ" [136] "NUI" "NVTP" "NWACQ" "NWEC" "OLAB" [141] "OO" "OSCIQ" "PENX" "PILLQ" "PMRY" [146] "PRFS" "PRLI" "PRTLQ" "PRXZ" "QDHC" [151] "QRCP" "RCOCQ" "RCRC" "REVUQ" "RGAA" [156] "RGFC" "RITA" "RSTO" "SBIT" "SBLKE" [161] "SEN.ETR" "SEPR" "SGGHU" "SHRPQ" "SHZ" [166] "SKRRF" "SMF" "SPCBQ" "SPNVD" "SQAA" [171] "SRNAE" "SUZ.DEU" "SVNT" "SZ" "TARRQ" [176] "TCPTF" "THQIQ" "TLCR" "TMG" "TNSIQ" [181] "TOBC" "TORCQ" "TRMS" "TWLL" "TWTRQ" [186] "UNISQ" "UPFC" "USFC" "USPI" "UTA" [191] "UVSLQ" "VARI" "VIONQ" "VLCY" "VRAIE" [196] "VRLKQ" "VRST" "VSE" "VTAI" "VTRO" [201] "WAVXQ" "WC" "WFTIQ_8616" "WGATQ" "WG.L" [206] "WGRD" "WLS" "WMANQ" "WNNB" "WPSL" [211] "WWCA" "XYBRQ" "YAKC"
Conclusion
In the end, our return data is still likely not perfect, but is pretty comprehensive, and certainly the data visualization reflected in our Red Flag App should be close to an unbiased reflection. Most of the four million stock returns we collected did not cost an unreasonable amount (as long as not used for commercial purposes). In our next post, we will explore the life and death of companies since 1986 using some of the data we collected in this project.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.