Evaluating Quandl Data Quality – part II
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This post is a more in depth analysis of Quandl futures data vs. Bloomberg data. Since my last post Quandl has updated its futures database to 200+ contracts from 68 contracts originally. For practical reasons, I limit myself here to the initial list of 60+ contracts. I’m still comparing the “Front Month” contract between the two sources. When evaluating the differences, I want the following:
- Evaluate the scale of the differences
- Evaluate the time localization of the differences (if any)
- A single number that captures both features above
- A measure that is comparable across instruments
After a bit of thinking, I came up with the below metric:
As an example, below is the chart of the above formula over time for the E-mini S&P 500 contract.
I plotted the same chart for each of the 60 contracts in the list of my previous post. Interested readers can find all the charts here.
From my perspective there are essentially two main sources of differences. First, plain wrong data points largely off compared to the reality and second a difference in the data building process (i.e. construction methodology for the front month contract). A mix of both is very likely to happen here. In order to quantify this, I defined one additional metric: Mean Absolute Differences (MAD).
0″ title=”MAD=sum{t=1}{n}{Abs(D_t)}/n for D_t <> 0″/>
Instrument | Quandl Symbol | Bloomberg Ticker | MAD |
---|---|---|---|
Soybean Oil | OFDP/FUTURE_BO1 | BO1 Comdty | 12254897 |
Russian Ruble | OFDP/FUTURE_RU1 | RU1 Curncy | 29653 |
DJ-UBS Commodity Index | OFDP/FUTURE_AW1 | DNA Index | 3041 |
S&P500 Volatility Index | OFDP/FUTURE_VX1 | UX1 Index | 2453 |
Cocoa | OFDP/FUTURE_CC1 | CC1 Comdty | 1552 |
Lean Hogs | OFDP/FUTURE_LN1 | LH1 Comdty | 391 |
Ranking the 60+ contracts on MAD allows to identify immediately large differences which are: Soybean Oil, Russian Ruble, DJ-UBS Commodity Index, S&P500 Volatility Index, Cocoa, and Lean Hogs. Those are the obvious candidates for immediate checking.
I put together what I think is the basis for a systematic data checking approach. It can obviously be refined in many ways but those refinements are largely dependent upon what one want to do with the data and which contracts are relevant to the analyst. As an example I assume that it is more relevant for most people to have accurate data for the E-mini S&P 500 contract than for the Milk contract.
As usual any comments welcome
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.