The case for data snooping
[This article was first published on Shifting sands, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
When we are backtesting automated trading systems, accidental data snooping or look forward errors are an easy mistake to make. The nature of the error in this context is making our predictions using the data we are trying to predict. Typically, it comes from a mistake with our calculations of time offsets somewhere.
However, it can be a useful tool. If we give our system perfect forward knowledge:
1) We establish an upper bound for performance.
2) We can get a quick read if something is worth pursuing further, and
3) It can help highlight other coding errors.
The first two are pretty closely related. If our wonderful model is built using the values it is trying to predict, and still performs no better than random guessing, it’s probably not worth the effort trying to salvage it.
The flip side is when it performs well, that will be as good as it will ever get.
There are two main ways it can help identifying errors. Firstly, if our subsequent testing on non-snooped data provides comparable performance, we probably have another look ahead bug lurking somewhere.
Secondly, things like having amazing accuracy yet still performing poorly is another sign of a bug lurking somewhere.
Example
I wanted to compare SVM models when trained with actual prices vs a series of log returns, using the rolling model code I put up earlier. As a baseline, I also added in a 200 day simple moving average model.
(S) Indicates snooped data
A few things strike me about this.
For the SMA system, peeking ahead by a day only provides a small increase in accuracy. Given the longer-term nature of the 200 day SMA this is probably to be expected.
For the SVM trained systems, the results are somewhat contradictory.
For the look forward models, training on price data had much lower accuracy than the log returns, and the log return model performed much better. Note that both could have achieved 100% accuracy by predicting its first column of training data.
However, when not snooping, the models trained on closing prices did much better than those trained on returns. I’m not 100% sure there isn’t still some bug lurking somewhere, but hey if the code was off it would’ve shown up in the forward tested results no?
Feel free to take a look and have a play around with the code, which is up here.
To leave a comment for the author, please follow the link and comment on their blog: Shifting sands.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.