Help! My model fits too well!

This is sort-of related to my sidelined study of graph algebra. I was thinking about data I could apply a first-order linear difference model to, and the stock market came to mind. After all, despite some black swan sized shocks, what better predicts a day’s closing than the previous day’s closing? So, I hunted down the data and graphed exactly that:

Isn’t that just lovely?  The tight clustering around the line indicates that we have found a very good linear fit.  How good?  Well, lets take a peek at our summary(model)

lm(formula = close ~ open)

 Min        1Q    Median        3Q       Max
-774.9914   -3.4477   -0.3122    3.2318  924.4627 

            Estimate   Std. Error    t value  Pr(>|t|)    
(Intercept) 0.4299070  0.4206543     1.022    0.307    
open        1.0000351  0.0000954     10482.9  <2e-16***

Residual standard error: 49.74 on 20593 degrees of freedom
Multiple R-squared: 0.9998,     Adjusted R-squared: 0.9998
F-statistic: 1.099e+08 on 1 and 20593 DF,  p-value: < 2.2e-16

Whoa!  An R-squared of .9998.  In other words, my very simple model describes 99.98% of all the variation seen in the Dow Jones industrial Index days-end prices.  Show this to any statistician and they’d say that’s nearly impossible.  You’ve got to have some tautology in the model, some independent variable that is basically the same as the dependent variable.  And they’d be right.  However, the linear model is not my goal.  I don’t want to predict the progress of the Dow over a day.  I want to do it over a much longer term.  For that reason, I can look past their complaints and build the first-order linear difference model.

If we plot the function y(x) = 1.0000351(y(x-1)) + 0.4299070, the output is a little less than satisfying.  Here is that function over a scatterplot of Dow scores:

That looks pretty underwhelming.  In fact, it almost looks…linear.  Gross.  What happened?

First off, I assure you it is not the problem the aforementioned statisticians pointed out.  The real problem was that, though our slope was really convincing, it was also really close to 1.  Which means that it basically fell out of our equation, leaving y(x) = y(x-1) + .423.  If all we’re doing is adding .423 every iteration, we have in fact generated the linear equation y = .423x + .423.  That doesn’t tell me anything about the stock market!

Take home points:

If you’re interested in running this yourself, the R code is here:

df <- read.csv(file="", head=TRUE, sep=",")
model <- lm(close ~ open)
plot(open, close, xlab="", ylab="", pch=19)
title(xlab="X", ylab="X(t+1)", main="Plot of the first differences", cex=1.5, col="black", =2)
abline(model, lwd=2)

y2 <- 0
t <- 0
y1 <- .3
a <- model$coefficients[[2]]
b <- model$coefficients[[1]]
timeserieslength <- nrow(df)
for (i in 1:timeserieslength) {
 y2[i] <- (a*y1[i])+b
 t[i] <- i
 if (i < timeserieslength) y1[i+1]=y2[i]}
plot(t, close, xlab="time", ylab="Dow Jones Industrial Index", main="DJIA over time, 1928-2010", pch=19)
lines(t, y2, lwd=2)
