Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
(This is the second of a series of ongoing posts on using Graph Algebra in the Social Sciences.)
First-order linear difference equations are powerful, yet simple modeling tools. They can provide access to useful substantive insights to real-world phenomena. They can have powerful predictive ability when used appropriately. Additionally, they may be classified in any number of ways in accordance with the parameters by which they are defined. And though they are not immune to any of a host of issues, a thoughtful approach to their application can always yield meaningful information, if not for discussion then for further refinement of the model.
Let’s look at that example from the last post:
df <- read.csv(file="http://nortalktoowise.com/code/datasets/Electricity.csv", head=TRUE, sep=",")
attach(df)
lagvar <- function(x,y){return(c(rep(NA,y),x[-((length(x)-y+1):length(x))]))}
lagvar1 <- lagvar(electricalusage,1)
model <- lm(electricalusage ~ lagvar1)
y1 <- 290
y2 <- 0
t <- 0
a <- model$coefficients[[2]]
b <- model$coefficients[[1]]
timeserieslength <- nrow(df)
for (i in 1:timeserieslength) {
y2[i] <- (a*y1[i])+b
t[i] <- i
if (i < timeserieslength) y1[i+1]=y2[i]}
plot(year, electricalusage, xlab="Year", ylab="Electricity Usage (in per capita KWh)", main="Figure 1: Annual Electricity Usage, 1920-70", pch=19)
lines(year, y2, lwd=2)
Alright, this probably needs a little bit of explaining. This first few lines are just getting the data from the URL, and attaching the set so we can call the variable names directly. I’ll have to justify that weird function. Instead of breaking it down character for character, I’m just going to get away with explaining what it does. lagvar is function that takes a table, looks at the variable you tell it to (in this case, electricalusage), and copies it into a new array. Why would you want that? Well, it has the added flexibility of delaying the variable by a number of rows (which you can readily specify. This one, for instance, takes our dataset:
> df
year electricalusage
1 1920 339
2 1921 347
3 1922 359
4 1923 368
5 1924 378
...
47 1966 5265
48 1967 5577
49 1968 6057
50 1969 6571
51 1970 7066
and generates lagvar1, which “lags” electricalusage by one row:
> newdata
year electricalusage lagvar1
1 1920 339 NA
2 1921 347 339
3 1922 359 347
4 1923 368 359
5 1924 378 368
...
47 1966 5265 4933
48 1967 5577 5265
49 1968 6057 5577
50 1969 6571 6057
51 1970 7066 6571
This is extremely useful for calculating differences over time. Then we run a simple linear regression: electricalusage as explained by lagvar1. However, the regression is not the end-all of this process (and a million freshman statistics students gasp in horror). We just run the regression to estimate our slope and intercept (a and b respectively). The rest of the code if effectively identical to the example from my last post, though I must make one small confession: you might have noticed that y1 is a constant, while the other parameter variables are called from the dataset or linear model. The truth is that I don’t know an easy mathematical way to pick a perfect y1. It should be pretty close to electricalusage at the beginning of the graph, which means 339-ish. It’s actually a bit lower (290) because you must pick a value which represents the electricalusage in a place where the line doesn’t go, just the tiniest bit behind the graph. In other words, just estimate it. If your curve is off-target, just fiddle around with the y1 value until it looks right. But if your y1 is crappy, your prediction curve is still going to beat the hell out of the linear model:
So, what’s Y* in this graph? Mathematically it is b/(1-a) = 64, but the substantive implication is silly. The close fit seen in figure 1 is compelling for predictive accuracy (and this may indeed be a fine graph for future predictions), but taking this model at face value still leaves us with a completely incorrect substantive conclusion: Human beings back to the beginning of time have had about 64 kWhrs at their disposal annually. While this may not play a meaningful part in any model which incorporates these data, it is an example of the need to at least be aware of the substantive implications of a model.
Next time: When a first-order linear difference equation doesn’t cut it.
References:
Code adapted from http://www.courtneybrown.com/classes/ModelingSocialPhenomena/Assignments/Assignment2CourtneyBrownMathModeling.htm
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.