Regression Modeling for Time Series

Rbloggers on Personal Blog

4 years ago

[This article was first published on Rbloggers on Personal Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A little over a month ago Rob Hyndman finished the 2nd edition of his open source book Forecasting: Principles and Practice. Take a look, it’s a fantastic introduction and companion to applied time series modeling using R.

It made me I rediscover the tslm()-function of the excellent forecast-library, which provides a convenient wrapper for linear models with timeseries-data. The function provides two shorthands to add trend– and season-variables as regressors, which is quite useful to avoid problems such as “spurious regression”. Let’s look at an example.

Query Google Trends

# devtools::install_github("PMassicotte/gtrendsR")
library(gtrendsR)
trends <- gtrends(c("frühling", "spring"), geo = c("DE"))

< !-- ```{r, eval=TRUE} --> < !-- # saveRDS(trends, "trends.Rdata") --> < !-- trends < - readRDS(here::here("trends.Rdata")) --> < !-- ``` -->

Google Trends can often serve as useful demo-data for timeseries-modelling and the gtrendsR-package by Philippe Massicotte and Dirk Eddelbuettel makes it easy to access this data-source. Here we ask for the (relative) weekly search volume for Germany for the past five years for the keywords “frühling” and “spring”. An englisch word and its german equivalent.

Explore and Decompose the Series

To explore the data we can transform the data into timeseries and look at their (classical) decomposition:

library(tidyverse)
library(forecast)
library(ggseas)  # install from https://github.com/ellisp/ggseas
 
# turn dataframe into timeseries
create_ts <- function(df, kw){
  df %>% 
    pluck("interest_over_time") %>% 
    filter(date    >= as.Date("2014-01-01"),
           keyword == kw) %>% 
    with(ts(hits, start = c(2014, 1), frequency = 52))
}

fruehling  <- create_ts(trends, "frühling")
spring     <- create_ts(trends, "spring")

# compare decomposition for both time series
cbind(fruehling, spring) %>% 
  ggseas::tsdf() %>% 
  gather("series", "hits", -x) %>% 
  ggseas::ggsdc(aes(x, hits, colour = series), method = "decompose") +
  geom_line() +
  labs(x = "time", y ="hits", title = "Decomposition of Search Traffic")

The overall search volume is higher for “frühling” than it is for “spring”. Both series show a seasonal pattern, while the search volume for “frühling” also shows a slight downwards trend – this may be influenced by a few bad springs which we had in Germany! 🙂

Model the Linear Relationship

Even though this is clearly seasonal data it may be useful to quantify the linear relationship between these two variables using linear regression. How strongly do the two search terms “swing together”?

But because the data is time-dependent and a least one series shows a trend, and both of them show a season, we should consider modeling these characteristics as well.

# customize model output
show_estimates <- function(model) {
  model %>%
    broom::tidy() %>% 
    dplyr::select(term, estimate) %>%
    modify_if(is.numeric, round, 2) %>% 
    # + some preparation to combine lm and arima output later
    modify_at("term", as.character) %>% 
    modify_at("term", stringr::str_replace,
              pattern = "\\(Intercept\\)",
              replacement = "intercept") %>% 
    modify_at("term", stringr::str_replace,
              pattern = "(trend|drift)", ## improve pattern
              replacement = "trend/drift")
}

#  customize html-tables
hux_table <- function(df, caption) {
  library(huxtable)

  content_rows <- 1:nrow(df) + 1  # +1 for the header
  content_cols <- 2:ncol(df)
  
  df %>% 
    hux(add_colnames = TRUE) %>% 
    set_bold(row = 1, col = everywhere, value = TRUE) %>%
    set_bottom_border(row = 1, col = everywhere, value = TRUE) %>%
    set_pad_decimal(content_rows, everywhere, ".") %>%
    set_align(c(1, content_rows), content_cols, "right") %>% 
    set_number_format(content_rows, everywhere, "%5.4g") %>% 
    set_caption(caption)
}

The above code is just to produce nicer outputs, show_estimates() serves as a customized version of broom::tidy() and hux_table() produces nice html-tables from a dataframe.

# fit models
ts_models <- list()
ts_models$lm1              <- tslm(fruehling ~ spring)
ts_models$lm1_trend        <- tslm(fruehling ~ trend + spring)

# inspect parameters
ts_models %>% 
  map(show_estimates) %>% 
  reduce(full_join, by = "term") %>% 
  set_names(c("term", names(ts_models))) %>% 
  filter(!str_detect(term, "season")) %>% 
  hux_table("Coefficients for Linear Models")

Coefficients for Linear Models
term	lm 1	lm 1_trend
intercept	-61.35	-59.26
spring	2.52	2.58
trend/drift		-0.04

We can see, that the model with the trend produces similar estimates of the co-relation between the two series. In this case the model choice doesn’t seem to make that mucht of a difference, though the trend_model would probably be preferred. If the trend of the variables were stronger, the differences between the models would be bigger as well.

Accounting for Autocorrelation

Timeseries are often characterised by the presence of trend and/or seasonality, but there may be additional autocorrelation in the data, which can be accounted for. The forecast-package makes it easy to combine the time-dependent variation of (the residuals of) a timeseries and regression-modeling using the Arima or auto.arima-functions. (For the implementation details please see https://robjhyndman.com/hyndsight/arimax/.)

In this case, we will add an autocorrelation of order 1 to the model.

# fit regression with autocorrelated models
ts_models$ar1       <- Arima(fruehling, xreg = spring, order = c(1,0,0))
ts_models$ar1_trend <- Arima(fruehling, xreg = spring, order = c(1,0,0),
                             include.drift = TRUE)
# inspect parameters
ts_models %>% 
  map(show_estimates) %>% 
  reduce(full_join, by = "term") %>% 
  set_names(c("term", names(ts_models))) %>% 
  filter(!str_detect(term, "season")) %>% 
  hux_table("Coefficients including Autocorrelated Models")

Coefficients including Autocorrelated Models
term	lm 1	lm 1_trend	ar 1	ar 1_trend
intercept	-61.35	-59.26	17.83	0.8
spring	2.52	2.58	0.39	0.39
trend/drift		-0.04		0.15
ar 1			0.97	0.96

Because our data shows a strong seasonal pattern, that hasn’t been accounted for in the regresssion model, the parameters of the AR-model differ strongly from the regression. This is because the AR-parameter picks up a lot of the seasonal variation in the data.

Summary

# compare models
ts_models %>% 
  map_df(BIC) %>% 
  gather("model", "fit") %>% 
  arrange(fit) %>% 
  hux_table("Information Criteria")

Information Criteria
model	fit
ar 1	1510
ar 1_trend	1514
lm 1	1884
lm 1_trend	1884

The visual inspection of the data and the corresponding BIC-values indicate, that the ar1-model may be the model with the best fit and hence, the parameters of this model should be preferred to the other ones.

Overall I wanted to showcase some of tools one can use to analyze the relation between two timeseries and the implications of certain model choices. If you would like to read a little more about this topic, please take a look at the Regressions-Chapter of the fpp2-book. For comments and feedback on this post you can use the comments or reach me at twitter via @henningsway.

< !-- ## not included: ggseas-adoption -->

To leave a comment for the author, please follow the link and comment on their blog: Rbloggers on Personal Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.