Salaries and time series regression

[This article was first published on R Analystatistics Sweden , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this post, I will examine Dynamic Linear Models and Time-Series Regression. I will return to data for Engineers from Statistics Sweden. Since the salaries for each year, each stratum (age group) is strongly correlated with the salary for the previous year it does not seem too distant to use a time series to represent the change of salaries throughout the period.

I will let the age represent the season in the time series. This violates the properties of a regular time series and has to be considered for the rest of the analysis. First, let´s decompose the series into its trend and seasonal patterns.

First, define libraries and functions.

library (tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.0     v dplyr   1.0.5
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.0.3
## Warning: package 'tibble' was built under R version 4.0.5
## Warning: package 'tidyr' was built under R version 4.0.5
## Warning: package 'readr' was built under R version 4.0.3
## Warning: package 'dplyr' was built under R version 4.0.5
## Warning: package 'forcats' was built under R version 4.0.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(imputeTS)
## Warning: package 'imputeTS' was built under R version 4.0.5
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(TSstudio)
## Warning: package 'TSstudio' was built under R version 4.0.5
library(forecast)
## Warning: package 'forecast' was built under R version 4.0.5
library(dynlm)
## Warning: package 'dynlm' was built under R version 4.0.5
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.0.5
## 
## Attaching package: 'zoo'
## The following object is masked from 'package:imputeTS':
## 
##     na.locf
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(lmtest) 
## Warning: package 'lmtest' was built under R version 4.0.4
library(sandwich)
## Warning: package 'sandwich' was built under R version 4.0.3
library(ggeffects)
## Warning: package 'ggeffects' was built under R version 4.0.5

readfile <- function (file1){
  read_csv (file1, col_types = cols(), locale = readr::locale (encoding = "latin1"), na = c("..", "NA")) %>%
    gather (starts_with("19"), starts_with("20"), key = "year", value = salary) %>%
    mutate (year_n = parse_number (year))
}

# Thanks to Grant, Stack Overflow
predNeweyWest <- function (model){
  pred_df <- data.frame(fit = predict(model))

  X_mat <- model.matrix(model)

  v_hac <- NeweyWest(model, prewhite = FALSE, adjust = TRUE)

  var_fit_hac <- rowSums((X_mat %*% v_hac) * X_mat)
  se_fit_hac <- sqrt(var_fit_hac)

  pred_df <-
    pred_df %>%
    mutate(se_fit_hac = se_fit_hac) %>%
    mutate(
      lwr_hac = fit - qt(0.975, df = model$df.residual) * se_fit_hac,
      upr_hac = fit + qt(0.975, df = model$df.residual) * se_fit_hac
    )
}

plotmodel <- function(data, pred_df, no_n = FALSE){ 
  if(no_n){
  bind_cols(
    data,
    pred_df
    ) %>%
      ggplot(aes(x = year_dec, y = salary, ymin = lwr_hac, ymax = upr_hac)) + 
      geom_point() + 
      geom_ribbon(fill = "#E41A1C", alpha = 0.3, col = NA) +
      labs(
        x = "Year",
        y = "Salary (SEK/month)",
        caption = 'Shaded region indicates HAC 95% CI.'
    )
  }
  else{
  bind_cols(
    data,
    pred_df
    ) %>%
      ggplot(aes(x = year_dec, y = salary, color = n, ymin = lwr_hac, ymax = upr_hac)) + 
      geom_point() + 
      geom_ribbon(fill = "#E41A1C", alpha = 0.3, col = NA) +
      labs(
        x = "Year",
        y = "Salary (SEK/month)",
        caption = 'Shaded region indicates HAC 95% CI.'
    )    
  }
}

assess_model <- function(model, timeseries, data, no_n = FALSE, doexp = FALSE){
  print(summary (model))

  print(coeftest(model, vcov = NeweyWest, prewhite = F, adjust = T))

  print(checkresiduals(model))
  
  if(doexp){
    pred_df <- exp(predNeweyWest(model))
  } else {
    pred_df <- predNeweyWest(model)
  }

  pred_df$year_dec <- timeseries

  plotmodel(data, pred_df, no_n)
}

The data table is downloaded from Statistics Sweden. It is saved as a comma-delimited file without heading, 000000D2_20210506-201343.csv, http://www.statistikdatabasen.scb.se/pxweb/en/ssd/.

The table: Average monthly pay (total pay), non-manual workers all sectors (SLP), SEK by occupational group (SSYK), age, sex and year. SSYK 2012 214, Year 2014 - 2019

The average age within each age group is used as a numeric value for graphical presentation and the linear model.

The number of Engineers in each stratum is downloaded separately in the file 000000CZ_20210506-201420.csv.

tb <- readfile("000000D2_20210506-201343.csv") %>%
  rowwise() %>%
  mutate(age_l = unlist(lapply(strsplit(substr(age, 1, 5), "-"), strtoi))[1]) %>%
  rowwise() %>%
  mutate(age_h = unlist(lapply(strsplit(substr(age, 1, 5), "-"), strtoi))[2]) %>%
  mutate(age_n = (age_l + age_h) / 2)

tbcount <- readfile("000000CZ_20210506-201420.csv")
tbcount$salary <- replace(tbcount$salary, is.na(tbcount$salary), 0)

tb$n <- tbcount$salary

Let’s have a look at the age distribution for the different years for men and women.

tb %>%
  ggplot () +  
    geom_point (mapping = aes(x = age_n,y = n, colour = year_n, shape = sex)) + 
  labs(
    x = "Age",
    y = "Number of Engineers"
  )

Create a time series for each gender. Time series can not have missing values, Impute missing values in time series with arima model. Women don’t have any data for the age group 65-66 year, that group is filtered away.

tb_men <- filter(tb, sex == "men")

tb_women <- filter(tb, sex == "women") %>% filter(age_n != 65.5)

summary(tb_men$salary)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   27300   36550   47100   43340   49500   52800       1

summary(tb_women$salary)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   26600   35000   44700   41293   45900   52100       1

tbts_men <- ts(tb_men$salary, start = 2014, freq = 6) %>% na_kalman("auto.arima")

tbts_women <- ts(tb_women$salary, start = 2014, freq = 5) %>% na_kalman("auto.arima")

tb_men$salary <- as.numeric(tbts_men)

tb_women$salary <- as.numeric(tbts_women)

Let’s use the decompose function from the stats package to view the trend, seasonal and random component of the time series.

decompose(tbts_men) %>% plot()

decompose(tbts_women) %>% plot()

Let´s have a look at the autocorrelation for the series. As expected the series for men shows a strong correlation with its sixth lag, i.e. the same age category the year before. The series for women shows a strong correlation with its fifth lag.

acf(tbts_men, 36)

acf(tbts_women, 30)

The partial autocorrelation function gives the partial correlation of a stationary time series with its own lagged values, regressed the values of the time series at all shorter lags.

pacf(tbts_men, 36)

pacf(tbts_women, 36)

The following plot shows the correlation between the salary and its yearly lag for three years.

ts_lags(tbts_men, c(6, 12, 18))
ts_lags(tbts_women, c(5, 10, 15))

Now, let’s fit an arima model to the time series with the auto.arima from the forecast library. The summary shows that the auto.arima has identified a SAR(2) process with drift and additionally an element of random walk. The checkresiduals function plots the residuals from the arima model, the autocorrelation of the residuals and a histogram of the residual distribution. The Ljung-Box test suggests that only white noise remains in the residual. The ggseasonplot plots the salary distribution on age for the years 2014-2019, remember that we used age as a season in this approach.

arimamodel_men <- auto.arima(tbts_men)

summary(arimamodel_men)
## Series: tbts_men 
## ARIMA(0,0,0)(2,1,0)[6] with drift 
## 
## Coefficients:
##         sar1     sar2     drift
##       -0.761  -0.6098  128.8569
## s.e.   0.158   0.1435   16.1691
## 
## sigma^2 estimated as 1163963:  log likelihood=-254.04
## AIC=516.08   AICc=517.68   BIC=521.69
## 
## Training set error measures:
##                     ME     RMSE      MAE        MPE    MAPE     MASE      ACF1
## Training set -25.68999 934.3299 573.1137 -0.1985635 1.37638 0.434177 0.2314468

checkresiduals(arimamodel_men)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,0)(2,1,0)[6] with drift
## Q* = 3.9755, df = 4, p-value = 0.4093
## 
## Model df: 3.   Total lags used: 7

ggseasonplot(tbts_men)

For women, the auto.arima is not able to pick up any SAR. The best fit is according to auto.arima is a constant drift.

arimamodel_women <- auto.arima(tbts_women)

summary(arimamodel_women)
## Series: tbts_women 
## ARIMA(0,0,0)(0,1,0)[5] with drift 
## 
## Coefficients:
##          drift
##       188.0000
## s.e.   43.5542
## 
## sigma^2 estimated as 1235313:  log likelihood=-210.3
## AIC=424.59   AICc=425.14   BIC=427.03
## 
## Training set error measures:
##                    ME    RMSE     MAE         MPE     MAPE      MASE
## Training set 6.362663 994.108 699.696 -0.07692376 1.730501 0.6551461
##                     ACF1
## Training set -0.07256362

checkresiduals(arimamodel_women)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,0)(0,1,0)[5] with drift
## Q* = 5.4793, df = 5, p-value = 0.3602
## 
## Model df: 1.   Total lags used: 6

ggseasonplot(tbts_women)

An AR(p) model assumes that a time series Yt can be modelled by a linear function of the first p of its lagged values. Let’s first start to model a seasonal SAR(1) model with the dynlm package. Each year the salaries increase by a fixed amount and a part that is relative to the salary size. I will use the NeweyWest function from the Sandwich package throughout this post to get heteroskedasticity- and autocorrelation-consistent error estimates.

dynmodel_men <- dynlm(ts(salary) ~ L(ts(salary), 6), data = tb_men)

assess_model(dynmodel_men, time(tbts_men)[7:36], tb_men[7:36,], no_n = TRUE) 
## 
## Time series regression with "ts" data:
## Start = 7, End = 36
## 
## Call:
## dynlm(formula = ts(salary) ~ L(ts(salary), 6), data = tb_men)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5517.7  -371.7    48.9   418.0  2600.3 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.339e+02  1.575e+03   0.275    0.785    
## L(ts(salary), 6) 1.009e+00  3.608e-02  27.980   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1522 on 28 degrees of freedom
## Multiple R-squared:  0.9655, Adjusted R-squared:  0.9642 
## F-statistic: 782.9 on 1 and 28 DF,  p-value: < 2.2e-16
## 
## 
## t test of coefficients:
## 
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.3389e+02 1.8030e+03  0.2406   0.8116    
## L(ts(salary), 6) 1.0094e+00 4.3076e-02 23.4340   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## 
##  Breusch-Godfrey test for serial correlation of order up to 6
## 
## data:  Residuals
## LM test = 9.8199, df = 6, p-value = 0.1324

dynmodel_women <- dynlm(ts(salary) ~ L(ts(salary), 5), data = tb_women)

assess_model(dynmodel_women, time(tbts_women)[6:30], tb_women[6:30,], no_n = TRUE)
## 
## Time series regression with "ts" data:
## Start = 6, End = 30
## 
## Call:
## dynlm(formula = ts(salary) ~ L(ts(salary), 5), data = tb_women)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2266.8  -683.0  -148.9   501.2  3157.1 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.323e+02  1.316e+03   0.101    0.921    
## L(ts(salary), 5) 1.020e+00  3.206e-02  31.816   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1126 on 23 degrees of freedom
## Multiple R-squared:  0.9778, Adjusted R-squared:  0.9768 
## F-statistic:  1012 on 1 and 23 DF,  p-value: < 2.2e-16
## 
## 
## t test of coefficients:
## 
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.3234e+02 1.1458e+03  0.1155   0.9091    
## L(ts(salary), 5) 1.0200e+00 2.9112e-02 35.0356   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## 
##  Breusch-Godfrey test for serial correlation of order up to 5
## 
## data:  Residuals
## LM test = 4.8683, df = 5, p-value = 0.4322

Now also add weights according to the number of engineers in the different strata. Note that the dynamic approach uses the information from the first year to predict the second. Weights from the first year have to be excluded. The fixed amount has decreased from 434 to 134 SEK and the relative part has increased from 0.94 % to 1.6 %. The fixed part is not statistically significant in either of these two models.

dynmodel_men <- dynlm(ts(salary) ~ L(ts(salary), 6), data = tb_men, weights = n[7:36])

assess_model(dynmodel_men, time(tbts_men)[7:36], tb_men[7:36,])
## 
## Time series regression with "ts" data:
## Start = 7, End = 36
## 
## Call:
## dynlm(formula = ts(salary) ~ L(ts(salary), 6), data = tb_men, 
##     weights = n[7:36])
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -164527  -12301       0   48694  129992 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.343e+02  1.083e+03   0.124    0.902    
## L(ts(salary), 6) 1.016e+00  2.398e-02  42.392   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 75460 on 21 degrees of freedom
## Multiple R-squared:  0.9884, Adjusted R-squared:  0.9879 
## F-statistic:  1797 on 1 and 21 DF,  p-value: < 2.2e-16
## 
## 
## t test of coefficients:
## 
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      134.30571  856.80707  0.1568   0.8769    
## L(ts(salary), 6)   1.01643    0.01883 53.9780   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## 
##  Breusch-Godfrey test for serial correlation of order up to 6
## 
## data:  Residuals
## LM test = 9.8199, df = 6, p-value = 0.1324

dynmodel_women <- dynlm(ts(salary) ~ L(ts(salary), 5), data = tb_women, weights = n[6:30])

assess_model(dynmodel_women, time(tbts_women)[6:30], tb_women[6:30,])
## 
## Time series regression with "ts" data:
## Start = 6, End = 30
## 
## Call:
## dynlm(formula = ts(salary) ~ L(ts(salary), 5), data = tb_women, 
##     weights = n[6:30])
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -106359  -30874       0   33627  144325 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -735.01210 1575.94230  -0.466    0.646    
## L(ts(salary), 5)    1.03745    0.03693  28.089   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 60090 on 21 degrees of freedom
## Multiple R-squared:  0.9741, Adjusted R-squared:  0.9728 
## F-statistic:   789 on 1 and 21 DF,  p-value: < 2.2e-16
## 
## 
## t test of coefficients:
## 
##                     Estimate  Std. Error t value Pr(>|t|)    
## (Intercept)      -735.012099  708.461714 -1.0375   0.3113    
## L(ts(salary), 5)    1.037452    0.014893 69.6602   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## 
##  Breusch-Godfrey test for serial correlation of order up to 5
## 
## data:  Residuals
## LM test = 4.8683, df = 5, p-value = 0.4322

Let’s drop the non-significant intercept. The relative salary raise increases to 1,94 % per year for men and 2,03 % for women.

dynmodel_men <- dynlm(ts(salary) ~ L(ts(salary), 6) - 1, data = tb_men, weights = n[7:36])

assess_model(dynmodel_men, time(tbts_men)[7:36], tb_men[7:36,])
## 
## Time series regression with "ts" data:
## Start = 7, End = 36
## 
## Call:
## dynlm(formula = ts(salary) ~ L(ts(salary), 6) - 1, data = tb_men, 
##     weights = n[7:36])
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -165299   -8902       0   47573  133181 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## L(ts(salary), 6) 1.019380   0.002739   372.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 73750 on 22 degrees of freedom
## Multiple R-squared:  0.9998, Adjusted R-squared:  0.9998 
## F-statistic: 1.385e+05 on 1 and 22 DF,  p-value: < 2.2e-16
## 
## 
## t test of coefficients:
## 
##                   Estimate Std. Error t value  Pr(>|t|)    
## L(ts(salary), 6) 1.0193798  0.0020172  505.35 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## 
##  Breusch-Godfrey test for serial correlation of order up to 6
## 
## data:  Residuals
## LM test = 9.9529, df = 6, p-value = 0.1266

dynmodel_women <- dynlm(ts(salary) ~ L(ts(salary),5) - 1, data = tb_women, weights = n[6:30])

assess_model(dynmodel_women, time(tbts_women)[6:30], tb_women[6:30,])
## 
## Time series regression with "ts" data:
## Start = 6, End = 30
## 
## Call:
## dynlm(formula = ts(salary) ~ L(ts(salary), 5) - 1, data = tb_women, 
##     weights = n[6:30])
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -103209  -32580       0   36088  146345 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## L(ts(salary), 5) 1.020343   0.004217   241.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 59020 on 22 degrees of freedom
## Multiple R-squared:  0.9996, Adjusted R-squared:  0.9996 
## F-statistic: 5.853e+04 on 1 and 22 DF,  p-value: < 2.2e-16
## 
## 
## t test of coefficients:
## 
##                   Estimate Std. Error t value  Pr(>|t|)    
## L(ts(salary), 5) 1.0203429  0.0035569  286.86 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## 
##  Breusch-Godfrey test for serial correlation of order up to 5
## 
## data:  Residuals
## LM test = 4.9459, df = 5, p-value = 0.4225

Now, let’s compare with a linear model. The relative salary raise increases to 1,92 % per year for men and 2.06 % for women.

model_men <- lm(log(salary) ~ year_n + age_n + I(age_n^2), data = tb_men, weights = n)

assess_model(model_men, time(tbts_men), tb_men, doexp = TRUE)
## 
## Call:
## lm(formula = log(salary) ~ year_n + age_n + I(age_n^2), data = tb_men, 
##     weights = n)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5386 -0.1234  0.0000  0.7387  2.7634 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.925e+01  3.086e+00  -9.479 2.08e-09 ***
## year_n       1.917e-02  1.530e-03  12.534 9.23e-12 ***
## age_n        5.160e-02  2.286e-03  22.569  < 2e-16 ***
## I(age_n^2)  -4.670e-04  2.581e-05 -18.094 4.24e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.694 on 23 degrees of freedom
## Multiple R-squared:  0.9898, Adjusted R-squared:  0.9885 
## F-statistic: 746.4 on 3 and 23 DF,  p-value: < 2.2e-16
## 
## 
## t test of coefficients:
## 
##                Estimate  Std. Error t value  Pr(>|t|)    
## (Intercept) -2.9252e+01  1.8119e+00 -16.145 4.852e-14 ***
## year_n       1.9171e-02  9.0239e-04  21.244 < 2.2e-16 ***
## age_n        5.1595e-02  1.6460e-03  31.347 < 2.2e-16 ***
## I(age_n^2)  -4.6705e-04  1.7897e-05 -26.097 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## 
##  Breusch-Godfrey test for serial correlation of order up to 7
## 
## data:  Residuals
## LM test = 6.7232, df = 7, p-value = 0.4583
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.

model_women <- lm(log(salary) ~ year_n + age_n + I(age_n^2), data = tb_women, weights = n)

assess_model(model_women, time(tbts_women), tb_women, doexp = TRUE)
## 
## Call:
## lm(formula = log(salary) ~ year_n + age_n + I(age_n^2), data = tb_women, 
##     weights = n)
## 
## Weighted Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.31829 -0.34720  0.05136  0.89951  2.33846 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.228e+01  4.347e+00  -7.427 1.50e-07 ***
## year_n       2.063e-02  2.154e-03   9.580 1.71e-09 ***
## age_n        5.521e-02  3.330e-03  16.577 2.77e-14 ***
## I(age_n^2)  -5.202e-04  3.912e-05 -13.298 2.77e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.288 on 23 degrees of freedom
## Multiple R-squared:  0.9798, Adjusted R-squared:  0.9771 
## F-statistic: 371.4 on 3 and 23 DF,  p-value: < 2.2e-16
## 
## 
## t test of coefficients:
## 
##                Estimate  Std. Error  t value  Pr(>|t|)    
## (Intercept) -3.2285e+01  3.9562e+00  -8.1605 3.045e-08 ***
## year_n       2.0631e-02  1.9667e-03  10.4903 3.075e-10 ***
## age_n        5.5206e-02  2.7439e-03  20.1196 4.246e-16 ***
## I(age_n^2)  -5.2025e-04  3.3832e-05 -15.3777 1.357e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## 
##  Breusch-Godfrey test for serial correlation of order up to 7
## 
## data:  Residuals
## LM test = 11.678, df = 7, p-value = 0.1117
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.

Now also add the SAR(2). The summary shows that the R-squared bumps up a few notches for men, although it does not show that the second year lag is significant. However, the sandwich package assures us that the SAR(2) process is significant at the 95 % level. For women, the second year lag is not significant in the summary nor the HAC error estimate.

dynmodel_men <- dynlm(ts(salary) ~ L(ts(salary), 6) + L(ts(salary), 12) - 1, data = tb_men, weights = n[13:36])

assess_model(dynmodel_men, time(tbts_men)[13:36], tb_men[13:36,]) 
## 
## Time series regression with "ts" data:
## Start = 13, End = 36
## 
## Call:
## dynlm(formula = ts(salary) ~ L(ts(salary), 6) + L(ts(salary), 
##     12) - 1, data = tb_men, weights = n[13:36])
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -127942  -17989       0   42788  123721 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)   
## L(ts(salary), 6)    0.7282     0.2371   3.071  0.00692 **
## L(ts(salary), 12)   0.2985     0.2417   1.235  0.23356   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 69840 on 17 degrees of freedom
## Multiple R-squared:  0.9999, Adjusted R-squared:  0.9999 
## F-statistic: 6.4e+04 on 2 and 17 DF,  p-value: < 2.2e-16
## 
## 
## t test of coefficients:
## 
##                   Estimate Std. Error t value  Pr(>|t|)    
## L(ts(salary), 6)   0.72822    0.11497  6.3338 7.484e-06 ***
## L(ts(salary), 12)  0.29855    0.11741  2.5427   0.02102 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## 
##  Breusch-Godfrey test for serial correlation of order up to 5
## 
## data:  Residuals
## LM test = 3.9255, df = 5, p-value = 0.5602

dynmodel_women <- dynlm(ts(salary) ~ L(ts(salary), 5) + L(ts(salary), 10) - 1, data = tb_women, weights = n[11:30])

summary (dynmodel_women)
## 
## Time series regression with "ts" data:
## Start = 11, End = 30
## 
## Call:
## dynlm(formula = ts(salary) ~ L(ts(salary), 5) + L(ts(salary), 
##     10) - 1, data = tb_women, weights = n[11:30])
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -110096  -37862   -1787   33549  134747 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## L(ts(salary), 5)   0.96703    0.23193   4.169 0.000723 ***
## L(ts(salary), 10)  0.05749    0.23683   0.243 0.811298    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 63340 on 16 degrees of freedom
## Multiple R-squared:  0.9996, Adjusted R-squared:  0.9996 
## F-statistic: 2.14e+04 on 2 and 16 DF,  p-value: < 2.2e-16

coeftest(dynmodel_women, vcov = NeweyWest, prewhite = F, adjust = T)
## 
## t test of coefficients:
## 
##                   Estimate Std. Error t value  Pr(>|t|)    
## L(ts(salary), 5)  0.967032   0.122156  7.9164 6.354e-07 ***
## L(ts(salary), 10) 0.057485   0.125308  0.4588    0.6526    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To leave a comment for the author, please follow the link and comment on their blog: R Analystatistics Sweden .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)