Time Series Forecasting Lab (Part 2) – Feature Engineering with Recipes

Boris Guarisma

2 months ago

[This article was first published on Foundations, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Cover photo by Kristine Tumanyan on Unsplash

Go to R-bloggers for R news and tutorials contributed by hundreds of R bloggers.

This is the second of a series of 6 articles about time series forecasting with panel data and ensemble stacking with R. Click here for Part 1.

Through these articles I will be putting into practice what I have learned from the Business Science University training course 1 DS4B 203-R: High-Performance Time Series Forecasting”, delivered by Matt Dancho. If you are looking to gain a high level of expertise in time series with R I strongly recommend this course.

The objective of this article is to explain how to perform time series preprocessing and feature enginneering with timetk . timetk extends recipes for time series.

I will explain what can and cannot be done with recipes for time series panel data.

Introduction

Since the ultimate goal of this series of articles is to perform time series forecast with machine learning, we all know that maximing model performance requires a lot of exprimentation with different feature engineered datasets.

Feature engineering is the most critical part of time series analysis and with recipes you can “use dplyr-like pipeable sequences of feature engineering steps to get your data ready for modeling”.

Prerequisites

I assume you are already familiar with the following topics, packages and terms:

time series data format: timestamp, measure, key(s), features
time series lags, rolling lags, Fourier terms
dplyr or tidyverse R packages
normalization and standardization
dummy variables (one-hot encoding)
dataset split into training and test for modeling

Packages

The following packages must be loaded:

# PART 1 - FEATURE ENGINEERING
library(tidyverse)  # loading dplyr, tibble, ggplot2, .. dependencies
library(timetk)  # using timetk plotting, diagnostics and augment operations
library(tsibble)  # for month to Date conversion
library(tsibbledata)  # for aus_retail dataset
library(fastDummies)  # for dummyfying categorical variables
library(skimr) # for quick statistics

# PART 2 -  FEATURE ENGINEERING WITH RECIPES
library(tidymodels)

End-to-end FE process with recipe

Below is a summary diagram of the end-to-end FE process that will be implemented in this article. Notice that the recipe creation will be the last step of the process. The whole process is described hereafter:

Perform measure transformations (here, Turnover) .
Add a future frame with prediction length
Add Fourier, lag, and rolling lag features. Add a rowid column.
Perform a first split into prepared and future datasets.
Apply the training and test split on the prepared dataset.
Create the recipe based upon the training dataset of the split.
Add steps to the recipe such as adding calendar-based features and other tranformations and cleaning.

Recipe limitation for panel data

Generally, we can perform 1., 2., and 3. and preprocessing within a recipe by simply adding the correspong step to it.

The problem is that we are working with panel data , thus it is not possible to include 1., 2., and 3. as different steps within a recipe. We must first perfom these “steps” per group (by “Industry”), and then bind all groups into a single tibble again to be able to do 4., 5., 6. and 7.

Note: the “NA” values in the prepared dataset correspond the (unknown) Turnover predictions within the future frame.

Our data

Let us recall our timeseries dataset from Part 1: the Australian retail trade turnover dataset tsibble::aus_retail filtered by the “Australian Capital Territory” State value.”

> aus_retail_tbl <- aus_retail %>%
     tk_tbl()

> monthly_retail_tbl <- aus_retail_tbl %>%
     filter(State == "Australian Capital Territory") %>%
     mutate(Month = as.Date(Month)) %>%
     mutate(Industry = as_factor(Industry)) %>%
     select(Month, Industry, Turnover)

> monthly_retail_tbl %>% glimpse()
Rows: 8,820
Columns: 3
$ Month    <date> 1982-04-01, 1982-05-01, 1982-06-01, 1982-07-01, 1982-08-01, 1982-09-01, 1982-10-01, 1982-11-01, 1982-12-01, 1983-01-01, 1983-02-01, 1983-03-01, 1983-04-01, 1983-05-01,~
$ Industry <fct> "Cafes, restaurants and catering services", "Cafes, restaurants and catering services", "Cafes, restaurants and catering services", "Cafes, restaurants and catering ser~
$ Turnover <dbl> 4.4, 3.4, 3.6, 4.0, 3.6, 4.2, 4.8, 5.4, 6.9, 3.8, 4.2, 4.0, 4.4, 4.3, 4.3, 4.6, 4.9, 4.5, 5.5, 6.7, 9.6, 3.8, 3.8, 4.4, 3.8, 3.9, 4.1, 4.3, 4.8, 5.0, 5.3, 5.6, 6.8, 4.6~

The code snippet below will steps (not recipe steps!) 1., 2., and 3. as explained above. You can always refer to Part 1 for function arguments explanations.

We will generate a list (groups) of 20 tibbles, each corresponding to a specific industry. In addition to steps 1., 2. and 3. I have also arranged in ascending order the timestamp (Month) for each Industry group for the lags, rolling lags and Fourier terms.

We add a future frame, starting from 2019-01-01 to 2019-12-01 (12-month length), which will hold NA values for the measure (Turnover) but to which the Fourier, lag and rolling lag features will also be added.

Finally, we bind them together to produce the first feature engineered tibble groups_fe_tbl.

groups <- lapply(X = 1:length(Industries), FUN = function(x){

  monthly_retail_tbl %>%
    filter(Industry == Industries[x]) %>%
    arrange(Month) %>%
    mutate(Turnover =  log1p(x = Turnover)) %>%
    mutate(Turnover =  standardize_vec(Turnover)) %>%
    future_frame(Month, .length_out = "12 months", .bind_data = TRUE) %>%
    mutate(Industry = Industries[x]) %>%
    tk_augment_fourier(.date_var = Month, .periods = 12, .K = 1) %>%
    tk_augment_lags(.value = Turnover, .lags = 12) %>%
    tk_augment_slidify(.value   = Turnover_lag12,
                       .f       = ~ mean(.x, na.rm = TRUE), 
                       .period  = c(3, 6, 9, 12),
                       .partial = TRUE,
                       .align   = "center")
})

groups_fe_tbl <- bind_rows(groups) %>%
  rowid_to_column(var = "rowid")

Below I will display the last 13 months of groups_fe_tbl, you should see the last historical row (2018-12-01) and the future frame values for the last industry (obviously, we have the same situation for other industries within the same tibble):

> groups_fe_tbl %>% tail(n=13) %>% glimpse()
Rows: 13
Columns: 16
$ rowid                  <int> 9048, 9049, 9050, 9051, 9052, 9053, 9054, 9055, 9056, 9057, 9058, 9059, 9060
$ Month                  <date> 2018-12-01, 2019-01-01, 2019-02-01, 2019-03-01, 2019-04-01, 2019-05-01, 2019-06-01, 2019-07-01, 20~
$ Industry               <chr> "Takeaway food services", "Takeaway food services", "Takeaway food services", "Takeaway food serv~
$ Turnover               <dbl> 2.078015, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
$ Month_sin12_K1         <dbl> 0.16810089, 0.63846454, 0.93775213, 0.99301874, 0.80100109, 0.40982032, -0.10116832, -0.57126822, ~
$ Month_cos12_K1         <dbl> 0.98576980, 0.76965124, 0.34730525, -0.11795672, -0.59866288, -0.91216627, -0.99486932, -0.8207634~
$ Turnover_lag12         <dbl> 1.668977, 1.397316, 1.542003, 1.894938, 1.894938, 1.894938, 1.878972, 1.886969, 1.957669, 1.926527~
$ Turnover_lag13         <dbl> 1.755391, 1.668977, 1.397316, 1.542003, 1.894938, 1.894938, 1.894938, 1.878972, 1.886969, 1.957669~
$ Turnover_lag12_roll_3  <dbl> 1.607228, 1.536098, 1.611419, 1.777293, 1.894938, 1.889616, 1.886960, 1.907870, 1.923722, 1.972616~
$ Turnover_lag13_roll_3  <dbl> 1.723756, 1.607228, 1.536098, 1.611419, 1.777293, 1.894938, 1.889616, 1.886960, 1.907870, 1.923722~
$ Turnover_lag12_roll_6  <dbl> 1.667587, 1.692260, 1.715518, 1.750517, 1.832126, 1.901404, 1.906669, 1.929788, 1.941529, 1.974703~
$ Turnover_lag13_roll_6  <dbl> 1.668911, 1.667587, 1.692260, 1.715518, 1.750517, 1.832126, 1.901404, 1.906669, 1.929788, 1.941529~
$ Turnover_lag12_roll_9  <dbl> 1.750363, 1.744253, 1.741597, 1.757160, 1.779636, 1.808252, 1.878956, 1.925999, 1.946341, 1.952766~
$ Turnover_lag13_roll_9  <dbl> 1.748589, 1.750363, 1.744253, 1.741597, 1.757160, 1.779636, 1.808252, 1.878956, 1.925999, 1.929881~
$ Turnover_lag12_roll_12 <dbl> 1.783846, 1.784512, 1.785157, 1.787128, 1.811024, 1.828524, 1.862610, 1.904910, 1.941200, 1.946341~
$ Turnover_lag13_roll_12 <dbl> 1.766346, 1.783846, 1.784512, 1.785157, 1.787128, 1.811024, 1.828524, 1.843028, 1.887599, 1.925999~

Save standarization parameters

Since we are working with panel data (there are 20 time series, one for each Industry), consequently there will be 20 Standardization Parameters (mean, standard deviation) generated by standardize_vec(Turnover).

The values of these parameters must be saved, and so far the only way to do this is by perfoming a manually …, this is indeed a limitation. Consequently, I had to save the means and standard deviations respectively in vectors std_mean and std_sd.

# 20 standardization parameters
Standardization Parameters
mean: 2.88109110996285
standard deviation: 0.594036003334063
Standardization Parameters
mean: 3.35376372924979
standard deviation: 0.547454581447732
Standardization Parameters
mean: 2.45800504638277
standard deviation: 0.531250546670929
Standardization Parameters
...
...
Standardization Parameters
mean: 2.46138105124506
standard deviation: 0.45386783117798

# Measure standardization parameters for inversion
std_mean <- c(2.88109110996285, 3.35376372924979, 2.45800504638277, 2.90694243034087,
               3.18507858440263, 2.83400545514364, 4.35243537687952, 2.03664732406185,
               2.64840690567097, 2.41927121148872, 3.70089063463501, 1.81454274237816,
               2.00305556153117, 1.65085776430947, 3.30159231490985, 2.09573352259677,
               1.97057625538218, 2.20889632513084, 4.19733019025565, 2.46138105124506)

std_sd <- c(0.594036003334063, 0.547454581447732, 0.531250546670929, 0.515550971472541,
             0.371902717442964, 0.683006197969896, 0.744552472892707, 0.429122786330909,
             0.550558802233277, 0.675900224536041, 0.657978689289912, 0.630475051501078,
             0.379777529493197, 0.516772299837197, 0.515586913143017, 0.464773051795071,
             0.58058847542517,  0.584886288114511, 0.753013481976128, 0.45386783117798)

Prepared and future datasets

We perform the split of groups_fe_tbl into prepared and future datasets .

The prepared dataset is the one with historic values of the measure (here Turnover) must not have NA values. Although we know Turnover does have missing values we still apply filter(!is.na(Turnover)). We will also drop all rows with NA values elsewhere e.g., those introduce by lag and rolling lag features.

Finally, we obtain an 8,560-row tibble (data_prepared_tbl ) from a 9,060-row one (groups_fe_tbl). You can run a groups_fe_tbl %>% group_by(Industry) %>% skim() to check the total number of missing values (NA and NaN ) that were removed.

Prepared dataset starts from 1983-04-01.

> data_prepared_tbl <- groups_fe_tbl %>%
    filter(!is.na(Turnover)) %>%
    drop_na()

> data_prepared_tbl %>% glimpse()
Rows: 8,560
Columns: 16
$ rowid                  <int> 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38~
$ Month                  <date> 1983-05-01, 1983-06-01, 1983-07-01, 1983-08-01, 1983-09-01, 1983-10-01, 1983-11-01, 1983-12-01, 1~
$ Industry               <chr> "Cafes, restaurants and catering services", "Cafes, restaurants and catering services", "Cafes, re~
$ Turnover               <dbl> -2.0426107, -2.0426107, -1.9499231, -1.8620736, -1.9802554, -1.6990366, -1.4138382, -0.8757670, -2~
$ Month_sin12_K1         <dbl> 0.51455540, 0.87434662, 1.00000000, 0.86602540, 0.50000000, 0.01688948, -0.48530196, -0.84864426, ~
$ Month_cos12_K1         <dbl> 8.574571e-01, 4.853020e-01, 1.567970e-14, -5.000000e-01, -8.660254e-01, -9.998574e-01, -8.743466e-~
$ Turnover_lag12         <dbl> -2.355895, -2.281065, -2.140701, -2.281065, -2.074676, -1.890850, -1.725136, -1.370672, -2.209420,~
$ Turnover_lag13         <dbl> -2.011144, -2.355895, -2.281065, -2.140701, -2.281065, -2.074676, -1.890850, -1.725136, -1.370672,~
$ Turnover_lag12_roll_3  <dbl> -2.216035, -2.259220, -2.234277, -2.165481, -2.082197, -1.896888, -1.662219, -1.768409, -1.884923,~
$ Turnover_lag13_roll_3  <dbl> -2.183520, -2.216035, -2.259220, -2.234277, -2.165481, -2.082197, -1.896888, -1.662219, -1.768409,~
$ Turnover_lag12_roll_6  <dbl> -2.213974, -2.190758, -2.170709, -2.065582, -1.913850, -1.925303, -1.890905, -1.901909, -1.921958,~
$ Turnover_lag13_roll_6  <dbl> -2.197201, -2.213974, -2.190758, -2.170709, -2.065582, -1.913850, -1.925303, -1.890905, -1.901909,~
$ Turnover_lag12_roll_9  <dbl> -2.190758, -2.147914, -2.095067, -2.014578, -2.036609, -2.005362, -1.989766, -1.975371, -1.948876,~
$ Turnover_lag13_roll_9  <dbl> -2.213974, -2.190758, -2.147914, -2.095067, -2.014578, -2.036609, -2.005362, -1.989766, -1.975371,~
$ Turnover_lag12_roll_12 <dbl> -2.095067, -2.014578, -2.034063, -2.037755, -2.046334, -2.046334, -2.020226, -2.000355, -1.984457,~
$ Turnover_lag13_roll_12 <dbl> -2.147914, -2.095067, -2.014578, -2.034063, -2.037755, -2.046334, -2.046334, -2.020226, -2.000355,~

The future dataset hold the 12 future (and unknown) values of Turnover e.g., the NA values introduced by the future frame. We will predict and fill in the values in Part 6 with the final stacking ensemble model.

> future_tbl <- groups_fe_tbl %>%
    filter(is.na(Turnover))

> future_tbl %>% head(n=12) %>% glimpse()
Rows: 12
Columns: 16
$ rowid                  <int> 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453
$ Month                  <date> 2019-01-01, 2019-02-01, 2019-03-01, 2019-04-01, 2019-05-01, 2019-06-01, 2019-07-01, 2019-08-01, 20~
$ Industry               <chr> "Cafes, restaurants and catering services", "Cafes, restaurants and catering services", "Cafes, r~
$ Turnover               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
$ Month_sin12_K1         <dbl> 0.63846454, 0.93775213, 0.99301874, 0.80100109, 0.40982032, -0.10116832, -0.57126822, -0.90511451,~
$ Month_cos12_K1         <dbl> 0.76965124, 0.34730525, -0.11795672, -0.59866288, -0.91216627, -0.99486932, -0.82076344, -0.425167~
$ Turnover_lag12         <dbl> 1.125397, 1.282320, 1.569292, 1.425855, 1.437951, 1.429897, 1.550608, 1.469789, 1.457920, 1.469789~
$ Turnover_lag13         <dbl> 1.384894, 1.125397, 1.282320, 1.569292, 1.425855, 1.437951, 1.429897, 1.550608, 1.469789, 1.457920~
$ Turnover_lag12_roll_3  <dbl> 1.264204, 1.325670, 1.425822, 1.477699, 1.431234, 1.472819, 1.483431, 1.492773, 1.465833, 1.495272~
$ Turnover_lag13_roll_3  <dbl> 1.316081, 1.264204, 1.325670, 1.425822, 1.477699, 1.431234, 1.472819, 1.483431, 1.492773, 1.465833~
$ Turnover_lag12_roll_6  <dbl> 1.370952, 1.370952, 1.378452, 1.449320, 1.480565, 1.462003, 1.469326, 1.489352, 1.490694, 1.478711~
$ Turnover_lag13_roll_6  <dbl> 1.378274, 1.370952, 1.370952, 1.378452, 1.449320, 1.480565, 1.462003, 1.469326, 1.489352, 1.501243~
$ Turnover_lag12_roll_9  <dbl> 1.411416, 1.395927, 1.404907, 1.408445, 1.416559, 1.454825, 1.485468, 1.470874, 1.476502, 1.482009~
$ Turnover_lag13_roll_9  <dbl> 1.420563, 1.411416, 1.395927, 1.404907, 1.408445, 1.416559, 1.454825, 1.485468, 1.474990, 1.482009~
$ Turnover_lag12_roll_12 <dbl> 1.433627, 1.429420, 1.420139, 1.420139, 1.430152, 1.434573, 1.462680, 1.480716, 1.470874, 1.476502~
$ Turnover_lag13_roll_12 <dbl> 1.429496, 1.433627, 1.429420, 1.420139, 1.420139, 1.430152, 1.434266, 1.465153, 1.485468, 1.474990~

Train and test datasets split

We perform the split with timetk::time_series_split() function. The latter returns a split object (an R list) that can be further used for plotting both the traininig and the test datasets.

One important argument is assess, it determines the length of the test dataset. As per 2 , “The size of the test set is typically about 20% of the total sample, although this value depends on how long the sample is and how far ahead you want to forecast. The test set should ideally be at least as large as the maximum forecast horizon required.” Since there are 428 observations (months) per industry, then 20% of 429 ~50 months.

cumulative = TRUE means that we consider the training dataset from the begining (first row) of the prepared dataset.

> splits <- data_prepared_tbl %>%
    time_series_split(Month, 
                      assess = "50 months", 
                      cumulative = TRUE)

# split object
> splits
<Analysis/Assess/Total>
<7580/1000/8580>

Which means

7580 rows for training (all 20 industries). Here, we only care about the rows since the number of columns will be updated by the recipe (explained soon).
1000 rows for test dataset (50 months x 20 industries)
8580 rows of data_prepared_tbl (all 20 industries)

You can confirm the size of the training and test datasets with the following code snippet using the tk_time_series_cv_plan() function which creates a new tibble with two new columns .id and .key

> splits %>%
   tk_time_series_cv_plan() %>% glimpse()
Rows: 8,560
Columns: 18
$ .id                    <chr> "Slice1", "Slice1", "Slice1", "Slice1", "Slice1", "Slice1", "Slice1", "Slice1", "Slice1", "Slice1"~
$ .key                   <fct> training, training, training, training, training, training, training, training, training, training~
$ rowid                  <int> 14, 467, 920, 1373, 1826, 2279, 2732, 3185, 3638, 4091, 4544, 4997, 5450, 5903, 6356, 6809, 7262, ~
$ Month                  <date> 1983-05-01, 1983-05-01, 1983-05-01, 1983-05-01, 1983-05-01, 1983-05-01, 1983-05-01, 1983-05-01, 1~
$ Industry               <chr> "Cafes, restaurants and catering services", "Cafes, restaurants and takeaway food services", "Clot~
$ Turnover               <dbl> -2.042611, -1.866127, -1.383977, -1.533674, -1.772910, -1.575604, -1.941481, -1.635071, -1.999504,~
$ Month_sin12_K1         <dbl> 0.5145554, 0.5145554, 0.5145554, 0.5145554, 0.5145554, 0.5145554, 0.5145554, 0.5145554, 0.5145554,~
$ Month_cos12_K1         <dbl> 8.574571e-01, 8.574571e-01, 8.574571e-01, 8.574571e-01, 8.574571e-01, 8.574571e-01, 8.574571e-01, ~
$ Turnover_lag12         <dbl> -2.355895, -2.397538, -1.674143, -1.891997, -1.973832, -1.550576, -2.113506, -2.109525, -2.292421,~
$ Turnover_lag13         <dbl> -2.011144, -2.195619, -1.713772, -1.891997, -2.044287, -1.680229, -2.080545, -2.035540, -2.434025,~
$ Turnover_lag12_roll_3  <dbl> -2.216035, -2.299396, -1.771137, -1.971674, -2.053104, -1.602136, -2.091532, -2.110334, -2.420468,~
$ Turnover_lag13_roll_3  <dbl> -2.183520, -2.296579, -1.693957, -1.891997, -2.009059, -1.615402, -2.097026, -2.072532, -2.363223,~
$ Turnover_lag12_roll_6  <dbl> -2.213974, -2.275130, -1.824439, -2.023204, -2.202758, -1.587032, -2.072703, -2.125292, -2.406516,~
$ Turnover_lag13_roll_6  <dbl> -2.197201, -2.278793, -1.787835, -1.988232, -2.146638, -1.577044, -2.076792, -2.110131, -2.411748,~
$ Turnover_lag12_roll_9  <dbl> -2.190758, -2.244673, -1.826687, -2.015706, -2.222243, -1.607126, -2.067327, -2.110334, -2.419395,~
$ Turnover_lag13_roll_9  <dbl> -2.213974, -2.275130, -1.824439, -2.023204, -2.202758, -1.587032, -2.072703, -2.125292, -2.406516,~
$ Turnover_lag12_roll_12 <dbl> -2.095067, -2.152525, -1.819038, -1.978000, -2.167591, -1.593199, -2.043333, -2.048362, -2.304961,~
$ Turnover_lag13_roll_12 <dbl> -2.147914, -2.200931, -1.828293, -2.002079, -2.217778, -1.606260, -2.056832, -2.089405, -2.359411,~

> splits %>%
   tk_time_series_cv_plan() %>% select(.id) %>% table()

Slice1 
  8560 

> splits %>%
   tk_time_series_cv_plan() %>% select(.key) %>% table()

training  testing 
    7560     1000

Let us plot the training and test datasets for one Industry value with the plot_time_series_cv_plan() function:

> Industries <- unique(aus_retail_tbl$Industry)
> splits %>%
  tk_time_series_cv_plan() %>%
  filter(Industry == Industries[1]) %>%
  plot_time_series_cv_plan(.date_var = Month, 
                           .value = Turnover, 
                           .title = paste0("Split for ", industry))

You can extract the training and test datasets by applying the training() and testing() functions respectively on the splits object.

> training(splits) %>% filter(Industry == Industries[1]) 
# A tibble: 378 x 16
   rowid Month      Industry        Turnover Month_sin12_K1 Month_cos12_K1 Turnover_lag12 Turnover_lag13 Turnover_lag12_r~ Turnover_lag13_r~ Turnover_lag12_r~ Turnover_lag13_r~
   <int> <date>     <chr>              <dbl>          <dbl>          <dbl>          <dbl>          <dbl>             <dbl>             <dbl>             <dbl>             <dbl>
 1    14 1983-05-01 Cafes, restaur~   -2.04          0.515        8.57e- 1          -2.36          -2.01             -2.22             -2.18             -2.21             -2.20
 2    15 1983-06-01 Cafes, restaur~   -2.04          0.874        4.85e- 1          -2.28          -2.36             -2.26             -2.22             -2.19             -2.21
 3    16 1983-07-01 Cafes, restaur~   -1.95          1            1.57e-14          -2.14          -2.28             -2.23             -2.26             -2.17             -2.19
 4    17 1983-08-01 Cafes, restaur~   -1.86          0.866       -5.00e- 1          -2.28          -2.14             -2.17             -2.23             -2.07             -2.17
 5    18 1983-09-01 Cafes, restaur~   -1.98          0.500       -8.66e- 1          -2.07          -2.28             -2.08             -2.17             -1.91             -2.07
 6    19 1983-10-01 Cafes, restaur~   -1.70          0.0169      -1.00e+ 0          -1.89          -2.07             -1.90             -2.08             -1.93             -1.91
 7    20 1983-11-01 Cafes, restaur~   -1.41         -0.485       -8.74e- 1          -1.73          -1.89             -1.66             -1.90             -1.89             -1.93
 8    21 1983-12-01 Cafes, restaur~   -0.876        -0.849       -5.29e- 1          -1.37          -1.73             -1.77             -1.66             -1.90             -1.89
 9    22 1984-01-01 Cafes, restaur~   -2.21         -0.999       -3.38e- 2          -2.21          -1.37             -1.88             -1.77             -1.92             -1.90
10    23 1984-02-01 Cafes, restaur~   -2.21         -0.882        4.70e- 1          -2.07          -2.21             -2.14             -1.88             -1.97             -1.92
# ... with 368 more rows, and 4 more variables: Turnover_lag12_roll_9 <dbl>, Turnover_lag13_roll_9 <dbl>, Turnover_lag12_roll_12 <dbl>, Turnover_lag13_roll_12 <dbl>

It is important to understand that the actual training dataset to be used to learn the model will be the output of the recipe , as explained hereafter.

Create recipe

As stated in Get Started, before training the model, we can use a recipe to create (new) predictors and conduct the preprocessing required by the model.

recipe(Turnover ~ ., data = training(splits))

“The recipe() function as we used it here has two arguments:

A formula. Any variable on the left-hand side of the tilde (~) is considered the model outcome (here, Turnover). On the right-hand side of the tilde are the predictors. Variables may be listed by name, or you can use the dot (.) to indicate all other variables as predictors.
The data. A recipe is associated with the data set used to create the model. This will typically be the training set, so data = training(splits) here. Naming a data set doesn’t actually change the data itself; it is only used to catalog the names of the variables and their types, like factors, integers, dates, etc.”

For each preprocessing and feature engineering steps explained in Part 1 there is a specific step_<> :

add calendar-based features: step_timeseries_signature()
remove columns: step_rm()
perform one-hot encoding on categorical variables: step_dummy()
normalize numeric variables: step_normalize()

recipe_spec <- recipe(Turnover ~ ., data = training(splits)) %>%
  update_role(rowid, new_role = "indicator") %>%  
  step_other(Industry) %>%
  step_timeseries_signature(Month) %>%
  step_rm(matches("(.xts$)|(.iso$)|(hour)|(minute)|(second)|(day)|(week)|(am.pm)")) %>%
  step_dummy(all_nominal(), one_hot = TRUE) %>%
  step_normalize(Month_index.num, Month_year)

Notice the two very useful functions explained hereafter:

update_role(rowid, new_role = "indicator"): it lets recipes know that rowid variable has a custom role called “indicator” (a role can have any character value). This tells the recipe to keep the rowid variable without using it neither as an outcome nor a predictor.

step_other(Industry): in case the training dataset does not include all Industry (categorical variable) values, it will pool the infrequently occurring values into an “other” category.

Below we display the recipe’s spec, its summary, and finally the (real) training dataset:

> recipe_spec
Recipe

Inputs:

      role #variables
 indicator          1
   outcome          1
 predictor         14

Operations:

Collapsing factor levels for Industry
Timeseries signature features from Month
Delete terms matches("(.xts$)|(.iso$)|(hour)|(minute)|(second)|(day)|(week)|(am.pm)")
Dummy variables from all_nominal()
Centering and scaling for Month_index.num, Month_year

# Recipe summary
> recipe_spec %>% 
+   prep() %>% 
+   summary()
# A tibble: 52 x 4
   variable              type    role      source  
   <chr>                 <chr>   <chr>     <chr>   
 1 rowid                 numeric indicator original
 2 Month                 date    predictor original
 3 Month_sin12_K1        numeric predictor original
 4 Month_cos12_K1        numeric predictor original
 5 Turnover_lag12        numeric predictor original
 6 Turnover_lag13        numeric predictor original
 7 Turnover_lag12_roll_3 numeric predictor original
 8 Turnover_lag13_roll_3 numeric predictor original
 9 Turnover_lag12_roll_6 numeric predictor original
10 Turnover_lag13_roll_6 numeric predictor original
# ... with 42 more rows

# Extracted training dataset
> recipe_spec %>% 
+   prep() %>%
+   juice() %>% 
+   glimpse()
Rows: 7,560
Columns: 52
$ rowid                                                                      <int> 14, 467, 920, 1373, 1826, 2279, 2732, 3185, 3638, 4091, 4544, 4997, 5450, 5903, 6356, 6809,~
$ Month                                                                      <date> 1983-05-01, 1983-05-01, 1983-05-01, 1983-05-01, 1983-05-01, 1983-05-01, 1983-05-01, 1983-0~
$ Month_sin12_K1                                                             <dbl> 0.5145554, 0.5145554, 0.5145554, 0.5145554, 0.5145554, 0.5145554, 0.5145554, 0.5145554, 0.5~
$ Month_cos12_K1                                                             <dbl> 8.574571e-01, 8.574571e-01, 8.574571e-01, 8.574571e-01, 8.574571e-01, 8.574571e-01, 8.57457~
$ Turnover_lag12                                                             <dbl> -2.355895, -2.397538, -1.674143, -1.891997, -1.973832, -1.550576, -2.113506, -2.109525, -2.~
$ Turnover_lag13                                                             <dbl> -2.011144, -2.195619, -1.713772, -1.891997, -2.044287, -1.680229, -2.080545, -2.035540, -2.~
$ Turnover_lag12_roll_3                                                      <dbl> -2.216035, -2.299396, -1.771137, -1.971674, -2.053104, -1.602136, -2.091532, -2.110334, -2.~
$ Turnover_lag13_roll_3                                                      <dbl> -2.183520, -2.296579, -1.693957, -1.891997, -2.009059, -1.615402, -2.097026, -2.072532, -2.~
$ Turnover_lag12_roll_6                                                      <dbl> -2.213974, -2.275130, -1.824439, -2.023204, -2.202758, -1.587032, -2.072703, -2.125292, -2.~
$ Turnover_lag13_roll_6                                                      <dbl> -2.197201, -2.278793, -1.787835, -1.988232, -2.146638, -1.577044, -2.076792, -2.110131, -2.~
$ Turnover_lag12_roll_9                                                      <dbl> -2.190758, -2.244673, -1.826687, -2.015706, -2.222243, -1.607126, -2.067327, -2.110334, -2.~
$ Turnover_lag13_roll_9                                                      <dbl> -2.213974, -2.275130, -1.824439, -2.023204, -2.202758, -1.587032, -2.072703, -2.125292, -2.~
$ Turnover_lag12_roll_12                                                     <dbl> -2.095067, -2.152525, -1.819038, -1.978000, -2.167591, -1.593199, -2.043333, -2.048362, -2.~
$ Turnover_lag13_roll_12                                                     <dbl> -2.147914, -2.200931, -1.828293, -2.002079, -2.217778, -1.606260, -2.056832, -2.089405, -2.~
$ Turnover                                                                   <dbl> -2.042611, -1.866127, -1.383977, -1.533674, -1.772910, -1.575604, -1.941481, -1.635071, -1.~
$ Month_index.num                                                            <dbl> -1.727712, -1.727712, -1.727712, -1.727712, -1.727712, -1.727712, -1.727712, -1.727712, -1.~
$ Month_year                                                                 <dbl> -1.712475, -1.712475, -1.712475, -1.712475, -1.712475, -1.712475, -1.712475, -1.712475, -1.~
$ Month_half                                                                 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
$ Month_quarter                                                              <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2~
$ Month_month                                                                <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6~
$ Industry_Cafes..restaurants.and.catering.services                          <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Industry_Cafes..restaurants.and.takeaway.food.services                     <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Industry_Clothing.retailing                                                <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0~
$ Industry_Clothing..footwear.and.personal.accessory.retailing               <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0~
$ Industry_Department.stores                                                 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0~
$ Industry_Electrical.and.electronic.goods.retailing                         <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0~
$ Industry_Food.retailing                                                    <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0~
$ Industry_Footwear.and.other.personal.accessory.retailing                   <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0~
$ Industry_Furniture..floor.coverings..houseware.and.textile.goods.retailing <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0~
$ Industry_Hardware..building.and.garden.supplies.retailing                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0~
$ Industry_Household.goods.retailing                                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1~
$ Industry_Liquor.retailing                                                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Industry_Newspaper.and.book.retailing                                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Industry_Other.recreational.goods.retailing                                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Industry_Other.retailing                                                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Industry_Other.retailing.n.e.c.                                            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Industry_Other.specialised.food.retailing                                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Industry_Pharmaceutical..cosmetic.and.toiletry.goods.retailing             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Industry_Supermarket.and.grocery.stores                                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Industry_Takeaway.food.services                                            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Month_month.lbl_01                                                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Month_month.lbl_02                                                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Month_month.lbl_03                                                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Month_month.lbl_04                                                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Month_month.lbl_05                                                         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Month_month.lbl_06                                                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
$ Month_month.lbl_07                                                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Month_month.lbl_08                                                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Month_month.lbl_09                                                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Month_month.lbl_10                                                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Month_month.lbl_11                                                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Month_month.lbl_12                                                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~

As for the standardization parameters, we save the normalization parameters generated by the step_normalize() function applied on Month_index.num and Month_year columns. Min-max normalization was used for these columns.

To get the original min and max values prior normalization you must customize the skim() function as I did in Part 1 with my myskim() functionand apply it to training(splits).

recipe(Turnover ~ ., data = training(splits)) %>%
  step_timeseries_signature(Month) %>%
  prep() %>%
  juice() %>% 
  myskim()

# Extract of the output for the sake of clarity:
... ...

-- Variable type: numeric ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# A tibble: 48 x 13
   skim_variable                                                              n_missing complete_rate      mean       sd    p0      p25       p50      p75    p100 hist      max   min
 * <chr>                                                                          <int>         <dbl>     <dbl>    <dbl> <dbl>    <dbl>     <dbl>    <dbl>   <dbl> <chr>   <dbl> <dbl>

... ...

15 Month_index.num                0             1  9.16e+8 286980149.    420595200    667785600      9.16e+8 1.16e+9 1412121600    ▇▇▇▇▇ 1412121600    420595200   
16 Month_year                     0             1  2.00e+3         9.10       1983         1991      2.00e+3 2.01e+3       2014    ▇▇▇▇▇       2014         1983 

... ...

# Month_index.num normalization Parameters
Month_index.num_limit_lower = 420595200
Month_index.num_limit_upper = 1412121600

# Month_year normalization Parameters
Month_year_limit_lower = 1983
Month_year_limit_upper = 2014

Save your work

We will keep all data, recipes, splits, parameters and any other useful stuff in a list and save it as an RDS file.

feature_engineering_artifacts_list <- list(
  # Data
  data = list(
    data_prepared_tbl = data_prepared_tbl,
    future_tbl      = future_tbl,
    industries = Industries
  ),

  # Recipes
  recipes = list(
    recipe_spec = recipe_spec
  ),

  # Splits
  splits = splits,

  # Inversion Parameters
  standardize = list(
    std_mean = std_mean,
    std_sd   = std_sd
  ),

  normalize = list(
    Month_index.num_limit_lower = Month_index.num_limit_lower, 
    Month_index.num_limit_upper = Month_index.num_limit_upper,
    Month_year_limit_lower = Month_year_limit_lower,
    Month_year_limit_upper = Month_year_limit_upper
  )  
)

feature_engineering_artifacts_list %>% 
  write_rds("feature_engineering_artifacts_list.rds")

Conclusion

In this article I have explained how to perform time series feature engineering with recipes thanks to the timetk package.

The recipe performed preprocessing, transformations and feature engineering applicable to all 20 time series (panel data). For operations such as measure standardization, adding lags, rolling lags and Fourier terms we needed to code a specific by group (here, by Industry) feature engineering.

The recipe is an important step in the data science process with modeltime since it will be added to a workflow along with a model, this will be explained in my next article (Part 3).

Note: I haven’t shown all the power of recipes e.g., model-specific recipes can be built from a “base” recipe which can be modified in order to adapt to the model requirements. Remember, it is important to perform experimentation with different datsets and creating interdependent recipes can be very useful.

References

1 Dancho, Matt, “DS4B 203-R: High-Performance Time Series Forecasting”, Business Science University

2 Hyndman R., Athanasopoulos G., “Forecasting: Principles and Practice“, 3rd edition

To leave a comment for the author, please follow the link and comment on their blog: Foundations.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.