Time Series Forecasting Lab (Part 2) – Feature Engineering with Recipes
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Cover photo by Kristine Tumanyan on Unsplash
Go to R-bloggers for R news and tutorials contributed by hundreds of R bloggers.
This is the second of a series of 6 articles about time series forecasting with panel data and ensemble stacking with R. Click here for Part 1.
Through these articles I will be putting into practice what I have learned from the Business Science University training course 1 DS4B 203-R: High-Performance Time Series Forecasting”, delivered by Matt Dancho. If you are looking to gain a high level of expertise in time series with R I strongly recommend this course.
The objective of this article is to explain how to perform time series preprocessing and feature enginneering with timetk . timetk extends recipes for time series.
I will explain what can and cannot be done with recipes for time series panel data.
Introduction
Since the ultimate goal of this series of articles is to perform time series forecast with machine learning, we all know that maximing model performance requires a lot of exprimentation with different feature engineered datasets.
Feature engineering is the most critical part of time series analysis and with recipes you can “use dplyr-like pipeable sequences of feature engineering steps to get your data ready for modeling”.
Prerequisites
I assume you are already familiar with the following topics, packages and terms:
time series data format: timestamp, measure, key(s), features
time series lags, rolling lags, Fourier terms
dplyr or tidyverse R packages
normalization and standardization
dummy variables (one-hot encoding)
dataset split into training and test for modeling
Packages
The following packages must be loaded:
# PART 1 - FEATURE ENGINEERING library(tidyverse) # loading dplyr, tibble, ggplot2, .. dependencies library(timetk) # using timetk plotting, diagnostics and augment operations library(tsibble) # for month to Date conversion library(tsibbledata) # for aus_retail dataset library(fastDummies) # for dummyfying categorical variables library(skimr) # for quick statistics # PART 2 - FEATURE ENGINEERING WITH RECIPES library(tidymodels)
End-to-end FE process with recipe
Below is a summary diagram of the end-to-end FE process that will be implemented in this article. Notice that the recipe creation will be the last step of the process. The whole process is described hereafter:
Perform measure transformations (here, Turnover) .
Add a future frame with prediction length
Add Fourier, lag, and rolling lag features. Add a rowid column.
Perform a first split into prepared and future datasets.
Apply the training and test split on the prepared dataset.
Create the recipe based upon the training dataset of the split.
Add steps to the recipe such as adding calendar-based features and other tranformations and cleaning.
Recipe limitation for panel data
Generally, we can perform 1., 2., and 3. and preprocessing within a recipe by simply adding the correspong step to it.
The problem is that we are working with panel data , thus it is not possible to include 1., 2., and 3. as different steps within a recipe. We must first perfom these “steps” per group (by “Industry”), and then bind all groups into a single tibble again to be able to do 4., 5., 6. and 7.
Note: the “NA” values in the prepared dataset correspond the (unknown) Turnover predictions within the future frame.
Our data
Let us recall our timeseries dataset from Part 1: the Australian retail trade turnover dataset tsibble::aus_retail
filtered by the “Australian Capital Territory” State value.”
> aus_retail_tbl <- aus_retail %>% tk_tbl() > monthly_retail_tbl <- aus_retail_tbl %>% filter(State == "Australian Capital Territory") %>% mutate(Month = as.Date(Month)) %>% mutate(Industry = as_factor(Industry)) %>% select(Month, Industry, Turnover) > monthly_retail_tbl %>% glimpse() Rows: 8,820 Columns: 3 $ Month <date> 1982-04-01, 1982-05-01, 1982-06-01, 1982-07-01, 1982-08-01, 1982-09-01, 1982-10-01, 1982-11-01, 1982-12-01, 1983-01-01, 1983-02-01, 1983-03-01, 1983-04-01, 1983-05-01,~ $ Industry <fct> "Cafes, restaurants and catering services", "Cafes, restaurants and catering services", "Cafes, restaurants and catering services", "Cafes, restaurants and catering ser~ $ Turnover <dbl> 4.4, 3.4, 3.6, 4.0, 3.6, 4.2, 4.8, 5.4, 6.9, 3.8, 4.2, 4.0, 4.4, 4.3, 4.3, 4.6, 4.9, 4.5, 5.5, 6.7, 9.6, 3.8, 3.8, 4.4, 3.8, 3.9, 4.1, 4.3, 4.8, 5.0, 5.3, 5.6, 6.8, 4.6~
The code snippet below will steps (not recipe steps!) 1., 2., and 3. as explained above. You can always refer to Part 1 for function arguments explanations.
We will generate a list (groups
) of 20 tibbles, each corresponding to a specific industry. In addition to steps 1., 2. and 3. I have also arranged in ascending order the timestamp (Month
) for each Industry group for the lags, rolling lags and Fourier terms.
We add a future frame, starting from 2019-01-01 to 2019-12-01 (12-month length), which will hold NA values for the measure (Turnover) but to which the Fourier, lag and rolling lag features will also be added.
Finally, we bind them together to produce the first feature engineered tibble groups_fe_tbl
.
groups <- lapply(X = 1:length(Industries), FUN = function(x){ monthly_retail_tbl %>% filter(Industry == Industries[x]) %>% arrange(Month) %>% mutate(Turnover = log1p(x = Turnover)) %>% mutate(Turnover = standardize_vec(Turnover)) %>% future_frame(Month, .length_out = "12 months", .bind_data = TRUE) %>% mutate(Industry = Industries[x]) %>% tk_augment_fourier(.date_var = Month, .periods = 12, .K = 1) %>% tk_augment_lags(.value = Turnover, .lags = 12) %>% tk_augment_slidify(.value = Turnover_lag12, .f = ~ mean(.x, na.rm = TRUE), .period = c(3, 6, 9, 12), .partial = TRUE, .align = "center") }) groups_fe_tbl <- bind_rows(groups) %>% rowid_to_column(var = "rowid")
Below I will display the last 13 months of groups_fe_tbl
, you should see the last historical row (2018-12-01) and the future frame values for the last industry (obviously, we have the same situation for other industries within the same tibble):
> groups_fe_tbl %>% tail(n=13) %>% glimpse() Rows: 13 Columns: 16 $ rowid <int> 9048, 9049, 9050, 9051, 9052, 9053, 9054, 9055, 9056, 9057, 9058, 9059, 9060 $ Month <date> 2018-12-01, 2019-01-01, 2019-02-01, 2019-03-01, 2019-04-01, 2019-05-01, 2019-06-01, 2019-07-01, 20~ $ Industry <chr> "Takeaway food services", "Takeaway food services", "Takeaway food services", "Takeaway food serv~ $ Turnover <dbl> 2.078015, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA $ Month_sin12_K1 <dbl> 0.16810089, 0.63846454, 0.93775213, 0.99301874, 0.80100109, 0.40982032, -0.10116832, -0.57126822, ~ $ Month_cos12_K1 <dbl> 0.98576980, 0.76965124, 0.34730525, -0.11795672, -0.59866288, -0.91216627, -0.99486932, -0.8207634~ $ Turnover_lag12 <dbl> 1.668977, 1.397316, 1.542003, 1.894938, 1.894938, 1.894938, 1.878972, 1.886969, 1.957669, 1.926527~ $ Turnover_lag13 <dbl> 1.755391, 1.668977, 1.397316, 1.542003, 1.894938, 1.894938, 1.894938, 1.878972, 1.886969, 1.957669~ $ Turnover_lag12_roll_3 <dbl> 1.607228, 1.536098, 1.611419, 1.777293, 1.894938, 1.889616, 1.886960, 1.907870, 1.923722, 1.972616~ $ Turnover_lag13_roll_3 <dbl> 1.723756, 1.607228, 1.536098, 1.611419, 1.777293, 1.894938, 1.889616, 1.886960, 1.907870, 1.923722~ $ Turnover_lag12_roll_6 <dbl> 1.667587, 1.692260, 1.715518, 1.750517, 1.832126, 1.901404, 1.906669, 1.929788, 1.941529, 1.974703~ $ Turnover_lag13_roll_6 <dbl> 1.668911, 1.667587, 1.692260, 1.715518, 1.750517, 1.832126, 1.901404, 1.906669, 1.929788, 1.941529~ $ Turnover_lag12_roll_9 <dbl> 1.750363, 1.744253, 1.741597, 1.757160, 1.779636, 1.808252, 1.878956, 1.925999, 1.946341, 1.952766~ $ Turnover_lag13_roll_9 <dbl> 1.748589, 1.750363, 1.744253, 1.741597, 1.757160, 1.779636, 1.808252, 1.878956, 1.925999, 1.929881~ $ Turnover_lag12_roll_12 <dbl> 1.783846, 1.784512, 1.785157, 1.787128, 1.811024, 1.828524, 1.862610, 1.904910, 1.941200, 1.946341~ $ Turnover_lag13_roll_12 <dbl> 1.766346, 1.783846, 1.784512, 1.785157, 1.787128, 1.811024, 1.828524, 1.843028, 1.887599, 1.925999~
Save standarization parameters
Since we are working with panel data (there are 20 time series, one for each Industry), consequently there will be 20 Standardization Parameters (mean, standard deviation) generated by standardize_vec(Turnover)
.
The values of these parameters must be saved, and so far the only way to do this is by perfoming a manually …, this is indeed a limitation. Consequently, I had to save the means and standard deviations respectively in vectors std_mean
and std_sd
.
# 20 standardization parameters Standardization Parameters mean: 2.88109110996285 standard deviation: 0.594036003334063 Standardization Parameters mean: 3.35376372924979 standard deviation: 0.547454581447732 Standardization Parameters mean: 2.45800504638277 standard deviation: 0.531250546670929 Standardization Parameters ... ... Standardization Parameters mean: 2.46138105124506 standard deviation: 0.45386783117798 # Measure standardization parameters for inversion std_mean <- c(2.88109110996285, 3.35376372924979, 2.45800504638277, 2.90694243034087, 3.18507858440263, 2.83400545514364, 4.35243537687952, 2.03664732406185, 2.64840690567097, 2.41927121148872, 3.70089063463501, 1.81454274237816, 2.00305556153117, 1.65085776430947, 3.30159231490985, 2.09573352259677, 1.97057625538218, 2.20889632513084, 4.19733019025565, 2.46138105124506) std_sd <- c(0.594036003334063, 0.547454581447732, 0.531250546670929, 0.515550971472541, 0.371902717442964, 0.683006197969896, 0.744552472892707, 0.429122786330909, 0.550558802233277, 0.675900224536041, 0.657978689289912, 0.630475051501078, 0.379777529493197, 0.516772299837197, 0.515586913143017, 0.464773051795071, 0.58058847542517, 0.584886288114511, 0.753013481976128, 0.45386783117798)
Prepared and future datasets
We perform the split of groups_fe_tbl
into prepared and future datasets .
The prepared dataset is the one with historic values of the measure (here Turnover
) must not have NA values. Although we know Turnover does have missing values we still apply filter(!is.na(Turnover))
. We will also drop all rows with NA values elsewhere e.g., those introduce by lag and rolling lag features.
Finally, we obtain an 8,560-row tibble (data_prepared_tbl
) from a 9,060-row one (groups_fe_tbl
). You can run a groups_fe_tbl %>% group_by(Industry) %>% skim()
to check the total number of missing values (NA and NaN ) that were removed.
Prepared dataset starts from 1983-04-01.
> data_prepared_tbl <- groups_fe_tbl %>% filter(!is.na(Turnover)) %>% drop_na() > data_prepared_tbl %>% glimpse() Rows: 8,560 Columns: 16 $ rowid <int> 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38~ $ Month <date> 1983-05-01, 1983-06-01, 1983-07-01, 1983-08-01, 1983-09-01, 1983-10-01, 1983-11-01, 1983-12-01, 1~ $ Industry <chr> "Cafes, restaurants and catering services", "Cafes, restaurants and catering services", "Cafes, re~ $ Turnover <dbl> -2.0426107, -2.0426107, -1.9499231, -1.8620736, -1.9802554, -1.6990366, -1.4138382, -0.8757670, -2~ $ Month_sin12_K1 <dbl> 0.51455540, 0.87434662, 1.00000000, 0.86602540, 0.50000000, 0.01688948, -0.48530196, -0.84864426, ~ $ Month_cos12_K1 <dbl> 8.574571e-01, 4.853020e-01, 1.567970e-14, -5.000000e-01, -8.660254e-01, -9.998574e-01, -8.743466e-~ $ Turnover_lag12 <dbl> -2.355895, -2.281065, -2.140701, -2.281065, -2.074676, -1.890850, -1.725136, -1.370672, -2.209420,~ $ Turnover_lag13 <dbl> -2.011144, -2.355895, -2.281065, -2.140701, -2.281065, -2.074676, -1.890850, -1.725136, -1.370672,~ $ Turnover_lag12_roll_3 <dbl> -2.216035, -2.259220, -2.234277, -2.165481, -2.082197, -1.896888, -1.662219, -1.768409, -1.884923,~ $ Turnover_lag13_roll_3 <dbl> -2.183520, -2.216035, -2.259220, -2.234277, -2.165481, -2.082197, -1.896888, -1.662219, -1.768409,~ $ Turnover_lag12_roll_6 <dbl> -2.213974, -2.190758, -2.170709, -2.065582, -1.913850, -1.925303, -1.890905, -1.901909, -1.921958,~ $ Turnover_lag13_roll_6 <dbl> -2.197201, -2.213974, -2.190758, -2.170709, -2.065582, -1.913850, -1.925303, -1.890905, -1.901909,~ $ Turnover_lag12_roll_9 <dbl> -2.190758, -2.147914, -2.095067, -2.014578, -2.036609, -2.005362, -1.989766, -1.975371, -1.948876,~ $ Turnover_lag13_roll_9 <dbl> -2.213974, -2.190758, -2.147914, -2.095067, -2.014578, -2.036609, -2.005362, -1.989766, -1.975371,~ $ Turnover_lag12_roll_12 <dbl> -2.095067, -2.014578, -2.034063, -2.037755, -2.046334, -2.046334, -2.020226, -2.000355, -1.984457,~ $ Turnover_lag13_roll_12 <dbl> -2.147914, -2.095067, -2.014578, -2.034063, -2.037755, -2.046334, -2.046334, -2.020226, -2.000355,~
The future dataset hold the 12 future (and unknown) values of Turnover
e.g., the NA values introduced by the future frame. We will predict and fill in the values in Part 6 with the final stacking ensemble model.
> future_tbl <- groups_fe_tbl %>% filter(is.na(Turnover)) > future_tbl %>% head(n=12) %>% glimpse() Rows: 12 Columns: 16 $ rowid <int> 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453 $ Month <date> 2019-01-01, 2019-02-01, 2019-03-01, 2019-04-01, 2019-05-01, 2019-06-01, 2019-07-01, 2019-08-01, 20~ $ Industry <chr> "Cafes, restaurants and catering services", "Cafes, restaurants and catering services", "Cafes, r~ $ Turnover <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA $ Month_sin12_K1 <dbl> 0.63846454, 0.93775213, 0.99301874, 0.80100109, 0.40982032, -0.10116832, -0.57126822, -0.90511451,~ $ Month_cos12_K1 <dbl> 0.76965124, 0.34730525, -0.11795672, -0.59866288, -0.91216627, -0.99486932, -0.82076344, -0.425167~ $ Turnover_lag12 <dbl> 1.125397, 1.282320, 1.569292, 1.425855, 1.437951, 1.429897, 1.550608, 1.469789, 1.457920, 1.469789~ $ Turnover_lag13 <dbl> 1.384894, 1.125397, 1.282320, 1.569292, 1.425855, 1.437951, 1.429897, 1.550608, 1.469789, 1.457920~ $ Turnover_lag12_roll_3 <dbl> 1.264204, 1.325670, 1.425822, 1.477699, 1.431234, 1.472819, 1.483431, 1.492773, 1.465833, 1.495272~ $ Turnover_lag13_roll_3 <dbl> 1.316081, 1.264204, 1.325670, 1.425822, 1.477699, 1.431234, 1.472819, 1.483431, 1.492773, 1.465833~ $ Turnover_lag12_roll_6 <dbl> 1.370952, 1.370952, 1.378452, 1.449320, 1.480565, 1.462003, 1.469326, 1.489352, 1.490694, 1.478711~ $ Turnover_lag13_roll_6 <dbl> 1.378274, 1.370952, 1.370952, 1.378452, 1.449320, 1.480565, 1.462003, 1.469326, 1.489352, 1.501243~ $ Turnover_lag12_roll_9 <dbl> 1.411416, 1.395927, 1.404907, 1.408445, 1.416559, 1.454825, 1.485468, 1.470874, 1.476502, 1.482009~ $ Turnover_lag13_roll_9 <dbl> 1.420563, 1.411416, 1.395927, 1.404907, 1.408445, 1.416559, 1.454825, 1.485468, 1.474990, 1.482009~ $ Turnover_lag12_roll_12 <dbl> 1.433627, 1.429420, 1.420139, 1.420139, 1.430152, 1.434573, 1.462680, 1.480716, 1.470874, 1.476502~ $ Turnover_lag13_roll_12 <dbl> 1.429496, 1.433627, 1.429420, 1.420139, 1.420139, 1.430152, 1.434266, 1.465153, 1.485468, 1.474990~
Train and test datasets split
We perform the split with timetk::time_series_split()
function. The latter returns a split object (an R list) that can be further used for plotting both the traininig and the test datasets.
One important argument is assess
, it determines the length of the test dataset. As per 2 , "The size of the test set is typically about 20% of the total sample, although this value depends on how long the sample is and how far ahead you want to forecast. The test set should ideally be at least as large as the maximum forecast horizon required." Since there are 428 observations (months) per industry, then 20% of 429 ~50 months.
cumulative = TRUE
means that we consider the training dataset from the begining (first row) of the prepared dataset.
> splits <- data_prepared_tbl %>% time_series_split(Month, assess = "50 months", cumulative = TRUE) # split object > splits <Analysis/Assess/Total> <7580/1000/8580>
Which means
7580 rows for training (all 20 industries). Here, we only care about the rows since the number of columns will be updated by the recipe (explained soon).
1000 rows for test dataset (50 months x 20 industries)
8580 rows of
data_prepared_tbl
(all 20 industries)
You can confirm the size of the training and test datasets with the following code snippet using the tk_time_series_cv_plan()
function which creates a new tibble with two new columns .id
and .key
> splits %>% tk_time_series_cv_plan() %>% glimpse() Rows: 8,560 Columns: 18 $ .id <chr> "Slice1", "Slice1", "Slice1", "Slice1", "Slice1", "Slice1", "Slice1", "Slice1", "Slice1", "Slice1"~ $ .key <fct> training, training, training, training, training, training, training, training, training, training~ $ rowid <int> 14, 467, 920, 1373, 1826, 2279, 2732, 3185, 3638, 4091, 4544, 4997, 5450, 5903, 6356, 6809, 7262, ~ $ Month <date> 1983-05-01, 1983-05-01, 1983-05-01, 1983-05-01, 1983-05-01, 1983-05-01, 1983-05-01, 1983-05-01, 1~ $ Industry <chr> "Cafes, restaurants and catering services", "Cafes, restaurants and takeaway food services", "Clot~ $ Turnover <dbl> -2.042611, -1.866127, -1.383977, -1.533674, -1.772910, -1.575604, -1.941481, -1.635071, -1.999504,~ $ Month_sin12_K1 <dbl> 0.5145554, 0.5145554, 0.5145554, 0.5145554, 0.5145554, 0.5145554, 0.5145554, 0.5145554, 0.5145554,~ $ Month_cos12_K1 <dbl> 8.574571e-01, 8.574571e-01, 8.574571e-01, 8.574571e-01, 8.574571e-01, 8.574571e-01, 8.574571e-01, ~ $ Turnover_lag12 <dbl> -2.355895, -2.397538, -1.674143, -1.891997, -1.973832, -1.550576, -2.113506, -2.109525, -2.292421,~ $ Turnover_lag13 <dbl> -2.011144, -2.195619, -1.713772, -1.891997, -2.044287, -1.680229, -2.080545, -2.035540, -2.434025,~ $ Turnover_lag12_roll_3 <dbl> -2.216035, -2.299396, -1.771137, -1.971674, -2.053104, -1.602136, -2.091532, -2.110334, -2.420468,~ $ Turnover_lag13_roll_3 <dbl> -2.183520, -2.296579, -1.693957, -1.891997, -2.009059, -1.615402, -2.097026, -2.072532, -2.363223,~ $ Turnover_lag12_roll_6 <dbl> -2.213974, -2.275130, -1.824439, -2.023204, -2.202758, -1.587032, -2.072703, -2.125292, -2.406516,~ $ Turnover_lag13_roll_6 <dbl> -2.197201, -2.278793, -1.787835, -1.988232, -2.146638, -1.577044, -2.076792, -2.110131, -2.411748,~ $ Turnover_lag12_roll_9 <dbl> -2.190758, -2.244673, -1.826687, -2.015706, -2.222243, -1.607126, -2.067327, -2.110334, -2.419395,~ $ Turnover_lag13_roll_9 <dbl> -2.213974, -2.275130, -1.824439, -2.023204, -2.202758, -1.587032, -2.072703, -2.125292, -2.406516,~ $ Turnover_lag12_roll_12 <dbl> -2.095067, -2.152525, -1.819038, -1.978000, -2.167591, -1.593199, -2.043333, -2.048362, -2.304961,~ $ Turnover_lag13_roll_12 <dbl> -2.147914, -2.200931, -1.828293, -2.002079, -2.217778, -1.606260, -2.056832, -2.089405, -2.359411,~ > splits %>% tk_time_series_cv_plan() %>% select(.id) %>% table() Slice1 8560 > splits %>% tk_time_series_cv_plan() %>% select(.key) %>% table() training testing 7560 1000
Let us plot the training and test datasets for one Industry value with the plot_time_series_cv_plan()
function:
> Industries <- unique(aus_retail_tbl$Industry) > splits %>% tk_time_series_cv_plan() %>% filter(Industry == Industries[1]) %>% plot_time_series_cv_plan(.date_var = Month, .value = Turnover, .title = paste0("Split for ", industry))
You can extract the training and test datasets by applying the training()
and testing()
functions respectively on the splits
object.
> training(splits) %>% filter(Industry == Industries[1]) # A tibble: 378 x 16 rowid Month Industry Turnover Month_sin12_K1 Month_cos12_K1 Turnover_lag12 Turnover_lag13 Turnover_lag12_r~ Turnover_lag13_r~ Turnover_lag12_r~ Turnover_lag13_r~ <int> <date> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 14 1983-05-01 Cafes, restaur~ -2.04 0.515 8.57e- 1 -2.36 -2.01 -2.22 -2.18 -2.21 -2.20 2 15 1983-06-01 Cafes, restaur~ -2.04 0.874 4.85e- 1 -2.28 -2.36 -2.26 -2.22 -2.19 -2.21 3 16 1983-07-01 Cafes, restaur~ -1.95 1 1.57e-14 -2.14 -2.28 -2.23 -2.26 -2.17 -2.19 4 17 1983-08-01 Cafes, restaur~ -1.86 0.866 -5.00e- 1 -2.28 -2.14 -2.17 -2.23 -2.07 -2.17 5 18 1983-09-01 Cafes, restaur~ -1.98 0.500 -8.66e- 1 -2.07 -2.28 -2.08 -2.17 -1.91 -2.07 6 19 1983-10-01 Cafes, restaur~ -1.70 0.0169 -1.00e+ 0 -1.89 -2.07 -1.90 -2.08 -1.93 -1.91 7 20 1983-11-01 Cafes, restaur~ -1.41 -0.485 -8.74e- 1 -1.73 -1.89 -1.66 -1.90 -1.89 -1.93 8 21 1983-12-01 Cafes, restaur~ -0.876 -0.849 -5.29e- 1 -1.37 -1.73 -1.77 -1.66 -1.90 -1.89 9 22 1984-01-01 Cafes, restaur~ -2.21 -0.999 -3.38e- 2 -2.21 -1.37 -1.88 -1.77 -1.92 -1.90 10 23 1984-02-01 Cafes, restaur~ -2.21 -0.882 4.70e- 1 -2.07 -2.21 -2.14 -1.88 -1.97 -1.92 # ... with 368 more rows, and 4 more variables: Turnover_lag12_roll_9 <dbl>, Turnover_lag13_roll_9 <dbl>, Turnover_lag12_roll_12 <dbl>, Turnover_lag13_roll_12 <dbl>
It is important to understand that the actual training dataset to be used to learn the model will be the output of the recipe , as explained hereafter.
Create recipe
As stated in Get Started, before training the model, we can use a recipe to create (new) predictors and conduct the preprocessing required by the model.
recipe(Turnover ~ ., data = training(splits))
"The recipe()
function as we used it here has two arguments:
A formula. Any variable on the left-hand side of the tilde (~) is considered the model outcome (here,
Turnover
). On the right-hand side of the tilde are the predictors. Variables may be listed by name, or you can use the dot (.) to indicate all other variables as predictors.The data. A recipe is associated with the data set used to create the model. This will typically be the training set, so data =
training(splits)
here. Naming a data set doesn’t actually change the data itself; it is only used to catalog the names of the variables and their types, like factors, integers, dates, etc."
For each preprocessing and feature engineering steps explained in Part 1 there is a specific step_<>
:
add calendar-based features:
step_timeseries_signature()
remove columns:
step_rm()
perform one-hot encoding on categorical variables:
step_dummy()
normalize numeric variables:
step_normalize()
recipe_spec <- recipe(Turnover ~ ., data = training(splits)) %>% update_role(rowid, new_role = "indicator") %>% step_other(Industry) %>% step_timeseries_signature(Month) %>% step_rm(matches("(.xts$)|(.iso$)|(hour)|(minute)|(second)|(day)|(week)|(am.pm)")) %>% step_dummy(all_nominal(), one_hot = TRUE) %>% step_normalize(Month_index.num, Month_year)
Notice the two very useful functions explained hereafter:
update_role(rowid, new_role = "indicator")
: it lets recipes know that rowid
variable has a custom role called "indicator" (a role can have any character value). This tells the recipe to keep the rowid
variable without using it neither as an outcome nor a predictor.
step_other(Industry)
: in case the training dataset does not include all Industry (categorical variable) values, it will pool the infrequently occurring values into an "other" category.
Below we display the recipe's spec, its summary, and finally the (real) training dataset:
> recipe_spec Recipe Inputs: role #variables indicator 1 outcome 1 predictor 14 Operations: Collapsing factor levels for Industry Timeseries signature features from Month Delete terms matches("(.xts$)|(.iso$)|(hour)|(minute)|(second)|(day)|(week)|(am.pm)") Dummy variables from all_nominal() Centering and scaling for Month_index.num, Month_year # Recipe summary > recipe_spec %>% + prep() %>% + summary() # A tibble: 52 x 4 variable type role source <chr> <chr> <chr> <chr> 1 rowid numeric indicator original 2 Month date predictor original 3 Month_sin12_K1 numeric predictor original 4 Month_cos12_K1 numeric predictor original 5 Turnover_lag12 numeric predictor original 6 Turnover_lag13 numeric predictor original 7 Turnover_lag12_roll_3 numeric predictor original 8 Turnover_lag13_roll_3 numeric predictor original 9 Turnover_lag12_roll_6 numeric predictor original 10 Turnover_lag13_roll_6 numeric predictor original # ... with 42 more rows # Extracted training dataset > recipe_spec %>% + prep() %>% + juice() %>% + glimpse() Rows: 7,560 Columns: 52 $ rowid <int> 14, 467, 920, 1373, 1826, 2279, 2732, 3185, 3638, 4091, 4544, 4997, 5450, 5903, 6356, 6809,~ $ Month <date> 1983-05-01, 1983-05-01, 1983-05-01, 1983-05-01, 1983-05-01, 1983-05-01, 1983-05-01, 1983-0~ $ Month_sin12_K1 <dbl> 0.5145554, 0.5145554, 0.5145554, 0.5145554, 0.5145554, 0.5145554, 0.5145554, 0.5145554, 0.5~ $ Month_cos12_K1 <dbl> 8.574571e-01, 8.574571e-01, 8.574571e-01, 8.574571e-01, 8.574571e-01, 8.574571e-01, 8.57457~ $ Turnover_lag12 <dbl> -2.355895, -2.397538, -1.674143, -1.891997, -1.973832, -1.550576, -2.113506, -2.109525, -2.~ $ Turnover_lag13 <dbl> -2.011144, -2.195619, -1.713772, -1.891997, -2.044287, -1.680229, -2.080545, -2.035540, -2.~ $ Turnover_lag12_roll_3 <dbl> -2.216035, -2.299396, -1.771137, -1.971674, -2.053104, -1.602136, -2.091532, -2.110334, -2.~ $ Turnover_lag13_roll_3 <dbl> -2.183520, -2.296579, -1.693957, -1.891997, -2.009059, -1.615402, -2.097026, -2.072532, -2.~ $ Turnover_lag12_roll_6 <dbl> -2.213974, -2.275130, -1.824439, -2.023204, -2.202758, -1.587032, -2.072703, -2.125292, -2.~ $ Turnover_lag13_roll_6 <dbl> -2.197201, -2.278793, -1.787835, -1.988232, -2.146638, -1.577044, -2.076792, -2.110131, -2.~ $ Turnover_lag12_roll_9 <dbl> -2.190758, -2.244673, -1.826687, -2.015706, -2.222243, -1.607126, -2.067327, -2.110334, -2.~ $ Turnover_lag13_roll_9 <dbl> -2.213974, -2.275130, -1.824439, -2.023204, -2.202758, -1.587032, -2.072703, -2.125292, -2.~ $ Turnover_lag12_roll_12 <dbl> -2.095067, -2.152525, -1.819038, -1.978000, -2.167591, -1.593199, -2.043333, -2.048362, -2.~ $ Turnover_lag13_roll_12 <dbl> -2.147914, -2.200931, -1.828293, -2.002079, -2.217778, -1.606260, -2.056832, -2.089405, -2.~ $ Turnover <dbl> -2.042611, -1.866127, -1.383977, -1.533674, -1.772910, -1.575604, -1.941481, -1.635071, -1.~ $ Month_index.num <dbl> -1.727712, -1.727712, -1.727712, -1.727712, -1.727712, -1.727712, -1.727712, -1.727712, -1.~ $ Month_year <dbl> -1.712475, -1.712475, -1.712475, -1.712475, -1.712475, -1.712475, -1.712475, -1.712475, -1.~ $ Month_half <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~ $ Month_quarter <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2~ $ Month_month <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6~ $ Industry_Cafes..restaurants.and.catering.services <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~ $ Industry_Cafes..restaurants.and.takeaway.food.services <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0~ $ Industry_Clothing.retailing <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0~ $ Industry_Clothing..footwear.and.personal.accessory.retailing <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0~ $ Industry_Department.stores <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0~ $ Industry_Electrical.and.electronic.goods.retailing <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0~ $ Industry_Food.retailing <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0~ $ Industry_Footwear.and.other.personal.accessory.retailing <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0~ $ Industry_Furniture..floor.coverings..houseware.and.textile.goods.retailing <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0~ $ Industry_Hardware..building.and.garden.supplies.retailing <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0~ $ Industry_Household.goods.retailing <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1~ $ Industry_Liquor.retailing <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~ $ Industry_Newspaper.and.book.retailing <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~ $ Industry_Other.recreational.goods.retailing <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~ $ Industry_Other.retailing <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~ $ Industry_Other.retailing.n.e.c. <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~ $ Industry_Other.specialised.food.retailing <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~ $ Industry_Pharmaceutical..cosmetic.and.toiletry.goods.retailing <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~ $ Industry_Supermarket.and.grocery.stores <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~ $ Industry_Takeaway.food.services <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~ $ Month_month.lbl_01 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~ $ Month_month.lbl_02 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~ $ Month_month.lbl_03 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~ $ Month_month.lbl_04 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~ $ Month_month.lbl_05 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~ $ Month_month.lbl_06 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~ $ Month_month.lbl_07 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~ $ Month_month.lbl_08 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~ $ Month_month.lbl_09 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~ $ Month_month.lbl_10 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~ $ Month_month.lbl_11 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~ $ Month_month.lbl_12 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
As for the standardization parameters, we save the normalization parameters generated by the step_normalize()
function applied on Month_index.num
and Month_year
columns. Min-max normalization was used for these columns.
To get the original min and max values prior normalization you must customize the skim()
function as I did in Part 1 with my myskim()
functionand apply it to training(splits)
.
recipe(Turnover ~ ., data = training(splits)) %>% step_timeseries_signature(Month) %>% prep() %>% juice() %>% myskim() # Extract of the output for the sake of clarity: ... ... -- Variable type: numeric --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- # A tibble: 48 x 13 skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist max min * <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> ... ... 15 Month_index.num 0 1 9.16e+8 286980149. 420595200 667785600 9.16e+8 1.16e+9 1412121600 ▇▇▇▇▇ 1412121600 420595200 16 Month_year 0 1 2.00e+3 9.10 1983 1991 2.00e+3 2.01e+3 2014 ▇▇▇▇▇ 2014 1983 ... ... # Month_index.num normalization Parameters Month_index.num_limit_lower = 420595200 Month_index.num_limit_upper = 1412121600 # Month_year normalization Parameters Month_year_limit_lower = 1983 Month_year_limit_upper = 2014
Save your work
We will keep all data, recipes, splits, parameters and any other useful stuff in a list and save it as an RDS file.
feature_engineering_artifacts_list <- list( # Data data = list( data_prepared_tbl = data_prepared_tbl, future_tbl = future_tbl, industries = Industries ), # Recipes recipes = list( recipe_spec = recipe_spec ), # Splits splits = splits, # Inversion Parameters standardize = list( std_mean = std_mean, std_sd = std_sd ), normalize = list( Month_index.num_limit_lower = Month_index.num_limit_lower, Month_index.num_limit_upper = Month_index.num_limit_upper, Month_year_limit_lower = Month_year_limit_lower, Month_year_limit_upper = Month_year_limit_upper ) ) feature_engineering_artifacts_list %>% write_rds("feature_engineering_artifacts_list.rds")
Conclusion
In this article I have explained how to perform time series feature engineering with recipes thanks to the timetk package.
The recipe performed preprocessing, transformations and feature engineering applicable to all 20 time series (panel data). For operations such as measure standardization, adding lags, rolling lags and Fourier terms we needed to code a specific by group (here, by Industry) feature engineering.
The recipe is an important step in the data science process with modeltime since it will be added to a workflow along with a model, this will be explained in my next article (Part 3).
Note: I haven't shown all the power of recipes e.g., model-specific recipes can be built from a "base" recipe which can be modified in order to adapt to the model requirements. Remember, it is important to perform experimentation with different datsets and creating interdependent recipes can be very useful.
References
1 Dancho, Matt, "DS4B 203-R: High-Performance Time Series Forecasting", Business Science University
2 Hyndman R., Athanasopoulos G., "Forecasting: Principles and Practice", 3rd edition
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.