Time Series in 5-Minutes, Part 1: Data Wrangling and Rolling Calculations
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Have 5-minutes? Then let’s learn time series. In this short articles series, I highlight how you can get up to speed quickly on important aspects of time series analysis. Today we are focusing preparing data for timeseries analysis rolling calculations.
Updates
This article has been updated. View the updated Time Series in 5-Minutes article at Business Science.
Time Series in 5-Mintues
Articles in this Series
I just released timetk
2.0.0 (read the release announcement). A ton of new functionality has been added. We’ll discuss some of the key pieces in this article series:
- Part 1, Data Wrangling and Rolling Calculations
- Part 2, The Time Plot
- Part 3, Autocorrelation
- Part 4, Seasonality
- Part 5, Anomalies and Anomaly Detection
- Part 6, Dealing with Missing Time Series Data
???? Register for our blog to get new articles as we release them.
Have 5-Minutes?
Then let’s learn Rolling Calculations
A collection of tools for working with time series in R Time series data wrangling is an essential skill for any forecaster.
timetk
includes the essential data wrangling tools. In this tutorial:
- Summarise by Time – For time-based aggregations
- Filter by Time – For complex time-based filtering
- Pad by Time – For filling in gaps and going from low to high frequency
- Slidify – For turning any function into a sliding (rolling) function
Additional concepts covered:
- Imputation – Needed for Padding (See Low to High Frequency)
- Advanced Filtering – Using the new add time
%+time
infix operation (See Padding Data: Low to High Frequency) - Visualization –
plot_time_series()
for all visualizations
Advanced Time Series Course
Become the times series domain expert in your organization.
Make sure you’re notified when my new Advanced Time Series Forecasting in R course comes out. You’ll learn timetk
and modeltime
plus the most powerful time series forecasting techiniques available. Become the times series domain expert in your organization.
???? Get notified here: Advanced Time Series Course.
You will learn:
- Time Series Preprocessing, Noise Reduction, & Anomaly Detection
- Feature engineering using lagged variables & external regressors
- Hyperparameter tuning
- Time series cross-validation
- Ensembling Multiple Machine Learning & Univariate Modeling Techniques (Competition Winner)
- NEW – Deep Learning with RNNs (Competition Winner)
- and more.
Signup for the Time Series Course waitlist
Let’s Get Started
Data
This tutorial will use the FANG
dataset:
- Daily
- Irregular (missing business holidays and weekends)
- 4 groups (FB, AMZN, NFLX, and GOOG).
The adjusted column contains the adjusted closing prices for each day.
The volume column contains the trade volume (number of times the stock was transacted) for the day.
Summarize by Time
summarise_by_time()
aggregates by a period. It’s great for:
- Period Aggregation –
SUM()
- Period Smoothing –
AVERAGE()
,FIRST()
,LAST()
Period Summarization
Objective: Get the total trade volume by quarter
- Use
SUM()
- Aggregate using
.by = "quarter"
Period Smoothing
Objective: Get the first value in each month
- We can use
FIRST()
to get the first value, which has the effect of reducing the data (i.e. smoothing). We could useAVERAGE()
orMEDIAN()
. - Use the summarization by time:
.by = "month"
to aggregate by month.
Filter By Time
Used to quickly filter a continuous time range.
Time Range Filtering
Objective: Get the adjusted stock prices in the 3rd quarter of 2013.
.start_date = "2013-09"
: Converts to “2013-09-01.end_date = "2013"
: Converts to “2013-12-31- A more advanced example of filtering using
%+time
and%-time
is shown in “Padding Data: Low to High Frequency”.
Padding Data
Used to fill in (pad) gaps and to go from from low frequency to high frequency. This function uses the awesome padr
library for filling and expanding timestamps.
Fill in Gaps
Objective: Make an irregular series regular.
- We will leave padded values as
NA
. - We can add a value using
.pad_value
or we can impute using a function likets_impute_vec()
(shown next).
Low to High Frequency
Objective: Go from Daily to Hourly timestamp intervals for 1 month from the start date. Impute the missing values.
.by = "hour"
pads from daily to hourly- Imputation of hourly data is accomplished with
ts_impute_vec()
, which performs linear interpolation whenperiod = 1
. - Filtering is accomplished using:
- “start”: A special keyword that signals the start of a series
FIRST(date) %+time% "1 month"
: Selecting the first date in the sequence then using a special infix operation,%+time%
, called “add time”. In this case I add “1 month”.
Sliding (Rolling) Calculations
We have a new function, slidify()
that turns any function into a sliding (rolling) window function. It takes concepts from tibbletime::rollify()
and it improves them with the R package slider
.
Rolling Mean
Objective: Calculate a “centered” simple rolling average with partial window rolling and the start and end windows.
slidify()
turns theAVERAGE()
function into a rolling average.
For simple rolling calculations (rolling average), we can accomplish this operation faster with slidify_vec()
– A vectorized rolling function for simple summary rolls (e.g. mean()
, sd()
, sum()
, etc)
Rolling Regression
Objective: Calculate a rolling regression.
- This is a complex sliding (rolling) calculation that requires multiple columns to be involved.
slidify()
is built for this.- Use the multi-variable
purrr
..1
,..2
,..3
, etc notation to setup a function
Advanced Time Series Course
Become the times series domain expert in your organization.
Make sure you’re notified when my new Advanced Time Series Forecasting in R course comes out. You’ll learn timetk
and modeltime
plus the most powerful time series forecasting techiniques available. Become the times series domain expert in your organization.
???? Get notified here: Advanced Time Series Course.
You will learn:
- Time Series Preprocessing, Noise Reduction, & Anomaly Detection
- Feature engineering using lagged variables & external regressors
- Hyperparameter tuning
- Time series cross-validation
- Ensembling Multiple Machine Learning & Univariate Modeling Techniques (Competition Winner)
- NEW – Deep Learning with RNNs (Competition Winner)
- and more.
Signup for the Time Series Course waitlist
Have questions on using Timetk for time series?
Make a comment in the chat below. ????
And, if you plan on using timetk
for your business, it’s a no-brainer – Join my Time Series Course Waitlist (It’s coming, it’s really insane).
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.