Machine Learning (XGBoost) Time-Series Classification Trading Strategy

Posts on Matthew Smith R Shenanigans

2 years ago

[This article was first published on Posts on Matthew Smith R Shenanigans, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Using Machine Learning (ML) and past price data to predict the next periods price or direction in the stock market is not new, neither does it produce any meaningful predictions. In this post I collapse down a series of asset time series data into a simple classification problem and see if a Machine Learning model can do a better job at predicting the next periods direction. I apply a similar method here Time Series Classification Synthetic vs Real Financial Time Series. The objective and strategy is to invest in a single asset each day. The asset we invest in will be the asset which the Machine Learning model is most confident will go up in share value in the next period \(t+1\). Alternatively speaking, we invest in the asset in which the Machine Learning model gives the highest predicted probability that, a given asset will go up in value tomorrow. That is, if the model predicts on day \(t\) that asset GOOG was going to be higher than it’s previous close with a predicted probability of 0.78 and it also predicts that AMZN would go up with 0.53 probability then we would invest in GOOG today. –we only invest in one asset each day– . The model can be expanded to short selling and multi-asset purchasing and multi-periods.

Load in the packages.

require(PerformanceAnalytics)
library(data.table)
library(dplyr)
library(tibble)
library(TTR)
library(tidyr)
library(tidyquant)
library(tsfeatures)
library(rsample)
library(purrr)
library(stringr)
library(tibbletime) # tsibble clashes with the base R index() function
library(xgboost)
library(rvest)

Pre-define a few intialisation objects and set the ticker symbols of the companies we want to download. For this task I am not really interested in which companies I apply the strategy to. For this reason, I scrape the Wikipedia page for the S&P 500 and take a random sample of 30 tickers.

set.seed(1234)
###################### Pre-define functions for later ##########################

Scale_Me <- function(x){
  (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}                                  # Note: I don't actually use this function but I leave it in here.

#################################################################################

start_date <- "2017-01-01"
end_date <- "2020-01-01"

url <- "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
symbols <- url %>%
  read_html() %>%
  html_nodes(xpath = '//*[@id="constituents"]') %>% 
  html_table() %>% 
  .[[1]] %>% 
  filter(!str_detect(Security, "Class A|Class B|Class C")) %>%     # Removes firms with Class A, B & C shares
  sample_n(30) %>% 
  pull(Symbol)


#symbols <- c(
  #'GOOG', 'MSFT', 'HOG', 'AAPL', 'FB' 
  #'AMZN', 'EBAY', 'IBM', 'NFLX', 'NVDA',
  #'TWTR', 'WMT', 'XRX', 'INTC', 'HPE'
# )

The data

Download the data and store it into a new environment.

dataEnv <- new.env()
getSymbols(symbols, 
           from = start_date, 
           to = end_date, 
           #src = "yahoo", 
           #adjust = TRUE, 
           env = dataEnv)
##  [1] "LEG"  "NLSN" "SLB"  "CHTR" "C"    "REGN" "CCI"  "SYK"  "ROP"  "RL"  
## [11] "CERN" "CMG"  "GS"   "CAT"  "MSI"  "BR"   "VRSK" "PNC"  "KEYS" "PHM" 
## [21] "FB"   "BKR"  "ABMD" "WYNN" "DG"   "ADI"  "GL"   "TSCO" "FLS"  "CDW"

Once the data has been downloaded and stored into a new environment I clean the data up a little, put all the lists into a single data frame, compute the daily returns for each asset and create the up or down direction which will be what the classification model will try to predict.

df <- eapply(dataEnv, function(x){
  as.data.frame(x) %>% 
    rename_all(function(n){
      gsub("^(\\w+)\\.", "", n, perl = TRUE)
    }
    ) %>%
    rownames_to_column("date")  
}) %>% 
  rbindlist(idcol = TRUE) %>% 
  mutate(date = as.Date(date)) %>% 
  group_by(.id) %>% 
  tq_mutate(
    select = Adjusted,
    mutate_fun = periodReturn,
    period = "daily",
    type = "arithmetic"
  ) %>% 
  mutate(
    Adj_lag = lag(Adjusted),
    chng_Adj = ifelse(Adjusted > Adj_lag, 1, 0) # more simply we could have just done if ret were pos/neg
  ) %>% 
  select("date", ".id", "Adjusted", "daily.returns", "chng_Adj", "Open", "High", "Low", "Close") %>% 
  as_tibble() %>% 
  as_tbl_time(index = date) %>% 
  setNames(c("date", "ID", "prc", "ret", "chng", "open", "high", "low", "close")) %>% 
  drop_na(chng)

The first few observations of the data looks like:

date	ID	prc	ret	chng	open	high	low	close
2017-01-04	CDW	50.63446	0.0162981	1	51.79	52.60	51.79	52.38
2017-01-05	CDW	50.14147	-0.0097364	0	52.10	52.66	51.84	51.87
2017-01-06	CDW	49.96746	-0.0034704	0	51.89	51.95	51.46	51.69

We can use the nest() function to put the data into convenient nested tibbles that we can simply map() over and apply the rolling_origin() function from the rsample package such that, each of our assets will have their own rolling_origin() function applied to it without any overlap or mixing of the asset classes, I do this in order to create the time series features for each period.

nested_df <- df %>%
  mutate(duplicate_ID = ID) %>% 
  nest(-ID)

I split the time series data into a number of lists such that the analysis() list contains 100 observations in each list and has a corresponding assessment() list which contains 1 observation. Usually the analysis() will become our training data set and the assessment() will become our testing data set, however, here I am using the rolling_origin() function to help create the time series features.

# First we set the number of days we want to construct the ts features
rolled_df <- map(nested_df$data, ~ rolling_origin(.,
                                                  initial = 100,
                                                  assess = 1,
                                                  cumulative = FALSE,
                                                  skip = 0))

Time-Series Functions

In order to create the time series variables I use the tsfeatures package but there is also the feasts packages here. For this model I simply select a few functions of interest from the tsfeatures package.

functions <- c(
  "entropy",                   # Measures the "forecastability" of a series - low values = high sig-to-noise, large vals = difficult to forecast
  "stability",                 # means/variances are computed for all tiled windows - stability is the variance of the means
  "lumpiness",                 # Lumpiness is the variance of the variances
  "max_level_shift",           # Finds the largest mean shift between two consecutive windows (returns two values, size of shift and time index of shift)
  "max_var_shift",             # Finds the max variance shift between two consecutive windows (returns two values, size of shift and time index of shift)
  "max_kl_shift",              # Finds the largest shift in the Kulback-Leibler divergence between two consecutive windows (returns two values, size of shift and time index of shift)
  "crossing_points"            # Number of times a series crosses the mean line
)

I wrote this code a little while ago and at the time I wrapped the model into a function. I think it would be more fun to exclusively stick to using just map() functions instead of function(SYMB). The function does the following for each asset in our data:

Using the out-of-sample t+1 (assessment) data, bind these lists together into a single data frame. Next apply the functions character string to call the functions from the tsfeatures package, apply these functions to the in-sample (analysis) data (which consists of 100 observations each), such that, we obtain a single collapsed down observation that we can just bind together. Finally we bind the columns of these two data sets together using bind_cols(). After this I rename the chng variable and rename the time series feature variables to something more dynamic using ~str_c("X", seq_along(.)) so we can just add functions to the functions character string and not have to worry about renaming the variables individually in order for the model to work.

Once this is done, I create the Machine Learning data set again using the rolling_origin() function. The first rolling_origin() function was used to help collapse the time series data down on a rolling basis by taking the previous 100 days of data and calculating the tsfeatures function on it – a similar method to calculating a rolling mean/sd using the rollapply() function from the zoo package.

I next split the data into X variables with X_train and X_test and the corresponding Y variable with Y_train and Y_test. The package xgboost expects a certain type of xgb.DMatrix() which is what dtrain and dtest are doing.

Then, I set the XGBoost parameters and apply the XGBoost model. – Suitable cross validation should be performed at this point, however I will leave this for another post since time series cross validation is quite tricky and there is no function in R which helps with this type of cross validation (that I have found as of 2020-02-02) –

Once the model has been trained, I make the predictions.

Apply the model

The function to compute all this is the following:

Prediction_Model <- function(SYMB){
  data <- bind_cols(
    map(rolled_df[[SYMB]]$splits, ~ assessment(.x)) %>%
      bind_rows(),
    map(rolled_df[[SYMB]]$splits, ~ analysis(.x)) %>%
      map(., ~tsfeatures(.x[["ret"]], functions)) %>%          # Compute the TSFeatures
      bind_rows()
  ) %>%
    rename(Y = chng) %>%
    rename_at(vars(-c(1:9)), ~str_c("X", seq_along(.)))        
  
  ml_data <- data %>% 
    as_tibble() %>% 
    rolling_origin(
      initial    = 200,
      assess     = 1,
      cumulative = FALSE,
      skip       = 0)
  
  X_train <- map(
    ml_data$splits, ~ analysis(.x) %>%
      as_tibble(., .name_repair = "universal") %>%
      select(starts_with("X")) %>% 
      as.matrix()
  )
  
  Y_train <- map(
    ml_data$splits, ~ analysis(.x) %>%
      as_tibble(., .name_repair = "universal") %>%
      select(starts_with("Y")) %>% 
      as.matrix()
  )
  
  X_test <- map(
    ml_data$splits, ~ assessment(.x) %>%
      as_tibble(., .name_repair = "universal") %>%
      select(starts_with("X")) %>% 
      as.matrix()
  )
  
  Y_test <- map(
    ml_data$splits, ~ assessment(.x) %>%
      as_tibble(., .name_repair = "universal") %>%
      select(starts_with("Y")) %>% 
      as.matrix()
  )
  
  #############################################################
  
  dtrain <- map2(
    X_train, Y_train, ~ xgb.DMatrix(data = .x, label = .y, missing = "NaN")
  )
  
  dtest <- map(
    X_test, ~ xgb.DMatrix(data = .x, missing = "NaN")
  )
  
  # Parameters:
  watchlist <- list("train" = dtrain)
  params <- list("eta" = 0.1, "max_depth" = 5, "colsample_bytree" = 1, "min_child_weight" = 1, "subsample"= 1,
                 "objective"="binary:logistic", "gamma" = 1, "lambda" = 1, "alpha" = 0, "max_delta_step" = 0,
                 "colsample_bylevel" = 1, "eval_metric"= "auc",
                 "set.seed" = 176)
  
  # Train the XGBoost model
  xgb.model <- map(
    dtrain, ~ xgboost(params = params, data = .x, nrounds = 10, watchlist)
  )
  
  xgb.pred <- map2(
    .x = xgb.model, 
    .y = dtest, 
    .f = ~ predict(.x, newdata = .y, type = 'prob')
  )
  
  preds <- cbind(plyr::ldply(xgb.pred, data.frame),
                 plyr::ldply(Y_test, data.frame)) %>% 
    setNames(c("pred_probs", "actual")) %>% 
    bind_cols(plyr::ldply(map(ml_data$splits, ~assessment(.x)))) %>% 
    rename(ID = duplicate_ID) %>% 
    #select(pred_probs, actual, date, ID, prc, ret) %>% 
    as_tibble()
  return(preds)
}

We can apply the above model to create the time series features, train and test on each of our assets by running the following.

Sys_t_start <- Sys.time()
Resultados <- lapply(seq(1:length(rolled_df)), Prediction_Model)
Sys_t_end <- Sys.time()
round(Sys_t_end - Sys_t_start, 2)

The Resultados output will give us a list the length of the number of assets we have in our data. The first few observations of the first asset in the list looks like:

pred_probs	actual	date	prc	ret	Y	open	high	low	close	ID	X1	X2	X3	X4	X5	X6	X7	X8	X9	X10
0.7304490	0	2018-03-15	73.17622	-0.0106061	0	75.36	75.51	74.18	74.63	CDW	0.9870653	0.1149955	0.7047308	1.277064	71	3.161538	63	3.245055	76	56
0.5149571	1	2018-03-16	74.35286	0.0160795	1	74.78	75.99	73.04	75.83	CDW	0.9886519	0.0745409	0.8408280	1.273320	70	3.143027	62	3.227452	75	55
0.6207952	0	2018-03-19	72.53889	-0.0243967	0	75.35	75.35	73.45	73.98	CDW	0.9902178	0.0901013	0.7192391	1.275344	69	3.153024	61	3.227452	74	55

Which consist of the XGBoost predicted probabilities, the actual observed result, the date of the result (of the out-of-sample testing data), the observed share price, the calculated daily returns, (a duplicate of the observed result), the OHLC data we collected from Yahoo and finally the time series features we constructed and then reneamed to \(X_{n}\)

The objective of this strategy was to invest every day in the asset which obtained the highest predicted probability that the market was going to go up. That is, if the model predicts on day \(t\) that asset GOOG was going to be higher than it’s previous close with a predicted probability of 0.78 and it also predicts that AMZN would go up with 0.53 probability then we would invest in GOOG today. That is, we only invest in the asset with the highest predicted probability that the market is going to go up.

Therefore, I create a new data frame called top_assets which basically gives me the highest predicted probability across all assets each day.

top_assets <- plyr::ldply(Resultados) %>% 
  #select(pred_probs, actual, date, open, high, low, close, prc, ret) %>% 
  group_by(date) %>%
  arrange(desc(pred_probs)) %>%
  dplyr::slice(1) %>% 
  ungroup() %>% 
  select(date, everything()) %>% 
  rename(score = pred_probs) %>% 
  select(-actual)

Strategy Assessment

The first 10 days of the strategy investments look like:

date	score	prc	ret	Y	open	high	low	close	ID	X1	X2	X3	X4	X5	X6	X7	X8	X9	X10
2018-03-15	0.7304490	73.17622	-0.0106061	0	75.36	75.51	74.18	74.63	CDW	0.9870653	0.1149955	0.7047308	1.2770644	71	3.161538	63	3.245055	76	56
2018-03-16	0.6899720	293.04999	-0.0051601	0	295.13	297.61	289.27	293.05	ABMD	0.9918187	0.0769474	2.4417643	1.0584676	69	4.917599	62	7.861935	80	39
2018-03-19	0.7299674	101.46883	-0.0081591	0	108.98	109.06	107.40	108.19	CCI	0.9894902	0.0705445	0.3788407	0.9924987	68	1.786445	5	1.402126	89	46
2018-03-20	0.7370850	60.44966	0.0132920	1	65.13	66.10	65.13	65.56	SLB	0.9830999	0.1717591	0.1725298	1.1379607	53	1.853699	9	1.739650	10	47
2018-03-21	0.7003193	334.51999	0.0448199	1	320.94	339.20	320.32	334.52	CMG	0.9532525	0.0669860	2.4899030	1.6249110	67	4.775679	73	12.454513	10	57
2018-03-22	0.7438304	87.40306	-0.0243534	0	91.53	92.43	90.48	90.54	ADI	0.9796797	0.0855250	0.8606480	1.4206984	65	2.718067	64	8.589078	72	49
2018-03-23	0.6494384	237.70613	-0.0290578	0	253.63	253.99	244.93	245.26	GS	0.9685330	0.0615987	0.7610257	1.1137175	63	3.588928	58	4.667287	71	50
2018-03-26	0.6868502	70.18565	0.0238877	1	70.74	71.71	69.71	71.58	CDW	0.9885363	0.1109430	0.5278402	1.1776141	64	2.688307	56	3.038961	69	56
2018-03-27	0.7274348	57.21453	-0.0020772	0	58.13	58.41	57.16	57.65	CERN	0.9787359	0.1971365	0.2600627	1.0608221	62	2.325379	56	2.115355	69	52
2018-03-28	0.7031060	68.56778	-0.0039882	0	70.36	70.39	69.26	69.93	CDW	0.9869206	0.1328377	0.5033828	1.1567020	62	2.593677	54	3.010604	67	56

We can see that the column score is the probability for the asset with the highest predicted probability that it’s price was going to be greater than it’s previous close. The ID column gives us the asset ticker we invest in.

Next I want to analyse the strategy of picking the best predicted winners against the S&P500 bench mark and therefore download the S&P 500 index.

top_assets <- xts(top_assets[,c(2:ncol(top_assets))], order.by = top_assets$date) # put top_assets into xts format

# Analyse strategy
getSymbols("SPY", 
           from = start_date, 
           to = end_date, 
           src = "yahoo") 
## [1] "SPY"
#detach("package:tsibble", unload = TRUE) # tsibble clashes with the base R index() function
SPY$ret_Rb <- Delt(SPY$SPY.Adjusted)
SPY <- SPY[index(SPY) >= min(index(top_assets))]

RaRb <- cbind(top_assets, SPY)

From here we can see how the strategy compares with the S&P 500. I show a number of statistics for analysing asset returns from the PerformanceAnalytics package. I have not expanded the model to include short selling or construct multi-asset portfolios of the top \(N\) assets.

We can plot the performance of our strategy:

charts.PerformanceSummary(RaRb[, c("ret", "ret_Rb")], geometric = FALSE, wealth.index = FALSE, 
                          main = "Strategy vs. Market")

Take a look at the drawdown and risk metrics.

##                             ret ret_Rb
## Sterling ratio           0.1870 0.3879
## Calmar ratio             0.2551 0.5884
## Burke ratio              0.2251 0.5344
## Pain index               0.0955 0.0283
## Ulcer index              0.1189 0.0455
## Pain ratio               0.7337 4.0290
## Martin ratio             0.5891 2.5027
## daily downside risk      0.0111 0.0066
## Annualised downside risk 0.1768 0.1044
## Downside potential       0.0054 0.0029
## Omega                    1.0722 1.1601
## Sortino ratio            0.0351 0.0714
## Upside potential         0.0058 0.0034
## Upside potential ratio   0.7027 0.6124
## Omega-sharpe ratio       0.0722 0.1601

Take a closer look at the drawdown information.

## $ret
##         From     Trough         To   Depth Length To Trough Recovery
## 1 2018-08-31 2019-01-03 2019-09-16 -0.2746    261        85      176
## 2 2019-11-06 2019-12-03       <NA> -0.1300     39        19       NA
## 3 2019-10-01 2019-10-18 2019-10-29 -0.0810     21        14        7
## 4 2018-03-22 2018-04-20 2018-05-09 -0.0773     34        21       13
## 5 2018-08-10 2018-08-15 2018-08-20 -0.0474      7         4        3
## 
## $ret_Rb
##         From     Trough         To   Depth Length To Trough Recovery
## 1 2018-09-21 2018-12-24 2019-04-12 -0.1935    140        65       75
## 2 2019-05-06 2019-06-03 2019-06-20 -0.0662     33        20       13
## 3 2018-03-15 2018-04-02 2018-06-04 -0.0610     56        12       44
## 4 2019-07-29 2019-08-05 2019-10-25 -0.0602     64         6       58
## 5 2018-06-13 2018-06-27 2018-07-09 -0.0300     18        11        7

Compare the returns.

chart.Boxplot(RaRb[,c("ret", "ret_Rb")],  main = "Returns")

Compare return statistics.

table.Stats(RaRb[, c("ret", "ret_Rb")]) %>% 
  t() %>% 
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

	Observations	NAs	Minimum	Quartile 1	Median	Arithmetic Mean	Geometric Mean	Quartile 3	Maximum	SE Mean	LCL Mean (0.95)	UCL Mean (0.95)	Variance	Stdev	Skewness	Kurtosis
ret	453	0	-0.0669	-0.0068	0.0006	0.0004	0.0003	0.0087	0.0642	0.0007	-0.0011	0.0018	0.0002	0.0156	-0.2542	2.8842
ret_Rb	453	0	-0.0324	-0.0030	0.0006	0.0005	0.0004	0.0054	0.0505	0.0004	-0.0004	0.0013	0.0001	0.0091	-0.2949	3.6264

Compare Sharpe Information.

lapply(RaRb[, c("ret", "ret_Rb")], function(x){SharpeRatio(x)})
## $ret
##                                       ret
## StdDev Sharpe (Rf=0%, p=95%): 0.025027498
## VaR Sharpe (Rf=0%, p=95%):    0.015346462
## ES Sharpe (Rf=0%, p=95%):     0.009618405
## 
## $ret_Rb
##                                   ret_Rb
## StdDev Sharpe (Rf=0%, p=95%): 0.05152014
## VaR Sharpe (Rf=0%, p=95%):    0.03218952
## ES Sharpe (Rf=0%, p=95%):     0.01913213

Plot the Risk – Return scatter plot.

chart.RiskReturnScatter(RaRb[, c("ret", "ret_Rb")],  # check this plot a little more
                        Rf=.03/252, scale = 252, # for daily data
                        main = "Risk - Return over the period")

Plot the rolling return, risk and Sharpe performance.

charts.RollingPerformance(RaRb[, c("ret", "ret_Rb")],      
                          Rf=.03/12, 
                          colorset = c("red", rep("darkgray",5), "orange", "green"), lwd = 2)

Compute the yearly returns.

lapply(RaRb[, c("ret")],function(x){periodReturn(
  x, period = 'yearly', type = 'arithmetic')})        # change type to log for continuous
## $ret
##            yearly.returns
## 2018-12-31      -1.855083
## 2019-12-31      -1.475181
lapply(RaRb[, c("ret_Rb")],function(x){periodReturn(
  x, period = 'yearly', type = 'arithmetic')})
## $ret_Rb
##            yearly.returns
## 2018-12-31     -9.0376638
## 2019-12-31     -0.7226497

To leave a comment for the author, please follow the link and comment on their blog: Posts on Matthew Smith R Shenanigans.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.