Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
Using Machine Learning (ML) and past price data to predict the next periods price or direction in the stock market is not new, neither does it produce any meaningful predictions. In this post I collapse down a series of asset time series data into a simple classification problem and see if a Machine Learning model can do a better job at predicting the next periods direction. I apply a similar method here Time Series Classification Synthetic vs Real Financial Time Series. The objective and strategy is to invest in a single asset each day. The asset we invest in will be the asset which the Machine Learning model is most confident will go up in share value in the next period \(t+1\). Alternatively speaking, we invest in the asset in which the Machine Learning model gives the highest predicted probability that, a given asset will go up in value tomorrow. That is, if the model predicts on day \(t\) that asset GOOG was going to be higher than it’s previous close with a predicted probability of 0.78 and it also predicts that AMZN would go up with 0.53 probability then we would invest in GOOG today. –we only invest in one asset each day– . The model can be expanded to short selling and multi-asset purchasing and multi-periods.
Load in the packages.
require(PerformanceAnalytics) library(data.table) library(dplyr) library(tibble) library(TTR) library(tidyr) library(tidyquant) library(tsfeatures) library(rsample) library(purrr) library(stringr) library(tibbletime) # tsibble clashes with the base R index() function library(xgboost) library(rvest)
Pre-define a few intialisation objects and set the ticker symbols of the companies we want to download. For this task I am not really interested in which companies I apply the strategy to. For this reason, I scrape the Wikipedia page for the S&P 500 and take a random sample of 30 tickers.
set.seed(1234) ###################### Pre-define functions for later ########################## Scale_Me <- function(x){ (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE) } # Note: I don't actually use this function but I leave it in here. ################################################################################# start_date <- "2017-01-01" end_date <- "2020-01-01" url <- "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies" symbols <- url %>% read_html() %>% html_nodes(xpath = '//*[@id="constituents"]') %>% html_table() %>% .[[1]] %>% filter(!str_detect(Security, "Class A|Class B|Class C")) %>% # Removes firms with Class A, B & C shares sample_n(30) %>% pull(Symbol) #symbols <- c( #'GOOG', 'MSFT', 'HOG', 'AAPL', 'FB' #'AMZN', 'EBAY', 'IBM', 'NFLX', 'NVDA', #'TWTR', 'WMT', 'XRX', 'INTC', 'HPE' # )
The data
Download the data and store it into a new environment.
dataEnv <- new.env() getSymbols(symbols, from = start_date, to = end_date, #src = "yahoo", #adjust = TRUE, env = dataEnv) ## [1] "LEG" "NLSN" "SLB" "CHTR" "C" "REGN" "CCI" "SYK" "ROP" "RL" ## [11] "CERN" "CMG" "GS" "CAT" "MSI" "BR" "VRSK" "PNC" "KEYS" "PHM" ## [21] "FB" "BKR" "ABMD" "WYNN" "DG" "ADI" "GL" "TSCO" "FLS" "CDW"
Once the data has been downloaded and stored into a new environment I clean the data up a little, put all the lists into a single data frame, compute the daily returns for each asset and create the up or down direction which will be what the classification model will try to predict.
df <- eapply(dataEnv, function(x){ as.data.frame(x) %>% rename_all(function(n){ gsub("^(\\w+)\\.", "", n, perl = TRUE) } ) %>% rownames_to_column("date") }) %>% rbindlist(idcol = TRUE) %>% mutate(date = as.Date(date)) %>% group_by(.id) %>% tq_mutate( select = Adjusted, mutate_fun = periodReturn, period = "daily", type = "arithmetic" ) %>% mutate( Adj_lag = lag(Adjusted), chng_Adj = ifelse(Adjusted > Adj_lag, 1, 0) # more simply we could have just done if ret were pos/neg ) %>% select("date", ".id", "Adjusted", "daily.returns", "chng_Adj", "Open", "High", "Low", "Close") %>% as_tibble() %>% as_tbl_time(index = date) %>% setNames(c("date", "ID", "prc", "ret", "chng", "open", "high", "low", "close")) %>% drop_na(chng)
The first few observations of the data looks like:
date | ID | prc | ret | chng | open | high | low | close |
---|---|---|---|---|---|---|---|---|
2017-01-04 | CDW | 50.63446 | 0.0162981 | 1 | 51.79 | 52.60 | 51.79 | 52.38 |
2017-01-05 | CDW | 50.14147 | -0.0097364 | 0 | 52.10 | 52.66 | 51.84 | 51.87 |
2017-01-06 | CDW | 49.96746 | -0.0034704 | 0 | 51.89 | 51.95 | 51.46 | 51.69 |
We can use the nest()
function to put the data into convenient nested tibbles that we can simply map()
over and apply the rolling_origin()
function from the rsample
package such that, each of our assets will have their own rolling_origin()
function applied to it without any overlap or mixing of the asset classes, I do this in order to create the time series features for each period.
nested_df <- df %>% mutate(duplicate_ID = ID) %>% nest(-ID)
I split the time series data into a number of lists such that the analysis()
list contains 100 observations in each list and has a corresponding assessment()
list which contains 1 observation. Usually the analysis()
will become our training data set and the assessment()
will become our testing data set, however, here I am using the rolling_origin()
function to help create the time series features.
# First we set the number of days we want to construct the ts features rolled_df <- map(nested_df$data, ~ rolling_origin(., initial = 100, assess = 1, cumulative = FALSE, skip = 0))
Time-Series Functions
In order to create the time series variables I use the tsfeatures
package but there is also the feasts
packages here. For this model I simply select a few functions of interest from the tsfeatures
package.
functions <- c( "entropy", # Measures the "forecastability" of a series - low values = high sig-to-noise, large vals = difficult to forecast "stability", # means/variances are computed for all tiled windows - stability is the variance of the means "lumpiness", # Lumpiness is the variance of the variances "max_level_shift", # Finds the largest mean shift between two consecutive windows (returns two values, size of shift and time index of shift) "max_var_shift", # Finds the max variance shift between two consecutive windows (returns two values, size of shift and time index of shift) "max_kl_shift", # Finds the largest shift in the Kulback-Leibler divergence between two consecutive windows (returns two values, size of shift and time index of shift) "crossing_points" # Number of times a series crosses the mean line )
I wrote this code a little while ago and at the time I wrapped the model into a function. I think it would be more fun to exclusively stick to using just map()
functions instead of function(SYMB)
. The function does the following for each asset in our data:
Using the out-of-sample t+1
(assessment
) data, bind these lists together into a single data frame. Next apply the functions
character string to call the functions from the tsfeatures
package, apply these functions to the in-sample (analysis
) data (which consists of 100 observations each), such that, we obtain a single collapsed down observation that we can just bind together. Finally we bind the columns of these two data sets together using bind_cols()
. After this I rename the chng
variable and rename the time series feature variables to something more dynamic using ~str_c("X", seq_along(.))
so we can just add functions to the functions
character string and not have to worry about renaming the variables individually in order for the model to work.
Once this is done, I create the Machine Learning data set again using the rolling_origin()
function. The first rolling_origin()
function was used to help collapse the time series data down on a rolling basis by taking the previous 100 days of data and calculating the tsfeatures
function on it – a similar method to calculating a rolling mean/sd using the rollapply()
function from the zoo
package.
I next split the data into X
variables with X_train
and X_test
and the corresponding Y
variable with Y_train
and Y_test
. The package xgboost
expects a certain type of xgb.DMatrix()
which is what dtrain
and dtest
are doing.
Then, I set the XGBoost parameters and apply the XGBoost model. – Suitable cross validation should be performed at this point, however I will leave this for another post since time series cross validation is quite tricky and there is no function in R which helps with this type of cross validation (that I have found as of 2020-02-02) –
Once the model has been trained, I make the predictions.
Apply the model
The function to compute all this is the following:
Prediction_Model <- function(SYMB){ data <- bind_cols( map(rolled_df[[SYMB]]$splits, ~ assessment(.x)) %>% bind_rows(), map(rolled_df[[SYMB]]$splits, ~ analysis(.x)) %>% map(., ~tsfeatures(.x[["ret"]], functions)) %>% # Compute the TSFeatures bind_rows() ) %>% rename(Y = chng) %>% rename_at(vars(-c(1:9)), ~str_c("X", seq_along(.))) ml_data <- data %>% as_tibble() %>% rolling_origin( initial = 200, assess = 1, cumulative = FALSE, skip = 0) X_train <- map( ml_data$splits, ~ analysis(.x) %>% as_tibble(., .name_repair = "universal") %>% select(starts_with("X")) %>% as.matrix() ) Y_train <- map( ml_data$splits, ~ analysis(.x) %>% as_tibble(., .name_repair = "universal") %>% select(starts_with("Y")) %>% as.matrix() ) X_test <- map( ml_data$splits, ~ assessment(.x) %>% as_tibble(., .name_repair = "universal") %>% select(starts_with("X")) %>% as.matrix() ) Y_test <- map( ml_data$splits, ~ assessment(.x) %>% as_tibble(., .name_repair = "universal") %>% select(starts_with("Y")) %>% as.matrix() ) ############################################################# dtrain <- map2( X_train, Y_train, ~ xgb.DMatrix(data = .x, label = .y, missing = "NaN") ) dtest <- map( X_test, ~ xgb.DMatrix(data = .x, missing = "NaN") ) # Parameters: watchlist <- list("train" = dtrain) params <- list("eta" = 0.1, "max_depth" = 5, "colsample_bytree" = 1, "min_child_weight" = 1, "subsample"= 1, "objective"="binary:logistic", "gamma" = 1, "lambda" = 1, "alpha" = 0, "max_delta_step" = 0, "colsample_bylevel" = 1, "eval_metric"= "auc", "set.seed" = 176) # Train the XGBoost model xgb.model <- map( dtrain, ~ xgboost(params = params, data = .x, nrounds = 10, watchlist) ) xgb.pred <- map2( .x = xgb.model, .y = dtest, .f = ~ predict(.x, newdata = .y, type = 'prob') ) preds <- cbind(plyr::ldply(xgb.pred, data.frame), plyr::ldply(Y_test, data.frame)) %>% setNames(c("pred_probs", "actual")) %>% bind_cols(plyr::ldply(map(ml_data$splits, ~assessment(.x)))) %>% rename(ID = duplicate_ID) %>% #select(pred_probs, actual, date, ID, prc, ret) %>% as_tibble() return(preds) }
We can apply the above model to create the time series features, train and test on each of our assets by running the following.
Sys_t_start <- Sys.time() Resultados <- lapply(seq(1:length(rolled_df)), Prediction_Model) Sys_t_end <- Sys.time() round(Sys_t_end - Sys_t_start, 2)
The Resultados
output will give us a list the length of the number of assets we have in our data. The first few observations of the first asset in the list looks like:
pred_probs | actual | date | prc | ret | Y | open | high | low | close | ID | X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | X9 | X10 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.7304490 | 0 | 2018-03-15 | 73.17622 | -0.0106061 | 0 | 75.36 | 75.51 | 74.18 | 74.63 | CDW | 0.9870653 | 0.1149955 | 0.7047308 | 1.277064 | 71 | 3.161538 | 63 | 3.245055 | 76 | 56 |
0.5149571 | 1 | 2018-03-16 | 74.35286 | 0.0160795 | 1 | 74.78 | 75.99 | 73.04 | 75.83 | CDW | 0.9886519 | 0.0745409 | 0.8408280 | 1.273320 | 70 | 3.143027 | 62 | 3.227452 | 75 | 55 |
0.6207952 | 0 | 2018-03-19 | 72.53889 | -0.0243967 | 0 | 75.35 | 75.35 | 73.45 | 73.98 | CDW | 0.9902178 | 0.0901013 | 0.7192391 | 1.275344 | 69 | 3.153024 | 61 | 3.227452 | 74 | 55 |
Which consist of the XGBoost predicted probabilities, the actual observed result, the date of the result (of the out-of-sample testing data), the observed share price, the calculated daily returns, (a duplicate of the observed result), the OHLC data we collected from Yahoo and finally the time series features we constructed and then reneamed to \(X_{n}\)
The objective of this strategy was to invest every day in the asset which obtained the highest predicted probability that the market was going to go up. That is, if the model predicts on day \(t\) that asset GOOG
was going to be higher than it’s previous close with a predicted probability of 0.78 and it also predicts that AMZN
would go up with 0.53 probability then we would invest in GOOG
today. That is, we only invest in the asset with the highest predicted probability that the market is going to go up.
Therefore, I create a new data frame called top_assets
which basically gives me the highest predicted probability across all assets each day.
top_assets <- plyr::ldply(Resultados) %>% #select(pred_probs, actual, date, open, high, low, close, prc, ret) %>% group_by(date) %>% arrange(desc(pred_probs)) %>% dplyr::slice(1) %>% ungroup() %>% select(date, everything()) %>% rename(score = pred_probs) %>% select(-actual)
Strategy Assessment
The first 10 days of the strategy investments look like:
date | score | prc | ret | Y | open | high | low | close | ID | X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | X9 | X10 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2018-03-15 | 0.7304490 | 73.17622 | -0.0106061 | 0 | 75.36 | 75.51 | 74.18 | 74.63 | CDW | 0.9870653 | 0.1149955 | 0.7047308 | 1.2770644 | 71 | 3.161538 | 63 | 3.245055 | 76 | 56 |
2018-03-16 | 0.6899720 | 293.04999 | -0.0051601 | 0 | 295.13 | 297.61 | 289.27 | 293.05 | ABMD | 0.9918187 | 0.0769474 | 2.4417643 | 1.0584676 | 69 | 4.917599 | 62 | 7.861935 | 80 | 39 |
2018-03-19 | 0.7299674 | 101.46883 | -0.0081591 | 0 | 108.98 | 109.06 | 107.40 | 108.19 | CCI | 0.9894902 | 0.0705445 | 0.3788407 | 0.9924987 | 68 | 1.786445 | 5 | 1.402126 | 89 | 46 |
2018-03-20 | 0.7370850 | 60.44966 | 0.0132920 | 1 | 65.13 | 66.10 | 65.13 | 65.56 | SLB | 0.9830999 | 0.1717591 | 0.1725298 | 1.1379607 | 53 | 1.853699 | 9 | 1.739650 | 10 | 47 |
2018-03-21 | 0.7003193 | 334.51999 | 0.0448199 | 1 | 320.94 | 339.20 | 320.32 | 334.52 | CMG | 0.9532525 | 0.0669860 | 2.4899030 | 1.6249110 | 67 | 4.775679 | 73 | 12.454513 | 10 | 57 |
2018-03-22 | 0.7438304 | 87.40306 | -0.0243534 | 0 | 91.53 | 92.43 | 90.48 | 90.54 | ADI | 0.9796797 | 0.0855250 | 0.8606480 | 1.4206984 | 65 | 2.718067 | 64 | 8.589078 | 72 | 49 |
2018-03-23 | 0.6494384 | 237.70613 | -0.0290578 | 0 | 253.63 | 253.99 | 244.93 | 245.26 | GS | 0.9685330 | 0.0615987 | 0.7610257 | 1.1137175 | 63 | 3.588928 | 58 | 4.667287 | 71 | 50 |
2018-03-26 | 0.6868502 | 70.18565 | 0.0238877 | 1 | 70.74 | 71.71 | 69.71 | 71.58 | CDW | 0.9885363 | 0.1109430 | 0.5278402 | 1.1776141 | 64 | 2.688307 | 56 | 3.038961 | 69 | 56 |
2018-03-27 | 0.7274348 | 57.21453 | -0.0020772 | 0 | 58.13 | 58.41 | 57.16 | 57.65 | CERN | 0.9787359 | 0.1971365 | 0.2600627 | 1.0608221 | 62 | 2.325379 | 56 | 2.115355 | 69 | 52 |
2018-03-28 | 0.7031060 | 68.56778 | -0.0039882 | 0 | 70.36 | 70.39 | 69.26 | 69.93 | CDW | 0.9869206 | 0.1328377 | 0.5033828 | 1.1567020 | 62 | 2.593677 | 54 | 3.010604 | 67 | 56 |
We can see that the column score
is the probability for the asset with the highest predicted probability that it’s price was going to be greater than it’s previous close. The ID
column gives us the asset ticker we invest in.
Next I want to analyse the strategy of picking the best predicted winners against the S&P500 bench mark and therefore download the S&P 500 index.
top_assets <- xts(top_assets[,c(2:ncol(top_assets))], order.by = top_assets$date) # put top_assets into xts format # Analyse strategy getSymbols("SPY", from = start_date, to = end_date, src = "yahoo") ## [1] "SPY" #detach("package:tsibble", unload = TRUE) # tsibble clashes with the base R index() function SPY$ret_Rb <- Delt(SPY$SPY.Adjusted) SPY <- SPY[index(SPY) >= min(index(top_assets))] RaRb <- cbind(top_assets, SPY)
From here we can see how the strategy compares with the S&P 500. I show a number of statistics for analysing asset returns from the PerformanceAnalytics
package. I have not expanded the model to include short selling or construct multi-asset portfolios of the top \(N\) assets.
We can plot the performance of our strategy:
charts.PerformanceSummary(RaRb[, c("ret", "ret_Rb")], geometric = FALSE, wealth.index = FALSE, main = "Strategy vs. Market")
Take a look at the drawdown and risk metrics.
## ret ret_Rb ## Sterling ratio 0.1870 0.3879 ## Calmar ratio 0.2551 0.5884 ## Burke ratio 0.2251 0.5344 ## Pain index 0.0955 0.0283 ## Ulcer index 0.1189 0.0455 ## Pain ratio 0.7337 4.0290 ## Martin ratio 0.5891 2.5027 ## daily downside risk 0.0111 0.0066 ## Annualised downside risk 0.1768 0.1044 ## Downside potential 0.0054 0.0029 ## Omega 1.0722 1.1601 ## Sortino ratio 0.0351 0.0714 ## Upside potential 0.0058 0.0034 ## Upside potential ratio 0.7027 0.6124 ## Omega-sharpe ratio 0.0722 0.1601
Take a closer look at the drawdown information.
## $ret ## From Trough To Depth Length To Trough Recovery ## 1 2018-08-31 2019-01-03 2019-09-16 -0.2746 261 85 176 ## 2 2019-11-06 2019-12-03 <NA> -0.1300 39 19 NA ## 3 2019-10-01 2019-10-18 2019-10-29 -0.0810 21 14 7 ## 4 2018-03-22 2018-04-20 2018-05-09 -0.0773 34 21 13 ## 5 2018-08-10 2018-08-15 2018-08-20 -0.0474 7 4 3 ## ## $ret_Rb ## From Trough To Depth Length To Trough Recovery ## 1 2018-09-21 2018-12-24 2019-04-12 -0.1935 140 65 75 ## 2 2019-05-06 2019-06-03 2019-06-20 -0.0662 33 20 13 ## 3 2018-03-15 2018-04-02 2018-06-04 -0.0610 56 12 44 ## 4 2019-07-29 2019-08-05 2019-10-25 -0.0602 64 6 58 ## 5 2018-06-13 2018-06-27 2018-07-09 -0.0300 18 11 7
Compare the returns.
chart.Boxplot(RaRb[,c("ret", "ret_Rb")], main = "Returns")
Compare return statistics.
table.Stats(RaRb[, c("ret", "ret_Rb")]) %>% t() %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Observations | NAs | Minimum | Quartile 1 | Median | Arithmetic Mean | Geometric Mean | Quartile 3 | Maximum | SE Mean | LCL Mean (0.95) | UCL Mean (0.95) | Variance | Stdev | Skewness | Kurtosis | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ret | 453 | 0 | -0.0669 | -0.0068 | 0.0006 | 0.0004 | 0.0003 | 0.0087 | 0.0642 | 0.0007 | -0.0011 | 0.0018 | 0.0002 | 0.0156 | -0.2542 | 2.8842 |
ret_Rb | 453 | 0 | -0.0324 | -0.0030 | 0.0006 | 0.0005 | 0.0004 | 0.0054 | 0.0505 | 0.0004 | -0.0004 | 0.0013 | 0.0001 | 0.0091 | -0.2949 | 3.6264 |
Compare Sharpe Information.
lapply(RaRb[, c("ret", "ret_Rb")], function(x){SharpeRatio(x)}) ## $ret ## ret ## StdDev Sharpe (Rf=0%, p=95%): 0.025027498 ## VaR Sharpe (Rf=0%, p=95%): 0.015346462 ## ES Sharpe (Rf=0%, p=95%): 0.009618405 ## ## $ret_Rb ## ret_Rb ## StdDev Sharpe (Rf=0%, p=95%): 0.05152014 ## VaR Sharpe (Rf=0%, p=95%): 0.03218952 ## ES Sharpe (Rf=0%, p=95%): 0.01913213
Plot the Risk – Return scatter plot.
chart.RiskReturnScatter(RaRb[, c("ret", "ret_Rb")], # check this plot a little more Rf=.03/252, scale = 252, # for daily data main = "Risk - Return over the period")
Plot the rolling return, risk and Sharpe performance.
charts.RollingPerformance(RaRb[, c("ret", "ret_Rb")], Rf=.03/12, colorset = c("red", rep("darkgray",5), "orange", "green"), lwd = 2)
Compute the yearly returns.
lapply(RaRb[, c("ret")],function(x){periodReturn( x, period = 'yearly', type = 'arithmetic')}) # change type to log for continuous ## $ret ## yearly.returns ## 2018-12-31 -1.855083 ## 2019-12-31 -1.475181 lapply(RaRb[, c("ret_Rb")],function(x){periodReturn( x, period = 'yearly', type = 'arithmetic')}) ## $ret_Rb ## yearly.returns ## 2018-12-31 -9.0376638 ## 2019-12-31 -0.7226497
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.