Outlier Days with R and Python
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Welcome to another installment of Reproducible Finance. Today’s post will be topical as we look at the historical behavior of the stock market after days of extreme returns and it will also explore one of my favorite coding themes of 2020 – the power of RMarkdown as an R/Python collaboration tool.
This post originated when Rishi Singh, the founder of tiingo and one of the nicest people I have encountered in this crazy world, sent over a note about recent market volatility along with some Python code for analyzing that volatility. We thought it would be a nice project to post that Python code along with the equivalent R code for reproducing the same results. For me, it’s a great opportunity to use RMarkdown's
R and Python interoperability superpowers, fueled by the reticulate
package. If you are an R coder and someone sends you Python code as part of a project, RMarkdown
+ reticulate
makes it quite smooth to incorporate that Python code into your work. It was interesting to learn how a very experienced Python coder might tackle a problem and then think about how to tackle that problem with R. Unsurprisingly, I couldn’t resist adding a few elements of data visualization.
Before we get started, if you’re unfamiliar with using R and Python chunks throughout an RMarkdown
file, have a quick look at the reticulate
documentation here.
Let’s get to it. Since we’ll be working with R and Python, we start with our usual R setup code chunk to load R packages, but we’ll also load the reticulate
package and source a Python script. Here’s what that looks like.
library(tidyverse) library(tidyquant) library(riingo) library(timetk) library(plotly) library(roll) library(slider) library(reticulate) riingo_set_token("your tiingo token here") # Python file that holds my tiingo token reticulate::source_python("credentials.py") knitr::opts_chunk$set(message = FALSE, warning = FALSE, comment = NA)
Note that I set my tiingo token twice: first using riingo_set_token()
so I can use the riingo
package in R chunks and then by sourcing the credentials.py
file, where I have put tiingoToken = 'my token'
. Now I can use the tiingoToken
variable in my Python chunks. This is necessary because we will use both R and Python to pull in data from Tiingo.
Next we will use a Python chunk to load the necessary Python libraries. If you haven’t installed these yet, you can open the RStudio terminal and run pip install
. Since we’ll be interspersing R and Python code chunks throughout, I will add a # Python Chunk
to each Python chunk and, um, # R Chunk
to each R chunk.
# Python chunk import pandas as pd import numpy as np import tiingo
Let’s get to the substance. The goal today is look back at the last 43 years of S&P 500 price history and analyze how the market has performed following a day that sees an extreme return. We will also take care with how we define an extreme return, using rolling volatility to normalize percentage moves.
We will use the mutual fund VFINX
as a tradeable proxy for the S&P 500 because it has a much longer history than other funds like SPY
or VOO
.
Let’s start by passing a URL string from tiingo to the pandas
function read_csv
, along with our tiingoToken
.
# Python chunk pricesDF = pd.read_csv("https://api.tiingo.com/tiingo/daily/vfinx/prices?startDate=1976-1-1&format=csv&token=" + tiingoToken)
We just created a Python object called pricesDF
. We can look at that object in an R chunk by calling py$pricesDF
.
# R chunk py$pricesDF %>% head()
We just created a Python object called pricesDF
. Let’s reformat the date column becomes the index, in date time format.
# Python chunk pricesDF = pricesDF.set_index(['date']) pricesDF.index = pd.DatetimeIndex(pricesDF.index)
Heading back to R for viewing, we see that the date column is no longer a column – it is the index of the data frame and in pandas
the index is more like a label than a new column. In fact, here’s what happens when call the row names of this data frame.
# R chunk py$pricesDF %>% head() %>% rownames()
We now have our prices, indexed by date. Let’s convert adjusted closing prices to log returns and save the results in a new column called returns
. Note the use of the shift(1)
operator here. That is analogous to the lag(..., 1)
function in dplyr
.
# Python chunk pricesDF['returns'] = np.log(pricesDF['adjClose']/pricesDF['adjClose'].shift(1))
Next, we want to calculate the 3-month rolling standard deviation of these daily log returns, and then divide daily returns by the previous rolling 3-month volatility in order to prevent look-ahead error. We can think of this as normalizing today’s return by the previous 3-months’ rolling volatility and will label it as stdDevMove
.
# Python chunk pricesDF['rollingVol'] = pricesDF['returns'].rolling(63).std() pricesDF['stdDevMove'] = pricesDF['returns'] / pricesDF['rollingVol'].shift(1)
Finally, we eventually want to calculate how the market has performed on the day following a large negative move. To prepare for that, let’s create a column of next day returns using shift(-1)
.
# Python chunk pricesDF['nextDayReturns'] = pricesDF.returns.shift(-1)
Now, we can filter by the size of the stdDevMove
column and the returns
column, to isolate days where the standard deviation move was at least 3 and the returns was less than -3%. We use mean()
to find the mean next day return following such large events.
# Python chunk nextDayPerformanceSeries = pricesDF.loc[(pricesDF['stdDevMove'] < -3) & (pricesDF['returns'] < -.03), ['nextDayReturns']].mean()
Finally, let’s loop through and see how the mean next day return changes as we filter on different extreme negative returns or we can call drop tolerances. We will label the drop tolerance as i
, set it at -.03
and then run a while
loop that decrements down i
by .0025 at each pass. In this way we can look at the mean next return following different levels of negative returns.
# Python chunk i = -.03 while i >= -.0525: nextDayPerformanceSeries = pricesDF.loc[(pricesDF['stdDevMove'] < -3) & (pricesDF['returns'] < i), ['nextDayReturns']] print(str(round(i, 5)) + ': ' + str(round(nextDayPerformanceSeries['nextDayReturns'].mean(), 6))) i -= .0025
It appears that as the size of the drop gets larger and more negative, the mean bounce back tends to get larger.
Let’s reproduce these results in R.
First, we import prices using the riingo_prices()
function from the riingo.
# R chunk sp_500_prices <- "VFINX" %>% riingo_prices(start_date = "1976-01-01", end_date = today())
We can use mutate()
to add a column of daily returns, rolling volatility, standard deviation move and next day returns.
# R chunk sp_500_returns <- sp_500_prices %>% select(date, adjClose) %>% mutate(daily_returns_log = log(adjClose/lag(adjClose)), rolling_vol = roll_sd(as.matrix(daily_returns_log), 63), sd_move = daily_returns_log/lag(rolling_vol), next_day_returns = lead(daily_returns_log))
Now let’s filter()
on an sd_move
greater than 3 and daily_returns_log
less than a drop tolerance of -.03.
# R chunk sp_500_returns %>% na.omit() %>% filter(sd_move < -3 & daily_returns_log < -.03) %>% select(date, daily_returns_log, sd_move, next_day_returns) %>% summarise(mean_return = mean(next_day_returns)) %>% add_column(drop_tolerance = scales::percent(.03), .before = 1) # A tibble: 1 x 2 drop_tolerance mean_return <chr> <dbl> 1 3% 0.00625
We used a while()
loop to iterate across different drop tolerances in Python, let’s see how to implement that using map_dfr()
from the purrr
package.
First, we will define a sequence of drop tolerances using the seq()
function.
# R chunk drop_tolerance <- seq(.03, .05, .0025) drop_tolerance [1] 0.0300 0.0325 0.0350 0.0375 0.0400 0.0425 0.0450 0.0475 0.0500
Next, we will create a function called outlier_mov_fun
that takes a data frame of returns, filters on a drop tolerance and gives us the mean return following large negative moves.
# R chunk outlier_mov_fun <- function(drop_tolerance, returns) { returns %>% na.omit() %>% filter(sd_move < -3 & daily_returns_log < -drop_tolerance) %>% select(date, daily_returns_log, sd_move, next_day_returns) %>% summarise(mean_return = mean(next_day_returns) %>% round(6)) %>% add_column(drop_tolerance = scales::percent(drop_tolerance), .before = 1) %>% add_column(drop_tolerance_raw = drop_tolerance, .before = 1) }
Notice how that function takes two arguments: a drop tolerance and data frame of returns.
Next, we pass our sequence of drop tolerances, stored in a variable called drop_tolerance
to map_dfr()
, along with our function and our sp_500_returns
object. map_dfr
will iterate through our sequence of drops and apply our function to each one.
# R chunk map_dfr(drop_tolerance, outlier_mov_fun, sp_500_returns) %>% select(-drop_tolerance_raw) # A tibble: 9 x 2 drop_tolerance mean_return <chr> <dbl> 1 3% 0.00625 2 3% 0.00700 3 3% 0.00967 4 4% 0.0109 5 4% 0.0122 6 4% 0.0132 7 4% 0.0149 8 5% 0.0149 9 5% 0.0162
Have a quick glance up that the results of our Python while()
and we should see that the results are consistent.
Alright, let’s have some fun and get to visualizing these results with ggplot
and plotly
.
# R chunk ( sp_500_returns %>% map_dfr(drop_tolerance, outlier_mov_fun, .) %>% ggplot(aes(x = drop_tolerance_raw, y = mean_return, text = str_glue("drop tolerance: {drop_tolerance} mean next day return: {mean_return * 100}%"))) + geom_point(color = "cornflowerblue") + labs(title = "Mean Return after Large Daily Drop", y = "mean return", x = "daily drop") + scale_x_continuous(labels = scales::percent) + scale_y_continuous(labels = scales::percent) + theme_minimal() ) %>% ggplotly(tooltip = "text")
Here’s what happens when we expand the upper bound to a drop tolerance of -2% and make our intervals smaller, moving from .25% increments to .125% increments.
# R chunk drop_tolerance_2 <- seq(.02, .05, .00125) ( sp_500_returns %>% map_dfr(drop_tolerance_2, outlier_mov_fun, .) %>% ggplot(aes(x = drop_tolerance_raw, y = mean_return, text = str_glue("drop tolerance: {drop_tolerance} mean next day return: {mean_return * 100}%"))) + geom_point(color = "cornflowerblue") + labs(title = "Mean Return after Large Daily Drop", y = "mean return", x = "daily drop") + scale_x_continuous(labels = scales::percent) + scale_y_continuous(labels = scales::percent) + theme_minimal() ) %>% ggplotly(tooltip = "text")
Check out what happens when we expand the lower bound, to a -6% drop tolerance.
# R chunk drop_tolerance_3 <- seq(.02, .06, .00125) ( sp_500_returns %>% map_dfr(drop_tolerance_3, outlier_mov_fun, .) %>% ggplot(aes(x = drop_tolerance_raw, y = mean_return, text = str_glue("drop tolerance: {drop_tolerance} mean next day return: {mean_return * 100}%"))) + geom_point(color = "cornflowerblue") + labs(title = "Mean Return after Large Daily Drop", y = "mean return", x = "daily drop") + scale_x_continuous(labels = scales::percent) + scale_y_continuous(labels = scales::percent) + theme_minimal() ) %>% ggplotly(tooltip = "text")
I did not expect that gap upward when the daily drop passes 5.25%.
A quick addendum that if I had gotten my act together and finished this 4 days ago I would not have included, but I’m curious how this last week has compared with other weeks in terms of volatility. I have in mind to visualize weekly return dispersion and that seemed a mighty tall task, until the brand new slider
package came to the rescue! slider
has a function called slide_period()
that, among other things, allows us to break up time series according to different periodicities.
To break up our returns by week, we call slide_period_dfr(., .$date, "week", ~ .x, .origin = first_monday_december, .names_to = "week")
, where first_monday_december
is a date that falls on a Monday. We could use our eyeballs to check a calendar and find a date that’s a Monday or we could use some good ol’ code. Let’s assume we want to find the first Monday in December of 2016.
We first filter our data with filter(between(date, as_date("2016-12-01"), as_date("2016-12-31")))
. Then create a column of weekday names with wday(date, label = TRUE, abbr = FALSE)
and filter to our first value of “Monday”.
# R Chunk first_monday_december <- sp_500_returns %>% mutate(date = ymd(date)) %>% filter(between(date, as_date("2016-12-01"), as_date("2016-12-31"))) %>% mutate(day_week = wday(date, label = TRUE, abbr = FALSE)) %>% filter(day_week == "Monday") %>% slice(1) %>% pull(date)
Now we run our slide_period_dfr()
code and it will start on the first Monday in December of 2016, and break our returns into weeks. Since we set .names_to = "week"
, the function will create a new column called week
and give a unique number to each of our weeks.
# R chunk sp_500_returns %>% select(date, daily_returns_log) %>% filter(date >= first_monday_december) %>% slide_period_dfr(., .$date, "week", ~ .x, .origin = first_monday_december, .names_to = "week") %>% head(10) # A tibble: 10 x 3 week date daily_returns_log <int> <dttm> <dbl> 1 1 2016-12-05 00:00:00 0.00589 2 1 2016-12-06 00:00:00 0.00342 3 1 2016-12-07 00:00:00 0.0133 4 1 2016-12-08 00:00:00 0.00226 5 1 2016-12-09 00:00:00 0.00589 6 2 2016-12-12 00:00:00 -0.00105 7 2 2016-12-13 00:00:00 0.00667 8 2 2016-12-14 00:00:00 -0.00810 9 2 2016-12-15 00:00:00 0.00392 10 2 2016-12-16 00:00:00 -0.00172
From here, we can group_by
that week
column and treat each week as a discrete time period. Let’s use ggplotly
to plot each week on the x-axis and the daily returns of each week on the y-axis, so that the vertical dispersion shows us the dispersion of weekly returns. Hover on the point to see the exact date of the return.
# R chunk ( sp_500_returns %>% select(date, daily_returns_log) %>% filter(date >= first_monday_december) %>% slide_period_dfr(., .$date, "week", ~ .x, .origin = first_monday_december, .names_to = "week") %>% group_by(week) %>% mutate(start_week = ymd(min(date))) %>% ggplot(aes(x = start_week, y = daily_returns_log, text = str_glue("date: {date}"))) + geom_point(color = "cornflowerblue", alpha = .5) + scale_y_continuous(labels = scales::percent, breaks = scales::pretty_breaks(n = 8)) + scale_x_date(breaks = scales::pretty_breaks(n = 10)) + labs(y = "", x = "", title = "Weekly Daily Returns") + theme_minimal() ) %>% ggplotly(tooltip = "text")
We can also plot the standard deviation of returns for each week.
# R chunk ( sp_500_returns %>% select(date, daily_returns_log) %>% filter(date >= first_monday_december) %>% slide_period_dfr(., .$date, "week", ~ .x, .origin = first_monday_december, .names_to = "week") %>% group_by(week) %>% summarise(first_of_week = first(date), sd = sd(daily_returns_log)) %>% ggplot(aes(x = first_of_week, y = sd, text = str_glue("week: {first_of_week}"))) + geom_point(aes(color = sd)) + labs(x = "", title = "Weekly Standard Dev of Returns", y = "") + theme_minimal() ) %>% ggplotly(tooltip = "text")
That’s all for today! Thanks for reading and stay safe out there.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.