Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’m going to be writing a series of posts which will look at some applications of R (and perhaps Python) to financial modelling. We’ll start here by pulling some stock data into R, calculating the daily returns and then looking at correlations and simple volatility estimates.
Data
I was looking for a chunky set of stock data, preferably something that was available at higher than daily time resolution. These data from Kaggle for NIFTY 100 stocks on the Indian Stock Market seemed a decent fit.
📌 Although I wanted intraday data I won’t be using it for the moment and will start by aggregatig the data up to daily resolution.
List of Data Frames
Data for each of the stocks is in a separate CSV file. Start by iterating over these using map_dfr()
and loading them into a single data frame.
FILES <- list.files("data", pattern=".csv$", full.names = TRUE) load <- function(path) { read_csv(path, show_col_types = FALSE) %>% mutate( symbol = sub(".csv$", "", basename(path)), time = as.POSIXct(date), date = as.Date(time) ) %>% group_by(symbol, date) %>% arrange(time) %>% # Summarise the daily closing price. summarise( close = last(close), .groups = "drop" ) } stocks <- map_dfr(FILES, load)
How many records do we have and what does the data look like?
nrow(stocks) [1] 151932 head(stocks) # A tibble: 6 × 3 symbol date close <chr> <date> <dbl> 1 ACC 2015-02-02 1515. 2 ACC 2015-02-03 1512. 3 ACC 2015-02-04 1490 4 ACC 2015-02-05 1500. 5 ACC 2015-02-06 1513. 6 ACC 2015-02-09 1504.
We could work with them en masse, but it’s going to be easier to split into a list of data frames, one for each stock.
stocks <- split(stocks, stocks$symbol) length(stocks) [1] 87 names(stocks) [1] "ACC" "ADANIENT" "ADANIPORTS" "AMBUJACEM" "APOLLOHOSP" [6] "ASIANPAINT" "AUROPHARMA" "AXISBANK" "BAJAJ-AUTO" "BAJAJFINSV" [11] "BAJFINANCE" "BANKBARODA" "BERGEPAINT" "BHARTIARTL" "BIOCON" [16] "BOSCHLTD" "BPCL" "BRITANNIA" "CADILAHC" "CHOLAFIN" [21] "CIPLA" "COALINDIA" "COLPAL" "DABUR" "DIVISLAB" [26] "DLF" "DRREDDY" "EICHERMOT" "GAIL" "GODREJCP" [31] "GRASIM" "HAVELLS" "HCLTECH" "HDFC" "HDFCBANK" [36] "HEROMOTOCO" "HINDALCO" "HINDPETRO" "HINDUNILVR" "ICICIBANK" [41] "IGL" "INDUSINDBK" "INDUSTOWER" "INFY" "IOC" [46] "ITC" "JINDALSTEL" "JSWSTEEL" "JUBLFOOD" "KOTAKBANK" [51] "LT" "LUPIN" "M_M" "MARICO" "MARUTI" [56] "MCDOWELL-N" "MUTHOOTFIN" "NAUKRI" "NESTLEIND" "NIFTY50" [61] "NIFTYBANK" "NMDC" "NTPC" "ONGC" "PEL" [66] "PIDILITIND" "PIIND" "PNB" "POWERGRID" "RELIANCE" [71] "SAIL" "SBIN" "SHREECEM" "SIEMENS" "SUNPHARMA" [76] "TATACONSUM" "TATAMOTORS" "TATASTEEL" "TCS" "TECHM" [81] "TITAN" "TORNTPHARM" "ULTRACEMCO" "UPL" "VEDL" [86] "WIPRO" "YESBANK"
There are 87 distinct stocks and a number of names there that you might recognise.
Time Series
Now convert them into individual xts
objects and concatenate to form a single xts
object with the closing time series for each of the stocks.
stocks <- stocks %>% map(~{ ts <- xts(x = .[["close"]], order.by = .[["date"]]) colnames(ts) <- unique(.$symbol) ts }) stocks <- do.call(merge, stocks) %>% na.omit() stocks <- stocks["2016-01-01/2022-01-01"]
We now have a multivariate time series. How many individual series and how many observations?
dim(stocks) [1] 1485 87
So that’s 87 time series and 1485 observations for each of them. Each time series looks like this:
head(stocks$TATASTEEL) TATASTEEL 2016-01-04 259.00 2016-01-05 272.95 2016-01-06 268.75 2016-01-07 251.00 2016-01-08 254.55 2016-01-11 252.90
We can print the closing prices but it’s more useful to have a plot.
Returns
For analytical purposes the absolute daily closing price is not as interesting as the daily return, which can be easily calculated but we’ll use CalculateReturns()
from {PerformanceAnalytics}
for convenience.
returns <- CalculateReturns(stocks) %>% na.omit()
Plot the corresponding return time series.
Variability & Volatility
There’s a lot of diversity in the return time series across the collection of stocks. To illustrate, here are the series for a collection of them.
ACC has a roughly consistent low level of return variability over the six years. By contrast, Yes Bank has periods of increased variability starting in the latter part of 2018.
Daily Volatility
It would be useful if we had a number (or numbers) to describe the relative level of variability. Volatility is a statistical measure of the dispersion of returns and is calculated as either the standard deviation or variance of the returns. In general, the higher the volatility, the riskier the security.
# Pull out the returns for two specific stocks. # ACC <- returns$ACC YESBANK <- returns$YESBANK
Let’s start by calculating the standard deviations of those stocks over the entire data period.
# Daily volatility. # sd(ACC) [1] 0.0184169 sd(YESBANK) [1] 0.04682746
As expected the volatility of Yes Bank is higher than that of ACC.
Annualised Volatility
These values can be thought of as the average daily variation in the stock’s return. However, what’s more commonly used is the annualised value, which takes into account the fact that there are on average 252 trading days in each year and reflects the average variation in the stock’s return over the course of a year.
# Annualised volatility. # sqrt(252) * sd(ACC) [1] 0.2923593 sqrt(252) * sd(YESBANK) [1] 0.7433629
Varying Volatility
It’s instructive to look at the volatility for shorter periods of time. Let’s compare the volatility of those two stocks in 2017 and 2020.
sqrt(252) * sd(ACC["2017"]) [1] 0.2344968 sqrt(252) * sd(YESBANK["2017"]) [1] 0.2665718
In 2017 the volatilities of ACC and Yes Bank are comparable.
sqrt(252) * sd(ACC["2020"]) [1] 0.3900236 sqrt(252) * sd(YESBANK["2020"]) [1] 1.283764
However, in 2020 Yes Bank is substantially more volatile than ACC.
Of course you can make these volatility calculations progressively more granular. However, as time intervals get shorter you have less data, so each of the individual measurements becomes less reliable. We’ll be modelling the time variation of volatility in upcoming posts.
Outliers
The returns might include significant outliers that would likely bias any derived results. We can use winsorisation to reduce the effect of extreme outliers in the data. The Return.clean()
function from {PerformanceAnalytics}
can be used to apply clean.boudt()
to the data.
yesbank <- Return.clean(YESBANK, method = "boudt")
In the plot below the raw data are in black and the cleaned data are in blue. The extrema are reduced in magnitude in the cleaned data.
This can be appreciated by looking at the range (minimum and maximum values) of the rase and cleaned data.
range(YESBANK) [1] -0.5537634 0.5964912 range(yesbank) [1] -0.1621622 0.1601474
Importance of Volatility
Why is volatility important? It is an fundamental consideration in predicting future returns. Stocks with lower volatility are less likely to have substantial changes in return from day to day. However, stocks with higher volatility are literally more “volatile”: their returns can vary substantially from day to day.
As we have seen, volatility varies from stock to stock (some companies are more volatile than others) and from period to period (a company may have times of low and high volatility).
Volatility needs to be factored into future predictions. Of course, we also don’t know what future volatilities will be, so in addition to predicting future returns we will probably also want to predict future volatilities. We’ll see how this can be done in an upcoming post on GARCH models.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.