Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The Russell 2000 Small-Cap Index, ticker symbol: ^RUT, is the hottest index of 2016 with YTD gains of over 18%. The index components are interesting not only because of recent performance, but because the top performers either grow to become mid-cap stocks or are bought by large-cap companies at premium prices. This means selecting the best components can result in large gains. In this post, I’ll perform a quantitative stock analysis on the entire list of Russell 2000 stock components using the R programming language. Building on the methodology from my S&P Analysis Post, I develop screening and ranking metrics to identify the top stocks with amazing growth and most consistency. I use R for the analysis including the rvest
library for web scraping the list of Russell 2000 stocks, quantmod
to collect historical prices for all 2000+ stock components, purrr
to map modeling functions, and various other tidyverse
libraries such as ggplot2
, dplyr
, and tidyr
to visualize and manage the data workflow. Last, I use plotly
to create an interactive visualization used in the screening process. Whether you are familiar with quantitative stock analysis, just beginning, or just interested in the R programming language, you’ll gain both knowledge of data science in R and immediate insights into the best Russell 2000 stocks, quantitatively selected for future returns!
In part 1 of the analysis, we screen the entire stock list using a reward-to-risk metric. Here’s a sneak peek at the plotly
interactive visualization, which aids in screening the stocks. The best stocks from the algorithm are those with highest reward.metric
. The color and size varies with the value of the reward.metric
. You can pan, zoom in, and hover over the points to gain information about the stocks.
In part 2 of the analysis, we review the top 15 stocks from part 1, developing a new, growth-to-consistency metric, to programmatically select the best of the best. Here’s a sneak peek at the top six stocks from the Russell 2000 index, with performance adjusted to remove stock splits.
Table of Contents
- Brief Overview
- Prerequisites
- Russell 2K Analysis: Part 1
- Russell 2K Analysis: Part 2
- Questions About the Analysis
- Download the .R File
- Conclusion
- Recap
- Further Reading
Overview
The S&P500 Analysis Post covered the fundamentals of quantitative stock analysis. I’ll spare you the details, but if interested I strongly recommend going through that post to get up to speed. The methodology leverages the fact that stock returns are approximately normally distributed and uncorrelated. Because of this, we can model the behavior of stock prices within a confidence interval using the mean and standard deviation of the stock returns. The general process is to collect the historical stock prices, calculate the daily log returns (we use log returns for structural reasons), then calculate the mean and standard deviation of the log returns. The mean characterizes the growth rate (reward) and the standard deviation characterizes the volatility (risk).
In this post, we build on what where we left off in the S&P500 Analysis Post, this time taking the analysis to a new level using a new stock index: the Russell 2000 Small Cap Index. The Russell 2000 index is a perfect candidate because it’s 4X the size of the S&P500 index, it contains only small-cap stocks (median market cap of $528M), and it’s not as well known meaning it’s full of hidden gems and takeover targets. Plus, it’s up over 18% this year!
In part 1 of this analysis, we analyze the Russell 2000 stock list honing in on the relationship between the mean and standard deviation of daily log returns. From there, we develop a reward-to-risk metric based on how the market tends to treat stocks. The end result is a plotly
interactive graph that enables visualizing the attributes of the best and worst stocks.
In part 2, we switch focus to the top 15 stocks from part 1, this time evaluating on how consistently each stock performs. We develop a new metric, growth-to-consistency which enables programmatically selecting the best stocks. We end by selecting the top 6 stocks with the unique combination of amazing growth, low volatility, and consistent returns.
The full code for the tutorial can be downloaded as a .R
file here.
Prerequisites
For those following along in R, you’ll need to load the following packages:
If you don’t have these installed, run install.packages(pkg_names)
with the package names as a character vector (pkg_names <- c("rvest", "quantmod", ...)
). I also recommend the open-source RStudio IDE, which makes R Programming easy and efficient.
Russell 2K Analysis: Part 1
In part 1, the goal is to gain an overall understanding of the Russell 2000 Index. We’ll perform the following:
- Get the Russell 2K Stocks: Web Scraping with rvest
- Get Historical Prices and Log Returns: Function Mapping with quantmod and purrr
- Visualize the Relationship between Std Dev and Mean
- Develop a Screening Metric: Reward-to-Risk Metric
- Visually Screen with Plotly
Get the Russell 2K Stocks: Web Scraping with rvest
It turns out that it is rather difficult to find the list of Russell 2000 stocks. The best website I found was www.marketvolume.com. The list is spread across tables on nine HTML pages, each containing roughly 250 stock components. We’ll collect the components using the rvest
package. To start, we get the base path and row numbers for each of the nine webpages.
Next, we create a function that we can map()
using the purrr
package. The function, get_stocklist()
, takes the base_path
and the row_num
, and using rvest
functions produces a table of stocks.
As an example, we can apply get_stocklist()
to the first page of nine, which is “row=0” in the html path.
Finally, we create a data frame of row numbers using the row_num
vector. Using the purrr::map()
function, we iterate the get_stocklist()
function across each of the row numbers. The result is a nested data frame with two levels. Using tidyr::unnest()
, we get the full list of Russell 2000 stocks on one level. The rest of the pipe (%>%
) operations after unnest()
just tidy the data.
The end result is a data frame of the Russell 2000 stocks:
Get Historical Prices and Log Returns: Function Mapping with quantmod and purrr
Now that we have a list of the Russell 2000 stocks, we can collect some information. We need:
-
Historical Stock Prices: The daily stock prices are used to calculate the daily log returns. The function
quantmod::getSymbols()
returns the stock prices. We use a wrapper function,get_stock_prices()
, to return the stock prices as a data frame in a consistent format needed for theunnesting()
process. -
Daily Log Returns: Log returns are the basis for quantitative stock analysis, which enables statistical prediction of future stock prices using the mean and standard deviation of the log returns. The mean drives the growth rate, and the standard deviation drives the stock volatility. The function
quantmod::periodReturns()
returns the logarithmic daily returns by settingperiod = "daily"
andtype = "log"
. We use a wrapper function,get_log_returns()
, to return the log returns from the historical stock prices as a data frame in a consistent format needed for the mapping process.
The code for the wrapper functions are provided below:
An example usage of get_stock_prices()
:
And, an example usage of get_log_returns()
:
We can now get the mean()
and sd()
of the log returns. We can also get the number of trade days using the nrow()
or length()
functions.
As we’ll see in the next section, these features are important to the risk-reward trade-off. For now, we need to collect these values for the list of stocks. We do this using purrr::map()
to apply functions to lists stored inside data frames. The next code chunk is the most complex of the post. Basically, we use the functions created previously to iteratively download the stock prices and compute the log returns. I added the proc.time()
functions to time the code. It will take about 15 minutes to run.
Warning: The following script stores the stock prices and log returns for the entire list of 2000+ Russell 2000 stock components. It takes my laptop a about 15 minutes to run the script.
And, a peek at the contents of stocklist
:
What is stocklist
? It’s the historical stock prices and log returns for every stock in the Russell 2000 index. The stock prices and log returns are stored as nested lists inside the top-level data frame. We can access them like a list. Here’s the stock prices for the first observation in the list, 1-800 FLOWERS.COM, ticker symbol FLWS:
Visualize the Relationship between Std Dev and Mean
Now that we have the mean daily log returns (MDLR) and the standard deviation of daily log returns (SDDLR), we can start to visualize the data. The next plot shows an important trend: the relationship between SDDLR and MDLR. We first filter out stocks with the number of trade days (n.trade.days
) less than 2494 so each stock retained has the same large number of samples. Each year has approximately 250 trade days, so this filters out stocks with less than ten years of data to trend. Next, we limit stocks to those with SDDLR below 0.075. This allows us to zoom in on the vast majority of stocks. Plotting the trend using ggplot2
shows an interesting phenomenon: stocks with a high SDDLR tend to perform worse than those with a low SDDLR.
The important point is that, while volatile stocks may have one or two good years, over the long haul the less volatile stocks are where you want to put your money. We can develop a screening metric using this rationale.
Develop a Screening Metric: Reward-to-Risk Metric
The screening metric we will use is a reward-to-risk metric. We want to reward stocks with high MDLR (growth rate). We want to penalize stocks with high SDDLR (volatility), since these stocks tend to perform worse over time. The constant, 2500, is multiplied to yield values generally in the range of 100 to -100. The equation then becomes:
Where,
- R is the reward-to-risk metric
- mu is the MDLR
- sigma is the SDDLR
Now we can add the reward-to-risk metric (reward.metric
) to our data frame. We remove stocks with less than ten years of trading data, then add the reward-to-risk metric.
Visually Screen with Plotly
Let’s use the reward.metric
to generate a visualization we can use for screening the stocks and understanding the index. Similar to the S&P500 post, we generate an interactive visualization using plotly
. However, this time we use the reward.metric
to drive the color
and size
of the markers, which enables us to visually see which stocks are scoring high on risk-to-reward. The best stocks have a green color, and the worse stocks have a brown color. We can pan, zoom, and hover over stocks to gain additional insights.
The code chunk to generate this visualization:
Russell 2K Analysis: Part 2
The end goal is to find the best stocks, and we don’t want to simply trust the reward-to-risk metric. Rather, we want to review the characteristics of the top stocks so we can select those with the most consistent growth. In this section, we perform the following:
- Visualize Top 15 Stocks to Understand Consistent Growth
- Compute the Three Attributes of High Performing Stocks
- Develop a Ranking Metric: Growth-to-Consistency
- Visualize Performance of Top Six Stocks
Visualize Top 15 Stocks to Understand Consistent Growth
We begin by filtering the stocklist
, first ranking by the reward.metric
then selecting the top 15.
Next, we create a means_by_year()
function to take a data frame of log.returns
and return a data frame of MDLRs by year. We then map()
the means_by_year()
function to iterate over the full data frame of log returns.
Then, we unnest()
the high performers to get a one-level data frame. Voila, we have mean.log.returns
by year for each stock.
Finally, we can visualize the results in ggplot2
using a facet plot.
When reviewing the facet plot, we want to select stocks with the following attributes that the market tends to reward:
-
Above zero MDLR: Every time a stocks MDLR drops below zero, the stock loses money for that year. All of the stocks drop below zero at least once. Those that drop below zero multiple times become bad investments in those respective years. We want good investments over the long haul, or in other words stocks with consistent, above-zero MDLR.
-
Flat or Upward Growth Trends: Remember, we are viewing MDLR, which is the growth rate. A flat trend means the stock is consistently growing. An upward trend means the stock’s growth rate is accelerating. A downward trend means the stocks growth rate is slowing. We want flat or upward growth.
-
Low Standard Deviation of MDLR by Year: Again, the market loves consistency. Less volatility makes for a more profitable investment.
We have two options now:
- We can manually review each chart to decide which stocks we want to invest in, or
- We can develop a method to programmatically rank the stocks.
Always opt for programmatic review! Programmatic review is less prone to errors and can be applied to a much larger set of stocks (ergo, while 15 stocks may be easy, 1500 becomes very difficult).
Compute Three Desired Attributes of High Performing Stocks
For the programatic review, we need to compute the three desired attributes of high performing stocks:
- The number of times the stock’s MDLR by year drops below zero (bad)
- The slope of the trend line of MDLR by year ()
- The standard deviation of MDLR by year
Attribute 1: Number of Times MDLR by Year Drops Below Zero
First, the number of times the stock drops below zero. We create a function means_below_zero()
that takes a data frame of means.by.year
for one stock and returns the number of MDLRs by year that are less than zero. We then map the function using map_dbl()
. Note that we use the map_dbl()
version of map()
because map()
returns a list and map_dbl()
returns a number. We want a number, not a list with the number in it.
Attribute 2: Slope of MDLR by Year
Next, we need to get the slope of the linear trend. The method we use is slightly more complex because we need to get the second coefficient of the linear model, but it is extremely powerful because we are applying models to data frames. We’ll follow the process outlined in R for Data Science, Chapter 25: Many Models.
We create a means_by_year_model()
function to apply a linear model to a single stock. The function takes a data frame of means.by.year
for the stock, and returns the model from the lm()
function.
Let’s test it out on the first stock, MAXIMUS, ticker symbol MMS:
We are interested in the coefficient for year, so let’s make one more function to extract the slope coefficient.
Again, let’s test it out on MMS to validate the workflow:
Now we are ready to apply the modeling and slope functions to the data frame:
Great! We now have the linear models and the slope of the linear trend line.
Attribute 3: Standard deviation of MDLR by year
Finally, to drive home consistency, we need the standard deviation of the MDLR by year. To do this we create a sd_of_means_by_year()
function that simply computes the sd()
of the MDLR by year. The function is then mapped the data frame using map_dbl()
, which returns the numeric value.
Develop a Ranking Metric: Growth-to-Consistency
To assist in the final ranking process we’ll use a growth-to-consistency ranking metric, one that incorporates the means.below.zero
, slope
of the linear trend line, and sd.of.means.by.year
for MDLR by year. We develop the following measure that rewards stocks with positive growth rate and that penalizes stocks with high volatility year-to-year and multiple years of negative returns.
Where,
- m is the slope of the linear trend line
- n is the number of times the MDLR goes below zero
- s is the standard deviation of the MDLR by year
Now we add the new, growth-to-consistency metric (growth.metric
) to our data frame, and view the results.
symbol | growth.metric | slope | means.below.zero | sd.of.means.by.year |
---|---|---|---|---|
CMN | 0.1085148 | 0.0001688 | 1 | 0.0007776 |
MGEE | 0.1006291 | 0.0000827 | 1 | 0.0004111 |
HIFS | 0.0714181 | 0.0001605 | 2 | 0.0007489 |
PLUS | 0.0502491 | 0.0000655 | 1 | 0.0006521 |
LANC | 0.0500958 | 0.0000949 | 2 | 0.0006315 |
NATH | 0.0399066 | 0.0001133 | 2 | 0.0009466 |
VGR | 0.0205184 | 0.0000469 | 2 | 0.0007623 |
BOFI | 0.0159775 | 0.0000802 | 2 | 0.0016740 |
AGX | 0.0147843 | 0.0000920 | 3 | 0.0015549 |
EXPO | 0.0102393 | 0.0000108 | 1 | 0.0005276 |
ATRI | -0.0055678 | -0.0000163 | 2 | 0.0009788 |
EBIX | -0.0068658 | -0.0000601 | 4 | 0.0017505 |
NEOG | -0.0271193 | -0.0000891 | 2 | 0.0010958 |
MMS | -0.0328055 | -0.0000443 | 1 | 0.0006747 |
CALM | -0.0528240 | -0.0002213 | 2 | 0.0013963 |
Visualize Performance of Top Six Stocks
Finally, we are ready to visualize the performance of the top six stocks. The code chunk below ranks high performance stocks by the growth-to-consistency metric (growth.metric
), then filters to the top six. From that point, the symbol
and stock.prices
are selected and unnested to return the historical stock prices for each symbol in the top six performers. Last, a facet plot is made using the historical stock prices adjusted for stock splits. These are the top performers!
Questions About the Analysis
-
Should you simply invest in the stocks screened? Why or why not? (Hint: What else about the stocks and/or companies should you investigate?)
-
Are there other factors not considered in the metrics that should be included? (Hint: What do characteristics do some of the top 15 performers have that is not included in the final metric, G?)
-
Can you combine the two metrics, R and G, into one metric that can be computed for the entire data frame of all Russell 2000 stocks?
Download the .R File
The full code for the tutorial can be downloaded as a .R
file here. The code will take approximately 15 minutes to run.
Conclusion
As shown in the previous S&P500 Analysis Post, quantitative analysis is a powerful tool. By applying data science techniques using R programming and the various packages (e.g. tidyverse
, quantmod
, purrr
, etc), we can evaluate massive data sets and quickly screen the stocks using reward-to-risk and growth-to-consistency metrics. However, a word of caution before jumping into any investments: Selecting investments on statistical analysis alone is never a good idea. The statistical analysis allows us to screen stocks as potential investments, but a thorough analysis of the company and the stock should be performed. Evaluation of stock and company fundamentals such as asset valuation (forward and trailing P/E ratio), industry analysis, diversification, etc should be pursued prior to making any investment decision. With that said, the screening process described herein is an excellent first step. Once you have investments worth investigating, I recommend reading articles on a website such as Seeking Alpha from experts with experience covering the stocks of interest.
Recap
Well done if you made it this far! This post covered a vast array of R programming functions, data management workflows, and data science techniques that can be used regardless of the application. The Russell 2000 Post extended the investment screening analysis from the previous S&P500 Analysis Post by using a variety of data science and modeling techniques.
To summarize:
-
We built custom functions to retrieve stock symbols, historical stock prices, and daily log returns. We used a custom function to web scrape stock symbols, leveraging the
rvest
package. We also used custom functions as wrappers forgetSymbols()
andperiodReturns()
functions from thequantmod
library. We mapped the custom functions to data frames using thepurrr::map()
function. The result was a data frame of stock prices and daily log returns for every stock in the Russell 2000 index. -
We visualized the relationship between MDLR and SDDLR using
ggplot2
, which gave us insight into how stocks function within the market. We used this information to develop a reward-to-risk metric that served as a useful way to screen stocks when applied to an interactive screening visualization usingplotly
. -
We pared down the list to the top 15 stocks, manipulating the data to visualize the level of consistency of our high performers using
ggplot2
. We then developed three desirable attributes of high performers. The attributes were added using the data modeling workflow, which consists of creating functions that return data frames, lists, or values and mapping the functions usingpurrr
. -
We developed a final metric that measured growth-to-consistency of our high performers. This measure enabled us to rank the high performers, and select the top 6 for further review. The share price performance of the top 6 was visualized using
ggplot2
.
Great work! If you understand this post, you now have many tools at your disposal that apply to much more than just investment analysis.
Further Reading
-
R For Data Science, Chapter 25: Many Models: Chapter 25 covers the workflow for modeling many models using
tidyverse
,purrr
, andmodelr
packages. This a very powerful resource that helps you extend a single model to many models. The entire R for Data Science book is free and online. -
Seeking Alpha: A website for the investing community focused on providing investment analysis and insight. Once you have screened stocks, a next logical step is to collect information. The analysis provided on SA can help you finalize your investment decisions.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.