Site icon R-bloggers

Generating Ticker Symbols with Markov Chains

[This article was first published on quantitate, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Stock ticker symbols are short character string codes (AAPL for Apple, WMT for Walmart) that uniquely identify stock listings. They originate from the days when stock quotes were transmitted via telegraph. These symbols are often used as tools for branding and companies choose them to be memorable and easily recognized.

As a fun and simple exercise with text processing in R, we’ll build a simple model that generates random character strings which mimic the roughly 6,000 ticker symbols listed on the AMEX, NASDAQ, and NYSE. A basic tool for generating random text patterned off of an input source text is the Markov chain. Markov chains are the engine behind gibberish generators, a briefly popular internet meme in the 90’s.

We start with the base of all real ticker symbols. Using the TTR (“Technical Trading Rules”) library, it’s easy to get the full list for the AMEX, NASDAQ, and NYSE.
library(TTR)
listings <- stockSymbols() 

On November 17, 2013, TTR found 6,462 ticker symbols. The 10 largest symbols by market cap at that time were:
listings <- listings[order(listings$MarketCap,decreasing=TRUE),]
 Symbol                           Name    MarketCap IPOyear Exchange
   AAPL                     Apple Inc. 472354352358    1980   NASDAQ
    XOM        Exxon Mobil Corporation 416188308487      NA     NYSE
   GOOG                    Google Inc. 345299382446    2004   NASDAQ
   MSFT          Microsoft Corporation 315895470257    1986   NASDAQ
     GE       General Electric Company 275192436800      NA     NYSE
    JNJ              Johnson & Johnson 266315572747      NA     NYSE
    WMT          Wal-Mart Stores, Inc. 256994921591    1970     NYSE
    CVX            Chevron Corporation 230896151941      NA     NYSE
     PG Procter & Gamble Company (The) 230614695048      NA     NYSE
    WFC          Wells Fargo & Company 229351244873      NA     NYSE

Ticker symbols pulled down from TTR range from 1 to 8 characters in length and make use of the capital letters A through Z and the hyphen "-". In ticker symbols that contains hyphens, the characters following a hyphen appear to be the "behind the dot" codes that provide additional information on the asset type. To keep things simple, we'll just strip off all this additional info:
  symbols <- unique(gsub("-.*","",listings$Symbol))

This leaves us with a list of 5,594 symbols. Let's get the distribution of symbol lengths for our symbol list:
lengths <- nchar(symbols)
pLength <- prop.table(table(lengths))
print(round(pLength,4))
lengths
1      2      3      4      5 
0.0039 0.0387 0.4572 0.4698 0.0305 

Let's now explain a very simple Markov chain model for text strings. A word $w$ of length $l$ is a combination of letters $c_1, c_2,...,c_n$, which we naturally write as $w=c_1c_2...c_n$. If the first $n-1$ letters of $w$ are known, we write the probability that the $n$-th letter is $c_n$ as $p(c_n|c_1c_2...c_{n-1})$.

A Markov chain model of $p(c_n|c_1c_2...c_{n-1})$ allows us to assume that $$p(c_n|c_1c_2...c_{n-1})=p(c_n | c_{n-1}).$$ That is, we can assume that the $n$-th letter of the word only depends on the letter that directly precedes it. The Markov assumption often allows us to make huge simplifications to probability models and computations.

As an example, suppose that ticker symbols are generated by a probabilistic process. If we have a three letter ticker symbol with its first two letters being "WM", we can meaningfully quantify the probability that the third letter is a "T". Under a Markov model, $p(T|WM)=p(T|M)$. Now there is a little bit of ambiguity in the expression $p(T|M)$. Is this the probability that $T$ directly follows an $M$ anywhere in a symbol or is the probability that $T$ is the third letter given $M$ is the second? Here we'll assume that the position information is relevant and we'll compute $$p(c_3=T|WM) = p(c_3=T | c_2 = M).$$
With our sample of symbols, here is a way to approximate $p(T|WM)$ in R using the Markov model:
sum(grepl("^.MT",symbols[lengths==3]))/sum(grepl("^.M",symbols[lengths==3]))
[1] 0.08391608
Unpacking the syntax, grepl is regular expression matching and returns TRUE when a match is found. The expression "^.MT" matches all the symbols whose first letter is anything and whose second and third letter are "MT" while "^.M" matches to anything with second letter "M". The sums count the number of TRUE evaluations.

With the preliminaries out of the way, let's demonstrate a function which generates a random symbol based on the Markov chain model:
# Create a random symbol of length n based on a length 1 Markov Chain
newSymbol <- function(n=3){
  symbolSet <- symbols[lengths==n]
  s <- ""
  for(i in 1:n){
    # Match s in the first i characters of symbolSet
    pattern         <- paste0("^",substr(s,i-1,i-1))
    match           <- grepl(pattern,substr(symbolSet,i-1,i-1))
    
    # Distribution of next letter
    nxtLetterDist   <- substr(symbolSet,i,i)[nxtLetterDist!="" & match]
    
    # Sample the next character from the distribution
    nextChar        <- sample(nxtLetterDist,1)
    
    # Append to the string
    s <- paste0(s,nextChar)
  }
  return(s)
}

Here is an example of the generator in action:
> replicate(10, newSymbol(3))
 [1] "EMS" "ZTT" "NBS" "BTG" "MAG" "HAM" "PLK" "ALX" "CFM" "FLS"

How can we test that the symbol generator is really making symbols that are similar to the actual ticker symbols? Well, one way is to inspect how the generator performs relative to a completely random symbol generator. Since there are 26 letters available to use in a valid ticker symbol, a completely random generator would use any given letter with probability 1/26. So, for example, the distribution of the second character of completely random generated symbol should be a uniform distribution on the letters A-Z with uniform probability 1/26. For the Markov chain model, we would hope that the second character of a generated random symbol would have a probability distribution similar to the distribution of the second character of the set of actual ticker symbols.
The following snippet inspects the distribution of the second character of symbols of length 3 for natural symbols, Markov generated symbols, and completely random symbols.
# Generate a bunch of symbols with the Markov chain
symbolsA <- unique(replicate(10000, newSymbol(3)))

# Build data frame for second character distributions
require(ggplot2)
require(reshape2)
Natural <- prop.table(table(substr(symbols[lengths==3],2,2)))
Markov <- prop.table(table(substr(symbolsA,2,2)))
secondChar<-as.data.frame(t(rbind(Natural,Markov)))
secondChar$character <- rownames(secondChar)
secondChar <- melt(secondChar, id=c("character"))

# Build the distribution plot
ggplot(secondChar, aes(x=character,y=value)) + 
    geom_point(aes(color=variable), alpha=.75,size=5)+
    ylab("Probability")+xlab("Character")+
    ggtitle("Ticker Symbols:  Second Character Distribution")+
    geom_hline(yintercept=1/26, size=1,color="black")

Here is the plot:
In this figure, the black line represents the uniform probability distribution corresponding to a completely random generation of length 3 ticker symbols. The blue and red circles compare the Markov generator model to the true distribution of the second characters in length 3 ticker symbols.

To leave a comment for the author, please follow the link and comment on their blog: quantitate.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.