Text Mining 40 Years of Warren Buffett’s Letters to Shareholders
Susan Li
[This article was first published on Susan Li | Data Ninja, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Warren Buffett released the most recent version of his annual letter to Berkshire Hathaway shareholders a couple of months ago. After reading a post regarding a sentiment analysis of Mr Warren Buffett’s annual shareholder letters, and I am also learning text mining with R. I thought it is a great opportunity to apply my latest skills into practice, – text mining 40 years of Warren Buffett’s letters to shareholders.
The code I used here to download all the letters were borrowed from Michael Toth.
Now I am ready to use “unnest_tokens” to split the dataset(all the letters) into tokens and remove stop words.
The most common words throughout 40 years of letters
The most common words each year
Sentiment by Year
Examine how often positive and negative words occurred in these letters. Which years were the most positive or negative overall?
AFINN lexion provides a positivity score for each word, from -5 (most negative) to 5 (most positive). What I am doing here is to calculate the average sentiment score for each year.
Warren Buffett is known for his long-term, optimistic economic outlook. Only 1 out of 40 letters appeared negative. Berkshire’s loss in net worth during 2001 was $3.77 billion, in addition, 911 terrorist attack contributed to the negative sentiment score in that year’s letter.
Sentiment Analysis by Words
Examine the total positive and negative contributions of each word.
For example, word “abandon” appeared 4 times and contributed total -8 scores.
Word “outstanding” made the most positive contribution and word “loss” made the most negative contribution.
Now we look for the words with the highest positive scores in each letter, here it is, “outstanding” appeared eight out of ten letters.
Unsurprisingly, seven out of ten letters, word “loss” secured the highest negative score.
From doing text mining Google finance articles a few days ago, I have learned another sentiment lexicon – “loughran”, which was developed based on analyses of financial reports. The Loughran dictionary divides words into six sentiments: “positive”, “negative”, “litigious”, “uncertainty”, “constraining”, and “superfluous”. I can’t wait to apply this dictionary to Buffett’s letters.
The assignments of words to sentments look reasonable. However, it removed “outstanding” and “superb” from the positive sentiment.
Relationship Between Words
Now it is the most interesting part. By tokenizing text into consecutive sequences of words, we can examine how often one word is followed by another. We can then study the relationship between words.
In this case, defining a list of six words that are used in negative situation, such as “don’t”, “not”, “no”, “can’t”, “won’t” and “without”, and visualize the sentiment-associated words that most often followed them.
It looks like the largest sources of misidentifying a word as positive come from “no matter”, “no better”, “not worth”, “not good”, and the largest source of incorrectly classified negative sentiment is “no debt”, “no problem” and “not charged”.
Source code that created this post can be found here. I am happy to hear any feedback or questions.