Analyze party platforms in R the tidy way
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Let’s face it: When it comes to politics this country is exceedingly polarized. What I would like to do is quantify that polarization. To do this, I will use the platform documents that each of the political parties create during presidential election years. I’ll be using the tidy packages (i.e., dplyr
, tidyr
, tidytext
), and the data comes from Comparative Agendas, which collects and organizes data from archived sources to track policy outcomes across countries.
library(ggplot2) library(tidytext) library(topicmodels) library(dplyr) library(wordcloud) library(RColorBrewer) library(tidyr) library(scales) library(stringr) platforms <- read.csv("US-parties-platforms.csv", header = TRUE, stringsAsFactors = FALSE) dems <- read.csv("democratic-platform.csv", header = TRUE, stringsAsFactors = FALSE) # create index for platform by year and unnest tokens platform_words <- platforms %>% mutate(linenumber = row_number()) %>% unnest_tokens(word, description) # get rid of "__" in platform data platform_words <- platform_words %>% mutate(word = str_extract(word, "[a-z]+")) dem_words <- dems %>% mutate(linenumber = row_number()) %>% unnest_tokens(word, description) # remove stopwords data(stop_words) platform_words <- platform_words %>% anti_join(stop_words, by = "word") dem_words <- dem_words %>% anti_join(stop_words, by = "word")
No that the dataset is tokenized, with stop words removed, we can begin to analyze it. First, I want to look at the sentiment within each party’s platforms and look at progression over time. For this, we will use the Bing sentiment dictionary from the tidytext
package.
# get and plot sentiment over time bing <- sentiments %>% filter(lexicon == "bing") %>% select(-score) republicansent <- platform_words %>% inner_join(bing) %>% count(year, index = linenumber %/% 45, sentiment) %>% spread(sentiment, n, fill = 0) %>% mutate(sentiment = positive - negative) demsent <- dem_words %>% inner_join(bing) %>% count(year, index = linenumber %/% 45, sentiment) %>% spread(sentiment, n, fill = 0) %>% mutate(sentiment = positive - negative)
And we can use ggplot
to visualize the results and compare democratic sentiment to republican sentiment.
ggplot(republicansent, aes(index, sentiment)) + geom_col(fill="darkgoldenrod2") + theme_minimal(base_size = 13) + labs(title = "Sentiment of Republican Party Platforms, 1948 - 2016", y = "Sentiment", x = "")
ggplot(demsent, aes(index, sentiment)) + geom_col(fill="darkgoldenrod2") + theme_minimal(base_size = 13) + labs(title = "Sentiment of Democratic Party Platforms, 1948 - 2016", y = "Sentiment", x = "")
It appears that democratic platforms are more likely to contain negative sentiments, especially in that dip in the center. So let’s find the democratic platform with the nighest proportion of negative sentiments. First, we get the list of negative words from the Bing dictionary. Then, we isolate the platforms by year and count the number of words in each. This will allow us to normalize our result based on the length of each platform. Finally, we obtain the negative words in each and divide by the words in each chapter.
# find the most negative platform by year # obtain the negative words from bing bingneg <- sentiments %>% filter(lexicon == "bing", sentiment == "negative") # isolate each year's platform and count words for each wordtotal <- dem_words %>% group_by(year) %>% summarise(words = n()) # count negative words in each chapter and divide by total per chapter dem_neg <- dem_words %>% semi_join(bingneg) %>% group_by(year) %>% summarise(negativewords = n()) %>% left_join(wordtotal, by = "year") %>% mutate(ratio = negativewords / words) dem_neg
# A tibble: 18 x 4 year negativewords words ratio <int> <int> <int> <dbl> 1 1948 98 2041 0.0480 2 1952 173 4255 0.0407 3 1956 404 6277 0.0644 4 1960 429 7765 0.0552 5 1964 354 9737 0.0364 6 1968 399 8125 0.0491 7 1972 825 12688 0.0650 8 1976 597 10174 0.0587 9 1980 893 18964 0.0471 10 1984 1169 18041 0.0648 11 1988 133 2370 0.0561 12 1992 298 4141 0.0720 13 1996 503 9131 0.0551 14 2000 621 10931 0.0568 15 2004 459 8127 0.0565 16 2008 720 12501 0.0576 17 2012 612 13213 0.0463 18 2016 813 13380 0.0608
Indeed, there is a cluster of more negative platforms in the 1984, 1988, and 1992 Democratic Party platforms. This would be during the Reagan-Bush years.
Word Frequencies
Now I want to look at the words embedded in the respective platforms to see if there is a different emphasis from one to the other. We can do this with the wordcloud
package.
# get the most frequent words rep_frequencies <- platform_words %>% count(word, sort = TRUE) dem_frequencies <- dem_words %>% count(word, sort = TRUE) # generate wordclouds based on frequencies wordcloud(words = rep_frequencies$word, freq = rep_frequencies$n, scale = c(3,.1), min.freq = 50, max.words = 200, random.order = FALSE, rot.per = 0.35, colors = brewer.pal(8, "Dark2"))
Word frequencies in Republican Party platforms
wordcloud(words = dem_frequencies$word, freq = dem_frequencies$n, scale = c(3,.1), min.freq = 1, max.words=200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))
Word frequencies in Democratic Party platforms
These seem to make sense. Republicans tend to be much more focused than Democrats on the size of government, while Democrats generally focus more on things like education, healthcare and civil rights.
The Bing lexicon works fine for determining positive and negative sentiment, but there are others that we can use. The Afinn dictionary also assigns terms to positive or negative sentiment, but it also assigns a score that ranges from -5 to 5. Alternatively, we can use the NRC lexicon, which assigns terms to positive or negative sentiment, but it also determines where terms fall according to 8 moods, or emotions (i.e., anger, anticipation, disgust, fear, joy, sadness, surprise, trust).
So let’s use the NRC dictionary and then visualize the results with the chordDiagram()
function from the circlize
package. In order to create a chord diagram that isn’t too busy, we can group the platforms by decade.
# attach nrc lexicon to party platforms and group them dem_nrc <- dem_words %>% inner_join(get_sentiments("nrc")) %>% filter(!sentiment %in% c("positive", "negative")) %>% mutate(decade = ifelse(year %in% 1948, "1940s", ifelse(year %in% 1950:1959, "1950s", ifelse(year %in% 1960:1969, "1960s", ifelse(year %in% 1970:1979, "1970s", ifelse(year %in% 1980:1989, "1980s", ifelse(year %in% 1990:1999, "1990s", ifelse(year %in% 2000:2009, "2000s", "2010s")))))))) rep_nrc <- platform_words %>% inner_join(get_sentiments("nrc")) %>% filter(!sentiment %in% c("positive", "negative")) %>% mutate(decade = ifelse(year %in% 1948, "1940s", ifelse(year %in% 1950:1959, "1950s", ifelse(year %in% 1960:1969, "1960s", ifelse(year %in% 1970:1979, "1970s", ifelse(year %in% 1980:1989, "1980s", ifelse(year %in% 1990:1999, "1990s", ifelse(year %in% 2000:2009, "2000s", "2010s")))))))) # set proportionality of moods dem_decade <- dem_nrc %>% count(sentiment, decade) %>% group_by(decade, sentiment) %>% summarise(sentiment_sum = sum(n)) %>% ungroup() rep_decade <- rep_nrc %>% count(sentiment, decade) %>% group_by(decade, sentiment) %>% summarise(sentiment_sum = sum(n)) %>% ungroup() # visualize platforms using nrc lexicon cols <- brewer.pal(8, "Dark2") grid.col = c("1940s" = cols[1], "1950s" = cols[2], "1960s" = cols[3], "1970s" = cols[4], "1980s" = cols[5], "1990s" = cols[6], "2000s" = cols[7], "2010s" = cols[8], "anger" = "grey", "anticipation" = "grey", "disgust" = "grey", "fear" = "grey", "joy" = "grey", "sadness" = "grey", "surprise" = "grey", "trust" = "grey") # set gaps for dems' platforms circos.par(gap.after = c(rep(5, length(unique(dem_decade[[1]])) - 1), 15, rep(5, length(unique(dem_decade[[2]])) - 1), 15)) chordDiagram(dem_decade, grid.col = grid.col, transparency = .2) title("Mood of Democratic Platforms by Decade")
# clear and reset gaps for reps' platforms circos.clear() circos.par(gap.after = c(rep(5, length(unique(rep_decade[[1]])) - 1), 15, rep(5, length(unique(rep_decade[[2]])) - 1), 15)) chordDiagram(dem_decade, grid.col = grid.col, transparency = .2) title("Mood of Republican Platforms by Decade")
Trust is very clearly the most dominant mood in the party platforms. That’s not all that surprising since all politicians what people to trust them.
The post Analyze party platforms in R the tidy way appeared first on my (mis)adventures in R programming.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.