Site icon R-bloggers

Analyze party platforms in R the tidy way

[This article was first published on my (mis)adventures in R programming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Let’s face it: When it comes to politics this country is exceedingly polarized. What I would like to do is quantify that polarization. To do this, I will use the platform documents that each of the political parties create during presidential election years. I’ll be using the tidy packages (i.e., dplyr, tidyr, tidytext), and the data comes from Comparative Agendas, which collects and organizes data from archived sources to track policy outcomes across countries.

library(ggplot2)
library(tidytext)
library(topicmodels)
library(dplyr)
library(wordcloud)
library(RColorBrewer)
library(tidyr)
library(scales)
library(stringr)

platforms <- read.csv("US-parties-platforms.csv", header = TRUE, stringsAsFactors = FALSE)

dems <- read.csv("democratic-platform.csv", header = TRUE, stringsAsFactors = FALSE)

# create index for platform by year and unnest tokens
platform_words <- platforms %>%
  mutate(linenumber = row_number()) %>%
  unnest_tokens(word, description)

# get rid of "__" in platform data
platform_words <- platform_words %>%
  mutate(word = str_extract(word, "[a-z]+"))

dem_words <- dems %>%
  mutate(linenumber = row_number()) %>%
  unnest_tokens(word, description)

# remove stopwords
data(stop_words)

platform_words <- platform_words %>%
  anti_join(stop_words, by = "word")

dem_words <- dem_words %>%
  anti_join(stop_words, by = "word")

No that the dataset is tokenized, with stop words removed, we can begin to analyze it. First, I want to look at the sentiment within each party’s platforms and look at progression over time. For this, we will use the Bing sentiment dictionary from the tidytext package.

# get and plot sentiment over time
bing <- sentiments %>%
  filter(lexicon == "bing") %>%
  select(-score)

republicansent <- platform_words %>%
  inner_join(bing) %>%
  count(year, index = linenumber %/% 45, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

demsent <- dem_words %>%
  inner_join(bing) %>%
  count(year, index = linenumber %/% 45, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

And we can use ggplot to visualize the results and compare democratic sentiment to republican sentiment.

ggplot(republicansent, aes(index, sentiment)) +
  geom_col(fill="darkgoldenrod2") +
  theme_minimal(base_size = 13) +
  labs(title = "Sentiment of Republican Party Platforms, 1948 - 2016",
       y = "Sentiment", x = "")

ggplot(demsent, aes(index, sentiment)) +
  geom_col(fill="darkgoldenrod2") +
  theme_minimal(base_size = 13) +
  labs(title = "Sentiment of Democratic Party Platforms, 1948 - 2016",
       y = "Sentiment", x = "")

It appears that democratic platforms are more likely to contain negative sentiments, especially in that dip in the center. So let’s find the democratic platform with the nighest proportion of negative sentiments. First, we get the list of negative words from the Bing dictionary. Then, we isolate the platforms by year and count the number of words in each. This will allow us to normalize our result based on the length of each platform. Finally, we obtain the negative words in each and divide by the words in each chapter.

# find the most negative platform by year
# obtain the negative words from bing
bingneg <- sentiments %>%
  filter(lexicon == "bing", sentiment == "negative")

# isolate each year's platform and count words for each
wordtotal <- dem_words %>%
  group_by(year) %>%
  summarise(words = n())

# count negative words in each chapter and divide by total per chapter
dem_neg <- dem_words %>%
  semi_join(bingneg) %>%
  group_by(year) %>%
  summarise(negativewords = n()) %>%
  left_join(wordtotal, by = "year") %>%
  mutate(ratio = negativewords / words) 
dem_neg

# A tibble: 18 x 4
    year negativewords words  ratio
   <int>         <int> <int>  <dbl>
 1  1948            98  2041 0.0480
 2  1952           173  4255 0.0407
 3  1956           404  6277 0.0644
 4  1960           429  7765 0.0552
 5  1964           354  9737 0.0364
 6  1968           399  8125 0.0491
 7  1972           825 12688 0.0650
 8  1976           597 10174 0.0587
 9  1980           893 18964 0.0471
10  1984          1169 18041 0.0648
11  1988           133  2370 0.0561
12  1992           298  4141 0.0720
13  1996           503  9131 0.0551
14  2000           621 10931 0.0568
15  2004           459  8127 0.0565
16  2008           720 12501 0.0576
17  2012           612 13213 0.0463
18  2016           813 13380 0.0608

Indeed, there is a cluster of more negative platforms in the 1984, 1988, and 1992 Democratic Party platforms. This would be during the Reagan-Bush years.

Word Frequencies

Now I want to look at the words embedded in the respective platforms to see if there is a different emphasis from one to the other. We can do this with the wordcloud package.

# get the most frequent words
rep_frequencies <- platform_words %>%
  count(word, sort = TRUE)

dem_frequencies <- dem_words %>%
  count(word, sort = TRUE)

# generate wordclouds based on frequencies
wordcloud(words = rep_frequencies$word, freq = rep_frequencies$n, 
          scale = c(3,.1), min.freq = 50,
          max.words = 200, random.order = FALSE, rot.per = 0.35,
          colors = brewer.pal(8, "Dark2"))

Word frequencies in Republican Party platforms

wordcloud(words = dem_frequencies$word, freq = dem_frequencies$n, 
          scale = c(3,.1), min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

Word frequencies in Democratic Party platforms

These seem to make sense. Republicans tend to be much more focused than Democrats on the size of government, while Democrats generally focus more on things like education, healthcare and civil rights.

The Bing lexicon works fine for determining positive and negative sentiment, but there are others that we can use. The Afinn dictionary also assigns terms to positive or negative sentiment, but it also assigns a score that ranges from -5 to 5. Alternatively, we can use the NRC lexicon, which assigns terms to positive or negative sentiment, but it also determines where terms fall according to 8 moods, or emotions (i.e., anger, anticipation, disgust, fear, joy, sadness, surprise, trust).

So let’s use the NRC dictionary and then visualize the results with the chordDiagram() function from the circlize package. In order to create a chord diagram that isn’t too busy, we can group the platforms by decade.

# attach nrc lexicon to party platforms and group them
dem_nrc <- dem_words %>%
  inner_join(get_sentiments("nrc")) %>%
  filter(!sentiment %in% c("positive", "negative")) %>%
  mutate(decade = ifelse(year %in% 1948, "1940s",
                ifelse(year %in% 1950:1959, "1950s",
                       ifelse(year %in% 1960:1969, "1960s",
                              ifelse(year %in% 1970:1979, "1970s",
                                     ifelse(year %in% 1980:1989, "1980s",
                                            ifelse(year %in% 1990:1999, "1990s",
                                                   ifelse(year %in% 2000:2009, "2000s", "2010s"))))))))

rep_nrc <- platform_words %>%
  inner_join(get_sentiments("nrc")) %>%
  filter(!sentiment %in% c("positive", "negative")) %>%
  mutate(decade = ifelse(year %in% 1948, "1940s",
                         ifelse(year %in% 1950:1959, "1950s",
                                ifelse(year %in% 1960:1969, "1960s",
                                       ifelse(year %in% 1970:1979, "1970s",
                                              ifelse(year %in% 1980:1989, "1980s",
                                                     ifelse(year %in% 1990:1999, "1990s",
                                                            ifelse(year %in% 2000:2009, "2000s", "2010s"))))))))

# set proportionality of moods
dem_decade <-  dem_nrc %>%
  count(sentiment, decade) %>%
  group_by(decade, sentiment) %>%
  summarise(sentiment_sum = sum(n)) %>%
  ungroup()

rep_decade <-  rep_nrc %>%
  count(sentiment, decade) %>%
  group_by(decade, sentiment) %>%
  summarise(sentiment_sum = sum(n)) %>%
  ungroup()

# visualize platforms using nrc lexicon
cols <- brewer.pal(8, "Dark2")
grid.col = c("1940s" = cols[1], "1950s" = cols[2], "1960s" = cols[3], 
             "1970s" = cols[4], "1980s" = cols[5], "1990s" = cols[6],
             "2000s" = cols[7], "2010s" = cols[8], "anger" = "grey", 
             "anticipation" = "grey", "disgust" = "grey", "fear" = "grey", 
             "joy" = "grey", "sadness" = "grey", "surprise" = "grey", 
             "trust" = "grey")

# set gaps for dems' platforms
circos.par(gap.after = c(rep(5, length(unique(dem_decade[[1]])) - 1), 15,
                         rep(5, length(unique(dem_decade[[2]])) - 1), 15))

chordDiagram(dem_decade, grid.col = grid.col, transparency = .2)
title("Mood of Democratic Platforms by Decade")

# clear and reset gaps for reps' platforms
circos.clear()
circos.par(gap.after = c(rep(5, length(unique(rep_decade[[1]])) - 1), 15,
                         rep(5, length(unique(rep_decade[[2]])) - 1), 15))

chordDiagram(dem_decade, grid.col = grid.col, transparency = .2)
title("Mood of Republican Platforms by Decade")

Trust is very clearly the most dominant mood in the party platforms. That’s not all that surprising since all politicians what people to trust them.

The post Analyze party platforms in R the tidy way appeared first on my (mis)adventures in R programming.

To leave a comment for the author, please follow the link and comment on their blog: my (mis)adventures in R programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.