Site icon R-bloggers

Mining Sent Email for Self-Knowledge

[This article was first published on Dan Garmat's Blog -- R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

How can we use data analytics to increase our self-knowledge? Along with biofeedback from digital devices like FitBit, less structured sources such as sent emails can provide insights.

E.g. here it seems my communication took a sudden more positive turn in 2013. Let’s see what else shakes out of my sent email corpus.

< !--more-->

In Snakes in a Package: combining Python and R with reticulate Adnan Fiaz uses a download of personal gmail from Google Takeout to extract R-bloggers post counts from subject lines. To handle gmail’s choice of mbox file format, rather than write a new R package to parse mbox files, he uses reticulate to import a Python package, mailbox. His approach seems a great use case for reticulate – when you want to take advantage of a highly developed Python package in R.

Loading Email Corpus into R

I wanted to mine my own emails for sentiment and see if I can learn anything about myself. Has my sent mail showed signs of mood trends over time? I started by following his example:

library(tidyverse)
library(stringr)
library(tidytext)
library(lubridate)
library(reticulate)
mailbox <- import("mailbox")

sent <- mailbox$mbox("Sent-001.mbox")
message <- sent$get_message(11L)
message$get("Date")
# [1] "Mon, 23 Jul 2018 20:01:33 -0700"
message$get("Subject")
# [1] "Re: Ptfc schedules"

Loading in email #11, can see it’s about Portland Football Club’s schedule. I wanted to see the body of the email, but found the normal built-in documentation doesn’t exist for Python modules

?get_message
# No documentation for ‘get_message’ in specified packages and libraries:
# you could try ‘??get_message’
?mailbox
# No documentation for ‘mailbox’ in specified packages and libraries:
# you could try ‘??mailbox’

Returning message prints the whole thing, but with much additional unneeded formatting. So worked around it with nested sub() and gsub() commands on specific example emails to get down to the text I wrote and sent, only.

It starts with this already difficult to understand call

sub(".*Content-Transfer-Encoding: quoted-printable", "", 
  gsub("=E2=80=99", "'", 
  gsub(">", "", 
  sub("On [A-Z][a-z]{2}.*", "", 
  gsub("\n|\t", " ", 
  message)))))

And, after much guess-try-see-what’s-left-and-add-another-sub(), ended up with this ugly function that does semi-reasonably for my goal of sentiment analysis:

parse_sent_message <- function(email){
  substr(
    gsub("-top:|-bottom:|break-word","",
    sub("Content-Type: application/pdf|Mime-Version: 1.0.*","",
    sub(".*charset ISO|charset  UTF-8|charset us-ascii","",
    sub(".*Content-Transfer-Encoding: 7bit", "", 
    sub("orwarded message.*", "", 
    gsub("=|\"", " ", 
    gsub("  ", " ", 
    gsub("= ", "", 
    sub(".*Content-Transfer-Encoding: quoted-printable", "", 
    sub(".*charset=UTF-8", "", 
    gsub("=E2=80=99|&#39;", "'", 
    gsub(">|<", "", 
    sub("On [A-Z][a-z]{2}.*", "",
    gsub("\n|\t|<div|</div>|<br>", " ", 
    email))))))))))))))), 
  1, 10000)
}

parse_sent_message(message)
# [1] " Hey aren't you planning to go to Seattle the 16th? Trying to figure out my days off schedule    "

Good to go. I tried using the R mailman wrapper, but ran into issues, so went back to the imported mailbox module. Importing and parsing took a few minutes:

message$get("From") # check this email index 11 if from my email address
myemail <- message$get("From") # since it is, save as myemail to check the rest

keys <- sent$keys()
# keys <- keys[1:3000] # uncomment if want to run the below on a subset to see if it works
number_of_messages <- length(keys)

pb <- utils::txtProgressBar(max=number_of_messages)
sent_messages <- data_frame(sent_date = as.character(NA), text = rep(as.character(NA), number_of_messages))

for(i in seq_along(keys)){
  message <- sent$get_message(keys[i])
  if(is.character(message$get("From"))){
    if (message$get("From") %in% myemail){
      sent_messages[i, 1] <- message$get("Date")
      sent_messages[i, 2] <- parse_sent_message(message)
    }
  }
  utils::setTxtProgressBar(pb, i)
}

If the message is not from me, it is saved as NA. What percent of mail flagged “sent” was not from myemail?

sum(is.na(sent_messages$text)) / number_of_messages
# [1] 0.6664132

67%. Removing them and doing some additional processing, can see these 11,093 remaining sent emails range from November of 2014 to September of 2018 with a median date of October of 2013.

sent_messages <- 
  sent_messages %>%
  filter(!is.na(text))

sent_messages <- 
  sent_messages %>% 
  mutate(sent_date = dmy_hms(sent_date))

# remove duplicates per month
sent_messages <- 
  sent_messages %>%
  mutate(year_sent = year(sent_date),
         month_sent = month(sent_date)) %>% 
  group_by(year_sent, month_sent, text) %>% 
  top_n(1, wt = sent_date) %>% 
  ungroup()

sent_messages %>% 
  summary(sent_date)
#   sent_date                       text             year_sent      month_sent    
# Min.   :2004-11-10 01:42:04   Length:11093       Min.   :2004   Min.   : 1.000  
# 1st Qu.:2010-07-17 20:39:10   Class :character   1st Qu.:2010   1st Qu.: 3.000  
# Median :2013-10-01 22:12:08   Mode  :character   Median :2013   Median : 6.000  
# Mean   :2013-03-24 10:55:30                      Mean   :2013   Mean   : 6.416  
# 3rd Qu.:2015-09-18 19:45:21                      3rd Qu.:2015   3rd Qu.: 9.000  
# Max.   :2018-09-30 01:35:02                      Max.   :2018   Max.   :12.000    

While median date comes a bit later than the chronological midpoint seemingly implies slightly more emails later, from the chart above, it’s probably more due to missing years of data.

Sentiment Analysis

Julia Silge and David Robinson have put together an excellent online reference on text mining at Text Mining with R so with some slight work can follow their analyses with email data. Using their tidytext package, quickly see a lot of html formatting tags still made it past my gsub() gauntlet.

tidy_emails <- 
  sent_messages %>%
  unnest_tokens(word, text)
  
tidy_emails
# # A tibble: 886,870 x 4
#    sent_date           year_sent month_sent word     
#    <dttm>                  <dbl>      <dbl> <chr>    
#  1 2018-09-27 16:30:19      2018          9 htmlbodyp
#  2 2018-09-27 16:30:19      2018          9 style    
#  3 2018-09-27 16:30:19      2018          9 margin   
#  4 2018-09-27 16:30:19      2018          9 0px      
#  5 2018-09-27 16:30:19      2018          9      
#  6 2018-09-27 16:30:19      2018          9 stretch  
#  7 2018-09-27 16:30:19      2018          9 normal   
#  8 2018-09-27 16:30:19      2018          9      
#  9 2018-09-27 16:30:19      2018          9 size     
# 10 2018-09-27 16:30:19      2018          9 12px     
# # ... with 886,860 more rows

In fact, after common stop words are removed, can see a need to add a few more

data(stop_words)

tidy_emails <- 
  tidy_emails %>%
  anti_join(stop_words)

tidy_emails %>%
  count(word, sort = TRUE) 
# # A tibble: 129,528 x 2
#    word                                                                             n
#    <chr>                                                                        <int>
#  1 3d                                                                            8433
#  2 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa  7620
#  3 content                                                                       4086
#  4 dan                                                                           3487
#  5 1                                                                             3451
#  6                                                                           2735
#  7 type                                                                          2695
#  8 style                                                                         2535
#  9 nbsp                                                                          2495
# 10 class                                                                         2451
# # ... with 129,518 more rows  

Maybe the

nchar("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa")
# [1] 76

76 a’s in a row come from <a href= consolidating from something in the gsub()s.

Adding these less useful terms to create an email stop words dictonary:

email_stop_words <- 
  stop_words %>% 
  rbind(
    data_frame("word" = c(seq(0,9), "3d", "8a", "mail.gmail.com", "wa", "aa", "content", "dir",
                          "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
                          "ad", "af", "", "type", "auto", "zz", "ae", "zx", "id", "ai",
                          "style", "nbsp", "class", "span", "http", "text", "gmail.com", 
                          "plain", "0px", "size", "color", "quot", "8859", "href", "margin", "ltr", 
                          "left", "disposition", "attachment", "padding", "rgba", "webkit", "https"),
               "lexicon" = "sent_email")
  )  

# just remove all words less than 3 letters
tidy_emails <- 
  tidy_emails %>%
  anti_join(email_stop_words) %>% 
  filter(nchar(word) >= 3)

tidy_emails %>%
  count(word, sort = TRUE) %>%
  top_n(n = 10, wt = n) %>% 
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

Can see some unsurprising name related common terms as well as “lol” and “hey”. But surprisingly “time”, “meeting”, “week”, and “people” also show up a lot. Wonder if those are unusual. (Would need another sent mail corpus to compare.)

What are my top joy words in email?

nrc_joy <- 
  get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

tidy_emails %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)
# # A tibble: 373 x 2
#    word        n
#    <chr>   <int>
#  1 art       531
#  2 feeling   389
#  3 hope      387
#  4 found     318
#  5 pretty    286
#  6 true      267
#  7 pay       229
#  8 money     218
#  9 friend    209
# 10 love      203
# # ... with 363 more rows

Hm, I only partially agree with this list. “Art” is a friend I email frequently. “Feeling” is a slight positive, but more neutral than a joy word per se. “Hope” is most common I’d agree with between 2004 and 2018 it seems.

How does sentiment look over time? Grouping by month:

email_sentiment <- 
  tidy_emails %>%
  mutate(year_sent = year(sent_date),
         month_sent = month(sent_date)) %>% 
  inner_join(get_sentiments("bing")) %>%
  count(year_sent, month_sent, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

ggplot(email_sentiment, aes(month_sent, sentiment, fill = as.factor(year_sent))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~year_sent, ncol = 2) 

2005, 2013, 2015 and 2016 look like more positive sentiment sent mail years. 2009 and 2011 look more negative overall. A few years, much of 2006, 2007 and 2008 are missing, weirdly.

Also see an apparently highly negative month in August of 2009.

# whoa happened in August of 2009?
sent_messages %>% filter(sent_date >= "2009-08-01", sent_date <= "2009-08-31") %>% write.csv("temp.csv")

tidy_emails %>%
  mutate(year_sent = year(sent_date),
         month_sent = month(sent_date)) %>% 
  inner_join(get_sentiments("bing")) %>%
  filter(year_sent == 2009, month_sent == 08) %>% 
  count(word, sentiment, sort = TRUE) 
# # A tibble: 237 x 3
#    word       sentiment     n
#    <chr>      <chr>     <int>
#  1 pain       negative     35
#  2 happiness  positive     21
#  3 sting      negative     21
#  4 happy      positive     12
#  5 stinging   negative     12
#  6 depression negative     11
#  7 free       positive     11
#  8 bad        negative      9
#  9 damage     negative      9
# 10 venom      negative      9
# # ... with 227 more rows

Was it a bad breakup? Digging into my emails, can find a New York Times Magazine article copy-and-pasted and sent to several people. The article, “Oh, Sting, Where Is Thy Death?” By Richard Conniff, mentions the pain of stinging insects and its relevance to happiness research. Note most of those ns are divisible by 3.

Most Common Charged Words

If taking all the emotionally charged words and seeing what comes out most often, both surprises and expected outcomes show up:

bing_word_counts <- 
  tidy_emails %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

bing_word_counts
# # A tibble: 2,143 x 3
#    word    sentiment     n
#    <chr>   <chr>     <int>
#  1 cool    positive    481
#  2 nice    positive    456
#  3 free    positive    445
#  4 bad     negative    308
#  5 pretty  positive    286
#  6 retreat negative    239
#  7 solid   positive    230
#  8 fine    positive    222
#  9 hard    negative    219
# 10 worth   positive    207
# # ... with 2,133 more rows

bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()

Surprised to see how much more positive words show up than negative words – Bing does have more positive words in its lexicon, so could make sense there. “Bad” as top negative word seems like a bad top word. “Issue” is definitely a word I have an issue with using a bad amount of time. But it’s cool to see how much I use “cool” (or is it bad? this is causing anxiety). Anyway, I think this is a solid view worth the time to get a nice feeling for top words I love to use in email.

Obligatory Wordcloud

Is it easier to read than the above? Nah, but it must be included in any text mining blog post, so…

library(wordcloud)

tidy_emails %>%
  anti_join(email_stop_words) %>%
  filter(nchar(word) >= 3) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))

library(reshape2)

tidy_emails %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)

Hope that was cool 🙂

To leave a comment for the author, please follow the link and comment on their blog: Dan Garmat's Blog -- R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.