A Rather Nosy Topic Model Analysis of the Enron Email Corpus
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Having only ever played with Latent Dirichlet Allocation using gensim in python, I was very interested to see a nice example of this kind of topic modelling in R. Whenever I see a really cool analysis done, I get the urge to do it myself. What better corpus to do topic modelling on than the Enron email dataset?!?!? Let me tell you, this thing is a monster! According to the website I got it from, it contains about 500k messages, coming from 151 mostly senior management users and is organized into user folders. I didn’t want to accept everything into my analysis, so I made the decision that I would only look into messages contained within the “sent” or “sent items” folders.
Being a large advocate of R, I really really tried to do all of the processing and analysis in R, but it was just too difficult and was taking up more time than I wanted. So I dusted off my python skills (thank you grad school!) and did the bulk of the data processing/preparation in python, and the text mining in R. Following is the code (hopefully well enough commented) that I used to process the corpus in python:
After having seen python’s performance in rifling through these enron emails, I was very impressed! It was very agile in creating a directory with the largest number of files I’d ever seen on my computer!
Okay, so now I had a directory filled with a whole lot of text files. The next step was to bring them into R so that I could submit them to the LDA. Following is the R code that I used:
Phew, that took a lot of computing power! Now that it’s done, let’s look at the results of the command on line 48 from the above gist:
Topic 1 Topic 2 Topic 3 Topic 4 [1,] "time" "thank" "market" "email" [2,] "vinc" "pleas" "enron" "pleas" [3,] "week" "deal" "power" "messag" [4,] "thank" "enron" "compani" "inform" [5,] "look" "attach" "energi" "receiv" [6,] "day" "chang" "price" "intend" [7,] "dont" "call" "gas" "copi" [8,] "call" "agreement" "busi" "attach" [9,] "meet" "question" "manag" "recipi" [10,] "hope" "fax" "servic" "enron" [11,] "talk" "america" "rate" "confidenti" [12,] "ill" "meet" "trade" "file" [13,] "tri" "mark" "provid" "agreement" [14,] "night" "kay" "issu" "thank" [15,] "friday" "corp" "custom" "contain" [16,] "peopl" "trade" "california" "address" [17,] "bit" "ena" "oper" "contact" [18,] "guy" "north" "cost" "review" [19,] "love" "discuss" "electr" "parti" [20,] "houston" "regard" "report" "contract"
Here’s where some really subjective interpretation is required, just like in PCA analysis. Let’s try to interpret the topics, one at a time:
- I see a lot of words related to time in this topic, and then I see the word ‘meet’. I’ll call this the meeting (business or otherwise) topic!
- I’m not sure how to interpret this second topic, so perhaps I’ll chalk it up to noise in my analysis!
- This topic contains a lot of ‘business content’ words, so it appears to be a kind of ‘talking shop’ topic.
- This topic, while still pretty ‘businessy’, appears to be less about the content of the business and more about the processes, or perhaps legalities of the business.
For each of the sensible topics (1,3,4), let’s bring up some emails that scored highly on these topics to see if the analysis makes sense:
sample(which(df.emails.topics$"1" > .95), 10) [1] 53749 32102 16478 36204 29296 29243 47654 38733 28515 53254 enron[[32102]] I will be out of the office next week on Spring Break. Can you participate on this call? Please report what is said to Christian Yoder 503-464-7845 or Steve Hall 503-4647795 03/09/2001 05:48 PM I don't know, but I will check with our client. Our client Avista Energy has received the communication, below, from the ISO regarding withholding of payments to creditors of monies the ISO has received from PG&E. We are interested in whether any of your clients have received this communication, are interested in this issue and, if so, whether you have any thoughts about how to proceed. You are invited to participate in a conference call to discuss this issue on Monday, March 12, at 10:00 a.m. Call-in number: (888) 320-6636 Host: Pritchard Confirmation number: 1827-1922 Diane Pritchard Morrison & Foerster LLP 425 Market Street San Francisco, California 94105 (415) 268-7188
So this one isn’t a business meeting in the physical sense, but is a conference call, which still falls under the general category of meetings.
enron[[29243]] Hey Fritz. I am going to send you an email that attaches a referral form to your job postings. In addition, I will also personally tell the hiring manager that I have done this and I can also give him an extra copy of youe resume. Hopefully we can get something going here.... Tori, I received your name from Diane Hoyuela. You and I spoke back in 1999 about the gas industry. I tried briefly back in 1999 and found few opportunities during de-regulations first few steps. Well,...I'm trying again. I've been applying for a few job openings at Enron and was wondering if you could give me an internal referral. Also, any advice on landing a position at Enron or in general as a scheduler or analyst. Last week I applied for these positions at Enron; gas scheduler 110360, gas analyst 110247, and book admin. 110129. I have a pretty good understanding of the gas market. I've attached my resume for you. Congrats. on the baby! I'll give you a call this afternoon to follow-up, I know mornings are your time. Regards, Fritz Hiser __________________________________________________ Do You Yahoo!? Get email alerts & NEW webcam video instant messaging with Yahoo! Messenger. http://im.yahoo.com
That one obviously shows someone who was trying to get a job at Enron and wanted to call “this afternoon to follow-up”. Again, a ‘call’ rather than a physical meeting.
Finally,
enron[[29296]] Susan, Well you have either had a week from hell so far or its just taking you time to come up with some good bs. Without being too forward I will be in town next Friday and wanted to know if you would like to go to dinner or something. At least that will give us a chance to talk face to face. If your busy don't worry about it I thought I would just throw it out there. I'll keep this one short and sweet since the last one was rather lengthy. Hope this Thursday is a little better then last week. Kyle _________________________________________________________________________ Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com. Share information about yourself, create your own public profile at http://profiles.msn.com.
Ahh, here’s a particularly juicy one. Kyle here wants to go to dinner, “or something” (heh heh heh) with Susan to get a chance to talk face to face with her. Finally, a physical meeting (maybe very physical…) lumped into a category with other business meetings in person or on the phone.
Okay, now let’s switch to topic 3, the “business content” topic.
sample(which(df.emails.topics$"3" > .95), 10) [1] 40671 26644 5398 52918 37708 5548 15167 56149 47215 26683 enron[[40671]] Please change the counterparty on deal 806589 from TP2 to TP3 (sorry about that).
Okay, that seems fairly in the realm of business content, but I don’t know what the heck it means. Let’s try another one:
enron[[5548]] Phillip, Scott, Hunter, Tom and John - Just to reiterate the new trading guidelines on PG&E Energy Trading: 1. Both financial and physical trading are approved, with a maximum tenor of 18 months 2. Approved entities are: PG&E Energy Trading - Gas Corporation PG&E Energy Trading - Canada Corporation NO OTHER PG&E ENTITIES ARE APPROVED FOR TRADING 3. Both EOL and OTC transactions are OK 4. Please call Credit (ext. 31803) with details on every OTC transaction. We need to track all new positions with PG&E Energy Trading on an ongoing basis. Please ask the traders and originators on your desks to notify us with the details on any new transactions immediately upon execution. For large transactions (greater than 2 contracts/day or 5 BCF total), please call for approval before transacting. Thanks for your assistance; please call me (ext. 53923) or Russell Diamond (ext. 57095) if you have any questions. Jay
That one is definitely oozing with business content. Note the terms such as “Energy Trading”, and “Gas Corporation”, etc. Finally, one more:
enron[[26683]] Hi Kathleen, Randy, Chris, and Trish, Attached is the text of the August issue of The Islander. The headings will be lined up when Trish adds the art and ads. A calendar, also, which is in the next e-mail. I'll appreciate your comments by the end of tomorrow, Monday. There are open issues which I sure hope get resolved before printing: 1. I'm waiting for a reply from Mike Bass regarding tenses on the Home Depot article. Don't know if there's one developer or more and what the name(s) is/are. 2. Didn't hear back from Ted Weir regarding minutes for July's water board meeting. I think there are 2 meetings minutes missed, 6/22 and July. 3. Waiting to hear back from Cheryl Hanks about the 7/6 City Council and 6/7 BOA meetings minutes. 4. Don't know the name of the folks who were honored with Yard of the Month. They're at 509 Narcissus. I'm not feeling very good about the missing parts but need to move on schedule! I'm also looking for a good dictionary to check the spellings of ettouffe, tree-house and orneryness. (Makes me feel kind of ornery, come to think about it!) Please let me know if you have revisions. Hope your week is starting out well. 'Nita
Alright, this one seems to be a mix between business content and process. So I can see how it was lumped into this topic, but it doesn’t quite have the perfection that I would like.
Finally, let’s move on to topic 4, which appeared to be a ‘business process’ topic to me. I’m suspicious of this topic, as I don’t think I successfully filtered out everything that I wanted to:
sample(which(df.emails.topics$"4" > .95), 10) [1] 51205 5129 48826 51214 55337 15843 52543 11978 48337 2609 enron[[5129]] very funny today...during the free fall, couldn't price jv and xh low enough on eol, just kept getting cracked. when we stabilized, customers came in to buy and couldnt price it high enough. winter versus apr went from +23 cents when we were at the bottom to +27 when april rallied at the end even though it should have tightened theoretically. however, april is being supported just off the strip. getting word a lot of utilities are going in front of the puc trying to get approval for hedging programs this year. hey johnny. hope all is well. what u think hrere? utuilites buying this break down? charts look awful but 4.86 ish is next big level. jut back from skiing in co, fun but took 17 hrs to get home and a 1.5 days to get there cuz of twa and weather.
Hrm, this one appears to be some ‘shop talk’, and isn’t too general. I’m not sure how this applies to the topic 4 words. Let’s try another one:
enron[[55337]] Fran, do you have an updated org chart that I could send to the Measurement group? Thanks. Lynn Cc: Estalee Russi Lynn, Attached are the org charts for ETS Gas Logistics: Have a great weekend. Thanks! Miranda
Here we go. This one seems to fall much more into the ‘business process’ realm. Let’s see if I can find another good example:
enron[[11978]] Bill, As per our conversation today, I am sending you an outline of what we intend to be doing in Ercot and in particular on the real-time desk. For 2002 Ercot is split into 4 zones with TCRs between 3 of the zones. The zones are fairly diverse from a supply/demand perspective. Ercot has an average load of 38,000 MW, a peak of 57,000 MW with a breakdown of 30% industrial, 30% commercial and 40% residential. There are already several successful aggregators that are looking to pass on their wholesale risk to a credit-worthy QSE (Qualified Scheduling Entity). Our expectation is that we will be a fully qualified QSE by mid-March with the APX covering us up to that point. Our initial on-line products will include a bal day and next day financial product. (There is no day ahead settlement in this market). There are more than 10 industrial loads with greater than 150 MW concentrated at single meters offering good opportunities for real-time optimization. Our intent is to secure one of these within the next 2 months. I have included some price history to show the hourly volatility and a business plan to show the scope of the opportunity. In addition, we have very solid analytics that use power flow simulations to map out expected outcomes in the real-time market. The initial job opportunity will involve an analysis of the real-time market as it stands today with a view to trading around our information. This will also drive which specific assets we approach to manage. As we are loosely combining our Texas gas and Ercot power desks our information flow will be superior and I believe we will have all the tools needed for a successful real-time operation. Let me know if you have any further questions. Thanks, Doug
Again, I seem to have found an email that straddles the boundary between business process and business content. Okay, I guess this topic isn’t the clearest in describing each of the examples that I found!
Overall, I probably could have done a bit more to filter out the useless stuff to construct topics that were better in describing the examples that they represent. Also, I’m not sure if I should be surprised or not that I didn’t pick up some sort of ‘social banter’ topic, where people were emailing about non-business topics. I suppose that social banter emails might be less predictable in their content, but maybe somebody much smarter than I am can tell me the answer
If you know how I can significantly ramp up the quality of this analysis, feel free to contribute your comments!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.