Digging into mbox details: A tale of tm & reticulate
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I had to processes a bunch of emails for a $DAYJOB
task this week and my “default setting” is to use R for pretty much everything (this should come as no surprise). Treating mail as data is not an uncommon task and many R packages exist that can reach out and grab mail from servers or work directly with local mail archives.
Mbox’in off the rails on a crazy tm1
This particular mail corpus is in mbox
format since it was saved via Apple Mail. It’s one big text file with each message appearing one after the other. The format has been around for decades, and R’s tm
package — via the tm.plugin.mail
plugin package — can process these mbox
files.
To demonstrate, we’ll use an Apple Mail archive excerpt from a set of R mailing list messages as they are not private/sensitive:
library(tm) library(tm.plugin.mail) # point the tm corpus machinery to the mbox file and let it know the timestamp format since it varies VCorpus( MBoxSource("~/Data/test.mbox/mbox"), readerControl = list( reader = readMail(DateFormat = "%a, %e %b %Y %H:%M:%S %z") ) ) -> mbox str(unclass(mbox), 1) ## List of 3 ## $ content:List of 198 ## $ meta : list() ## ..- attr(*, "class")= chr "CorpusMeta" ## $ dmeta :'data.frame': 198 obs. of 0 variables str(unclass(mbox[[1]]), 1) ## List of 2 ## $ content: chr [1:476] "Try this:" "" "> library(lubridate)" "> library(tidyverse)" ... ## $ meta :List of 9 ## ..- attr(*, "class")= chr "TextDocumentMeta" str(unclass(mbox[[1]]$meta), 1) ## List of 9 ## $ author : chr "jim holtman " ## $ datetimestamp: POSIXlt[1:1], format: "2018-08-01 15:01:17" ## $ description : chr(0) ## $ heading : chr "Re: [R] read txt file - date - no space" ## $ id : chr "" ## $ language : chr "en" ## $ origin : chr(0) ## $ header : chr [1:145] "Delivered-To: [email protected]" "Received: by 2002:ac0:e681:0:0:0:0:0 with SMTP id b1-v6csp950182imq;" " Wed, 1 Aug 2018 08:02:23 -0700 (PDT)" "X-Google-Smtp-Source: AAOMgpcdgBD4sDApBiF2DpKRfFZ9zi/4Ao32Igz9n8vT7EgE6InRoa7VZelMIik7OVmrFCRPDBde" ... ## $ : NULL
We’re using unclass()
since the str()
output gets a bit crowded with all of the tm
class attributes stuck in the output display.
The tm
suite is designed for text mining. My task had nothing to do with text mining and I really just needed some header fields and body content in a data frame. If you’ve been working with R for a while, some things in the str()
output will no doubt cause a bit of angst. For instance:
datetimestamp: POSIXlt[1:1],
:POSIXlt
and data frames really don’t mix welldescription : chr(0)
/origin : chr(0)
: zero-length character vectors$ : NULL
: Blank element name with aNULL
value…I Don’t Even 2
The tm
suite is also super opinionated and “helpfully” left out a ton of headers (though it did keep the source for the complete headers around). Still, we can roll up our sleeves and turn that into a data frame:
# helper function for cleaner/shorter code `%|0|%` <- function(x, y) { if (length(x) == 0) y else x } # might as well stay old-school since we're using tm do.call( rbind.data.frame, lapply(mbox, function(.x) { # we have a few choices, but this one is pretty explicit abt what it does # so we'll likely be able to decipher it quickly in 2 years when/if we come # back to it data.frame( author = .x$meta$author %|0|% NA_character_, datetimestamp = as.POSIXct(.x$meta$datetimestamp %|0|% NA), description = .x$meta$description %|0|% NA_character_, heading = .x$meta$heading %|0|% NA_character_, id = .x$meta$id %|0|% NA_character_, language = .x$meta$language %|0|% NA_character_, origin = .x$meta$origin %|0|% NA_character_, header = I(list(.x$meta$header %|0|% NA_character_)), body = I(list(.x$content %|0|% NA_character_)), stringsAsFactors = FALSE ) }) ) %>% glimpse() ## Observations: 198 ## Variables: 9 ## $ author <chr> "jim holtman ", "PIKAL Petr ... ## $ datetimestamp 2018-08-01 15:01:17, 2018-08-01 13:09:18, 2018-... ## $ description <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ... ## $ heading <chr> "Re: [R] read txt file - date - no space", "Re: ... ## $ id <chr> " "en", "en", "en", "en", "en", "en", "en", "en", ... ## $ origin <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ... ## $ header Delivere...., Delivere...., Delivere...., De... ## $ body Try this...., SGkNCg0K...., Dear Pik...., De...
That wasn’t a huge effort, but we would now have to re-process the headers and/or write a custom version of tm.plugin.mail::readMail()
(the function source is very readable and extendable) to get any extra data out. Here’s what that might look like:
# Custom msg reader read_mail <- function(elem, language, id) { # extract header val hdr_val <- function(src, pat) { gsub( sprintf("%s: ", pat), "", grep(sprintf("^%s:", pat), src, "", value = TRUE, useBytes = TRUE) ) %|0|% NA } mail <- elem$content index <- which(mail == "")[1] header <- mail[1:index] mid <- hdr_val(header, "Message-ID") PlainTextDocument( x = mail[(index + 1):length(mail)], author = hdr_val(header, "From"), spam_score = hdr_val(header, "X-Spam-Score"), ###
If we wanted all the headers, there are even more succinct ways to solve for that use case.
Packaging up emails with a reticulated message.mbox
Since the default functionality of tm.plugin.mail::readMail()
forced us to work a bit to get what we needed there’s some justification in seeking out an alternative path. I’ve written about reticulate
before and am including it in this post as the Python standard library module mailbox
can also make quick work of mbox
files.
Two pieces of advice I generally reiterate when I talk about reticulate
is that I highly recommend using Python 3 (remember, it’s a fragmented ecosystem) and that I prefer specifying the specific target Python to use via the RETICULATE_PYTHON
environment variable that I have in ~/.Renviron
as RETICULATE_PYTHON=/usr/local/bin/python3
.
Let’s bring the mailbox
module into R:
library(reticulate) library(tidyverse) mailbox <- import("mailbox")
If you're unfamiliar with a Python module or object, you can get help right in R via reticulate::py_help()
. Et sequitur3: py_help(mailbox)
will bring up the text help for that module and py_help(mailbox$mbox)
(remember, we swap out dots for dollars when referencing Python object components in R) will do the same for the mailbox.mbox
class.
Text help is great and all, but we can also render it to HTML with this helper function:
py_doc <- function(x) { require("htmltools") require("reticulate") pydoc <- reticulate::import("pydoc") htmltools::html_print( htmltools::HTML( pydoc$render_doc(x, renderer=pydoc$HTMLDoc()) ) ) }
Here's what the text and HTML help for mailbox.mbox
look side-by-side:
We can also use a helper function to view the online documentation:
readthedocs <- function(obj, py_ver=3, check_keywords = "yes") { require("glue") query <- obj$`__name__` browseURL( glue::glue( "https://docs.python.org/{py_ver}/search.html?q={query}&check_keywords={check_keywords}" ) ) }
Et sequitur: readthedocs(mailbox$mbox)
will take us to this results page
Going back to the task at hand, we need to cycle through the messages and make a data frame for the bits we (well, I, care about). The reticulate
package does an amazing job making Python objects first-class citizens in R, but Python objects may feel "opaque" to R users since we have to use the $
syntax to get to methods and values and — very often — familiar helpers such as str()
are less than helpful on these objects. Let's try to look at the first message (remember, Python is 0
-indexed):
msg1 <- mbox$get(0) str(msg1) msg1
The output for those last two calls is not shown because they both are just a large text dump of the message source. #unhelpful
We can get more details, and we'll wrap some punctuation-filled calls in two, small helper functions that have names that will sound familiar:
pstr <- function(obj, ...) { str(obj$`__dict__`, ...) } # like 'str()` pnames <- function(obj) { import_builtins()$dir(obj) } # like 'names()' but more complete
Lets see them in action:
pstr(msg1, 1) # we can pass any params str() will take ## List of 10 ## $ _from : chr "[email protected] Wed Aug 01 15:02:23 2018" ## $ policy :Compat32() ## $ _headers :List of 56 ## $ _unixfrom : NULL ## $ _payload : chr "Try this:\n\n> library(lubridate)\n> library(tidyverse)\n> input <- read.csv(text =3D \"date,str1,str2,str3\n+ "| __truncated__ ## $ _charset : NULL ## $ preamble : NULL ## $ epilogue : NULL ## $ defects : list() ## $ _default_type: chr "text/plain" pnames(msg1) ## [1] "__bytes__" "__class__" ## [3] "__contains__" "__delattr__" ## [5] "__delitem__" "__dict__" ## [7] "__dir__" "__doc__" ## [9] "__eq__" "__format__" ## [11] "__ge__" "__getattribute__" ## [13] "__getitem__" "__gt__" ## [15] "__hash__" "__init__" ## [17] "__init_subclass__" "__iter__" ## [19] "__le__" "__len__" ## [21] "__lt__" "__module__" ## [23] "__ne__" "__new__" ## [25] "__reduce__" "__reduce_ex__" ## [27] "__repr__" "__setattr__" ## [29] "__setitem__" "__sizeof__" ## [31] "__str__" "__subclasshook__" ## [33] "__weakref__" "_become_message" ## [35] "_charset" "_default_type" ## [37] "_explain_to" "_from" ## [39] "_get_params_preserve" "_headers" ## [41] "_payload" "_type_specific_attributes" ## [43] "_unixfrom" "add_flag" ## [45] "add_header" "as_bytes" ## [47] "as_string" "attach" ## [49] "defects" "del_param" ## [51] "epilogue" "get" ## [53] "get_all" "get_boundary" ## [55] "get_charset" "get_charsets" ## [57] "get_content_charset" "get_content_disposition" ## [59] "get_content_maintype" "get_content_subtype" ## [61] "get_content_type" "get_default_type" ## [63] "get_filename" "get_flags" ## [65] "get_from" "get_param" ## [67] "get_params" "get_payload" ## [69] "get_unixfrom" "is_multipart" ## [71] "items" "keys" ## [73] "policy" "preamble" ## [75] "raw_items" "remove_flag" ## [77] "replace_header" "set_boundary" ## [79] "set_charset" "set_default_type" ## [81] "set_flags" "set_from" ## [83] "set_param" "set_payload" ## [85] "set_raw" "set_type" ## [87] "set_unixfrom" "values" ## [89] "walk" names(msg1) ## [1] "add_flag" "add_header" ## [3] "as_bytes" "as_string" ## [5] "attach" "defects" ## [7] "del_param" "epilogue" ## [9] "get" "get_all" ## [11] "get_boundary" "get_charset" ## [13] "get_charsets" "get_content_charset" ## [15] "get_content_disposition" "get_content_maintype" ## [17] "get_content_subtype" "get_content_type" ## [19] "get_default_type" "get_filename" ## [21] "get_flags" "get_from" ## [23] "get_param" "get_params" ## [25] "get_payload" "get_unixfrom" ## [27] "is_multipart" "items" ## [29] "keys" "policy" ## [31] "preamble" "raw_items" ## [33] "remove_flag" "replace_header" ## [35] "set_boundary" "set_charset" ## [37] "set_default_type" "set_flags" ## [39] "set_from" "set_param" ## [41] "set_payload" "set_raw" ## [43] "set_type" "set_unixfrom" ## [45] "values" "walk" # See the difference between pnames() and names() setdiff(pnames(msg1), names(msg1)) ## [1] "__bytes__" "__class__" ## [3] "__contains__" "__delattr__" ## [5] "__delitem__" "__dict__" ## [7] "__dir__" "__doc__" ## [9] "__eq__" "__format__" ## [11] "__ge__" "__getattribute__" ## [13] "__getitem__" "__gt__" ## [15] "__hash__" "__init__" ## [17] "__init_subclass__" "__iter__" ## [19] "__le__" "__len__" ## [21] "__lt__" "__module__" ## [23] "__ne__" "__new__" ## [25] "__reduce__" "__reduce_ex__" ## [27] "__repr__" "__setattr__" ## [29] "__setitem__" "__sizeof__" ## [31] "__str__" "__subclasshook__" ## [33] "__weakref__" "_become_message" ## [35] "_charset" "_default_type" ## [37] "_explain_to" "_from" ## [39] "_get_params_preserve" "_headers" ## [41] "_payload" "_type_specific_attributes" ## [43] "_unixfrom"
Using just names()
excludes the "hidden" builtins for Python objects, but knowing they are there and what they are can be helpful, depending on the program context.
Let's continue on the path to our messaging goal and see what headers are available. We'll use some domain knowledge about the _headers
component, though we won't end up going that route to build a data frame:
map_chr(msg1$`_headers`, ~.x[[1]]) ## [1] "Delivered-To" "Received" ## [3] "X-Google-Smtp-Source" "X-Received" ## [5] "ARC-Seal" "ARC-Message-Signature" ## [7] "ARC-Authentication-Results" "Return-Path" ## [9] "Received" "Received-SPF" ## [11] "Authentication-Results" "Received" ## [13] "X-Virus-Scanned" "Received" ## [15] "Received" "Received" ## [17] "X-Virus-Scanned" "X-Spam-Flag" ## [19] "X-Spam-Score" "X-Spam-Level" ## [21] "X-Spam-Status" "Received" ## [23] "Received" "Received" ## [25] "Received" "DKIM-Signature" ## [27] "X-Google-DKIM-Signature" "X-Gm-Message-State" ## [29] "X-Received" "MIME-Version" ## [31] "References" "In-Reply-To" ## [33] "From" "Date" ## [35] "Message-ID" "To" ## [37] "X-Tag-Only" "X-Filter-Node" ## [39] "X-Spam-Level" "X-Spam-Status" ## [41] "X-Spam-Flag" "Content-Disposition" ## [43] "Subject" "X-BeenThere" ## [45] "X-Mailman-Version" "Precedence" ## [47] "List-Id" "List-Unsubscribe" ## [49] "List-Archive" "List-Post" ## [51] "List-Help" "List-Subscribe" ## [53] "Content-Type" "Content-Transfer-Encoding" ## [55] "Errors-To" "Sender"
The mbox
object does provide a get()
method to retrieve header values so we'll go that route to build our data frame but we'll make yet-another helper since doing something like msg1$get("this header does not exist")
will return NULL
just like list(a=1)$b
would. We'll actually make two new helpers since we want to be able to safely work with the payload content and that means ensuring it's in UTF-8 encoding (mail systems are horribly diverse beasts and the R community is international and, remember, we're using R mailing list messages):
# execute an object's get() method and return a character string or NA if no value was present for the key get_chr <- function(.x, .y) { as.character(.x[["get"]](.y)) %|0|% NA_character_ } # get the object's value as a valid UTF-8 string utf8_decode <- function(.x) { .x[["decode"]]("utf-8", "ignore") %|0|% NA_character_ }
We're also doing this because I get really tired of using the $
syntax.
We also want the message content or payload. Modern mail messages can be really complex structures with many multiple part entities. To put it a different way, there may be HTML, RTF and plaintext versions of a message all in the same envelope. We want the plaintext ones so we'll have to iterate through any multipart messages to (hopefully) get to a plaintext version. Since this post is already pretty long and we ignored errors in the tm
portion, I'll refrain from including any error handling code here as well.
map_df(1:py_len(mbox), ~{ m <- mbox$get(.x-1) # python uses 0-index lists list( date = as.POSIXct(get_chr(m, "date"), format = "%a, %e %b %Y %H:%M:%S %z"), from = get_chr(m, "from"), to = get_chr(m, "to"), subj = get_chr(m, "subject"), spam_score = get_chr(m, "X-Spam-Score") ) -> mdf content_type <- m$get_content_maintype() %|0|% NA_character_ if (content_type[1] == "text") { # we don't want images while (m$is_multipart()) m <- m$get_payload()[[1]] # cycle through until we get to something we can use mtmp <- m$get_payload(decode = TRUE) # get the message text mdf$body <- utf8_decode(mtmp) # make it safe to use } mdf }) -> mbox_df glimpse(mbox_df) ## Observations: 198 ## Variables: 7 ## $ date 2018-08-01 11:01:17, 2018-08-01 09:09:18, 20... ## $ from <chr> "jim holtman ", "PIKAL Pe... ## $ to <chr> "[email protected], R mailing list "Re: [R] read txt file - date - no space", "R... ## $ spam_score <chr> "-3.631", "-3.533", "-3.631", "-3.631", "-3.5... ## $ content_type <chr> "text", "text", "text", "text", "text", "text... ## $ body <chr> "Try this:\n\n library(lubridate)\n library...
FIN
By now, you've likely figured out this post really had nothing to do with reading mbox
files. I mean, it did — and this was a task I had to do this week — but the real goal was to use a fairly basic task to help R folks edge a bit closer to becoming more friendly with Python in R. There hundreds of thousands of Python packages out there and, while I'm one to wax poetic about having R or C[++]-backed R-native packages — and am wont to point out Python's egregiously prolific flaws — sometimes you just need to get something done quickly and wish to avoid reinventing the wheel. The reticulate
package makes that eminently possible.
I'll be wrapping up some of the reticulate
helper functions into a small package soon, so keep your eyes on RSS.
: You might want to read this even if you're not interested in mbox
files. FIN (right above this note) might have some clues as to why.
1: yes, the section title was a stretch
2: am I doing this right, Mara? 😉
3: Make Latin Great Again
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.