Moving largish data from R to H2O – spam detection with Enron emails
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Moving around sparse matrices of text data – the limitations of as.h2o
This post is the resolution of a challenge I first wrote about in late 2016, moving large sparse data from an R environment onto an H2O cluster for machine learning purposes. In that post, I experimented with functionality recently added by the H2O team to their supporting R package, the ability for as.h2o()
to interpret a sparse Matrix
object from R and convert it to an H2O frame. The Matrix
and as.h2o
method is ok for medium sized data but broke down on my hardware with a larger dataset – a bags of words from New York Times articles with 300,000 rows and 102,000 columns. Cell entries are the number of times a particular word is used in the document represented by a row and are mostly empty, so my 12GB laptop has no problem managing the data in a sparse format like Matrix
from the Matrix
package or a simple triplet matrix from the slam
package. I’m not sure what as.h2o
does under the hood in converting from Matrix
to an H2O frame, but it’s too much for my laptop.
My motivation for this is that I want to use R for convenient pre-processing of textual data using the tidytext
approach; but H2O for high powered machine learning. tidytext
makes it easy to create a sparse matrix with cast_dtm
or cast_sparse
, but uploading this to H2O can be a challenge.
How to write from R into SVMLight format
After some to-and-fro on Stack Overflow, the best advice was to export the sparse matrix from R into a SVMLight/LIBSVM format text file, then read it into H2O with h2o.importFile(..., parse_type = "SVMLight")
. This turned the problem from an difficult and possibly intractable memory managment challenge into a difficult and possibly intractable data formatting and file writing challenge – how to efficiently write files in SVMLight format.
SVMLight format combines a data matrix with some modelling information ie the response value of a model, or “label” as it is (slightly oddly, I think) often called in this world. Instead of a more conventional row-based sparse matrix format which might convey information in row-column-value triples, it uses label-column:value indicators. It looks like this:
1 10:3.4 123:0.5 34567:0.231
0.2 22:1 456:0.3
That example is equivalent to two rows of a sparse matrix with at least 34,567 columns. The first row has 1
as the response value, 3.4
in the 10th column of explanatory variables, and 0.231 in the 34,567th column; the second row has 0.2 as the response value, 1
in the 22nd column, and so on.
Writing data from R into this format is a known problem discussed in this Q&A on Stack Overflow. Unfortunately, the top rated answer to that question, e1071::write.svm
is reported as being slow, and also it is integrated into a workflow that requires you to first fit a Support Vector Machine model to the data, a step I wanted to avoid. That Q&A led me to a GitHub repo by zygmuntz that had a (also slow) solution for writing dense matrices into SVMLight format, but that didn’t help me as my data were too large for R to hold in dense format. So I wrote my own version for taking simplet triplet matrices and writing SVMLight format. My first version depended on nested paste
statements that were apply
d to each row of the data and was still too slow at scale, but with the help of yet another Stack Overflow interaction and some data table wizardry by @Roland this was able to reduce the expected time writing my 300,000 by 83,000 New York Times matrix (having removed stop words) from several centuries to two minutes.
I haven’t turned this into a package – it would seem better to find an existing package to add it to than create a package just for this one function, any ideas appreciated. The functions are available on GitHub but in case I end up moving them, here they are in full. One function creates a big character vector; the second writes that to file. This means multiple copies of the data need to be held in R and hence creates memory limitations, but is much much faster than writing it one line at a time (seconds rather than years in my test cases).
library(data.table)
library(slam)
# Convert a simple triplet matrix to svm format
#' @author Peter Ellis
#' @return a character vector of length n = nrow(stm)
calc_stm_svm <- function(stm, y){
# returns a character vector of length y ready for writing in svm format
if(!"simple_triplet_matrix" %in% class(stm)){
stop("stm must be a simple triple matrix")
}
if(!is.vector(y) | nrow(stm) != length(y)){
stop("y should be a vector of length equal to number of rows of stm")
}
n <- length(y)
# data.table solution thanks to @roland at http://stackoverflow.com/questions/41477700/optimising-sapply-or-for-paste-to-efficiently-transform-sparse-triplet-m/41478999#41478999
stm2 <- data.table(i = stm$i, j = stm$j, v = stm$v)
res <- stm2[, .(i, jv = paste(j, v, sep = ":"))
][order(i), .(res = paste(jv, collapse = " ")), by = i][["res"]]
out <- paste(y, res)
return(out)
}
#' @param stm a simple triplet matrix (class exported slam) of features (ie explanatory variables)
#' @param y a vector of labels. If not provided, a dummy of 1s is provided
#' @param file file to write to.
#' @author Peter Ellis
write_stm_svm <- function(stm, y = rep(1, nrow(stm)), file){
out <- calc_stm_svm(stm, y)
writeLines(out, con = file)
}
Example application - spam detection with the Enron emails
Although I’d used the New York Times bags of words from the UCI machine learning dataset repository for testing the scaling up of this approach, I actually didn’t have anything I wanted to analyse that data for in H2O. So casting around for an example use case I decided on using the Enron email collection for spam detection, first analysed in a 2006 conference paper by V. Metsis, I. Androutsopoulos and G. Paliouras. As well as providing one of the more sensational corporate scandals of recent times, the Enron case has blessed data scientists with one of the largest published sets of emails collected from their natural habitat.
The original authors classified the emails as spam or ham and saved these pre-processed data for future use and reproducibility. I’m not terribly knowledgeable (or interested) in spam detection, so please take the analysis below as a crude and naive example only.
Data
First the data need to be downloaded and unzipped. The files are stored as 6 Tape ARchive files
library(tidyverse)
library(tidytext)
library(tm)
library(testthat)
library(data.table)
library(h2o)
library(stringr)
library(forcats)
#===========download files=============
# Files described at http://csmining.org/index.php/enron-spam-datasets.html
baseurl <- "http://csmining.org/index.php/enron-spam-datasets.html?file=tl_files/Project_Datasets/Enron-Spam%20datasets/Preprocessed/enron"
filenames <- paste0(baseurl, 1:6, ".tar.tar")
for(i in 1:6){
message("Downloading ", i)
dfile <- paste0("enron", i, ".tar.tar")
download.file(filenames[i], destfile = dfile, mode = "wb")
message("Un-archiving ", i)
untar(dfile)
}
This creates six folders with the names enron1
, enron2
etc; each with a spam
and a ham
subfolder containing numerous text files. The files look like this example piece of ham (ie non-spam; a legitimate email), chosen at random:
Subject: re : creditmanager net meeting
aidan ,
yes , this will work for us .
vince
" aidan mc nulty " on 12 / 16 / 99 08 : 36 : 14 am
to : vince j kaminski / hou / ect @ ect
cc :
subject : creditmanager net meeting
vincent , i cannot rearrange my schedule for tomorrow so i would like to
confirm that we will have a net - meeting of creditmanager on friday 7 th of
january at 9 . 30 your time .
regards
aidan mc nulty
212 981 7422
The pre-processing has removed duplicates, emails sent to themselves, some of the headers, etc.
Importing the data into R and making tidy data frames of documents and word counts is made easy by Silge and Robinson’s tidytext
package which I never tire of saying is a game changer for convenient analysis of text by statisticians:
#=============import to R================
# Adapting the example at http://tidytextmining.com/usenet.html
folders <- paste0("enron", 1:6)
spam <- data_frame(file = dir(paste0(folders, "/spam"), full.names = TRUE)) %>%
mutate(text = map(file, readLines, encoding = "Latin-1")) %>%
transmute(id = basename(file), text) %>%
unnest(text) %>%
mutate(SPAM = "spam")
ham <- data_frame(file = dir(paste0(folders, "/ham"), full.names = TRUE)) %>%
mutate(text = map(file, readLines, encoding = "Latin-1")) %>%
transmute(id = basename(file), text) %>%
unnest(text) %>%
mutate(SPAM = "ham")
enron_raw <- rbind(spam, ham)
#============error checks - failing!===================
# Check against the counts provided at http://csmining.org/index.php/data.html
# Turns out some 3013 spam messages have gone missing
# should be 20170 spam messages and 16545 ham messages:
# returns an error
expect_equal(length(unique(enron_raw$id)), 20170 + 16545)
enron_raw %>%
select(id, SPAM) %>%
distinct() %>%
summarise(spam_count = sum(SPAM == "spam"), ham_count = sum(SPAM == "ham"))
# For my purposes I decide not to worry about this.
#=================further processing==================
enron <- enron_raw %>%
# will just remove the "Subject:" and "Subject :" and treat subject words like any other
mutate(text = gsub("^Subject *: *", "", text),
text = gsub("<U.....>", "", text, fixed = FALSE))
enron_words <- enron %>%
unnest_tokens(word, text) %>%
select(-SPAM)
As well as basic word counts, I wanted to experiment with other characteristics of emails such as number of words, number and proportion of of stopwords (frequently used words like “and” and “the”). I create a traditional data frame with a row for each email, identified by id
, and columns indicating whether it is SPAM and those other characteristics of interest.
# First I'm creating a summary, dense data frame with some numeric info on each document
enron_sum1 <- enron %>%
mutate(number_characters = nchar(text)) %>%
group_by(id, SPAM) %>%
summarise(number_characters = sum(number_characters))
enron_sum2 <- enron_words %>%
group_by(id) %>%
summarise(number_words = length(word))
enron_sum3 <- enron_words %>%
anti_join(stop_words, by = "word") %>%
group_by(id) %>%
summarise(number_nonstop_words = length(word))
enron_sum_total <- enron_sum1 %>%
left_join(enron_sum2, by = "id") %>%
left_join(enron_sum3, by = "id") %>%
mutate(number_stop_words = number_words - number_nonstop_words,
proportion_stop_words = number_stop_words / number_words) %>%
select(-number_nonstop_words)
enron_sum_total
Source: local data frame [33,702 x 6]
Groups: id [33,702]
id SPAM number_characters number_words number_stop_words
<chr> <chr> <int> <int> <int>
1 0001.1999-12-10.farmer.ham.txt ham 28 4 0
2 0001.1999-12-10.kaminski.ham.txt ham 24 4 3
3 0001.2000-01-17.beck.ham.txt ham 3486 559 248
4 0001.2000-06-06.lokay.ham.txt ham 3603 536 207
5 0001.2001-02-07.kitchen.ham.txt ham 322 48 18
6 0001.2001-04-02.williams.ham.txt ham 1011 202 133
7 0002.1999-12-13.farmer.ham.txt ham 4194 432 118
8 0002.2001-02-07.kitchen.ham.txt ham 385 64 40
9 0002.2001-05-25.SA_and_HP.spam.txt spam 990 170 80
10 0002.2003-12-18.GP.spam.txt spam 1064 175 63
I next make my sparse matrix as a document term matrix (which is a special case of a simplet triplet matrix from the slam
package), with a column for each word (having first limited myself to interesting words)
used_words <- enron_words %>%
# knock out stop words:
anti_join(stop_words, by = "word") %>%
# knock out numerals, and words with only 2 or 1 letters:
mutate(word = gsub("[0-9]", "", word),
wordlength = nchar(word)) %>%
filter(wordlength > 2) %>%
group_by(word) %>%
summarise(count = length(word)) %>%
ungroup() %>%
# knock out words used less than 10 times:
filter(count >= 10)
enron_dtm <- enron_words %>%
right_join(used_words, by = "word") %>%
cast_dtm(id, word, count)
# we need a version of the dense data in the same order as the document-term-matrix, to do a sort of
# manual join of the sparse matrix with the dense one later in H2O.
rows <- data_frame(id = rownames(enron_dtm))
enron_dense <- left_join(rows, enron_sum_total, by = "id")
expect_equal(nrow(enron_dtm), nrow(enron_dense))
expect_equal(rownames(enron_dtm), enron_dense$id)
Now we can load our two datasets onto an H2O cluster for analysis:
#================import to h2o and join up there============
h2o.init(nthreads = -1, max_mem_size = "8G")
# Load up the dense matrix with counts of stopwords etc:
enron_dense_h2o <- as.h2o(enron_dense)
# Load up the sparse matrix with columns for each word:
thefile <- tempfile()
write_stm_svm(enron_dtm, file = thefile)
enron_sparse_h2o <- h2o.uploadFile(thefile, parse_type = "SVMLight")
unlink(thefile)
# Number of rows should be equal:
expect_equal(nrow(enron_sparse_h2o), nrow(enron_dtm))
# Number of columns should be 1 extra in H2O, dummy variable of labels (1) added by write_stm_svm:
expect_equal(ncol(enron_sparse_h2o), ncol(enron_dtm) + 1)
# First column should be the dummy labels = all one
expect_equal(mean(enron_sparse_h2o[ , 1]), 1)
enron_fulldata <- h2o.cbind(enron_sparse_h2o, enron_dense_h2o)
head(colnames(enron_fulldata), 10)
# Convert the target variable to a factor so h2o.glm and other modelling functions
# know what to do with it:
enron_fulldata[ , "SPAM"] <- as.factor(enron_fulldata[ , "SPAM"])
I now have an H2O frame with 33602 rows and 26592 columns; most of the columns representing words and the cells being counts; but some columns representing other variables such as number of stopwords.
Analysis
To give H2O a workout, I decided to fit four different types of models trying to understand which emails were ham and which spam:
- generalized linear model, with elastic net regularization to help cope with the large number of explanatory variables
- random forest
- gradient boosting machine
- neural network
I split the data into training, validation and testing subsets; with the idea that the validation set would be used for choosing tuning parameters, and the testing set used as a final comparison of the predictive power of the final models. As things turned out, I didn’t have patience to do much in the way of tuning. This probably counted against the latter three of my four models, because I’m pretty confident better performance would be possible with more careful choice of some of the meta parameters. Here’s the eventual results from my not-particularly-tuned models:
The humble generalized linear model (GLM) performs pretty well; outperformed clearly only by the neural network. The GLM has a big advantage in interpretability too. Here are the most important variables for the GLM in predicting spam (NEG means a higher count of the word means less likely to be spam)
names coefficients sign word
1 C9729 1.1635213 NEG enron
2 C25535 0.6023054 NEG vince
3 C1996 0.5990230 NEG attached
4 C15413 0.4524011 NEG louise
5 C19891 0.3905246 NEG questions
6 C11478 0.2993239 NEG gas
7 proportion_stop_words 0.2935112 NEG <NA>
8 C12866 0.2774074 NEG hourahead
9 C7268 0.2600452 NEG daren
10 C16257 0.2497282 NEG meter
11 C12878 0.2439315 NEG houston
12 C21441 0.2345008 NEG sally
13 C16106 0.2179897 NEG meeting
14 C12894 0.1965571 NEG hpl
15 C16618 0.1909332 NEG monday
16 C11270 0.1873195 NEG friday
17 C8553 0.1704185 NEG doc
18 C7386 0.1673093 NEG deal
19 C21617 0.1636310 NEG schedule
20 C4185 0.1510748 NEG california
21 C12921 0.3695006 POS http
22 C16624 0.2132104 POS money
23 C22597 0.2034031 POS software
24 C15074 0.1957970 POS life
25 C5394 0.1922659 POS click
26 C17683 0.1915608 POS online
27 C25462 0.1703109 POS viagra
28 C16094 0.1605989 POS meds
29 C21547 0.1583438 POS save
30 C9483 0.1498732 POS email
So, who knew, emails containing the words “money”, “software”, “life”, “click”, “online”, “viagra” and “meds” are (or at least were in the time of Enron - things may have changed) more likely to be spam.
Here’s the code for the analysis all together:
#=====================analysis in H2O=============
#-----------prep------------------------
# names of the explanatory variables - all the columns in the sparse matrix (which are individual words)
# except the first one which is the dummy "labels" created just for the SVMLight format. And the
# four summary variables in the dense dataframe:
xnames <- c(colnames(enron_sparse_h2o)[-1],
"number_characters", "number_words", "number_stop_words", "proportion_stop_words")
# A lookup table of the column names that refer to words to the words they actually mean.
# Useful when we look at variable performance down the track.
wordcols <- data_frame(
variable = colnames(enron_sparse_h2o),
word = c("", colnames(enron_dtm))
)
# The slower models (ie apart from glm) take ages for cross-validation so
# we'll settle for single-split validation
enron_split <- h2o.splitFrame(enron_fulldata, ratios = c(0.6, 0.2))
dim(enron_split[[1]])
#-------------------GLM---------------------
# Binomial GLM, with elastic net regularization. This is the minimal baseline sort of model.
# 200 seconds
system.time({
mod.glm <- h2o.glm(x = xnames, y = "SPAM", training_frame = enron_split[[1]],
validation_frame = enron_split[[2]],
family = "binomial",
alpha = 0.5, lambda_search = TRUE)
})
h2o.varimp(mod.glm) %>%
slice(1:30) %>%
left_join(wordcols, by = c("names" = "variable")) %>%
arrange(sign, desc(coefficients))
h2o.performance(mod.glm, valid = TRUE) # 131 / 6640
#--------------------Random Forest-------------------
# with ntrees = 50, nfolds = 10, max_depth = 20, took 25 minutes to get
# 5% through so I stopped it and went back to having a single validation frame.
# Can view progress by going to http://127.0.0.1:54321/flow/index.html
# 1340 seconds
system.time({
mod.rf <- h2o.randomForest(x = xnames, y = "SPAM", training_frame = enron_split[[1]],
validation_frame = enron_split[[2]],
ntrees = 500, max_depth = 20,
stopping_tolerance = 0.0001, stopping_rounds = 3, score_tree_interval = 25)
})
# note from watching progress in flow, scoring every score_tree_interval models takes quite a lot of time.
# so I pushed out score_tree_interval. On current settings it will score
# the latest model against the validation frame every 25 trees, and stop if it hasn't improved since 3*25 trees ago.
# If score_tree_interval is a small number, it stops growing trees too quickly
h2o.performance(mod.rf, valid = TRUE) # 172/6640 error rate
#-------------------gradient boosting machine-----------------
# 1240 seconds
system.time({
mod.gbm <- h2o.gbm(x = xnames, y = "SPAM", training_frame = enron_split[[1]],
validation_frame = enron_split[[2]],
ntrees = 100, max_depth = 5,
stopping_tolerance = 0.0001, stopping_rounds = 3, score_tree_interval = 5)
})
h2o.performance(mod.gbm, valid = TRUE) # 240/6640 error rate
#-------------------artificial neural network------------------------
# 900 seconds; much faster when sparse = TRUE
system.time({
mod.dl <- h2o.deeplearning(x = xnames, y = "SPAM", training_frame = enron_split[[1]],
validation_frame = enron_split[[2]],
hidden = c(200, 200),
stopping_tolerance = 0.0001, stopping_rounds = 5, sparse = TRUE)
})
h2o.performance(mod.dl, valid = TRUE) # 82/6640
#-----------------Naive Bayes - doesn't work--------------------
# Conditional probabilities won't fit in the driver node's memory (20.99 GB > 6.06GB)
mod.nb <- h2o.naiveBayes(x = xnames, y = "SPAM", training_frame = enron_split[[1]],
validation_frame = enron_split[[2]])
#==============presentation of results==========================
perf <- function(newdata){
return(c(
glm = as.character(h2o.confusionMatrix(mod.glm, newdata = newdata)$Rate[1]),
rf = as.character(h2o.confusionMatrix(mod.rf, newdata = newdata)$Rate[1]),
gbm = as.character(h2o.confusionMatrix(mod.gbm, newdata = newdata)$Rate[1]),
dl = as.character(h2o.confusionMatrix(mod.dl, newdata = newdata)$Rate[1])
))
}
# this isn't the most efficient computing wise, because the performance on the training
# and validation sets could be extracted more directly but it is quick and easy to code:
perfs <- lapply(enron_split, perf)
perfsdf <- data_frame(value = unlist(perfs),
model = rep(c("Generalized Linear Model", "Random Forest", "Gradient Boosting Machine", "Deep Learning"), 3),
dataset = rep(c("Training", "Validation", "Testing"), each = 4),
abbrev = rep(names(perfs[[1]]), 3)) %>%
mutate(errors = as.numeric(str_sub(str_extract(value, "=[0-9]+"), start = 2)),
denom = as.numeric(str_sub(str_extract(value, "/[0-9]+"), start = 2)),
errorrate = errors / denom,
model = factor(model, levels = c("Deep Learning", "Generalized Linear Model",
"Random Forest", "Gradient Boosting Machine")),
dataset = factor(dataset, levels = c("Training", "Validation", "Testing")))
perfsdf %>%
ggplot(aes(x = errorrate, y = dataset, label = abbrev, colour = model)) +
geom_path(aes(group = abbrev), alpha = 0.5, linetype = 2) +
geom_text() +
scale_colour_brewer(palette = "Set1") +
scale_x_continuous("Error rate", label = percent) +
labs(y = "", colour = "") +
ggtitle("Error rates detecting ham from spam",
"Enron email dataset, four different statistical learning methods")
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.