Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
IMPORTANT NOTE
Code from this post is outdated (package APIs were changed).
See this post.
Today I am pleased to present tmlite – small, but fast and robust package for text-mining tasks in R. It is not availible yet on CRAN, but you can install it directly from github:
devtools::install_github("dselivanov/tmlite")
Reasonable question is – why new package? R already has such great package as tm and companion packages tau and NLP?
I’ll try to answer these questions in the last part of the post.
Focus
As unix philosophy says – Do One Thing and Do It Well, so we will focus on one particular problem – infrastructure for text analysis. R ecosystem contains lots of packages that are well suited for working with sparse high-dimensional data (and thus suitable for text modeling). Here are my favourites:
- lda blazing fast package for topic modeling.
- glmnet for L1, L2 linear models.
- xgboost for gradient boosting.
- LiblineaR – wrapper of
liblinear
svm library. - irlba – A fast and memory-efficient method for computing a few approximate singular values and singular vectors of large matrices.
These are all excellent and very efficient packages, so tmlite will be focused (at least in the nearest future) not on modeling, but on framework – Document-Matrix construction and manipulation – basis for any text-mining analysis. tmlite is partially inspired by gensim – robust and well designed python library for text mining. In the near future we will try to replicate some of its functionality.
tmlite is designed for practitioners (and kagglers!) who:
- understand what they want and how to do that. So we will not expose trivial high-level API like
findAssocs
,findFreqTerms
, etc. - work with medium to large collections of documents
- have at least medium level of experience in R and know basic concepts of functional programming
Key features
Note that package is in very alpha version. This doesn’t mean the package is not robust, but this means that API can change at any time.
- Flexible and easy functional-style API. Easy chaining.
- Efficient and memory-friendly streaming corpus construction. tmlite’s provides API for construction corporas from
character
vectors and more important –connection
s. Read more about connections here. So it is possible (and easy!) to construct Document-Term matrices for collections of documents thar are don’t fit in the memory. - Fast – core functions are written in C++, thanks to Rcpp authors.
- Has two main corpus classes –
DictCorpus
– traditional dictionary-based container used for Document-Term matrix construction.-
HashCorpus
– container that implements feature hashing or “hashing trick”. Similar to scikit-learn FeatureHasher and gensim corpora.hashdictionary.The class HashCorpus is a high-speed, low-memory vectorizer that uses a technique known as feature hashing, or the “hashing trick”. Instead of building a hash table of the features encountered in training, as the vectorizers do, instances of HashCorpus apply a hash function to the features to determine their column index in sample matrices directly.
- Document-Term matrix is key object. At the moment it can be extracted from corpus into
dgCMatrix
,dgTMatrix
or LDA-C which is standart for lda package.dgCMatrix
is default for sparse matrices in R and most of the packages that work with sparse matrices work withdgCMatrix
matrices, so it will be easy to interact with them.
Quick reference
First quick example is based on kaggle’s Bag of Words Meets Bags of Popcorn competition data – labeledTrainData.tsv.zip.
Here I’ll demostrate flexibility of the corpus creation procedure and how to vectorize large collection of documents.
Suppose text file is very large, but it contains 3 tab-separated columns. Only one is relevant (third column in example below). Now we want to create corpus, but can’t read whole file into memory. See how this will be resolved. First load libraries:
library(methods) library(tmlite) ## Loading required package: Matrix # for pipe syntax library(magrittr)
File contains 3 columns – id
, sentiment
, review
. Only review
is relevant.
Simple preprocessing function will do the trick for us – we will only read third column – text of the review.
# function receives character vector - batch of rows. preprocess_fun <- function(x) { # file is tab-sepatated - split each row by t rows <- strsplit(x, 't', fixed = T) # text review is in the third column txt <- sapply(rows, function(x) x[[3]]) # tolower, keep only letters simple_preprocess(txt) }
Read documents and create dictionary-based corpus:
# we don't want read all file into RAM - we will read it iteratively, row by row path <- '~/Downloads/labeledTrainData.tsv' con <- file(path, open = 'r', blocking = F) corp <- create_dict_corpus(src = con, preprocess_fun = preprocess_fun, # simple_tokenizer - split string by whitespace tokenizer = simple_tokenizer, # read by batch of 1000 documents batch_size = 1000, # skip first row - header skip = 1, # do not show progress bar because of knitr progress = F )
Now we want to try predict sentiment, based on review. For that we will use glmnet package, so we have to create Document-Term matrix in dgCMatrix
format. It is easy with get_dtm
function:
dtm <- get_dtm(corpus = corp, type = "dgCMatrix") %>% # remove very common and very uncommon words dtm_transform(filter_commons_transformer, term_freq = c(common = 0.001, uncommon = 0.975)) %>% # make tf-idf transformation dtm_transform(tfidf_transformer) dim(dtm) ## [1] 25000 10067
Cool. We have feature matrix, but don’t have response variable, which is still in the large file (which possibly won’t fit into memory). Fortunately reading particular columns is easy, for example see this stackoverflow discussion. We will use fread()
function from data.table package:
library(data.table) # read only second column - value of sentiment dt <- fread(path, select = c(2))
So all stuff is ready for model fitting.
library(glmnet) ## Loading required package: foreach ## Loaded glmnet 2.0-2 # I have 4 core machine, so will use parallel backend for n-fold crossvalidation library(doParallel) ## Loading required package: iterators ## Loading required package: parallel registerDoParallel(4) # train logistic regression with 4-fold cross-validation, maximizing AUC fit <- cv.glmnet(x = dtm, y = dt[['sentiment']], family = "binomial", type.measure = "auc", nfolds = 4, parallel = T) plot(fit)
print (paste("max AUC = ", round(max(fit$cvm), 4))) ## [1] "max AUC = 0.9483"
Not bad!
Now lets try to construct dtm using HashCorpus
class. Our data is tiny, but for larger data or streaming environments, HashCorpus
is natural choice. Read documents and create hash-based corpus:
con <- file(path, open = 'r', blocking = F) hash_corp <- create_hash_corpus(src = con, preprocess_fun = preprocess_fun, # simple_tokenizer - split string by whitespace tokenizer = simple_tokenizer, # read by batch of 1000 documents batch_size = 1000, # skip first row - header skip = 1, # don't show progress bar because of knitr progress = F) hash_dtm <- get_dtm(corpus = hash_corp, type = "dgCMatrix") %>% dtm_transform(filter_commons_transformer, term_freq = c(common = 0.001, uncommon = 0.975)) %>% dtm_transform(tfidf_transformer) # note, that ncol(hash_dtm) > ncol(dtm). Effect of collisions - we can fix this by increasing `hash_size` parameter . dim(hash_dtm) ## [1] 25000 10107 registerDoParallel(4) hash_fit <- cv.glmnet(x = hash_dtm, y = dt[['sentiment']], family = "binomial", type.measure = "auc", nfolds = 4, parallel = T) plot(hash_fit)
# near the same result print (paste("max AUC = ", round(max(hash_fit$cvm), 4))) ## [1] "max AUC = 0.9481"
Future work
Project has issue tracker on github where I’m filing feature requests and notes for future work. Any ideas are very appreciated.
If you like it, you can help:
- Test and leave feedback on github issuer tracker (preferably) or directly by email.
- package is tested on linux and OS X platforms, so Windows users are especially welcome
- Fork and start contributing. Vignettes, docs, tests, use cases are very welcome.
- Or just give me a star on project page 🙂
Short-term plans
- add tests
- add n-gram tokenizers
- add methods for tokenization in C++ (at the moment tokenization takes almost half of runtime)
- switch to murmur3 hash and add second hash function to reduce probability of collision
- push dictionary and stopwords filtering into C++ code
Middle-term plans
- add word2vec wrapper. It is strange, that R community still didn’t have it.
- add corpus serialization
Long-term plans
- integrate models like it is done in gensim
- try to implement out-of-core transformations like gensim does
Reasons why I started develop tmlite
All conslusions below are based on personal experience so they can be heavily biased.
First time I started to use tm was end of 2014. I tried to process collection of text dosuments which was less then 1 Gb. About 10000 texts. Surprisingly I wasn’t able to process them on machine with 16 Gb of RAM! But what is really cool – R and all the packages are open source. So I started to examine source code. Unfortunatelly I ended by rewriting most of the package. That first version (anyone interested can browse commits history on github) was quite robust and can handle such tiny-to-medium collections of documents. After that I tried it on some kaggle competitions, but didn’t do any new development, since my work wasn’t related to text analysis and I had no time for that. Also I noted, that almost all text-mining packages in R has tm dependency. We will try to develop an alternative.
About month ago I started full redesign (based on previous experience) and now I rewrote core functions in C++ and want bring alpha version to community.
So why you should not use tm:
- tm has a lot of functions – in fact reference manual contains more than 50 pages. But its API is very messy. A lot of packages depends on it , so it is hard redesign it.
- tm is not very efficient (from my experience). I found it very slow and what is more important – very RAM unfriendly and RAM-greedy. (I’ll provide few examples below). As I understand it is designed more for academia researchers, then data science practitioners. It perfectly handles metadata, processes different encodings. API is very high-level, but the price for that is performance.
- Can only handle documents that fit in RAM. (To be fair I should say, that there is
PCorpus()
function. But it seems it cannot help with Document-Term matrix construction when size of the documents larger than RAM – see examples below.DocumentTermMatrix()
is very RAM-greedy).
Comparison with tm
Some naive benchmarks on Document-Trem matrix construction
Here I’ll provide simple benchmark, which can give some impression about tmlite speed, compared to tm. For now we assume, that documents are already in memory, so we only need to clean text and tokenize it:
library(tm) ## Loading required package: NLP library(data.table) library(tmlite) dt <- fread('~/Downloads/labeledTrainData.tsv') txt <- dt[['review']] print(object.size(txt), quote = FALSE, units = "Mb") ## 32.8 Mb # 32.8 Mb system.time ( corpus_tm <- VCorpus(VectorSource(txt)) ) ## user system elapsed ## 2.081 0.011 2.095 print(object.size(corpus_tm), quote = FALSE, units = "Mb") ## 121.4 Mb # 121.4 Mb!!! system.time ( corpus_tm <- tm_map(corpus_tm, content_transformer(simple_preprocess)) ) ## user system elapsed ## 10.761 0.281 6.591 system.time ( dtm_tm <- DocumentTermMatrix(corpus_tm, control = list(tokenize = words) ) ) ## user system elapsed ## 15.002 0.740 12.227
Now lets check timings for tmlite:
system.time ( corp <- create_dict_corpus(src = txt, preprocess_fun = simple_preprocess, # simple_tokenizer - split string by whitespace tokenizer = simple_tokenizer, # read by batch of 5000 documents batch_size = 5000, # do not show progress bar because of knitr progress = FALSE) ) ## user system elapsed ## 10.127 0.079 10.224 # get in dgTMatrix form, because tm stores dtm matrix in triplet form system.time ( dtm <- get_dtm(corpus = corp, type = "dgTMatrix")) ## user system elapsed ## 0.042 0.008 0.050
Well, only two times faster. Is it worth the effort? Lets check another example. Here we will use data from excellent Mining massive datasets course. This is quite a large collection of short texts – more than 9 million rows, 500Mb zipped and about 1.4Gb unzipped.
# we will read only small fraction - 200000 rows (~ 42Mb) txt <- readLines('~/Downloads/sentences.txt', n = 2e5) print(object.size(txt), quote = FALSE, units = "Mb") ## 41.7 Mb # 41.7 Mb # VCorpus is very slow, about 20 sec on my computer system.time ( corpus_tm <- VCorpus(VectorSource(txt)) ) ## user system elapsed ## 19.340 0.204 19.573 print(object.size(corpus_tm), quote = FALSE, units = "Mb") ## 749.8 Mb # 749.8 Mb!!! wow! system.time ( corpus_tm <- tm_map(corpus_tm, content_transformer(simple_preprocess)) ) ## user system elapsed ## 20.629 1.487 29.161 # 26 sec. To process 42 Mb of text.
But the following is trully absurd. This forks 2 processes (because it uses mclapply internally). Each process uses 1.3Gb of RAM. 2.6 Gb of RAM to process 42 Mb text chunk. And this takes more then 50 sec on my macbook pro with latest core i7 intel chip. In fact it is not possible to process 1 million rows (200Mb) from my macbook pro with 16 gb of RAM.
system.time ( dtm_tm <- DocumentTermMatrix(corpus_tm, control = list(tokenize = words) ) ) ## user system elapsed ## 99.256 3.884 53.380
Compare with tmlite:
system.time ( corp <- create_dict_corpus(src = txt, preprocess_fun = simple_preprocess, # simple_tokenizer - split string by whitespace tokenizer = simple_tokenizer, # read by batch of 5000 documents batch_size = 5000, # do not show progress bar because of knitr progress = F) ) ## user system elapsed ## 10.025 0.050 10.081 # only around 9 sec and 120 Mb of ram system.time ( dtm_tmlite <- get_dtm(corpus = corp, type = "dgTMatrix")) ## user system elapsed ## 0.116 0.016 0.133 # less than 1 second
So here tmlite 8 times faster and what is much more important consumes 20 times less RAM. On large collections of documents speed up will be even more significant.
Document-Term Matrix manipulations
In practice it can be usefull to remove common and uncommon terms. Both packages provide functions for that: removeSparseTerms()
in tm and dtm_remove_common_terms
in tmlite. Also note, that removeSparseTerms()
can only remove uncommon terms, so to be fair we will test only that functionality:
system.time( dtm_tm_reduced <- removeSparseTerms(dtm_tm, 0.99)) ## user system elapsed ## 1.422 0.104 1.535 # common = 1 => do not remove common terms system.time( dtm_tmlite_reduced <- dtm_tmlite %>% dtm_transform(filter_commons_transformer, term_freq = c(common = 0.001, uncommon = 0.975))) ## user system elapsed ## 0.350 0.081 0.431
3-5 times faster – not bad. Now compare tf-idf transformation:
system.time( dtm_tm_tfidf <- weightTfIdf(dtm_tm, normalize = T)) ## Warning in weightTfIdf(dtm_tm, normalize = T): empty document(s): 6782 ## 26135 26136 26137 26138 26139 26140 26141 26142 26143 26144 26145 27664 ## 60895 60896 60897 60898 60899 60900 88953 106921 122685 141442 141443 ## 141449 141454 152656 ## user system elapsed ## 0.246 0.028 0.274 # common = 1 => do not remove common terms # timings slightly greate than weightTfIdf, because all transformations optimized for # dgCMatrix format, which is standart for sparse matrices in R system.time( dtm_tmlite_tfidf <- dtm_tmlite %>% dtm_transform(tfidf_transformer)) ## user system elapsed ## 0.390 0.091 0.481 # for dtm in dgCMatrix timings should be equal dtm_tmlite_dgc<- as(dtm_tmlite, "dgCMatrix") system.time( dtm_tmlite_tfidf <- dtm_tmlite_dgc %>% dtm_transform(tfidf_transformer)) ## user system elapsed ## 0.252 0.049 0.302
Equal timings – great (and surprise for me) – within the last year tm authors have significantly improved its performace!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.