text2vec 0.3
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Today I’m pleased to announce preview of the new version of text2vec. It is located in the 0.3 development branch, but very soon (probably in about a week) it will be merged into master.
To reproduce examples below, please install [email protected]
from github:
Also I’m waiting for feedback from text2vec users, please spend a few minutes:
- What APIs are not clear / not intuitive?
- What functionality is missing?
- Do you have any problems with speed / RAM usage?
Overview
In two words: text2vec
became faster and more user-friendly. During the work on this version I almost didn’t touch underlying core C++ code and focused on high-level features and usability. First I will briefly describe main improvements and then will provide full-featured example.
In this post i would like to highlight the following improvements:
- important bugfix
dtm
keeps document ids as rownames- several API breaks – some functions removed, some renamed and some have another default arguments
- performance improvements – all core functions have parallel mode
Full list of the features/changes available at github and marked with 0.3 tag.
Bugfix
There was one significant bug: when last document has no terms (at least from vocabulary), i.e. last row of dtm
has all zeros, get_dtm()
function omitted this last row. So dtm
had less rows than number of documents in corpus
. Now fixed.
Preserving document ids in corpus
and dtm
I’m not only the developer of the text2vec
, but also probably the most active user. Since the first public release I felt that I needed to improve some rough edges. One of the most obviously missing things was lack of mechanism for keeping document ids
during corpus
(and dtm
) construction. Now it is straightforward – if input of the itoken
function has names, these names will be used as documents ids
.
New high-level API
In 0.2 corpus
was the central object. We can think about it as a container with reference semantics, which allow us to perform vectorization and collection of terms coocurence statistics simulteniously. After the corpus is created, only the following two functions are useful in 99% of cases – get_dtm
and get_tcm
. After that, users usually work with matrices. This means that corpus
actually is an intermediate object and mainly should be used internally. In real life users usually need Document-Term matrix (dtm) or Term-Cooccurence matrix (tcm) which simplifies the process of transition from raw text to a vector space.
In 0.3 I introduce new higher-level API for direct dtm
and tcm
creation – create_dtm()
and create_tcm()
functions. Such simplification also allows me to implement efficient concurrent growing of dtm
and tcm
. create_dtm()
and create_tcm()
internally use create_corpus()
, but hide all gory details and care about parallel execution. Experienced users, who need simulteniously vectorize corpus and collect cooccurence statistics, can still use create_corpus()
and corresponding get_dtm()
, get_tcm
functions.
Another refinement – is the introduction of vectorizer
concept. vectorizer
is the function which performs mapping from raw text space to vector space. There are 2 kinds of vectorizers:
vocab_vectorizer
which uses vocabulary to perfrom bag-of-ngrams vectorization;hash_vectorizer
which uses feature hashing (or hashing trick);
Iterators
As it was pointed out here, in case of vocabulary vectorization, we perform 2 passes over input source. This means we read, preprocess and tokenize twice. While I/O usually is not an issue (if you use efficient reader like data.table::fread
or functions from readr
package), preprocessing can occupy a significant amount of time. For this reason I created itoken
S3 method which works with list
of character
vectors – list of tokens. Now user can tokenize input and then reuse list of tokens in vocabulary
, dtm
, tcm
construction. See examples below.
Vocabulary
There were several improvements to vocabulary construction:
- stopwords filtering during vocanulary construction (especially usefull for ngrams with
n > 1
); vocabulary
can be built in parallel using all your CPU cores;prune_vocabulary()
became slightly more efficient – it performs less unnecessary computations;
Transformers
All transformers renamed, now all starts with transformer_*
(this was done for more convenient work with autocompletion):
transformer_binary
transformer_tfidf
transformer_tf
transformer_filter_commons
still useful, even with some intersection withprune_vocabulary
The following example demonstrates new pipeline with many text2vec features: (note how flexible text2vec can be! thanks to functional style)
Loading required package: methods
List of 4 $ vocab :Classes 'data.table' and 'data.frame': 9595 obs. of 3 variables: ..$ terms : chr [1:9595] "fiorentino" "bfg" "tadashi" "kabei" ... ..$ terms_counts: int [1:9595] 5 8 5 5 11 5 6 10 6 8 ... ..$ doc_counts : int [1:9595] 1 1 1 1 1 1 1 1 1 1 ... ..- attr(*, ".internal.selfref")=<externalptr> $ ngram : Named int [1:2] 1 1 ..- attr(*, "names")= chr [1:2] "ngram_min" "ngram_max" $ document_count: int 5000 $ stopwords : chr [1:11] "i" "me" "my" "myself" ... - attr(*, "class")= chr "text2vec_vocabulary"
One important note. In current R realization, iterators are mutable. So at this point our iterator is empty:
So before corpus
/ dtm
/ tcm
construction we need to reinitialise it. Here we create dtm
directly:
[1] "5814_8" "2381_9" "7759_3" "3630_4" "9495_8" "8196_8"
[1] TRUE
Or tcm
:
Old-style simultenious vectorization and collection of cooccurence statistics:
Another option is to use hash_vectorizer
. Procedure is the same:
Parallel mode
Now create_dtm
, create_tcm
, vocabulary
take advantage of multicore machines and do it in transparent manner. In contrast to GloVe fitting which uses low-level thread parallelism via RcppParallel
, other functions use standart R high-level parallelism on top of foreach
package. They are flexible and can use diffrent parallel backends - doParallel
, doRedis
, etc. But user should remember that such high-level parallelism can involve significant overhead.
Only two things user should perform manually to take advantage of multicore machine:
- prepare splits of input data in a form of
list
ofitoken
iterators. - register parallel backend
Here is simple example with timings:
Loading required package: foreach Loading required package: iterators Loading required package: parallel
user system elapsed 0.363 0.000 0.364
user system elapsed 0.020 0.019 0.260
user system elapsed 0.435 0.043 0.957
user system elapsed 1.288 0.301 0.693
user system elapsed 0.488 0.157 0.930
user system elapsed 0.764 0.183 0.542
user system elapsed 0.787 0.285 3.053
user system elapsed 2.871 0.202 1.829
As you can see, speedup is not perfect. This happened because, R’s high-level parallelism has significant overhead on small tasks. On larger tasks you can expect almost linear speedup!
Bonus: how fast is fast?
On 16-core machine I was able to perform vectorization (unigrams) of english wikipedia (13 gb of text, 4M of documents) in 2.5 minutes using hash vectorizer and in 6 minutes using vocabulary vectorizer. Timings include time spent for reading from disk! Resulted dtm
was about 13gb and at peak R processes consumes about 30gb of RAM. (Try to do it with any other R package or python module).
Here is code:
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.