Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The latest release of the stringdist package for approximate text matching has two performance-enhancing novelties. First of all, encoding conversion got a lot faster since this is now done from C rather than from R.
Secondly, stringdist
now employs multithreading based on the openmp protocol. This means that calculations are now parallelized on multicore machines running OS’s that support openmp.
The stringdist
package offers two main functions, both of which are now parallelized with openmp:
stringdist
can compute a number of different string metrics between vectors of strings (see here)amatch
is an approximate text matching version of R’s native match function.
By default, the package now uses the following number of cores: if your machine has one or two cores, all of them are used. If your machine has 3 or more cores, parallel::detectCores()
. This way, you can still use your computer for other things while stringdist is doing its job. I set this default since I noticed in some benchmarks that using all cores in a computation is sometimes slower than using amatch
and stringdist
now have a nthread
argument. You may also alter the global option
options("sd_num_thread")
or change the environmental variable OMP_THREAD_LIMIT
prior to loading stringdist
, but I’m digressing in details now.
A simple benchmark on my quadcore Linux machine (code at the end of the post) shows a near linear speedup as a function of the number of cores. The (default) distance computed here is the optimal string alignment distance. For this benchmark I sampled 10k strings of lengths between 5 and 31 characters. The first benchmark (left panel) shows the time it takes to compute 10k pairwise distances as a function of the number of cores used (nthread=1,2,3,4
). The right panel shows how much time it takes to fuzzy-match 15 strings against a table of 10k strings as a function of the number of threads. The areas around the lines show the 1st and 3rd quartile interval of timings (thanks to the awesome microbenchmark package of Olaf Mersmann).
According to the Writing R extensions manual, certain commercially available operating systems have extra (fixed?) overhead when running openmp-based multithreading. However, for larger computations this shouldn’t really matter.
library(microbenchmark) library(stringdist) set.seed(2015) # number of strings N
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.