Site icon R-bloggers

An OpenBLAS-based Rblas for Windows 64

[This article was first published on Strange Attractors » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One of the more important pieces of software that powers R is its BLAS, which stands for Basic Linear Algebra Subprograms. This is the suite of programs which, as its name implies, performs basic linear algebra routines such as vector copying, scaling and dot products; linear combinations; and matrix multiplication. It is the engine on which LINPACK, EISPACK, and the now industry-standard LAPACK and its variants are built. As a lower-level software suite, the BLAS can be fine-tuned to different computer architectures, and a tuned BLAS can create a dramatic speedup in many fundamental R routines. There are a number of projects developing tuned BLAS suites such as the open source ATLAS, GotoBLAS2, and OpenBLAS, and the closed source Intel MKL and AMD ACML libraries.

In ‘R’ itself, the BLAS is stored in the shared library file Rblas.dll, and a properly compiled tuned version can just be “dropped in” to the bin subdirectory and used immediately. For people working in Linux, there is significant support for specially tuned BLAS (and LAPACK) files, which can be found in great detail in the R Installation manual—the support for Windows is somewhat less robust, to be charitable. For many years, I was constrained to working in a 32-bit Windows environment, and took advantage of the 32-bit Windows compiled versions that existed in R’s repository of tuned BLAS files. However, at that time, the most recent version was for a Pentium 4, so over the next few years, I struggled and compiled ATLAS-based Rblas’s for Windows 32-bit. I did this for the Core 2 Duo (C2D) and the quad-core SandyBridge (Core i7 called C2i7 in the repository). The speedup was dramatic (see this blog, which uses the Core 2 BLAS I compiled, for an example) However, I found ATLAS to be difficult to compile, subject to any number of issues, and not really being a programmer, I often ran into issues I could not solve. Moreover, I was never able to successfully compile a 64-bit BLAS which passed the comprehensive make check-all test suite.

Recently, I made the complete switch to 64 bit Windows at both work and home so finding a 64-bit Rblas become much more important. There are some pre-compiled 64-bit BLAS files for R, graciously compiled by Dr. Ei-ji Nakama, which can be found here. Using these binaries, I found a dramatic increase in speed over the reference BLAS. However, the most recent processor-specific BLAS in that repository is for the Nehalem architecture, and cannot take advantage of the SSE4 and AVX operations built into the new SandyBridge and IvyBridge cores. Much to my frustration, I had many failures trying to compile a 64-bit ATLAS-based BLAS which, when compiled in R, would pass all the checks. With each try taking between six and a dozen hours, I eventually gave up trying to use ATLAS and resigned myself to living with the GotoBLAS-based files—which honestly was not much of a resignation.

A bit more recently, came across the OpenBLAS project, which claims to have reached near-MKL speeds on the Sandy/IvyBridge architecture due to some elegant hand-coding and optimization, and I was hoping to be able to take advantage of this in R. Unfortunately, there are no pre-compiled binaries to make use of, and so I had to attempt the compilation on my own. What made this a bit more difficult is that officially, R for Windows does not support using Goto or ACML based BLAS routines, and even ATLAS has limited support (see this thread, specifically Dr. Ripley’s response). This called for a lot of trial and error, originally resulting in dismal failure.

Serendipitously, around the time of the 3.0.1 release, there was an OpenBLAS update as well. Trying again, I was finally successful in compiling a single-threaded, OpenBLAS v2.8 based BLAS for the SandyBridge architecture on Windows 64 bit that, when used in the R compilation, created an Rblas that passed make-check all! For those interested in compiling their own, once the OpenBLAS is compiled, it can be treated as an ATLAS blas in R’s Makefile.local, with the only additional change being pointing to the compiled .a file in \src\extra\blas\Makefile.win.

Once I was successful with the SandyBridge-specific file, I compiled an Rblas that was not Sandy-Bridge dependent, but could be used on any i386 machine. I plan on submitting both to Dr. Uwe Ligges at R, and hope that, like the other BLAS’s I submitted, they will be posted eventually.

To demonstrate the increase in speed that using a tuned BLAS can provide in R, I ran a few tests. First, I created two 1000×1000 matrices populated with random normal variables. In specific (I don’t know why the first assignment operator has an extra space, I’ve seen that before when posting to WordPress):

A <- matrix(rnorm(1e6, 1000, 100), 1000, 1000)
B <- matrix(rnorm(1e6, 1000, 100), 1000, 1000)
write.csv(A, file="C:/R/A.csv", row.names=FALSE)
write.csv(B, file="C:/R/B.csv", row.names=FALSE)

I can provide the specific matrices for anyone interested. I then compiled a basically vanilla R 3.0.2 for 64 bit windows, using only -mtune=corei7-avx -O3 for optimizations, so the code should run on any i386. I followed the compilation steps for a full installation, so it included base, bitmapdll, cairodevices, recommended, vignettes, manuals, and rinstaller for completeness. Using a Dell M4700 (i7-3740 QM 2.7Ghz, Windows 7 Professional 64bit, 8GB RAM) I tested the following BLAS’s:

I updated all packages and installed the microbenchmark package (all from source). To test the effects of the different BLAS’s I renamed and copied the appropriate blas to Rblas.dll each time. The test ran multiple copies of crossprod, solve, qr, svd, eigen, and lu (the last needs the Matrix package, but it is a recommended package). The actual test code is:

library(microbenchmark)
library(Matrix)
A <- as.matrix(read.csv(file="C:/R/BLAS/A.csv", colClasses='numeric'))
B <- as.matrix(read.csv(file="C:/R/BLAS/B.csv", colClasses='numeric'))
colnames(A) <- colnames(B) < - NULL
microbenchmark(crossprod(A,B), solve(A), qr(A, LAPACK=TRUE), svd(A), eigen(A),
               lu(A), times=100L, unit='ms')

The results are illuminating:

Reference BLAS:

Unit: milliseconds
                 expr       min        lq    median        uq       max neval
      crossprod(A, B) 1173.4761 1184.7500 1190.9430 1198.4409 1291.9620   100
             solve(A) 1000.1138 1011.5769 1018.7613 1027.0789 1172.6551   100
 qr(A, LAPACK = TRUE)  625.3756  633.2889  638.0123  644.8645  713.7831   100
               svd(A) 3045.0855 3074.7472 3093.8418 3138.0180 3621.9590   100
             eigen(A) 5491.4986 5563.1129 5586.9327 5632.5695 5836.6824   100
                lu(A)  170.8486  172.5362  175.5632  178.7300  247.1116   100

GotoBLAS_Dynamic

Unit: milliseconds
                 expr         min          lq      median         uq        max neval
      crossprod(A, B)    63.68471    83.16222    92.33265   104.5118   184.4941    20
             solve(A)   207.38801   229.81548   250.53656   276.6736   391.5336    20
 qr(A, LAPACK = TRUE)   171.93558   175.22074   178.57133   186.3217   996.7584    20
               svd(A)   892.90194   920.55781   969.79708  1069.5740 12185.9414    20
             eigen(A) 14870.65481 14943.13804 15113.49240 15513.8285 29100.0788    20
                lu(A)    70.65831    76.07561    83.52785   146.1552   152.5949    20

GotoBLAS_Nehalem

Unit: milliseconds
                 expr         min          lq      median          uq         max neval
      crossprod(A, B)    52.86457    64.42491    73.75266    77.89007    82.05677    20
             solve(A)   203.96266   209.78135   220.36275   229.09164   298.06924    20
 qr(A, LAPACK = TRUE)   171.22182   174.58935   175.90882   181.63144   247.92046    20
               svd(A)   895.20834   904.17256   950.78584   970.47127  1057.90653    20
             eigen(A) 15429.49102 15470.62708 15552.95074 15627.39700 15759.67103    20
                lu(A)    67.13746    72.27017    74.26725    77.52482    85.24281    20

OpenBLAS_Dynamic

Unit: milliseconds
                 expr        min         lq     median         uq       max neval
      crossprod(A, B)  102.54157  102.99851  104.42070  105.95056  177.8572   100
             solve(A)  180.64641  182.99257  186.41507  188.91836  261.6805   100
 qr(A, LAPACK = TRUE)  188.08381  194.37999  197.10319  199.66507  281.2404   100
               svd(A) 1027.00070 1040.84424 1055.35816 1108.67206 1193.3480   100
             eigen(A) 2114.96652 2190.11265 2208.06810 2250.41861 2476.9335   100
                lu(A)   57.47394   58.53336   59.80013   62.60323  139.1076   100

OpenBLAS_SandyBridge

Unit: milliseconds
                 expr        min         lq     median        uq       max neval
      crossprod(A, B)  102.41031  102.71278  104.14695  105.4141  179.8048   100
             solve(A)  180.52238  181.97006  184.94760  187.6423  260.5379   100
 qr(A, LAPACK = TRUE)  186.89180  191.01798  193.69381  199.2825  286.8123   100
               svd(A) 1024.47896 1034.60627 1043.91214 1099.8361 1153.3881   100
             eigen(A) 2108.22006 2172.05087 2192.65913 2216.7981 2375.9831   100
                lu(A)   57.55384   59.10672   61.08402   62.9213  133.6456   100

All the tuned BLAS results are much better than the reference, with the exception of eigenvalue decomposition for the GotoBLAS-based Rblas. I do not know why that is the case, but the difference was so severe that I had to run it only 20 times to have results in reasonable time. For the other routines, sometimes the OpenBLAS based version is quicker, other times not. I personally will use it exclusively as the minor lag in comparison to some of the GotoBLAS timings is more than compensated for by the eigenvalue speedup, and overall, it is still anywhere between 3 and 10 times faster than the reference BLAS. Regardless, it is clear that using a tuned BLAS can speed up calculations considerably.

Using a tuned BLAS is not the only way to increase R’s speed. For anyone compiling R for themselves, throwing the proper flags for R in its compilation can squeeze a bit more speed out if it, and activating byte-compilation can do a bit more. In my next post, I hope to show similar timing numbers, but this time, using an R compiled for my specific machine (Ivy Bridge) in concert with the tuned BLAS.

To leave a comment for the author, please follow the link and comment on their blog: Strange Attractors » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.