Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Oracle R Distribution Performance Benchmarks
Oracle R Distribution provides dramatic performance gains with MKL
Using the recognized R benchmark R-benchmark-25.R test script, we compared the performance of Oracle R Distribution with and without the dynamically loaded high performance Math Kernel Library (MKL) from Intel. The benchmark results show Oracle R Distribution is significantly faster with the dynamically loaded high performance library. R users can immediately gain performance enhancements over open source R, analyzing data on 64-bit architectures and leveraging parallel processing within specific R functions that invoke computations performed by these high performance libraries.
The Community-developed test consists of matrix calculations and functions, program control, matrix multiplication, Cholesky Factorization, Singular Value Decomposition (SVD), Principal Component Analysis (PCA), and Linear Discriminant Analysis. Such computations form a core component of many real-world problems, often taking the majority of compute time. The ability to speed up these computations means faster results for faster decision making.
While the benchmark results reported were conducted using Intel MKL, Oracle R Distribution also supports AMD Core Math Library (ACML) and Solaris Sun Performance Library.
< size="3">Oracle R Distribution 2.15.1 x64 Benchmark Results (time in seconds)< >
< size="2" face="arial,helvetica,sans-serif"> < > |
< size="2" face="arial,helvetica,sans-serif"> ORD with internal BLAS/LAPACK 1 thread < > |
< size="2" face="arial,helvetica,sans-serif"> ORD + MKL 1 thread < > |
< size="2" face="arial,helvetica,sans-serif"> ORD + MKL 2 threads < > |
< size="2" face="arial,helvetica,sans-serif"> ORD + MKL 4 threads < > |
< size="2" face="arial,helvetica,sans-serif"> ORD + MKL 8 threads < > |
< size="2" face="arial,helvetica,sans-serif"> Performance gain ORD + MKL 4 threads < > |
< size="2" face="arial,helvetica,sans-serif"> Performance gain ORD + MKL 8 threads < > |
< size="2" face="arial,helvetica,sans-serif"> Matrix Calculations < > |
< size="2" face="arial,helvetica,sans-serif"> 11.2< > | < size="2" face="arial,helvetica,sans-serif"> 1.9< > | < size="2" face="arial,helvetica,sans-serif"> 1.3< > | < size="2" face="arial,helvetica,sans-serif"> 1.1< > | < size="2" face="arial,helvetica,sans-serif"> 0.9< > | < size="2" face="arial,helvetica,sans-serif"> 9.2x< > | < size="2" face="arial,helvetica,sans-serif"> 11.4x< > |
< size="2" face="arial,helvetica,sans-serif"> Matrix Functions < > |
< size="2" face="arial,helvetica,sans-serif"> 7.2< > | < size="2" face="arial,helvetica,sans-serif"> 1.1< > | < size="2" face="arial,helvetica,sans-serif">0.6 < > |
< size="2" face="arial,helvetica,sans-serif"> 0.4< > | < size="2" face="arial,helvetica,sans-serif"> 0.4< > | < size="2" face="arial,helvetica,sans-serif"> 17.0x< > | < size="2" face="arial,helvetica,sans-serif"> 17.0x< > |
< size="2" face="arial,helvetica,sans-serif"> Program Control < > |
< size="2" face="arial,helvetica,sans-serif"> 1.4< > | < size="2" face="arial,helvetica,sans-serif"> 1.3< > | < size="2" face="arial,helvetica,sans-serif"> 1.5< > | < size="2" face="arial,helvetica,sans-serif"> 1.4< > | < size="2" face="arial,helvetica,sans-serif"> 0.8< > | < size="2" face="arial,helvetica,sans-serif"> 0.0x< > | < size="2" face="arial,helvetica,sans-serif"> 0.8x< > |
< size="2" face="arial,helvetica,sans-serif"> Matrix Multiply < > |
< size="2" face="arial,helvetica,sans-serif"> 517.6< > | < size="2" face="arial,helvetica,sans-serif"> 21.2< > | < size="2" face="arial,helvetica,sans-serif"> 10.9< > | < size="2" face="arial,helvetica,sans-serif"> 5.8< > | < size="2" face="arial,helvetica,sans-serif"> 3.1< > | < size="2" face="arial,helvetica,sans-serif"> 88.2x< > | < size="2" face="arial,helvetica,sans-serif"> 166.0x< > |
< size="2" face="arial,helvetica,sans-serif"> Cholesky Factorization < > |
< size="2" face="arial,helvetica,sans-serif"> 25< > | < size="2" face="arial,helvetica,sans-serif"> 3.9< > | < size="2" face="arial,helvetica,sans-serif"> 2.1< > | < size="2" face="arial,helvetica,sans-serif"> 1.3< > | < size="2" face="arial,helvetica,sans-serif"> 0.8< > | < size="2" face="arial,helvetica,sans-serif"> 18.2x< > | < size="2" face="arial,helvetica,sans-serif"> 29.4x< > |
< size="2" face="arial,helvetica,sans-serif"> Singular Value Decomposition < > |
< size="2" face="arial,helvetica,sans-serif"> 103.5< > | < size="2" face="arial,helvetica,sans-serif"> 15.1< > | < size="2" face="arial,helvetica,sans-serif"> 7.8< > | < size="2" face="arial,helvetica,sans-serif"> 4.9< > | < size="2" face="arial,helvetica,sans-serif"> 3.4< > | < size="2" face="arial,helvetica,sans-serif"> 20.1x< > | < size="2" face="arial,helvetica,sans-serif"> 40.9x< > |
< size="2" face="arial,helvetica,sans-serif"> Principal Component Analysis < > |
< size="2" face="arial,helvetica,sans-serif"> 490.1< > | < size="2" face="arial,helvetica,sans-serif"> 42.7< > | < size="2" face="arial,helvetica,sans-serif"> 24.9< > | < size="2" face="arial,helvetica,sans-serif"> 15.9< > | < size="2" face="arial,helvetica,sans-serif"> 11.7< > | < size="2" face="arial,helvetica,sans-serif"> 29.8x< > | < size="2" face="arial,helvetica,sans-serif"> 40.9x< > |
< size="2" face="arial,helvetica,sans-serif"> Linear Discriminant Analysis < > |
< size="2" face="arial,helvetica,sans-serif"> 419.8< > | < size="2" face="arial,helvetica,sans-serif"> 120.9< > | < size="2" face="arial,helvetica,sans-serif"> 110.8< > | < size="2" face="arial,helvetica,sans-serif"> 94.1< > | < size="2" face="arial,helvetica,sans-serif"> 88.0< > | < size="2" face="arial,helvetica,sans-serif"> 3.5x< > | < size="2" face="arial,helvetica,sans-serif"> 3.8x< > |
This benchmark was executed on a 3-node cluster, with 24 cores at 3.07GHz per CPU and 47 GB RAM, using Linux 5.5.
In the first graph, we see significant performance improvements. For example, SVD with ORD plus MKL executes 20 times faster using 4 threads, and 29 times faster using 8 threads. For Cholesky Factorization, ORD plus MKL is 18 and 30 times faster for 4 and 8 threads, respectively.
In the second graph,we focus on the three longer running tests. Matrix multiplication is 88 and 166 times faster for 4 and 8 threads, respectively. PCA is 30 and 50 times faster, and LDA is over 3 times faster.
This level of performance improvement can significantly reduce application execution time and make interactive, dynamically generated results readily achievable. Note that ORD plus MKL not only impacts performance on the client side, but also when used in combination with R scripts executed using Oracle R Enterprise Embedded R Execution. Such R scripts, executing at the database server machine, reap these performance gains as well.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.