R and Linear Algebra
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
by Joseph Rickert
I was recently looking through upcoming Coursera offerings and came across the course Coding the Matrix: Linear Algebra through Computer Science Applications taught by Philip Klein from Brown University. This looks like a fine course; but why use Python to teach linear algebra? I suppose this is a blind spot of mine: MATLAB I can see. That software has a long tradition of being used in applied mathematics and engineering applications. The Linear Algebra course from MIT open courseware is based on MATLAB and half the linear algebra books published by SIAM use MATLAB.
I expected MATLAB and Python seems like a stretch, but where do we stand in the R world vis-a-vis linear algebra? Well, from a pedagogical point of view there does not appear to be much material “out there” that specifically relates to teaching linear algebra with R. It seems that not much has changed since this 2009 post. A google search yields some nice, short documents (Hyde, Højsgaard, Petris, and Carstensen) that look like the online residue from efforts to teach linear algebra using R. And, although most introductory R books have some material devoted to linear algebra (e.g. the extended markov chain in The Art of R Programming), one would be hard pressed to find a book entirely devoted to teaching linear algebra with R. Hands-On Matrix Algebra Using R: Active and Motivated Learning with Applications by Hrishikesh D. Vinod is the exception. (This looks like a gem waiting to be discovered.)
My guess, however, is that we will see much more introductory R material focused on linear algebra as scientists and engineers with computational needs outside of statistics proper discover R. R is, after all, well suited for doing the matrix computations associated with linear algebra, and here are some reasons:
(1) As a language designed for doing computational statistics, R is built on an efficient foundation of well-tested and trusted linear algebra code. From the very beginning, R was good at linear algebra. The vignette 2nd Introduction to the Matrix package lays out a little of the history of R’s linear algebra underpinnings:
Initially the numerical linear algebra functions in R called underlying Fortran routines from the Linpack (Dongarra et al., 1979) and Eispack (Smith et al., 1976) libraries but over the years most of these functions have been switched to use routines from the Lapack (Anderson et al., 1999) library which is the state-of-the-art implementation of numerical dense linear algebra.
(For example, the base R functions for computing eigenvalues, eigen(), Cholesky decompositions, chol(), and singular value decompositions svd() all use LAPACK or LINPACK code.)
(2) R’s notation, indexing and operators are very close to the matrix notation which mathematicians normally use to express linear algebra, and there are basic R functions for several matrix operations (See Quick R for a summary)
(3) The way R functions operate on whole objects: vectors, matrices arrays etc model very closely the conceptual processes of manipulating matrices as single entities.
(4) The seamless interplay between the data frame and matrix data structures in R make it easy to populate matrices from the appropriate columns in heterogeneous data sets.
(5) R is an extensible language and there is a considerable amount of work being done in the R community to “go deep”, to “go sparse” and to “go big”.
Going Deep
“Going Deep” means making it easy to access the computational resources that may be necessary for building production level applications. To this end, the Rcpp package makes it easy to write R functions that call C++ code to do the heavy lifting. Moreover, the RcppArmadillo and RcppEigen packages provide direct and efficient access to the C++ Armadillo and Eigen libraries for doing linear algebra.
Going Sparse
Modern statistical applications on even moderately large data sets often produce sparse matrices. One way to work with them in R is via Matrix, a “recommended” package that provides S4 classes for both dense and sparse matrices that extend R’s basic matrix data type. Methods for R functions that work on Matrix objects provide access to efficient linear algebra libraries including BLAS, Lapack CHOLMOD including AMD and COLAMD and Csparse. (Getting oriented in the world of linear algebra software is not an easy task, I found this chart from an authoritative source helpful.)
The following code which provides a very first look at the Matrix package shows a couple of notable features: (1) the Matrix() function evaluates a matrix to determine its class and (2) once the Cholesky factorization is computed it automatically becomes part of the matrix object.# A very first look at library(Matrix) set.seed (1) m <- 10; n <- 5 # Set dim of matrix A <- matrix (runif (m*n),m,n) # Construct a random matrix B <- crossprod(A) # Make a positive-definite symmetric matrix A;B # Have a look # A.M <- Matrix(A) # Matrix recognizes as a dense matrix and class(A.M) # automatically assigns A.M to class dgeMatrix [1] "dgeMatrix" # B.M <- Matrix(B) # Matrix recognizes B as a symmetric, dense matrix class(B.M) # and automatically assign B.M to class dysMatrix [1] "dsyMatrix" # chol(B.M) # Compute the Cholesky decomposition of B.M str(B.M) # The Cholesky decomposition is now part of the B.M object Formal class 'dsyMatrix' [package "Matrix"] with 5 slots ..@ x : num [1:25] 3.94 3.09 2.05 2.87 3.07 ... ..@ Dim : int [1:2] 5 5 ..@ Dimnames:List of 2 .. ..$ : NULL .. ..$ : NULL ..@ uplo : chr "U" ..@ factors :List of 1 .. ..$ Cholesky:Formal class 'Cholesky' [package "Matrix"] with 5 slots .. .. .. ..@ x : num [1:25] 1.98 0 0 0 0 ... .. .. .. ..@ Dim : int [1:2] 5 5 .. .. .. ..@ Dimnames:List of 2 .. .. .. .. ..$ : NULL .. .. .. .. ..$ : NULL .. .. .. ..@ uplo : chr "U" .. .. .. ..@ diag : chr "N" B.M@factors # Access the B.M object $Cholesky 5 x 5 Matrix of class "Cholesky" [,1] [,2] [,3] [,4] [,5] [1,] 1.9845453 1.5548529 1.0323413 1.4475384 1.5451499 [2,] . 1.1677451 0.4304767 0.5193746 0.6326812 [3,] . . 1.1606778 0.4354018 0.9637288 [4,] . . . 0.8844295 0.1572823 [5,] . . . . 0.646782
Other contributed sparse matrix packages are SparseM, slam and spam. For more practical information see the short tutorial from John Myles White and the Sparse Models Vignette by Martin Maechler
Going Big
R has had the capability to work with big matrices for some time and this continues to be an area of active development. The R packages bigmemory, and biganalytics provide structures for working with matrices that are too large to fit into memory. bigalgebra contains functions for doing linear algebra with bigmemory structures. The scidb package enables R users to do matrix computations on huge arrays stored in a SciDB data base, and Revolution Analytics’ commercial distribution of R, Revolution R Enterprise (RRE), makes high performance matrix calculations readily available from R. Because RRE is compiled with the Intel Math Kernel Library most common R functions based on linear algebra calculations automatically get a significant performance boost. However, the real linear algebra benefit RRE provides comes from the ability to compute very large matrices in seconds and seamlessly integrate them into an R workflow. A post from 2011 shows the code for doing a principal components analysis on 50 years of stock data with over 9 million observations and 2,800 stocks. The function rxCor() took only 0.57 seconds on my 8GB, 2-core laptop to compute the correlation matrix .
Of course these there categories are not mutually exclusive. In this short video Bryan Lewis shows how the winners of the NetFlix prize used R to go deep, to go big and to go sparse.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.