Site icon R-bloggers

Find coefficients from linear regression in R (really fast)

[This article was first published on R – Faux Value, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

[This is the first part of a three part series that demonstrates how to create an R package that includes RcppArmadillo source code; follow these links for parts two and three]

Every journey needs motivation … so let’s say you want to run a LOT of regressions. Additionally, let’s say you are really only interested in the coefficients.

In that case, lm.fit in R does way too much work and we can speed it up with a few relatively simple tricks (TL;DR RcppArmadillo is fastest!)

First of all, it should be noted that it might not be worth the effort. You can get a long way without linking to C++ code.  In the present slightly contrived case, the lm.fit and .lm.fit functions are much faster than lm.


set.seed(1)

ROWS <- 1e5
Y <- 1:ROWS + runif(ROWS, -1, 1)
X <- 1:ROWS + rnorm(ROWS, 0, 3)

# get coefs
coef_1st <- function() lm(Y ~ X)$coefficients
coef_2nd <- function() lm.fit(cbind(rep(1, length(X)), X), Y)$coefficients
coef_3rd  microbenchmark(coef_1st(), coef_2nd(), coef_3rd(), times = 1e4)

# benchmark
microbenchmark(coef_1st(), coef_2nd(), coef_3rd(), times = 1e4)
Unit: milliseconds
       expr       min        lq      mean    median        uq       max neval
 coef_1st() 25.453486 34.005381 49.059493 45.734325 60.332143 1476.2446 10000
 coef_2nd()  3.976818  5.264472  9.967156  7.899375 11.087321  177.8020 10000
 coef_3rd()  2.868874  3.476643  7.131980  5.091950  7.207663  114.3153 10000

Using medians,  lm.fit is ~6x faster than lm, and .lm.fit is a little over 9x faster! These are big improvements for little effort. So if this is your exact problem, you can now stop reading.

HOWEVER if you have REAL work to do, you might still be interested in C++.  And if you want to do regressions, you need linear algebra.  Enter RcppArmadillo!

You could of course use the ​fastLmPure function in the RcppArmadillo package. I have done that below.


require(RcppArmadillo)

coef_4th <- function() {
# note to make XX to get intercept from fastLmPure;
# this behavior is the default for the other functions
    XX <- cbind(rep(1L, length(X)), X)
    fastLmPure(XX, Y)$coefficients
}

Note that we need to prepare the data differently, by adding the column of 1s to the matrix (this is the intercept).  This isn't free so i included that step in the benchmarking function.

Right away you should be thinking that we can do better … we could move the code that adds the column of 1s (creating XX) into C++, and also do only the calculations we need.

This C++ code does just that:


#include

// [[Rcpp::depends(RcppArmadillo)]]

//[[Rcpp::export]]
arma::colvec getCoef(const arma::vec & X, const arma::vec & Y) {
	// this function takes two numeric vectors
	// checks to make sure the dimensions match
	// returns the coefficients from a bivariate regression
	// TODO: add an intercept switch ... noInt

	int Xlen = X.size();
	int Ylen = Y.size();

	if( Xlen != Ylen ) Rcpp::stop("X and Y must have the same length");

	arma::mat XX(Xlen, 2);

	for( int i = 0; i < Xlen; i++) {
		XX(i, 0) = 1; // constant
		XX(i, 1) = X(i);
	}

	// find coefficients
	arma::colvec coef = arma::solve(XX, Y);

	return coef;
}

Now we source the C++ function using Rcpp::sourceCpp, loading getCoef as a function in the global environment.

R> Rcpp::sourceCpp("~/Dropbox/Cpp/R/fastLMCo.cpp")

As you would expect, our custom getCoef function is fastest, besting RcppArmadillo’s fastLmPure by ~25%


coef_5th <- function() getCoef(X, Y)

microbenchmark(coef_1st(), coef_2nd(), coef_3rd(),
               coef_4th(), coef_5th(), times = 1e4)

Unit: milliseconds
expr min lq mean median uq max neval
coef_1st() 24.688233 28.716522 33.826753 32.054957 35.927653 187.17681 10000
coef_2nd() 3.672581 4.328348 7.166927 4.916639 8.739248 87.31223 10000
coef_3rd() 2.508058 3.079421 5.142339 3.409392 4.531942 101.43583 10000
coef_4th() 2.242320 2.614726 3.578321 2.834551 3.240812 106.49090 10000
coef_5th() 1.986783 2.169972 2.467147 2.260406 2.530783 44.72023 10000

To leave a comment for the author, please follow the link and comment on their blog: R – Faux Value.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.