Parallelizing Linear Regression or Using Multiple Sources

[This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

My previous post was explaining how mathematically it was possible to parallelize computation to estimate the parameters of a linear regression. More speficially, we have a matrix X which is n×k matrix and y a n-dimensional vector, and we want to compute ˆβ=[XTX]1XTy by spliting the job. Instead of using the n observations, we’ve seen that it was to possible to compute “something” using the first n1 rows, then the next n2 rows, etc. Then, finally, we “aggregate” the m objects created to get our overall estimate.

Parallelizing on multiple cores

Let us see how it works from a computational point of view, to run each computation on a different core of the machine. Each core will see a slave, computing what we’ve seen in the previous post. Here, the data we use are

y = cars$dist
X = data.frame(1,cars$speed)
k = ncol(X)

On my laptop, I have three cores, so we will split it in m=3 chunks

library(parallel)
library(pbapply)
ncl = detectCores()-1
cl = makeCluster(ncl)

This is more or less what we will do: we have our dataset, and we split the jobs,

We can then create lists containing elements that will be sent to each core, as Ewen suggested,

chunk = function(x,n) split(x, cut(seq_along(x), n, labels = FALSE))
a_parcourir = chunk(seq_len(nrow(X)), ncl)
for(i in 1:length(a_parcourir)) a_parcourir[[i]] = rep(i, length(a_parcourir[[i]]))
Xlist = split(X, unlist(a_parcourir))
ylist = split(y, unlist(a_parcourir))

It is also possible to simplify the QR functions we will use

compute_qr = function(x){
  list(Q=qr.Q(qr(as.matrix(x))),R=qr.R(qr(as.matrix(x))))
}
get_Vlist = function(j){
  Q3 = QR1[[j]]$Q %*% Q2list[[j]]
  t(Q3) %*% ylist[[j]]
}
clusterExport(cl, c("compute_qr", "get_Vlist"), envir=environment())

Then, we can run our functions on each core. The first one is

  QR1 = parLapply(cl=cl,Xlist, compute_qr)

note that it is also possible to use

  QR1 = pblapply(Xlist, compute_qr, cl=cl)

which will include a progress bar (that can be nice when the database is rather large). Then use

  R1 = pblapply(QR1, function(x) x$R, cl=cl) %>% do.call("rbind", .)
  Q1 = qr.Q(qr(as.matrix(R1)))
  R2 = qr.R(qr(as.matrix(R1)))
  Q2list = split.data.frame(Q1, rep(1:ncl, each=k))
  clusterExport(cl, c("QR1", "Q2list", "ylist"), envir=environment())
  Vlist = pblapply(1:length(QR1), get_Vlist, cl=cl)
  sumV = Reduce('+', Vlist)

and finally the ouput is

solve(R2) %*% sumV
         [,1]
X1 -17.579095
X2   3.932409

which is what we were expecting…

Using multiple sources

In practice, it might also happen that various “servers” have the data, but we cannot get a copy. But it is possible to run some functions on their server, and get some output, that we can use afterwards.

Datasets are supposed to be available somewhere. We can send a request, and get a matrix. Then we we aggregate all of them, and send another request. That’s what we will do here. Provider j should run f1(X) on his part of the data, that function will return R(1)j. More precisely, to the first provider, send

function1 = function(subX){
return(qr.R(qr(as.matrix(subX))))}
R1 = function1(Xlist[[1]])

and actually, send that function to all providers, and aggregate the output

for(j in 2:m) R1 = rbind(R1,function1(Xlist[[j]]))

The create on your side the following objects

Q1 = qr.Q(qr(as.matrix(R1)))
R2 = qr.R(qr(as.matrix(R1)))
Q2list=list()
for(j in 1:m) Q2list[[j]] = Q1[(j-1)*k+1:k,]

Finally, contact one last time the providers, and send one of your objects

function2=function(subX,suby,Q){
Q1=qr.Q(qr(as.matrix(subX)))
Q2=Q
return(t(Q1%*%Q2) %*% suby)}

Provider j should then run f2(X,y,Q(2)j) on his part of the data, using also Q(2)j as argument (that we obtained on own side) and that function will return (Q(2)jQ(1)j)Tjyj. For instance, ask the first provider to run

sumV = function2(Xlist[[1]],ylist[[1]], Q2list[[1]])

and do the same with all providers

for(j in 2:m) sumV = sumV+ function2(Xlist[[j]],ylist[[j]], Q2list[[j]])
solve(R2) %*% sumV
         [,1]
X1 -17.579095
X2   3.932409

which is what we were expecting…

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)