Parallelizing Linear Regression or Using Multiple Sources
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
My previous post was explaining how mathematically it was possible to parallelize computation to estimate the parameters of a linear regression. More speficially, we have a matrix X which is n×k matrix and y a n-dimensional vector, and we want to compute ˆβ=[XTX]−1XTy by spliting the job. Instead of using the n observations, we’ve seen that it was to possible to compute “something” using the first n1 rows, then the next n2 rows, etc. Then, finally, we “aggregate” the m objects created to get our overall estimate.
Parallelizing on multiple cores
Let us see how it works from a computational point of view, to run each computation on a different core of the machine. Each core will see a slave, computing what we’ve seen in the previous post. Here, the data we use are
y = cars$dist X = data.frame(1,cars$speed) k = ncol(X)
On my laptop, I have three cores, so we will split it in m=3 chunks
library(parallel) library(pbapply) ncl = detectCores()-1 cl = makeCluster(ncl)
This is more or less what we will do: we have our dataset, and we split the jobs,
We can then create lists containing elements that will be sent to each core, as Ewen suggested,
chunk = function(x,n) split(x, cut(seq_along(x), n, labels = FALSE)) a_parcourir = chunk(seq_len(nrow(X)), ncl) for(i in 1:length(a_parcourir)) a_parcourir[[i]] = rep(i, length(a_parcourir[[i]])) Xlist = split(X, unlist(a_parcourir)) ylist = split(y, unlist(a_parcourir))
It is also possible to simplify the QR functions we will use
compute_qr = function(x){ list(Q=qr.Q(qr(as.matrix(x))),R=qr.R(qr(as.matrix(x)))) } get_Vlist = function(j){ Q3 = QR1[[j]]$Q %*% Q2list[[j]] t(Q3) %*% ylist[[j]] } clusterExport(cl, c("compute_qr", "get_Vlist"), envir=environment())
Then, we can run our functions on each core. The first one is
QR1 = parLapply(cl=cl,Xlist, compute_qr)
note that it is also possible to use
QR1 = pblapply(Xlist, compute_qr, cl=cl)
which will include a progress bar (that can be nice when the database is rather large). Then use
R1 = pblapply(QR1, function(x) x$R, cl=cl) %>% do.call("rbind", .) Q1 = qr.Q(qr(as.matrix(R1))) R2 = qr.R(qr(as.matrix(R1))) Q2list = split.data.frame(Q1, rep(1:ncl, each=k)) clusterExport(cl, c("QR1", "Q2list", "ylist"), envir=environment()) Vlist = pblapply(1:length(QR1), get_Vlist, cl=cl) sumV = Reduce('+', Vlist)
and finally the ouput is
solve(R2) %*% sumV [,1] X1 -17.579095 X2 3.932409
which is what we were expecting…
Using multiple sources
In practice, it might also happen that various “servers” have the data, but we cannot get a copy. But it is possible to run some functions on their server, and get some output, that we can use afterwards.
Datasets are supposed to be available somewhere. We can send a request, and get a matrix. Then we we aggregate all of them, and send another request. That’s what we will do here. Provider j should run f1(X) on his part of the data, that function will return R(1)j. More precisely, to the first provider, send
function1 = function(subX){ return(qr.R(qr(as.matrix(subX))))} R1 = function1(Xlist[[1]])
and actually, send that function to all providers, and aggregate the output
for(j in 2:m) R1 = rbind(R1,function1(Xlist[[j]]))
The create on your side the following objects
Q1 = qr.Q(qr(as.matrix(R1))) R2 = qr.R(qr(as.matrix(R1))) Q2list=list() for(j in 1:m) Q2list[[j]] = Q1[(j-1)*k+1:k,]
Finally, contact one last time the providers, and send one of your objects
function2=function(subX,suby,Q){ Q1=qr.Q(qr(as.matrix(subX))) Q2=Q return(t(Q1%*%Q2) %*% suby)}
Provider j should then run f2(X,y,Q(2)j) on his part of the data, using also Q(2)j as argument (that we obtained on own side) and that function will return (Q(2)jQ(1)j)Tjyj. For instance, ask the first provider to run
sumV = function2(Xlist[[1]],ylist[[1]], Q2list[[1]])
and do the same with all providers
for(j in 2:m) sumV = sumV+ function2(Xlist[[j]],ylist[[j]], Q2list[[j]]) solve(R2) %*% sumV [,1] X1 -17.579095 X2 3.932409
which is what we were expecting…
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.