Graph Bisection in R

11 years ago

[This article was first published on Christophe Ladroue » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Recently I had to partition a set of SNPs into a training set and a test set. Making a random split would not do: both sets would likely contain very similar SNPs due to linkage disequilibrium (LD), making them non-independent. So what I really needed was a partition that minimises the LD between the two sets.

The problem is equivalent to bisecting a graph (with nodes being the SNPs, linked with LD) and it’s NP hard so we shouldn’t expect to always find the optimal solution. There are very good heuristic algorithms for it though, and I decided to implement the Kernighan/Lin algorithm, which gives a good solution in . See the code at the end of this post.

A first, direct implementation using loops convinced me that it was working well, but as I needed to process a 2000×2000 matrix, efficiency rapidly became an issue: bisecting a 1000×1000 matrix was taking 6 min and the time taken was going up at an exponential rate, giving an estimated time of completion of 1 hour for a 2000×2000 matrix. It turned out that the code was very amenable to vectorization and after spending some time putting that in place, I gained almost 2 orders of magnitudes: it takes 43 seconds instead of 55 minutes on test data. In other words, it is actually usable.

The results are really good as well. Here is an example on a non-separable graph. The partition is such that the terms off the (block) diagonal are small. This is much better than a random partition.

Selec All Code:

  N<-100
  topMatrix<-matrix(round(runif(N*N)*50),nrow=N,ncol=N)
  bottomMatrix<-matrix(round(runif(N*N)*50),nrow=N,ncol=N)
  smallValues<-matrix(runif(N*N)*1,nrow=N,ncol=N)
 
  weightMatrix<-rbind(
  cbind(topMatrix+t(topMatrix),smallValues),
  cbind(t(smallValues),bottomMatrix+t(bottomMatrix)))
 
  partition<-approximateBisection(weightMatrix)

(the diagonal goes from the bottom left to the top right)
And here is the code:

Selec All Code:

## Approximate bisection
# returns a bisection of the graph that minimizes the cost using Kernighan/Lin Algorithm
# http://www.eecs.berkeley.edu/~demmel/cs267/lecture18/lecture18.html#link_4.2
# partition<-approximateBisection(weightMatrix)
# weightMatrix is symmetric matrix of size 2Nx2N made of non-negative values.
# partition is a list of two vectors of N indices.
# C.Ladroue
 
approximateBisection<-function(weightMatrix,mode="matrix",minimumGain=1e-5){
#   minimumGain<-1e-5 # minimum value for gain, setting it to 0 might lead to infinite loop due to numerical inaccuracy
 
  N<-dim(weightMatrix)[1] # number of elements
  m<-N/2
 
  # start off with a random partition
  A<-sample(1:N,N/2,replace=FALSE)
  B<-(1:N)[-A]
 
  maxGain<-Inf
  while(maxGain>minimumGain){
    DA<-rowSums(weightMatrix[A,B])-rowSums(weightMatrix[A,A])+diag(weightMatrix[A,A])
    DB<-rowSums(weightMatrix[B,A])-rowSums(weightMatrix[B,B])+diag(weightMatrix[B,B])
    unmarkedA<-1:m
    unmarkedB<-1:m
    markedA<-rep(0,m)
    markedB<-rep(0,m)
    gains<-rep(0,m)
    for(k in 1:m){
      # find best pair from remainder
      # with 2 loops, slow but easy on memory
if(mode=='2loops'){
      bestGain<--Inf
      besti<-0
      bestj<-0
      for(i in unmarkedA)
        for(j in unmarkedB){
          gain<-DA[i]+DB[j]-2*weightMatrix[A[i],B[j]]
          if(gain>bestGain) {bestGain<-gain; besti<-i;bestj<-j}
        }
#           mark the best pair
        unmarkedA<-unmarkedA[-which(unmarkedA==besti)]
        unmarkedB<-unmarkedB[-which(unmarkedB==bestj)]
        markedA[k]<-besti
        markedB[k]<-bestj
}              
        # with one matrix, much faster but builds a matrix as large as weightMatrix
if(mode=='matrix'){
        dimension<-m+1-k
        fasterGain<-matrix(DA[unmarkedA],nrow=dimension,ncol=dimension,byrow=FALSE)+
                matrix(DB[unmarkedB],nrow=dimension,ncol=dimension,byrow=TRUE)-
                  2*weightMatrix[A[unmarkedA],B[unmarkedB]]
        # mark the best pair
        best<-arrayInd(which.max(fasterGain),.dim=c(dimension,dimension))
        besti<-unmarkedA[best[1]]
        bestj<-unmarkedB[best[2]]
        bestGain<-fasterGain[best]
        markedA[k]<-unmarkedA[best[1]]
        markedB[k]<-unmarkedB[best[2]]
        unmarkedA<-unmarkedA[-best[1]]
        unmarkedB<-unmarkedB[-best[2]]
}
        # record gain
        gains[k]<-bestGain
 
        # update D for unmarked indices 
        DA[unmarkedA]<-DA[unmarkedA]+2*weightMatrix[A[unmarkedA],A[besti]]-2*weightMatrix[A[unmarkedA],B[bestj]]
        DB[unmarkedB]<-DB[unmarkedB]+2*weightMatrix[B[unmarkedB],B[bestj]]-2*weightMatrix[B[unmarkedB],A[besti]]
      }
      gains<-cumsum(gains)
      bestPartition<-which.max(gains)
      maxGain<-gains[bestPartition]
      if(maxGain>minimumGain){ 
          # swap best pairs
          A1<-c(A[-markedA[1:bestPartition]],B[markedB[1:bestPartition]])
          B1<-c(B[-markedB[1:bestPartition]],A[markedA[1:bestPartition]])
          A<-A1
          B<-B1
          }
    }
  list(A,B)
}

To leave a comment for the author, please follow the link and comment on their blog: Christophe Ladroue » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.