Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In my last post I was struggling with getting a big memory version of the distance matrix to work fast. Nick and other readers had some suggestions and after puttering around with Nicks code I’ve adapted it to big memory and not impacted the run time very much. For comparison writing a 10K by 10K matrix was taking minutes and I projected about 24 hours for berkeley data. Nick came to the rescue with two really cool ideas. The first is a different method for calculating distance on the sphere. Here he expanded on my idea of pre computing sin() and cos() for latitudes and longitudes and he developed an approach that allows us to compute these values once and then solve a distance equation using super fast matrix maths. On top of that he came up with a “block” approach. At first this looks like it will be slower because we calculate the entire matrix as oppsoed to just the upper half. In his approach Nick wrote out files for every block. In my approach I write blocks to a filebacked big matrix. Thats a bit slower but I get one file that I can then random access. Here’s the code
createDistanceMatrix <- function(Inventory, blockSize = 20, Directory= getwd(), filename
= “distance.bin”){
# based on code by Nick Stokes
require(bigmemory)
options(bigmemory.allow.dimnames=TRUE)
if(!grepl(“\\.bin”,filename))stop(“please use a bin extension”)
columnsPresent <- intersect(colnames(Inventory),c(“Id”,”Lon”,”Lat”))
if(length(columnsPresent) != 3) stop(“missing the correct column names”)
descName <- sub(“bin”,”desc”,filename)
if(class(Inventory) == “matrix”) Inventory <- as.data.frame(Inventory)
L <- cbind(Inventory$Lat,Inventory$Lon)*(pi/180)
s <- cbind(cos(L),sin(L))
# z is the array of 3D coords of stations. All unit vecs
z <- cbind(s[,1]*s[,2],s[,1]*s[,4],s[,3])
z2 <- z/sqrt(2) # Better to mult now than after outer prod
D <- 6371*2 #Diam of Earth
n <- nrow(L)
if(n <= blockSize)stop(“BlockSize must be less than rows in Inventory”)
blocks <- n %/% blockSize
if((n %% blockSize) > 0)blocks <- blocks + 1
dex <- 1:blockSize
BM <- filebacked.big.matrix(nrow = n , ncol = n,
dimnames = list(as.list(Inventory$Id),NULL),
init = NA,
backingpath = Directory,
backingfile = filename,
descriptorfile = descName,
type = “double”)
startTime <- Sys.time()
for(i in 1:blocks){
p <- dex + (i-1)*10
p <- p[p<= n]
for(j in 1:blocks){
q <- dex +(j-1)*10
q <- q[q<=n]
x <- 0.5000000001-z2[p,]%*%t(z2[q,])
x <- D * asin(sqrt(x))
if(identical(p,q))diag(x)<-0
BM[p,q]<- x
}
}
print(Sys.time()-startTime)
return(BM)
}
That takes 46 seconds. orders of magintude improvement
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.