Simplifying polygon shapefiles in R
[This article was first published on The Praise of Insects, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Recently I downloaded the Crosby Code shapefile from Landcare Research’s LRIS server for use in some publications I’m preparing. This shapefile is incredibly detailed, far more so than what I require. This detail means that it takes a while for the map to be plotted each time. As detail is less important for me than speed of plotting, I decided to try and simplify the map.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The first thing we needed is to load the packages required and the shapefile itself:
library(maptools)
library(shapefiles)
CCBound <- readShapePoly("nz-area-codes-for-recording-specimen-localities")
CCBound2 <- CCBound
system.time(plot(CCBound2))
user system elapsed
9.537 14.789 142.760
I’m wanting to plot this map multiple times, so waiting 2.5 minutes each time is not at all desirable. The first thing to do was to check out the structure and composition of the object.
str(CCBound, max.level = 3)
sapply(CCBound@polygons, function(x) length(x@Polygons))
62 62 1 1 182 216 2332 81 158 2012 1 22 126 284 45 41 1 35 1 3 1 68 30 78 142 333 1 14 1070 1257 716
area <- lapply(CCBound@polygons, function(x) sapply(x@Polygons, function(y) y@area))
quantile(unlist(area))
0% 25% 50% 75% 100%
3.379739e-10 1.978036e-08 6.223511e-08 2.120671e-07 1.952179
Part of the reason why this is plotting so slowly is the enormous number of polygons to be plotted—9376 in total. We can also see that the majority of these polygons are very small. These polygons represent the myriad of small offshore islands around the NZ coast. As I’m wanting to look at NZ as a whole, these islands are not significant. If we can get rid of these, we will be a long way toward making a leaner, meaner version of the shapefile. We first need to find out what polygons within each larger group are large enough to be worth keeping, then remove them.
mainPolys <- lapply(area, function(x) which(x > 0.001))
CCBound@data <- CCBound@data[-c(1:2),]
CCBound@polygons <- CCBound@polygons[-c(1:2)]
CCBound@plotOrder <- 1:length(CCBound@polygons)
mainPolys <- mainPolys[-c(1:2)]
for(i in 1:length(mainPolys)){
if(length(mainPolys[[i]]) >= 1 && mainPolys[[i]][1] >= 1){
CCBound@polygons[[i]]@Polygons <- CCBound@polygons[[i]]@Polygons[mainPolys[[i]]]
CCBound@polygons[[i]]@plotOrder <- 1:length(CCBound@polygons[[i]]@Polygons)
}
}
When we look at mainPolys, we see the first two larger groupings of polygons have no significantly sized polygons in them. We get rid of these in the following block of data. @plotOrder needs to be the same length as @polygons or else the object will create an error. The second block of code crawls through the object retaining only those polygons we found in mainPolys.
When we plot the returned object:
system.time(plot(CCBound))We see that the plotting time is much more efficient. Depending on our requirements, we could leave it as it is. Being a devil for punishment however, 3 seconds is still a little too slow. There is still a lot of detail in the edges of the main polygons that is unecessary for my requirements. This detail can be simplified using the dp() function in the shapefiles package. Once again, we create a loop that goes through the object simplifying each of the polygons within it. dp() works only on dataframes, so we need to break each @coords matrix into a dataframe, run dp(), then convert it back into a matrix.
user system elapsed
2.536 0.292 3.211
for(i in 1:length(CCBound@polygons)){
for(j in 1:length(CCBound@polygons[[i]]@Polygons)){
temp <- as.data.frame(CCBound@polygons[[i]]@Polygons[[j]]@coords)
names(temp) <- c("x", "y")
temp2 <- dp(temp, 0.01)
CCBound@polygons[[i]]@Polygons[[j]]@coords <- as.matrix(cbind(temp2$x, temp2$y))
}
}
WARNING: This code takes a while to run—around 5 minutes on my machine. The time increases as you decrease the threshold given to dp(). When I had it set to 0.001 it took 7 minutes. However, the time taken to plot the shapefile is now:
system.time(plot(CCBound))Much better! The final thing to do now is to compare the difference between the original and the final product.
user system elapsed
0.072 0.096 0.873
png(“comparison.png”, width=800, height=500)
layout(matrix(1:2, ncol=2))
plot(CCBound2)
title(main=”Original shapefile”)
plot(CCBound)
title(main=”Modified shapefile”)
dev.off()
Looks pretty good to me! When looking at the two of them, you can tell that the modified version has been, well, modified. However, for the purposes I have in mind, the modified version is more than adequate.
Data is of course reproduced with the permission of Landcare Research New Zealand Limited. Get your own copy of it here.
To leave a comment for the author, please follow the link and comment on their blog: The Praise of Insects.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.