Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The MazamaSpatialUtils package on CRAN has just been updated with additional shape file conversion scripts and location buffering so that points located just outside of polygons (i.e. coastal sites) can still be associated with the nearest neighboring polygon. This package can now be used as an off-line equivalent for the functionality available at the geonames web service.
In this post I will demonstrate how to use the package to duplicate the kind of high quality maps one sees elsewhere on the internet.
Background
The Ebola Map described in this post was created as a response to a map seen online, created by Robert Johnston. This map seemed interesting to me and I decided to go about recreating it with help from the MazamaSpatialUtils package by Mazama Science. The map details the number of Ebola cases per 1,000 residents in each of the countries of Guinea, Sierra Leone, and Liberia. These cases are comprised of both confirmed and unconfirmed cases, as recorded by the World Health Organization. The data used for the graph was also given on the website. The following shows step-by-step how the MazamaSpatialUtils package can be used as an aid to create informative visual displays from interesting information.
Here is the original map I will be recreating:
Preliminary Mapping
Before beginning, you must of course install and load the MazamaSpatialUtils and sp packages. Utilities within that package will download, convert and load other spatial datasets. The first steps in any project with MazamaSpatialUtils will be:
- set the directory to use as the data store
- convert any datasets you haven’t already converted
- load the desired spatial data into memory as SpatialPolygonsDataFrames
library(MazamaSpatialUtils) library(sp) # NOTE: Specify the directory in which to store your spatial data setSpatialDataDir('~/SpatialData') # Download and convert the TMWorldBorders dataset convertTMWorldBorders() ## OGR data source with driver: ESRI Shapefile ## Source: "/Users/robinwinstanley/SpatialData/world", layer: "TM_WORLD_BORDERS-0.3" ## with 246 features and 11 fields ## Feature type: wkbPolygon with 2 dimensions # load TMWorldBorders into memory loadSpatialData('TMWorldBorders')
After loading the appropriate data the next step is to identify the border of each country from the TMWorldBorders data. This is done by taking a subset of the TMWorldBorders data that refers to the three countries of interest. The borders between the countries can then be plotted, with thicker lines used to further distinguish the borders.
ebolaCountries <- subset(TMWorldBorders,countryName %in% c('Liberia', 'Sierra Leone', 'Guinea')) countryCodes <- ebolaCountries@data$countryCode plot(ebolaCountries, lwd=3)
The next step is to use the convertGADM()
function to download and convert the required country data from the GADM database at the correct level. The map uses the provinces of Guinea (level 2), the provinces of Sierra Leone (level 2), and the counties of Liberia (level 1).
convertGADM(countryCode=countryCodes[1], admLevel=2) loadSpatialData('GADM_GN_2') convertGADM(countryCode=countryCodes[2], admLevel=1) loadSpatialData('GADM_LR_1') convertGADM(countryCode=countryCodes[3], admLevel=2) loadSpatialData('GADM_SL_2')
Now that each country of interest is divided into provinces/counties, we are ready to proceed to the data acquisition step regarding Ebola cases and their use in our map.
Acquiring Data
Using the rvest
package, we can download the necessary table of Ebola counts per province/county into a local dataframe. While the table also includes information for a 2014 population estimate and the total number of Ebola cases, we are solely interested in the column referring to number of cases/1000 pop as this is what we will be graphing.
# The rvest package makes it super easy to ingest data from web pages url <- 'http://www.johnstonsarchive.net/policy/westafrica-ebola.html' raw <- rvest::html(url) tables <- rvest::html_nodes(raw, "table") # list of tables on the site ebolaDF <- rvest::html_table(tables[[3]])
Using this dataframe, we can sort the cases/100 pop for each province/county into seven levels, which we copy from the original graph in order to achieve as close a match as possible. Note that these levels are not linear. We also used a color picker (e.g. colorzilla) to pull the colors used in the legend of the original graph, again with the goal of providing a match to the original graph. We then sort the data into these seven levels, with corresponding colors. We also add to the end an additional color, the color white, in order to take into account those values from the data which are NA (Values of NA indicate that there were 0 Ebola cases reported, resulting in a NA value for cases/100 pop).
breaks <- c(.003,.01,.03,.1,.3,1,3,10) # NOTE: see table for which rows to use colorIndices <- .bincode(ebolaDF[c language="(2:16,20:34,38:78),4"][/c], breaks) ## Warning in .bincode(ebolaDF[c 1="20:34," 2="38:78)," 3="4" language="(2:16,"][/c], breaks): NAs ## introduced by coercion colorIndices[is.na(colorIndices)] <- 9 colors <- c('#8ed150','#a2d840','#c2e529','#fdfd00','#ffd600','#ff8000','#f40000','#97547e','#ffffff')
Running this code gives us the error, “NAs introduced by coercion.” We can ignore this error, as we account for the NA values in the next line.
Applying Data to Map
Now that we have sorted our data into categories, we can refer to our preliminary mappings to add our chosen colors to the graph. As we have already labeled the data by level and assigned corresponding colors, we simply plot the GADM data for each country, adding the new plot on top of what we have already created. However, adding the colored polygons for each country has overridden the original borders between the countries, so we must add them back using our original plot containing only the country outlines, not the provinces/counties outlines.
We are almost there! Our graph is looking very similar to the original graph, but it is missing a few crucial pieces. We add a legend defining our use of color, as well as labels for each country, and a title for the graph.
plot(GADM_GN_2, col=colors[colorIndices[31:71]]) plot(GADM_LR_1, col=colors[colorIndices[1:15]], add=TRUE) plot(GADM_SL_2, col=colors[colorIndices[16:30]], add=TRUE) plot(ebolaCountries, lwd=3, add=TRUE) legend('bottomleft', legend=rev(breaks[1:7]), fill=rev(colors[1:8]), bty='n', title='case rate pern1000 pop') text(-10.8,5.8,'Liberia', =2) text(-14.1,7.8,'Sierra Leone', =2) text(-14.8,9.8,'Guinea', =2) text(-12.1,4.8, 'Rate of Ebola cases by region,nWest Africa (to 22 Mar 2015)',) text(-12.1,4.3,'(both confirmed and unconfirmed cases)', cex=.8, xpd=NA)
Et voilà!
Variations between the Graphs
Our final product is very similar to the original graph which we were trying to mimic. The main difference lies in the use of color. The original graph appears to use a continuous color scheme, providing a label only for a limited number of the colors used. In contrast, our graph limits itself to six distinct colors, which are used both in the legend and in the graph itself. Aside from this difference, the two plots are quite comparable.
Using the MazamaSpatialUtils package, we were able to choose an existing graph and create our own nearly identical version with just ~35 lines of actual code that did everything from data access and conversion to plotting and annotating.
Have fun creating your own maps of the world!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.