Freshwater access in rural regions, using d3Network to explore similarities
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This post describes the construction of a similarity matrix and its use in creating grouped network graphs to examine freshwater access in rural regions of 194 countries around the world. The data comes from the WHO/UNICEF Joint Monitoring Programme (JMP) for Water Supply and Sanitation, downloaded from The World Bank December 26, 2013. Dataset construction
To run the code, you’ll need Christopher Gandrud’s d3Network
package.
setwd("C:/_Rproject/ForceDirected") require('d3Network',lib.loc="c:/r/packages/")
The following code snippet reads a .csv file containing two columns, Country
(after removing any accents and diacritical marks) and Access_Rural
, from the table linked above, strips trailing blanks off columns, and creates a data frame called water
.
water <- read.csv(file="3.5_Freshwater_useForCooccurrence_clean.csv", strip.white=TRUE, head=TRUE,sep=",", na.strings=c("."), colClasses=c('character','numeric'))
The meta data, available with the table linked earlier, contains the table name, income group, currency, region, and other fields for each country. The following commands load the data into a data frame and subset the data frame to three columns of interest. I wanted High Income counties in one group, regardless of OECD membership status, so the group names are cleaned before converting Income.Group into a factor variable and merging to create the final source data frame, 'water'.
meta <- read.csv(file="FreshwaterMeta.csv",strip.white=TRUE, head=TRUE,sep=",", na.strings=c(" ")) meta <- subset(meta,Income.Group != "",select=c("Table.Name","Income.Group","Region")) meta[2] <- lapply(meta[2], as.character) meta$inc <-ifelse(substr(meta$Income.Group,1,1) =='H',"High income",meta$Income.Group) meta$ecogrp <- as.integer(factor(meta$inc, levels=c("Low income","Lower middle income","Upper middle income","High income")))
water <-merge(water,meta, by.x = "Country", by.y = "Table.Name", all.x = TRUE)
Given the size of my drawing area, between 800 and 1000 pixels, I divided the data frame by region, to restrict the number of countries to a range of 50-70. The following command creates the data frame combining two regions, Europe & Central Asia and East Asia & Pacific, and restricts the resulting data frame to records with non-missing Access_Rural
values. Other regional data frames were created in the same manner.
waterECA <- subset(water,Region=="Europe & Central Asia" & !is.na(Access_Rural))
Matrix Construction
To create the similarity matrix, I began with a square matrix of zeros with a row for each country.
m <- matrix(rep(0), nrow=nrow(water), ncol=nrow(waterECA))
The waterNLA
data frame can now be used to populate m
with a set of non-negative values, bound between 0 and 100, that reflect the level of agreement between each pair of countries. in the matrix, m
, each element, (i,j), will represent the absolute difference in percentages between country i and country j.
for(i in seq_along(waterECA$Country)){ for(j in seq_along(waterECA$Country)){ m[i, j] <- abs(waterECA$Access_Rural[i]-waterECA$Access_Rural[j]) } } rownames(m) <- waterECA$Country colnames(m) <- waterECA$Country
Only the elements above or below m
’s diagonal are needed to create the set of edges for the graph. These next steps set m
’s upper triangle elements to NULL, coerce m
into a table of distinct country pairs and their corresponding similarity estimate, and subset the resulting data frame, links
, to non-missing values.
m[upper.tri(m, diag=TRUE)] <- NA links <- as.data.frame(as.table(m)) colnames(links)<-c("source","target","value") links <- subset(dm, !is.na(value))
Before passing links
to d3Network, these next steps assign ordinal values to the source and target countries. Since by default, the levels of “source” and “target” in this case are the unique, alphabetically sorted country names from the same file (waterECA), I used R's internal ordering of these factors to set the "values", using the as.integer()
function to assign both.
links$sourceN <-as.integer(links$source) -1 # initialize to zero links$targetN <-as.integer(links$target) -1 # initialize to zero links <- subset(links,sourceN != targetN) links <- subset(links,select=c("sourceN","targetN","value"))
The nodes
data frame was created from unique values of the waterECA
data frame.
nodes <-as.data.frame(unique(waterECA[,c("Country","ecogrp")]))
Graphing
d3Network’s d3ForceNetwork
function will send the contents of the HTML file that displays the graph to the console unless the output is redirected. Since I have to modify the code slightly to render the graph in WordPress and make some other adjustments (described later), I called the sink
function first to divert the output to a text file in my working directory.
sink("d3force-waterECA.txt") d3ForceNetwork(Links = links, Nodes = nodes, Source = "sourceN", Target = "targetN", Value = "value", NodeID = "country", Group = "ecogrp", width = 800, height = 800, opacity = 0.9)
The output of d3ForceNetwork
can be easily customized. For example, by default, the link distance is fixed and the values in the set of edges determines the stroke width. Because each node in this data is connected to every other node, the modification, the resulting graph looks like this.
Opening the text output file and varying the force layout's linkDistance and charge attributes helped make the graph more readable.
Original output:
var force = d3.layout.force() .nodes(d3.values(nodes)) .links(links) .size([width, height]) .linkDistance(50) .charge(-120) .on("tick", tick) .start();
Sample modification:
.linkDistance(function(d) { return (d.value +1)*9; }) .charge(-1*Math.pow(nodes.length, 2))
An example of the graph produced using this method appears here: Europe and Central Asia. Since I chose not to show the connecting lines in the final graph, I set their opacity value to 0. I also replaced the d3ForceNetwork default nodes information, which can only contain the node name and group level for now, with a JSON-formatted list containing a third variable, the percentage of freshwater access, and added this field to the node's text element.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.