Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Most of my free coding time has been spent tweaking a D3-based live power outage tracker for Central Maine Power customers (there’s also a woefully less-featured Shiny app for it, too). There is some R associated with the D3 vis, but it’s limited to a cron job that’s makes the CSV files for the sparklines in D3 vis by
- reading historical outage data from a database of scraped readings I’ve been keeping
- filling out the time series
- reducing it to an hourly time series, and
- trimming the data set to the last 30 days of records:
#!/usr/bin/Rscript # running in a cron job so no spurious text output pls options(warn=-1) options(show.error.messages=FALSE) suppressMessages(library(methods)) suppressMessages(library(zoo)) library(chron) library(xts) library(reshape2) library(DBI) library(RMySQL) m <- dbDriver("MySQL"); con <- dbConnect(m, user='DBUSER', password='DBPASSWORD', host='localhost', dbname='DBNAME'); res <- dbSendQuery(con, "SELECT * FROM outage") # cld just pull the 2 fields I need outages <- fetch(res, n = -1) outages$ts <- as.POSIXct(gsub("\\:[0-9]+\\..*$","", outages$ts), format="%Y-%m-%d %H:%M") # for each county we have data for for (county in unique(outages$county)) { # get the 2 fields I need (shld prbly filter that in the SELECT) outage.raw <- outages[outages$county == county,c(1,4)] # make it a zoo object outage.zoo <- zoo(outage.raw$withoutpower, outage.raw$ts) # fill out the 15-minute readings complete.zoo <- merge(outage.zoo, zoo(, seq(start(outage.zoo), max(outages$ts), by="15 min")), all=TRUE) complete.zoo[is.na(complete.zoo)] <- 0 # shrink to hourly and trim at 30 days hourly.zoo <- last(to.hourly(complete.zoo), "30 days") # crank out a CSV df <- data.frame(hourly.zoo) df <- data.frame(ts=rownames(df), withoutPower=df$complete.zoo.High) write.csv(df, sprintf("OUTPOUT_LOCATION/%s.csv",county), row.names=FALSE) } |
I knew there were other power companies in Maine, but CMP is the largest, so I focused my attention on getting data from it. After hearing an outage update on MPBN I decided to see if Bangor Hydro Electric had scrape-able content and it turns out there was a lovely (well, as lovely as XML can be) XML file delivered on the page with this “meh” Google push-pin map:
The XML file is used to build the markers for the map and has marker
tags that look like this:
<marker name="Exeter" outages="18" lat="44.96218" lng="-69.12253" streets="BUTTERS RD;CHAMPEON RD;GREENBUSH RD;" reflabels="87757-1;84329-1;85032-1;"/> |
I’m really only tracking county-level data and BHE does not provide that, even in the huge table of street-level outages that you’ll see on that outage page. I decided to use R to aggregate the BHE data to the county level via the “point-in-polygon” method.
Getting Right To the Point
To perform the aggregation in R, I needed county-level polygons for Maine. I already had that thanks to the previous work, but I wanted to optimize the search process, so I took the US counties shapefile and used OGR
from the GDAL
(Geospatial Data Abstraction Library) suite to extract just the Maine county polygons:
ogr2ogr -f "ESRI Shapefile" \ -where "STATE_NAME = 'MAINE'" maine.shp counties.shp |
You can see both a reduction in file size and complexity via ogrinfo
:
$ll *shp -rwxr-xr-x@ 1 bob staff 1517624 Jan 8 2010 counties.shp -rw-r--r-- 1 bob staff 12588 Dec 26 23:03 maine.shp |
$ ogrinfo -sql "SELECT COUNT(*) FROM counties" counties.shp INFO: Open of 'counties.shp' using driver 'ESRI Shapefile' successful. Layer name: counties Geometry: Polygon Feature Count: 1 Layer SRS WKT: (unknown) COUNT_*: Integer (0.0) OGRFeature(counties):0 COUNT_* (Integer) = 3141 |
$ ogrinfo -sql "SELECT COUNT(*) FROM maine" maine.shp INFO: Open of 'maine.shp' using driver 'ESRI Shapefile' successful. Layer name: maine Geometry: Polygon Feature Count: 1 Layer SRS WKT: (unknown) COUNT_*: Integer (0.0) OGRFeature(maine):0 COUNT_* (Integer) = 16 |
The conversion reduces the file size from 1.5MB to ~12K and shrinks the number of polygons from 3,141 to 16. The counties.shp
and maine.shp
shapefiles were built from U.S. census data and have scads more information that you might want to use (i.e. perhaps, for correlation with the outage info, though storms are the prime causal entity for the outages :-):
$ ogrinfo -al -so counties.shp INFO: Open of 'counties.shp' using driver 'ESRI Shapefile' successful. Layer name: counties Geometry: Polygon Feature Count: 3141 Extent: (-176.806138, 18.921786) - (-66.969271, 71.406235) Layer SRS WKT: GEOGCS["GCS_WGS_1984", DATUM["WGS_1984", SPHEROID["WGS_84",6378137.0,298.257223563]], PRIMEM["Greenwich",0.0], UNIT["Degree",0.0174532925199433]] NAME: String (32.0) STATE_NAME: String (25.0) STATE_FIPS: String (2.0) CNTY_FIPS: String (3.0) FIPS: String (5.0) POP2000: Integer (9.0) POP2007: Integer (9.0) POP00_SQMI: Real (10.1) POP07_SQMI: Real (7.1) WHITE: Integer (9.0) BLACK: Integer (9.0) AMERI_ES: Integer (9.0) ASIAN: Integer (9.0) HAWN_PI: Integer (9.0) OTHER: Integer (9.0) MULT_RACE: Integer (9.0) HISPANIC: Integer (9.0) MALES: Integer (9.0) FEMALES: Integer (9.0) AGE_UNDER5: Integer (9.0) AGE_5_17: Integer (9.0) AGE_18_21: Integer (9.0) AGE_22_29: Integer (9.0) AGE_30_39: Integer (9.0) AGE_40_49: Integer (9.0) AGE_50_64: Integer (9.0) AGE_65_UP: Integer (9.0) MED_AGE: Real (9.1) MED_AGE_M: Real (9.1) MED_AGE_F: Real (9.1) HOUSEHOLDS: Integer (9.0) AVE_HH_SZ: Real (9.2) HSEHLD_1_M: Integer (9.0) HSEHLD_1_F: Integer (9.0) MARHH_CHD: Integer (9.0) MARHH_NO_C: Integer (9.0) MHH_CHILD: Integer (9.0) FHH_CHILD: Integer (9.0) FAMILIES: Integer (9.0) AVE_FAM_SZ: Real (9.2) HSE_UNITS: Integer (9.0) VACANT: Integer (9.0) OWNER_OCC: Integer (9.0) RENTER_OCC: Integer (9.0) NO_FARMS97: Real (11.0) AVG_SIZE97: Real (11.0) CROP_ACR97: Real (11.0) AVG_SALE97: Real (7.2) SQMI: Real (8.1) OID: Integer (9.0) |
With the new shapefile in hand, the basic workflow to get BHE outages at the county level is:
- Read and parse the BHE outages XML file to get the lat/long pairs
- Build a
SpatialPoints
object out of those pairs - Make a
SpatialPolygonsDataFrame
out of the reduced Maine counties shapefile - Overlay the points in the polygons and get the feature metadata intersection (including county)
- Aggregate the outage data
The R code (below) does all that and is liberally commented. One has to appreciate how succinct the XML parsing is and the beautiful simplicity of the over()
function (which does all the really hard work).
library(XML) library(maptools) library(sp) library(plyr) # Small script to get county-level outage info from Bangor Hydro # Electric's town(-ish) level info # # BHE's outage google push-pin map is at # http://apps.bhe.com/about/outages/outage_map.cfm # read BHE outage XML file that was intended for the google map # yep. One. Line. #takethatpython doc <- xmlTreeParse("http://apps.bhe.com/about/outages/outage_map.xml", useInternalNodes=TRUE) # xmlToDataFrame() coughed up blood on that simple file, so we have to # resort to menial labor to bend the XML to our will doc.ls <- xmlToList(doc) doc.attrs <- doc.ls$.attrs doc.ls$.attrs <- NULL # this does the data frame conversion magic, tho it winces a bit suppressWarnings(doc.df <- data.frame(do.call(rbind, doc.ls), stringsAsFactors=FALSE)) # need numbers for some of the columns (vs strings) doc.df$outages <- as.numeric(doc.df$outages) doc.df$lat <- as.numeric(doc.df$lat) doc.df$lng <- as.numeric(doc.df$lng) # SpatialPoints likes matrices, note that it's in LON, LAT order # that always messes me up for some reason doc.m <- as.matrix(doc.df[,c(4,3)]) doc.pts <- SpatialPoints(doc.m) # I trimmed down the country-wide counties file from # http://www.baruch.cuny.edu/geoportal/data/esri/usa/census/counties.zip # with # ogr2ogr -f "ESRI Shapefile" -where "STATE_NAME = 'MAINE'" maine.shp counties.shp # to both save load time and reduce the number of iterations for over() later counties <- readShapePoly("maine.shp", repair=TRUE, IDvar="NAME") # So, all the above was pretty much just for this next line which does # the "is this point 'a' in polygon 'b' automagically for us. found.pts <- over(doc.pts, counties) # steal the column we need (county name) and squirrel it away with outage count doc.df$county <- found.pts$NAME doc.sub <- doc.df[,c(2,7)] # aggregate the result to get outage count by county count(doc.sub, c("county"), wt_var="outages") ## county freq ##1 Hancock 4440 ##2 Penobscot 869 ##3 Waldo 28 ##4 Washington 545 ##5 <NA> 328 |
Astute readers will notice unresolved points (the NAs
). I suspect they are right on coastal boundaries that were probably missed in these simplified county polygons. We can see what they are by looking at the NA
entries in the merged data frame:
doc.df[is.na(doc.df$county),c(1:4)] name outages lat lng 35 Deer Isle 1 44.22451 -68.67778 38 Harborside 166 44.34900 -68.81555 39 Sorrento 43 44.47341 -68.17723 62 Bucksport 71 44.57369 -68.79562 70 Penobscot 40 44.44552 -68.81780 78 Bernard 1 44.24119 -68.35585 79 Stonington 5 44.15619 -68.66669 80 Roque Bluffs 1 44.61286 -67.47971 |
But a map would be more useful for those not familiar with Maine geography/extents:
library(ggplot2) ff = fortify(counties, region = "NAME") missing <- doc.df[is.na(doc.df$county),] gg <- ggplot(ff, aes(x = long, y = lat)) gg <- gg + geom_path(aes(group = group), size=0.15, fill="black") gg <- gg + geom_point(data=missing, aes(x=lng, y=lat), color="#feb24c", size=3) gg <- gg + coord_map(xlim=extendrange(range(missing$lng)), ylim=extendrange(range(missing$lat))) gg <- gg + theme_bw() gg <- gg + labs(x="", y="") gg <- gg + theme(plot.background = element_rect(fill = "transparent",colour = NA), panel.border = element_blank(), panel.background =element_rect(fill = "transparent",colour = NA), panel.grid = element_blank(), axis.text = element_blank(), axis.ticks = element_blank(), legend.position="right", legend.title=element_blank()) gg |
The “zoom in” is done by taking and slightly extending the range of the extracted points via range()
and extendrange()
, reproduced below:
range(missing$lng) [1] -68.81780 -67.47971 range(missing$lat) [1] 44.15619 44.61286 extendrange(range(missing$lng)) [1] -68.88470 -67.41281 extendrange(range(missing$lat)) [1] 44.13336 44.63569 |
It turns out my suspicion was right, so to use this in “production” I’ll need a more accurate shapefile for Maine counties (which I have, but Descent is calling me, so it will have to wait for another day).
I’ll leave you with a non-Google push-pin map of outages that you can build upon (it needs some tweaking):
gg <- ggplot(ff, aes(x = long, y = lat)) gg <- gg + geom_polygon(aes(group = group), size=0.15, fill="black", color="#7f7f7f") gg <- gg + geom_point(data=doc.df, aes(x=lng, y=lat, alpha=outages, size=outages), color="#feb24c") gg <- gg + coord_map(xlim=c(-71.5,-66.75), ylim=c(43,47.5)) gg <- gg + theme_bw() gg <- gg + labs(x="", y="") gg <- gg + theme(plot.background = element_rect(fill = "transparent",colour = NA), panel.border = element_blank(), panel.background =element_rect(fill = "transparent",colour = NA), panel.grid = element_blank(), axis.text = element_blank(), axis.ticks = element_blank(), legend.position="right", legend.title=element_blank()) gg |
You can find all the R code in one, compact gist.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.