Yet-Another-Power Outages Post : Full Tidyverse Edition

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This past weekend, violent windstorms raged through New England. We — along with over 500,000 other Mainers — went “dark” in the wee hours of Monday morning and (this post was published on Thursday AM) we still have no utility-provided power nor high-speed internet access. The children have turned iFeral, and being a remote worker has been made that much more challenging. Earlier in the week, even making a cellular phone call (not an Google Voice or other VoIP-ish call, just pressing buttons in the phone “app” in iOS) resulted in an “All circuits are busy” message vs human contact. (I had to repair our generator at some point between then and now, but it’s all a blur at this point).

Late Tuesday night, we checked the Central Maine Power outage info and were greeting with a “November 4, 2017” estimate. After regaining composure, we doubled down on the fact that we’d be extreme indoor glamping for a while longer.

As noted, I cope by coding and have a history of scraping Central Maine Power’s site for outage info. I ceased when I discovered the site & twitter bot I mentioned in a recent post, as it does that for the entirety of the U.S. (though many power companies continue make it difficult to scrape their outage info).

However, I wanted to see just how many other streets were in the same position as we are (I should note that less than a mile from us there are folks with power and internet, due mostly to their proximity to “vital” resources and how screwed up the Maine power grid is). Rather than reuse existing code, I wanted to make a modern, tidyverse edition of scrapers. If you follow enough paths in the aforementioned outage site, you’ll see that you eventually come to a page with a pretty ugly iframe that lets you poke around counties and towns. The following code traverses that tree to get street-level outage info:

library(rvest)
library(stringi)
library(hrbrthemes)
library(tidyverse)

# helper to make numbers from comma strings; i still find it amusing that
# this has never been a core "S" or base "R" function given that the
# central goal of both languages are to work with data
to_num <- function(x) { as.numeric(stri_replace_all_fixed(x, ",", "")) }

# top of the tree
pg <- read_html("http://www3.cmpco.com/OutageReports/CMP.html")

# basic idiom all the way down is to find the links to traverse until we get to
# the street level data, plucking the timestamp of the CMP report along the way
html_nodes(pg, "a") %>% 
  map_df(~{
    
    county <- stri_trans_totitle(html_text(.x))
    cpg <- read_html(sprintf("http://www3.cmpco.com/OutageReports/%s", html_attr(.x, "href")))
        
    message(sprintf("Processing %s...", county))
    
    html_nodes(cpg, xpath=".//td[not(@colspan = '2') and not(@align = 'right')][1]/a") %>% 
      map_df(~{
        
        town <- stri_trans_totitle(html_text(.x))
        tpg <- read_html(sprintf("http://www3.cmpco.com/OutageReports/%s", html_attr(.x, "href")))
        
        message(sprintf("  - %s", town))
    
        html_node(tpg, xpath=".//td[contains(., 'Update:')]") %>%
          html_text() %>%
          stri_replace_first_regex("Update: ", "") %>%
          as.POSIXct("%b %d, %Y %I:%M %p", tz="America/New_York") -> ts
        
        html_node(tpg, "table") %>% 
          html_table() %>% 
          docxtractr::assign_colnames(3) %>%  
          docxtractr::mcga() %>% # in github version
          mutate(street = stri_trans_totitle(street)) %>% 
          mutate_at(vars(-estimated_restoration, -street), .funs=to_num) %>% 
          filter(!is.na(total_customersby_street)) %>% 
          mutate(timestamp = ts) %>% 
          mutate(county = county) %>% 
          mutate(town = town) %>% 
          tbl_df()
      })
    
  }) -> xdf

xdf <- mutate(xdf, estimated_restoration = as.POSIXct(estimated_restoration, "%b %d, %Y %I:%M %p", tz="America/New_York"))

xdf
## # A tibble: 10,601 x 7
##           street total_customersby_street customerswithout_power estimated_restoration           timestamp       county   town
##            <chr>                    <dbl>                  <dbl>                <dttm>              <dttm>        <chr>  <chr>
##  1        2Nd St                       59                     14                    NA 2017-11-02 06:46:00 Androscoggin Auburn
##  2        3Rd St                      128                     53                    NA 2017-11-02 06:46:00 Androscoggin Auburn
##  3        4Th St                       89                     15                    NA 2017-11-02 06:46:00 Androscoggin Auburn
##  4        5Th St                       56                      3                    NA 2017-11-02 06:46:00 Androscoggin Auburn
##  5     Adams Ave                        4                      4   2017-11-03 19:00:00 2017-11-02 06:46:00 Androscoggin Auburn
##  6     Allain St                        8                      8                    NA 2017-11-02 06:46:00 Androscoggin Auburn
##  7     Atwood St                        6                      3   2017-11-04 22:00:00 2017-11-02 06:46:00 Androscoggin Auburn
##  8    Baxter Ave                       32                      9   2017-11-04 22:00:00 2017-11-02 06:46:00 Androscoggin Auburn
##  9     Beaver Rd                        9                      4   2017-11-04 22:00:00 2017-11-02 06:46:00 Androscoggin Auburn
## 10 Bellflower Dr                       10                      9   2017-11-04 22:00:00 2017-11-02 06:46:00 Androscoggin Auburn
## # ... with 10,591 more rows

One truly upsetting revelation from data is the number of folks still in an “Assessing” condition (i.e. no restoration time estimate):

filter(xdf, is.na(estimated_restoration)) %>% 
  summarise(streets = n(), customers_impacted = sum(total_customersby_street))
## # A tibble: 1 x 2
##   streets customers_impacted
##     <int>              <dbl>
## 1    2255              42067

I’m thankful (for them and us) that Winter has not yet hit and that the weather has been and is going to be sufficiently mild to not make things life-threatening for most folks (it does get cold in northern Maine at this time of year).

It’s About Time

We can get an overview of when things are slated to get better for everyone but the folks I just mentioned:

select(xdf, county, estimated_restoration) %>% 
  mutate(day = as.Date(estimated_restoration, tz="America/New_York")) %>% 
  filter(!is.na(day)) %>%
  count(day) %>% 
  ggplot(aes(day, n)) +
  geom_col() +
  scale_x_date(date_labels = "%b\n%d", date_breaks = "1 day") +
  scale_y_comma() +
  labs(
    x=NULL, y="# Streets",
    title="Distribution of Street Estimated Restoration Target Dates",
    subtitle=sprintf("Central Maine Power / Generated: %s", Sys.time())
  ) +
  theme_ipsum_rc(grid="Y")

It seems that most of us are in the same “November 4th” bucket. But, we can also see that Central Maine Power’s data curation leaves much to be desired since there should be no dates in the past in that chart, but there are.

With the scraping data above, we can explore the outage info in many ways, but — as time and bandwidth are precious commodities — I’ll leave you with the total number of customers still without power:

count(xdf, wt = customerswithout_power) %>% pull(n)
## [1] 153465

and, a county-level view of the outage:

select(xdf, county, estimated_restoration) %>% 
  mutate(day = as.Date(estimated_restoration, tz="America/New_York")) %>% 
  filter(!is.na(day)) %>% 
  count(county, day) %>% 
  complete(county, day, fill=list(n=0)) %>% 
  filter(day >= Sys.Date()) %>% 
  ggplot(aes(day, n)) +
  geom_segment(aes(xend=day, yend=0), color="steelblue", size=4) +
  scale_x_date(date_labels = "%b\n%d", date_breaks = "1 day") +
  scale_y_comma(limits=c(0,1250)) +
  facet_wrap(~county, scales="free", ncol=5) +
  labs(
    x=NULL, y="# Streets",
    title="Distribution of Street Estimated Restoration Target Dates by County",
    subtitle=sprintf("Central Maine Power / Generated: %s", Sys.time())
  ) +
  theme_ipsum_rc(grid="Y", strip_text_face = "bold", axis="xy") +
  theme(panel.spacing.x=unit(3, "lines")) +
  theme(panel.spacing.y=unit(2, "lines"))

FIN

In a way, I wish I had continued scraping CMP data since the power outages site I mentioned doesn’t seem to provide access to the raw data and getting a historical perspective of the outage locations and analyzing by geography and other demographics might be interesting.

Hopefully the scraping code will be useful for some folks. It was definitely therapeutic for me 🙂

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)