Visualizing Historical & Most-likely First Snowfall Dates for U.S. Regions
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
UPDATE: You can now run this as a local Shiny app by entering shiny::runGist("95ec24c1b0cb433a76a5", launch.browser=TRUE)
at an R prompt (provided all the dependent libraries (below) are installed) or use it interactively over at Shiny Apps.
The impending arrival of the first real snowfall of the year in my part of Maine got me curious about what the most likely “first snow” dates are for my region. The U.S. Historical Climatology Network (USHCN) maintains historical daily climate records for each station in each state and has data (for some stations) going back as far as the 1800’s. A quick look at their data files indicated that they would definitely help satiate my curiosity (and make for a late night of cranking out some R code and ggplot visualizations).
To start, we’ll need a bit more than base R to get the job done:
library(pbapply) library(data.table) library(dplyr) library(ggplot2) library(grid) library(gridExtra) library(stringi) |
In all honesty, pbapply
, dplyr
and stringi
are not necessary, but they definitely make life easier by (respectively) giving us:
- free progress bars for
*apply
operations, - high efficacy data manipulation idioms, and
- a handy utility for converting strings to title case.
With setup out of the way, the first real task is to see which observer station is closest to my area. To figure that out we need to read in the station data file which is, sadly, in fixed-width format. Some stations have #
characters in their titles, to we have to account for that when we call read.fwf
. After reading in the station database we use a naive–but-usable distance calculation to find the closest station:
stations <- read.fwf("data/ushcn-stations.txt", widths=c(6, 9, 10, 7, 3, 31, 7, 7, 7, 3), col.names=c("coop_id", "latitude", "longitude", "elevation", "state", "name", "component_1", "component_2", "component_3", "utc_offset"), colClasses=c("character", "numeric", "numeric", "numeric", "character", "character", "character", "character", "character", "character"), comment.char="", strip.white=TRUE) # not a great circle, but it gets the job done here closestStation <- function(stations, lat, lon) { index <- which.min(sqrt((stations$latitude-lat)^2 + (stations$longitude-lon)^2)) stations[index,] } # what's the closest station? closestStation(stations, 43.2672, -70.8617) ## coop_id latitude longitude elevation state name component_1 component_2 component_3 utc_offset 633 272174 43.15 -70.95 24.4 NH DURHAM ------ ------ ------ +5 |
As a Mainer, I’m not thrilled that this is the actual, closest station, so we’ll also see what the closest one is in Maine:
closestStation(stations %>% filter(state=="ME"), 43.2672, -70.8617) ## coop_id latitude longitude elevation state name component_1 component_2 component_3 utc_offset 10 176905 43.6497 -70.3003 13.7 ME PORTLAND JETPORT ------ ------ ------ +5 |
The analysis is easy enough to do for both, so we’ll first take a look at Durham, New Hampshire then do the exact same valuation for Portland, Maine.
Despite being fixed-width, the station database was not too difficult to wrangle. The state-level files that contain the readings are another matter:
Variable | Columns | Type |
---|---|---|
COOP ID | 1-6 | Character |
YEAR | 7-10 | Integer |
MONTH | 11-12 | Integer |
ELEMENT | 13-16 | Character |
VALUE1 | 17-21 | Integer |
MFLAG1 | 22 | Character |
QFLAG1 | 23 | Character |
SFLAG1 | 24 | Character |
VALUE2 | 25-29 | Integer |
MFLAG2 | 30 | Character |
QFLAG2 | 31 | Character |
SFLAG2 | 32 | Character |
… | … | … |
VALUE31 | 257-261 | Integer |
MFLAG31 | 262 | Character |
QFLAG31 | 263 | Character |
SFLAG31 | 264 | Character |
We have fixed-width, wide-format records with 31 days for each month, which proves the existence of truly insidious people in the world. Rather than use read.fwf
again, we’ll take a different approach (since we ultimately need the data in long format) and use readLines
to read in all the records from the NH data file, then filter out everything but snowfall entries from the station we’re interested in.
Next, we setup nested lapply
calls to build a long data frame from each month then combine them all together into a single data frame:
snow <- readLines("data/state27_NH.txt") snow <- grep("SNOW", snow, value=TRUE) snow <- grep("^272174", snow, value=TRUE) snow_dat <- rbindlist(pblapply(snow, function(x) { rbindlist(lapply(1:31, function(i) { # record format described here: # http://cdiac.ornl.gov/ftp/ushcn_daily/data_format.txt start <- 17 + (i-1)*8 list(coop_id=substr(x, 1, 6), date=sprintf("%s-%02d-%02d", substr(x, 7, 10), as.numeric(substr(x, 11, 12)), i), element=substr(x, 13, 16), value=as.numeric(substr(x, start, start+4)), mflag=substr(x, start+5, start+5), qflag=substr(x, start+6, start+6), sflag=substr(x, start+7, start+7)) })) })) |
Now, we’ll clean up the records even further by removing invalid entries (those with a value
== -9999
) and convert record dates to actual Date
objects and filter out invalid dates:
snow_dat <- snow_dat %>% filter(value != -9999) # since the data file has 31 days for each records regardless of whether # that's valid or not we do a shortcut to remove invalid dates by doing the # a vectorized Date conversion, then removing records with NA dates snow_dat$date <- as.Date(snow_dat$date) snow_dat <- snow_dat %>% filter(!is.na(date)) # having the year extracted is handy for filtering snow_dat$year <- format(snow_dat$date, "%Y") |
Given that Winter in the U.S. spans across two calendar years, we need a way to keep dates in January-May associated with the previous year (yes, that adds an inherent assumption that no first snow is in June, which might not hold true for Alaska). To facilitate this, we’ll convert each date to its corresponding day of year value then add the number of total days in the start year to those values for all months <= May. We really do need to do this, too, since there are many cases where the first snowfall will be in January-March for many states.
snow_dat$doy <- as.numeric(format(snow_dat$date, "%j")) snow_dat$doy <- ifelse(snow_dat$doy<=180, snow_dat$doy + as.numeric(format(as.Date(sprintf("%s-12-31", snow_dat$year)), "%j")), snow_dat$doy) |
Now, the fun begins. We use (mostly) dplyr
to extract the first snowfall day from each year, then make a dot-line plot from the data:
first <- snow_dat %>% filter(value>0) %>% # ignore 0 values filter(date>=as.Date("1950-01-01")) %>% # start at 1950 (arbitrary) merge(stations, by="coop_id", all.x=TRUE) %>% # merge station details group_by(coop_id, year) %>% # group by station and year arrange(doy) %>% # sort by our munged day of year filter(row_number(doy) == 1) %>% # grab the first entry by group select(name, state, date, value, doy) # we only need some variables title_1 <- sprintf("First observed snowfall (historical) at %s, %s", stri_trans_totitle(unique(first$name)), unique(first$state)) gg <- ggplot(first, aes(y=year, x=doy)) gg <- gg + geom_segment(aes(xend=min(first$doy)-20, yend=year), color="#9ecae1", size=0.25) gg <- gg + geom_point(aes(color=coop_id), shape=8, size=3, color="#3182bd") gg <- gg + geom_text(aes(label=format(date, "%b-%d")), size=3, hjust=-0.2) gg <- gg + scale_x_continuous(expand=c(0, 0), limits=c(min(first$doy)-20, max(first$doy)+20)) gg <- gg + labs(x=NULL, y=NULL, title=title_1) gg <- gg + theme_bw() gg <- gg + theme(legend.position="none") gg <- gg + theme(panel.grid=element_blank()) gg <- gg + theme(panel.border=element_blank()) gg <- gg + theme(axis.ticks.x=element_blank()) gg <- gg + theme(axis.ticks.y=element_blank()) gg <- gg + theme(axis.text.x=element_blank()) gg <- gg + theme(axis.text.y=element_text(color="#08306b")) by_year <- gg |
While that will help us see the diversity across years, we have to do quite a bit of eye tracking to get the most likely date range for the first snowfall, so we’ll add a boxplot into the mix and use summary
to figure out the quartiles (for labeling the chart) for the actual date values:
wx_range <- summary(as.Date(format(first$date, "2013-%m-%d"))) names(wx_range) <- NULL min_wx <- gsub("2013-", "", wx_range[2]) max_wx <- gsub("2013-", "", wx_range[5]) title_2 <- sprintf("Most likely first snowfall will be between %s & %s", min_wx, max_wx) gg <- ggplot(first %>% mutate(name="0000"), aes(name, doy)) gg <- gg + geom_boxplot(fill="#3182bd", color="#08306b", outlier.colour="#08306b") gg <- gg + scale_y_continuous(expand=c(0, 0), limits=c(min(first$doy)-20, max(first$doy)+20)) gg <- gg + coord_flip() gg <- gg + labs(x=NULL, y=NULL, title=title_2) gg <- gg + theme_bw() gg <- gg + theme(legend.position="none") gg <- gg + theme(panel.grid=element_blank()) gg <- gg + theme(panel.border=element_blank()) gg <- gg + theme(axis.ticks.x=element_blank()) gg <- gg + theme(axis.text.x=element_blank()) gg <- gg + theme(axis.ticks.y=element_line(color="white")) gg <- gg + theme(axis.text.y=element_text(color="white")) gg <- gg + theme(plot.title=element_text(size=11)) box_wx <- gg |
Finally, we’ll combine them together to get the finished product:
grid.arrange(by_year, box_wx, nrow=2, heights=unit(c(0.9, 0.1), "npc")) |
And, do the same for Portland:
There are many more analyses and visualizations that can be performed on these data sets, but be wary when creating them as I’ve seen a few files with fixed-width formatting errors and have also noticed missing records for some observer stations.
You can find the complete, commented code up on github.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.