NHS Winter Situation Reports: Shiny Viewer v2
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Having got my NHS Winter sitrep data scraper into shape (I think!), and dabbled with a quick Shiny demo using the R/Shiny library, I thought I’d tidy it up a little over the weekend and long the way learn a few new presentation tricks.
To quickly recap the data availability, the NHS publish a weekly spreadsheet (with daily reports for Monday to Friday – weekend data is rolled over to the Monday) as an Excel workbook. The workbook contains several sheets, corresponding to different data collections. A weekly scheduled scraper on Scraperwiki grabs each spreadsheet and pulls the data into a rolling database: NHS Sitreps scraper/aggregator. This provides us with a more convenient longitudinal dataset if we want to look at sitrep measures for a period longer than a single week.
So here’s where I’ve got to now – NHS sitrep demo:
The panel on the left controls user actions. The PCT (should be relabelled as “Trust”) drop down list is populated based on the selection of a Strategic Health Authority. The Report types follow the separate sheets in the Winter sitrep spreadsheet (though some of them include several reported measures, which is handled in the graphical display). The Download button allows you to download, as CSV data, the data for the selected report. By default, it downloads data at the SHA level (that is, data for each Trust in the selected SHA), although checkbox control allows you to limit the downloaded results to just data for the selected Trust:
Using just these controls, then, the user can select and download Winter sitrep data (to date), as a CSV file, for any selected Trust, or for all the Trusts in a given SHA. Here’s how the downloader was put together using Shiny:
So how does the Download work? Quite straightforwardly, as it turns out:
#This function marhsals the data for download downloadData <- reactive(function() { ds=results.data() if (input$pctdownonly==TRUE) ds=subset(ds,tid==input$rep & Code==input$tbl,select=c('Name','fromDateStr','toDateStr','tableName','facetB','value')) ds }) output$downloadData <- downloadHandler( #Add a little bit of logic to name the download file appropriately filename = function() { if (input$pctdownonly==FALSE) paste(input$sha,'_',input$rep, '.csv', sep='') else paste(input$tbl,'_',input$rep, '.csv', sep='') }, content = function(file) { write.csv(downloadData(), file, row.names=FALSE) } )
Graphical reports are split into two panels: at the top, views over the report data for each Trust in the selected SHA; at the bottom, more focussed views over the currently selected Trust.
Working through the charts, the SHA level stacked bar char is intended to show summed metrics at the SHA level:
My thinking here was that it may be useful to look at bed availability across an SHA, for example. The learning I had to do for this view was in the layout of the legend:
#g is a ggplot object g=g+theme( legend.position = 'bottom' ) g=g+scale_fill_discrete( guide = guide_legend(title = NULL,ncol=2) )
The facetted, multiplot view also uses independent y-axis scales for each plot (sometimes this makes sense, sometimes it doesn’t. Maybe I need to some logic to control when to use this and when not to?)
#The 'scales' parameter allows independent y-axis limits for each facet plot g=g+facet_wrap( ~tableName+facetB, scales = "free_y" )
The line chart shows the ame data in a more connected way:
To highlight the data trace for the currently selected Trust, I overplot that line with dots that show the value of each data point for that Trust. I’m not sure whether these should be coloured? Again, the y-axis scales are free.
The SHA Boxplot shows the distribution of values for each Trust in the SHA. I overplot the box for the selected Trust using a different colour.
(I guess a “semantic depth of field“/blur approach might also be used to focus attention on the plot for the currently selected Trust?)
My original attempt at this graphic was distorted by very long text labels, that were also misaligned. To get round this, I generated a new label attribute that included line breaks:
#Wordwrapper via: ##http://stackoverflow.com/questions/2351744/insert-line-breaks-in-long-string-word-wrap #Limit the length of each line to 15 chars limiter=function(x) gsub('(.{1,15})(\\s|$)', '\\1\n', x) d$sName=sapply(d$Name,limiter) #We can then print axis tick labels using d$sName
We can offset the positioning of the label when it is printed:
#Tweak the positioning using vjust, rotate it and also modify label size g=g+theme( axis.text.x=element_text(angle=-90,vjust = 0.5,size=5) )
The Trust Barchart and Linechart are quite straightforward. The Trust Daily Boxplot is a little more involved. The intention of the Daily plot is to try to identify whether or not there are distributional differences according to the day of the week. (Note that some of the data reports relate to summed values over the weekend, so these charts are likely to have comparatively high values on the weekend reporting Monday figure!)
I ‘borrowed’ a script for identifying days of the week… (I need to tweak the way these are ordered – the original author had a very particular application in mind.)
library('zoo') library('plyr') #http://margintale.blogspot.co.uk/2012/04/ggplot2-time-series-heatmaps.html tmp$year<-as.numeric(as.POSIXlt(tmp$fdate)$year+1900) # the month too tmp$month<-as.numeric(as.POSIXlt(tmp$fdate)$mon+1) # but turn months into ordered facors to control the appearance/ordering in the presentation tmp$monthf<-factor(tmp$month,levels=as.character(1:12), labels=c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"),ordered=TRUE) # the day of week is again easily found tmp$weekday = as.POSIXlt(tmp$fdate)$wday # again turn into factors to control appearance/abbreviation and ordering # I use the reverse function rev here to order the week top down in the graph # you can cut it out to reverse week order tmp$weekdayf<-factor(tmp$weekday,levels=rev(0:6),labels=rev(c("Sun","Mon","Tue","Wed","Thu","Fri","Sat")),ordered=TRUE) # the monthweek part is a bit trickier # first a factor which cuts the data into month chunks tmp$yearmonth<-as.yearmon(tmp$fdate) tmp$yearmonthf<-factor(tmp$yearmonth) # then find the "week of year" for each day tmp$week <- as.numeric(format(tmp$fdate,"%W")) # and now for each monthblock we normalize the week to start at 1 tmp<-ddply(tmp,.(yearmonthf),transform,monthweek=1+week-min(week))
The weekdayf value could then be used as the basis for plotting the results by day of week.
To add a little more information to the chart, I overplot the boxplot with actual data point, adding a small amount of jitter added to the x-component (the y-value is true).
g=g+geom_point(aes(x=weekdayf,y=val),position = position_jitter(w = 0.3, h = 0))
I guess it would be more meaningful if the data points were actually ordered by week/year. (Indeed, what I originally intended to do was a seasonal subseries style plot at the day level, to see whether there were any trends within a day of week over time, as well as pull out differences at the level of day of week.)
Finally, the Trust datatable shows the actual data values for the selected report and Trust:
(Remember, this data, or data for this report for each trusts in the selected SHA, can also be downloaded directly as a CSV file.)
The thing I had to learn here was how to disable the printing of the dataframe row names in the SHiny context:
output$view = reactiveTable(function() { #...get the data and return it for printing }, include.rownames=FALSE)
As a learning exercise, this app got me thinking about solving several presentational problems, as well as trying to consider what reports might be informative or pattern revealing (for example, the Daily boxplots).
The biggest problem, of course, is coming up with views that are meaningful and useful to end-users, the sorts of questions they may want to ask of the data, and the sorts of things they may want to pull from it. I have no idea who the users, if any, of the Winter sit rep data as published on the NHS website might be, or how they make use of the data, either in mechanistic terms – what do they actually do with the spreadsheets – or at the informational level – what stories they look for in the data/pull out out of it, and what they then use that information for.
This tension is manifest around a lot of public data releases, I think – hacks’n'hackers look for why shiny(?!) things they can do with the data, though often out of any sort of context other than demonstrating technical prowess or quick technical hacks. Users of the data may possibly struggle with doing anything other than opening the spreadsheet in Excel and then copying and pasting it into other spreadsheets, although they might know exactly what they want to get out of the data as presented to them. Some users may be frustrated at a technical level in the sense of knowing what they’d like to be able to get from the data (for example, getting monthly timeseries from weekly timeseries spreadsheets) but may not be able to do it easily for lack of technical skills. Some users may not know what can be readily achieved with the way data is organised, aggregated and mixed with other datasets, and what this data manipulation then affords in its potential for revealing stories, trends, structures and patterns in the data, and here we have a problem with even knowing what value might become unlockable (“Oh, I didn’t know you coud do that with it…”). This is one reason why hackdays – such as the NHS Hackday and various govcamps – can be so very useful (I’m also reminded of MashLib/Mashed Library events where library folk and techie shambrarians come together to learn from each other). What I think I’d like to see more of, though, is people with real/authentic questions that might be asked of data, or real answers they’d like to be able to find from data, starting to air them as puzzles that the data junkies, technicians, wranglers and mechanics amongst us can start to play with from a technical side.
PS this could be handy… downloading PDF docs from Shiny.
PPS Radio 4′s Today programme today had a package on NHS release of surgeon success data. In an early interview with someone from the NHS, the interviewee made the point that the release of the data was there for quality/improvement purposes and to help identify opportunities for supporting best practice (eg along the lines of previous releases of heart surgery performance data. The 8am, after 8 interview, and 8.30am news bulletins all pushed the faux and misleading line of how this data could be used for “parent choice”, (I complained bitterly – twice- via Twitter;-) though the raising standards line was taken in the 9am bulletin. There’s real confusion, I think, about how all this data stuff might, could and should be used (I’m thoroughly confused by it myself), and I’m not sure the press are always very helpful in communicating it…
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.