cusumcharter available on CRAN
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I submitted my cusumcharter package to CRAN late last Friday evening and was pleasantly surprised to receive an email early on Monday advising me it was on its way to CRAN.
Big thanks to the CRAN maintainers who manage to work through the queues of packages so quickly.
The package has several functions aimed at making it easier to create CUSUM charts, and more importantly, CUSUM control charts.
Process control charts are all the rage in the NHS, and rightly so – they are pretty easy to understand, and provide a robust mechanism for detecting statistically significant change.
However, these charts rely on signals that are either extremely unlikely to happen (e.g. a single point outside the control limits) or involve being off kilter for some time ( for example, a run of 8 consecutive points either side of the centre line).
Sometimes, you can’t afford to wait that long. Or, you absolutely don’t want a ‘black swan’ event to happen before you realise you have a problem. Other times, you don’t have, or can’t obtain, much data.
CUSUM charts pick up on small shifts of less than 1 standard deviation from the mean / target (whereas control limits are set at 3 sigma, which may or may not also equate to 3 standard deviations, depending on the chart you are using – although most times, it won’t be the same).
The point is – they are good at highlighting that change is underway.
For some reason, I’m getting flashbacks to the Jaws movie – didn’t they rig up some sort of system on the boat, involving barrels, that alerted them that something was afoot? Well, in this case CUSUMS are the barrel system, and SPC and points outside control limits are akin to the bit where the shark pops up and chomps the gnarly old shark hunter (sorry for any spoilers there).
Do you want an early warning? Then CUSUMS are your friend.
I’m loathe to use COVID-19 as an example – but here is some code I rigged up to grab the latest data from Public Health Scotland (which, depending on the time of day, should be recent) and then plots CUSUM charts for each Council Area – 32 in all.
These show rolling 28 day counts, the idea being that the charts will indicate if the number of positive cases is on the increase, or vice versa.
You can of course, adapt this to your needs.
library(dplyr) library(data.table) library(ggplot2) library(cusumcharter) library(ggExtra) # make the link dynamic part1 <- "https://www.opendata.nhs.scot/dataset/" part2 <- "b318bddf-a4dc-4262-971f-0ba329e09b87/" part3 <- "resource/427f9a25-db22-4014-a3bc-893b68243055/" part4 <- "download/trend_ca_" part5 <- ".csv" today <- gsub('-','',as.character(Sys.Date())) link <- paste0(part1, part2, part3, part4, today, part5, sep = '') dates <- seq.Date(as.Date(Sys.Date()-27), as.Date(Sys.Date()), by = '1 day') DT <- data.table::fread(link) DT[, Date := as.character(Date)] DT[, Date := as.IDate(Date, format = "%Y%m%d")] positives <- DT[Date >= as.Date(Sys.Date() -28),.(Date, CAName, DailyPositive)][] ggplot(positives,aes(Date, DailyPositive)) + geom_line() + geom_point() + facet_wrap(~ CAName, ncol = 4, scales = "free_y") + theme_minimal() + labs(x = NULL, y = NULL) + ggExtra::rotateTextX() p <- positives %>% group_by(CAName) %>% group_modify(~ cusum_control(.$DailyPositive), .keep = TRUE) %>% ungroup() %>% group_by(CAName) %>% mutate(Date = dates) %>% ungroup() %>% cusum_control_plot(., xvar = Date, facet_var = CAName, facet_scales = 'free_y', title_text = "CUSUM Rolling 28 Day Positive Cases") p <- p + facet_wrap(~CAName, ncol = 4, scales = 'free_y') print(p)
This results in the following:
Compare to the following, which is a plot of the daily counts
CUSUM charts do not plot values from your original data, instead the value depend on the cumulative sum of the variance between the observed value and the target value, but the plotted values also depend on the previous value, resulting in what looked like a hideous formula, before I managed to decipher it and turn it into R code.
The cusum_control
function does the hard work of calculating the control limits, so you don’t have to worry about it - although it’s a good idea to do your own research on these charts.
There are functions for calculating simple CUSUM statistics using either a supplied target, or the mean or median of the data.
Next steps
When I first started working on this, I had no intention of making a package, I just wanted to get to the point of calculating the control limits. These functions initially, (and to a large extent, still do) assumed a vector was being passed in - I need to create a function that takes a dataframe, and keeps that information to avoid the faffing around with adding the dates column back to the output of the cusum_control function.
(The cusum_control
works from a supplied vector add adds additional columns to it, and returns a separate dataframe).
Also - I need to add in an argument for the number of columns to use in the facets - something which is available in {runcharter} and {spccharter}.
That said, I’m pleased to have this package available - for one, it makes it easier to share my work with my colleagues, and also makes it easier for those who are (rightly) nervous of installing directly from GitHub.
The most pleasing thing was that this actually made it to CRAN at the first time of asking, so I have a better idea now of what is required for future submissions.
This was my second package to make it to CRAN in less than a week ({runcharter} made it the previous Tuesday) and I’ll write up my experience of that, which might help some others get through it.
In the meantime, happy CUSUM-charting!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.