Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The Problem
We want to have a closer look at the time–dependence of turn around times (TATs). In particular, we would like to see if there is a significant trend in TAT over time (improvement or deterioration) and we would like the data to inform us of slowdowns and potentially unexpected problems that occur throughout each week. This should allow us to identify areas of the pre-analytical and/or analytical process phlebotomists that require attention.
My interest in this topic (which in past seemed entirely banal) came from the frustration of receiving monthly TAT reports showing spaghetti plots produced in Excel. In examining these figures is was entirely unclear to me whether any observed changes in the median (the only measure of central tendency provided) represented stochastic behaviour or a real problem. Utlimately, we want to be able to identify real problems in the preanalytical and analytical process but to do this, we need to visualize the data in a more sophisticated manner.
To do this, we are going to look at order–to–file times for a whole year for a nameless test X. You should be able to modify this approach to the manner in which your data is provided to you.
The real data was a little dirty but I have pre–cleaned it—this will have to be the topic of another post. In short, I purged the cancelled tests, removed duplicate records and limited my analysis to stat tests based on a stat flag that is stored in the laboratory information system (LIS). I won’t discuss this process here. The buffed–up file is named “2014_and_All_Clean.txt”. This happens to be a tab–delimited txt file. For this reason, I used read.delim() rather than read.csv(). These are basically the same function with different defaults for the seperator–one uses a comma and the other uses a tab. Please see our first post on TAT to understand how we are using the lubridate function ymd_hm().
Loading the Data
library(lubridate) myData <- read.delim(file = "2014_and_All_clean.txt") myData$ordered <- ymd_hm(myData$ordered) myData$collected <- ymd_hm(myData$collected) myData$received <- ymd_hm(myData$received) myData$resulted <- ymd_hm(myData$resulted) #confirm success head(myData) ## ordered collected received ## 1 2014-01-01 17:53:00 2014-01-01 17:54:00 2014-01-01 18:08:00 ## 2 2014-01-01 15:10:00 2014-01-01 15:19:00 2014-01-01 15:21:00 ## 3 2014-01-01 17:07:00 2014-01-01 17:15:00 2014-01-01 17:17:00 ## 4 2014-01-01 18:20:00 2014-01-01 18:30:00 2014-01-01 18:35:00 ## 5 2014-01-01 11:19:00 2014-01-01 11:25:00 2014-01-01 11:29:00 ## 6 2014-01-01 11:00:00 2014-01-01 11:08:00 2014-01-01 11:11:00 ## resulted ## 1 2014-01-01 18:45:00 ## 2 2014-01-01 16:33:00 ## 3 2014-01-01 18:09:00 ## 4 2014-01-01 19:45:00 ## 5 2014-01-01 13:33:00 ## 6 2014-01-01 11:47:00
Now we want to look at a TAT. As in our first post on this topic, we will look at the order–to–file time.
otf <- difftime(myData$resulted, myData$ordered,units = "min") myData <- cbind(myData,otf)
Sanity Check
Let’s just have a quick look at this to make sure nothing crazy is happening.
hist(as.numeric(myData$otf),xlim = c(0,200),breaks = 150, col = "orange", xlab = "TAT for X (min)", main = "Histogram of TAT for X")
summary(as.numeric(myData$otf)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 3.00 51.00 62.00 70.85 77.00 1656.00
Some Nutty Stuff
We do note one thing—there is a sample with a TAT of 1656 min. This is a little crazy so we could investigate those samples to see if this is real (because of a lost sample) or an artifact of an add–on analysis being misidentified as a stat or some other similar nonsensical event.
If you wanted to list all of these extreme outliers for the year, you could do so like this:
library("dplyr") ## ## Attaching package: 'dplyr' ## ## The following objects are masked from 'package:lubridate': ## ## intersect, setdiff, union ## ## The following objects are masked from 'package:stats': ## ## filter, lag ## ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union head(arrange(myData,desc(otf)),10) ## ordered collected received ## 1 2014-09-21 21:50:00 2014-09-21 21:58:00 2014-09-21 22:04:00 ## 2 2014-09-07 14:59:00 2014-09-08 12:00:00 2014-09-08 12:04:00 ## 3 2014-03-18 13:45:00 2014-03-18 14:10:00 2014-03-18 14:29:00 ## 4 2014-04-01 01:48:00 2014-04-01 02:03:00 2014-04-01 02:06:00 ## 5 2014-04-01 02:21:00 2014-04-01 02:28:00 2014-04-01 02:31:00 ## 6 2014-09-04 21:35:00 2014-09-04 21:45:00 2014-09-04 22:08:00 ## 7 2014-10-08 10:38:00 2014-10-08 10:42:00 2014-10-08 10:46:00 ## 8 2014-05-25 13:23:00 2014-05-25 13:35:00 2014-05-25 13:45:00 ## 9 2014-08-06 16:50:00 2014-08-06 17:09:00 2014-08-06 17:15:00 ## 10 2014-08-02 11:52:00 2014-08-02 21:30:00 2014-08-02 21:44:00 ## resulted otf ## 1 2014-09-23 01:26:00 1656 mins ## 2 2014-09-08 13:26:00 1347 mins ## 3 2014-03-19 11:17:00 1292 mins ## 4 2014-04-01 17:29:00 941 mins ## 5 2014-04-01 16:20:00 839 mins ## 6 2014-09-05 11:07:00 812 mins ## 7 2014-10-08 22:41:00 723 mins ## 8 2014-05-26 01:20:00 717 mins ## 9 2014-08-07 03:46:00 656 mins ## 10 2014-08-02 22:44:00 652 mins
which gives you the TAT of the 10 (or whatever number you prefer) worst specimens for the year. Obviously when you do this kind of analysis on your own data, you will retain the specimen ID in the data set and you could explore what is going on here–whether these are add–ons etc. You discover interesting things when you dig into your data.
Time Dependence
But we are interested in time-dependence of the TAT, so let’s look at a scatterplot of the whole year.
start <- ceiling_date(min(myData$collected)) finish <- start+days(356) plot(myData$collected,myData$otf, pch = 19, xlim = c(start,finish),col = "#00000020",cex = 0.5, ylim = c(0,200), ylab = "TAT of X (min)", xlab = "Date")
So, that’s pretty hard to draw inferences from. We can see that there are some outliers with inconceivably low TAT. We will have to investigate what is going on with those collections but not right now. These outliers will not affect the non–parametric measures of central tendency.
Tunnelling Down
Let’s have a look at one week.
finish <- start + days(7) ticks <- seq(from = start, to = finish, by = "days") plot(myData$collected,myData$otf, pch = 19, xlim = c(start,finish), col = "#00000020", cex = 0.5, ylim = c(0,200), ylab = "TAT of X (min)", xlab = "", xaxt = "n") axis.POSIXct(side = 1, ticks, at = ticks, las = 2, cex.axis = 0.6, col.axis = "gray30", format = "%b %d %Y") mtext("Date of analysis", side = 1, line = 4)
See the first post on this topic for more information about the plotting parameters.
We can see there is a definite (and unsurprising) periodicity in the number of tests per hour. We can look at “volumes” another time. What we want to do now is look for time–dependence in the TAT so we can ultimately investigate what days of the week and times of the day are worse. But we don’t want to do this for one week—we want to do this for all weeks in the year. It would be nice, for example to plot all the Sundays, Mondays, Tuesdays etc overlapping and then see if we can see day–of–week and time–of–day trends.
Some More Lubridate Magic
Therefore, we need to assign every point in our myData dataframe a day of the week. The lubridate function wday() does this for us.
start ## [1] "2014-01-01 02:39:00 UTC" wday(start, label = TRUE) ## [1] Wed ## Levels: Sun < Mon < Tues < Wed < Thurs < Fri < Sat #if you want them as numbers, just leave the label option out. wday(start) ## [1] 4
So, January 1, 2014 was a Wednesday, which is the 4th day of the week. Let’s assign the day of the week for all our days and then bind this to our data.
weekday <- wday(myData$collected) myData <- cbind(myData,weekday) head(myData) ## ordered collected received ## 1 2014-01-01 17:53:00 2014-01-01 17:54:00 2014-01-01 18:08:00 ## 2 2014-01-01 15:10:00 2014-01-01 15:19:00 2014-01-01 15:21:00 ## 3 2014-01-01 17:07:00 2014-01-01 17:15:00 2014-01-01 17:17:00 ## 4 2014-01-01 18:20:00 2014-01-01 18:30:00 2014-01-01 18:35:00 ## 5 2014-01-01 11:19:00 2014-01-01 11:25:00 2014-01-01 11:29:00 ## 6 2014-01-01 11:00:00 2014-01-01 11:08:00 2014-01-01 11:11:00 ## resulted otf weekday ## 1 2014-01-01 18:45:00 52 mins 4 ## 2 2014-01-01 16:33:00 83 mins 4 ## 3 2014-01-01 18:09:00 62 mins 4 ## 4 2014-01-01 19:45:00 85 mins 4 ## 5 2014-01-01 13:33:00 134 mins 4 ## 6 2014-01-01 11:47:00 47 mins 4
Now let’s plot all of the Monday–data for the whole year and look at the with–day–trends for Mondays. We are going to convert all of the TATs and times at which they are collected to decimal numbers so we don’t run into any hassles. (Yes, I ran into hassles when I did not do this.)
This little function accomplishes this for us:
#convert the time to decimal number of hours since midnight for simplicity in plotting timeconvert = function(t){hour(t)+minute(t)/60} mondayData <- subset(myData,weekday == 2) mondayTimes <- timeconvert(mondayData$collected) mondayData <- cbind(mondayData,times = mondayTimes) mondayData$otf <- as.numeric(mondayData$otf) head(mondayData) ## ordered collected received ## 225 2014-01-06 23:35:00 2014-01-06 23:30:00 2014-01-07 00:09:00 ## 226 2014-01-06 11:02:00 2014-01-06 11:10:00 2014-01-06 11:14:00 ## 227 2014-01-06 14:08:00 2014-01-06 14:20:00 2014-01-06 14:24:00 ## 228 2014-01-06 10:06:00 2014-01-06 10:16:00 2014-01-06 10:19:00 ## 229 2014-01-06 14:42:00 2014-01-06 15:09:00 2014-01-06 15:16:00 ## 230 2014-01-06 20:02:00 2014-01-06 20:17:00 2014-01-06 20:22:00 ## resulted otf weekday times ## 225 2014-01-07 00:46:00 71 2 23.50000 ## 226 2014-01-06 12:42:00 100 2 11.16667 ## 227 2014-01-06 16:02:00 114 2 14.33333 ## 228 2014-01-06 11:44:00 98 2 10.26667 ## 229 2014-01-06 16:04:00 82 2 15.15000 ## 230 2014-01-06 20:55:00 53 2 20.28333
So this seems to have worked and now we can make a scatter plot.
Monday Monday, So Good to Me?
plot(mondayData$times,mondayData$otf, pch = 19, col = "#00000020", cex = 0.5, ylim = c(0,200), xlim = c(0,24), xaxt = "n", xlab = "Hour of Day", ylab = "Turnaround Time (min)") axis(side = 1, 0:24, at = 0:24, las = 2, cex.axis = 0.6, col.axis = "gray30")
But now for the interesting part. We want to see how the median TAT is related to the time of day. We might want to look at, say, the running median over one–hour window all day long. Notice that I have made the times, t, go from 0.5 to 23.5 because these are the only times for which a 60 min moving median can be calculated. Otherwise we’d have this really annoying situation where we’d have to fetch data from the last half-hour of Sunday and the first half-hour of Tuesday. I don’t need that level of perfectionism at present.
plot(mondayData$times,mondayData$otf, pch = 19, col = "#00000020",cex = 0.5, ylim = c(30,100),xlim = c(0,24), xaxt = "n", xlab = "Hour of Day", ylab = "Turnaround Time (min)") axis(side = 1, 0:24, at = 0:24, las = 2, cex.axis = 0.6, col.axis = "gray30") #60 min moving median calculation: #Create a point at every minute of the day t <- seq(from = 0.5,to = 23.5, by = 1/60) #create and empty vector to store the data med.tats <- vector() for(i in 1:length(t)){ #get all points within an hour of the minute tats <- subset(mondayData,(mondayData$times >= (t[i]-0.5)) & (mondayData$times < (t[i]+0.5))) #alternatively using filter() from the dplyr package #tats <- filter(mondayData, times >= (t[i] - 0.5) & times < (t[i] + 0.5)) med.tats[i] <- median(tats$otf) } lines(t,med.tats, col = "blue") grid(NA,NULL)
Removing the For–ness
Many R folks don’t like for–loops and would rather use the apply() family of functions. I’m not sure I always understand contempt towards loops for small simple tasks but if you wanted to accomplish the same looping task without using a for–loop, you could do as follows:
t <- seq(from = 0.5, to = 23.5, by = 1/60) movingmed<-function(t,x){ tats <- filter(x, times >= (t - 0.5) & times < (t + 0.5)) median(tats$otf) } med.tats <- sapply(t, movingmed, x=mondayData)
Smoothing
This approach is reasonable but the problem (I have found) is that it is computationally expensive on large data sets. For this reason, it is nice to use a canned smoothing algorithm like LOWESS which is much faster. The parameter f of the lowess function has a default of 2/3 which in our case results in a fit that is way–too smoothed. I played around with f until I got something that more or less tracked with the 60–min moving median. There are many approaches to smoothing–don’t get lost in the vortex.
Lowess Smoothing
plot(mondayData$times,mondayData$otf, pch = 19, col = "#00000020",cex = 0.5, ylim = c(30,100),xlim = c(0,24), xaxt = "n", xlab = "Hour of Day", ylab = "Turnaround Time (min)") axis(side = 1, 0:24, at = 0:24, las = 2, cex.axis = 0.6, col.axis = "gray30") #running median #create a point at every minute of the day t <- seq(from = 0.5,to = 23.5, by = 1/60) med.tats <- vector() for(i in 1:length(t)){ #get all points within an hour of the minute tats <- subset(mondayData,(mondayData$times >= (t[i] - 0.5)) & (mondayData$times < (t[i] + 0.5))) #alternatively using filter() from the dplyr package #tats <- filter(mondayData, times >= (t[i] - 0.5) & times < (t[i] + 0.5)) med.tats[i] <- median(tats$otf) } lines(t,med.tats, col = "blue") grid(col = "black") mondayFit <- lowess(mondayData$times,mondayData$otf,f = 0.05) lines(mondayFit,col = "red")
So, that’s cool. Now lets loop over all the days of the week and make plots for each day.
#create a 2x4 plot window par(mfrow = c(2,4)) #loop over days for (i in 1:7){ #make the lowess fit mydayData <- subset(myData,weekday == i) mydayTimes <- timeconvert(mydayData$collected) mydayData <- cbind(mydayData,times = mydayTimes) mydayFit <- lowess(mydayData$times,mydayData$otf,f = 0.05) plot(NA,NA,ylim = c(30,100), xlim = c(0,24), xaxt = "n", xlab = "Hour of Day", ylab = "Turnaround Time (min)") axis(side = 1, 0:24, at = 0:24, las = 2, cex.axis = 0.6, col.axis = "gray30") #running median #create a point at every minute of the day t <- seq(from = 0.5,to = 23.5, by = 1/60) med.tats <- vector() for(j in 1:length(t)){ #get all points within an hour of the minute tats <- subset(mydayData,(mydayData$times >= (t[j] - 0.5)) & (mydayData$times < (t[j] + 0.5))) #alternatively using filter() from the dplyr package #tats <- filter(mydayData, times >= (t[j] - 0.5) & times < (t[j] + 0.5)) med.tats[j] <- median(tats$otf) } lines(t,med.tats, col = "blue") grid(col = "black") mydayFit <- lowess(mydayData$times,mydayData$otf,f = 0.05) lines(mydayFit,col = "red") #put in horizontal gridding grid(NA,NULL) #you could add a legend on all the plots if you wanted #legend("bottomright", c("moving median","lowess"), lty = c(1,1), col = c("blue","red", byt = "n")) }
Now, let’s overplot all the lowess fits on a single graph and see what practical observations we can make. I have increased the lowess() smoothing to make things easier to look at.
par(mfrow = c(1,1)) plot(0,0,ylim = c(40,80),xlim = c(0,24), xaxt = "n", xlab = "Hour of Day", ylab = "Turnaround Time (min)") axis(side = 1, 0:24, at = 0:24, las = 2, cex.axis = 0.6, col.axis = "gray30") for (i in 1:7){ mydayData <- subset(myData,weekday == i) mydayTimes <- timeconvert(mydayData$collected) mydayData <- cbind(mydayData,times = mydayTimes) mydayFit <- lowess(mydayData$times,mydayData$otf,f = 0.1) lines(mydayFit, col = rainbow(7)[i], ylim = c(40,80),xlim = c(0,24), xaxt = "n", xlab = "Hour of Day", ylab = "Turnaround Time (min)", main = paste("TAT for",wday(i,label = TRUE),"in 2014")) } legend("bottomright",as.character(wday(1:7,label = TRUE)), col = rainbow(7), lty = rep(1,7), cex = 0.5)
Observations
We can immediately see some issues. Weekends in the early hours of the morning are bad. 8 am is bad across all days. Noon is generally problematic and particularly so on Saturdays. There is also a slowdown in mid–afternoon and in the early evenings. Saturday midnight is the most problematic time, although the endpoints of the figure have fewer local weighting points and their confidence intervals are wider. This is something we can cover another time.
Remember, also, this is only the median we have looked at. Other horrors may be lurking in the 90th percentile.
Next time what we will do is move all of this TAT visualization to a 3D representation so we can more easily spot the problematic times.
-Dan
The lot is cast into the lap, but its every decision is from the LORD.
Proverbs 16:33
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.