Strange behavior from the cut function with dates in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I recently encountered some strange behavior from R when using the cut.POSIXt method with “day” as the interval specification. This function isn’t working as I intended and I doubt that it is working properly. I’ll show you the behavior I’m seeing (and what I was expecting) then I’ll show you my current base R workaround. To generate a reproducible example, I’ll use this latemail function I gleaned from this stack overflow post.
latemail <- function(N, st="2013/01/01", et="2013/12/31") { st <- as.POSIXct(as.Date(st)) et <- as.POSIXct(as.Date(et)) dt <- as.numeric(difftime(et,st,unit="sec")) ev <- sort(runif(N, 0, dt)) rt <- st + ev }
And generate some data…
set.seed(7110) #generate 1000 random POSIXlt dates and times bar<-data.frame("date"=latemail(1000, st="2013/03/02", et="2013/03/30")) # assign factors based on the day portion of the POSIXlt object bar$dateCut <- cut(bar$date, "day", labels = FALSE)
I expected that all rows with the date 2013-03-01 would receive factor 1, all rows with the date 2013-03-02 would receive factor 2, and so on. At first glance this seems to be what is happening.
head(bar, 10) date dateCut 1 2013-03-01 19:10:31 1 2 2013-03-01 19:31:31 1 3 2013-03-01 19:55:02 1 4 2013-03-01 20:09:36 1 5 2013-03-01 20:13:32 1 6 2013-03-01 22:15:42 1 7 2013-03-01 22:16:06 1 8 2013-03-01 23:41:50 1 9 2013-03-02 00:30:53 2 10 2013-03-02 01:08:52 2
Note that at row 9 the date changes from March 1 to March 2 and the factor (dateCut) changes from 1 to 2. So far so good. But we shall see some strange things in the midnight hour.
For additional locations where I see the expected behavior you can check
bar[ c(259, 260, 294, 295), ] 259 2013-03-08 23:22:15 8 260 2013-03-09 00:11:08 9 294 2013-03-09 23:59:11 9 295 2013-03-10 00:56:19 10
Now the weirdness.
bar[320:326, ] 320 2013-03-10 22:14:22 10 321 2013-03-10 22:28:03 10 322 2013-03-11 00:08:27 10 323 2013-03-11 00:30:08 10 324 2013-03-11 00:56:23 10 325 2013-03-11 01:19:54 11 326 2013-03-11 01:22:43 11
At row 322 the date changes from March 10 to March 11 but the dateCut factor doesn’t change until line 325. After 1:00 AM things seem to behave as expected. At first I thought maybe some sort of floor rounding was going on which was rounding midnight back to the previous day, but notice that the previous examples included times between midnight and 1:00 that were cut as expected. More weirdness examples:
bar[398:405,] 398 2013-03-12 23:56:20 12 399 2013-03-13 00:53:47 12 400 2013-03-13 01:30:33 13 401 2013-03-13 01:45:31 13 bar[430:435,] 430 2013-03-13 23:45:48 13 431 2013-03-14 00:28:40 13 432 2013-03-14 00:46:24 13 433 2013-03-14 00:55:16 13 434 2013-03-14 01:33:19 14 435 2013-03-14 02:02:45 14
I see even stranger behavior when I truncate to just the date.
bar$datetrunc=trunc(bar$date, "day") bar$truncCut <- cut(bar$datetrunc, "day", labels = FALSE)
Again, things work fine for a while
head(bar, 10) date dateCut datetrunc truncCut 1 2013-03-01 19:10:31 1 2013-03-01 1 2 2013-03-01 19:31:31 1 2013-03-01 1 3 2013-03-01 19:55:02 1 2013-03-01 1 4 2013-03-01 20:09:36 1 2013-03-01 1 5 2013-03-01 20:13:32 1 2013-03-01 1 6 2013-03-01 22:15:42 1 2013-03-01 1 7 2013-03-01 22:16:06 1 2013-03-01 1 8 2013-03-01 23:41:50 1 2013-03-01 1 9 2013-03-02 00:30:53 2 2013-03-02 2 10 2013-03-02 01:08:52 2 2013-03-02 2
But eventually wind up worse than ever.
bar[320:330,] date dateCut datetrunc truncCut 320 2013-03-10 22:14:22 10 2013-03-10 10 321 2013-03-10 22:28:03 10 2013-03-10 10 322 2013-03-11 00:08:27 10 2013-03-11 10 323 2013-03-11 00:30:08 10 2013-03-11 10 324 2013-03-11 00:56:23 10 2013-03-11 10 325 2013-03-11 01:19:54 11 2013-03-11 10 326 2013-03-11 01:22:43 11 2013-03-11 10 327 2013-03-11 02:29:34 11 2013-03-11 10 328 2013-03-11 02:34:23 11 2013-03-11 10 329 2013-03-11 02:51:47 11 2013-03-11 10 330 2013-03-11 03:11:00 11 2013-03-11 10
The timeCut factor changes 3 rows too late but the truncCut factor stays stuck at 10 for a long time (47 rows). At row 369, the timeCut factor changes to 12 (correctly) and the truncCut factor finally turns over to 11.
bar[365:375,] date dateCut datetrunc truncCut 365 2013-03-11 19:49:05 11 2013-03-11 10 366 2013-03-11 21:19:31 11 2013-03-11 10 367 2013-03-11 21:31:58 11 2013-03-11 10 368 2013-03-11 22:06:44 11 2013-03-11 10 369 2013-03-12 02:45:14 12 2013-03-12 11 370 2013-03-12 03:14:56 12 2013-03-12 11 371 2013-03-12 04:02:03 12 2013-03-12 11 372 2013-03-12 05:12:03 12 2013-03-12 11 373 2013-03-12 05:31:53 12 2013-03-12 11 374 2013-03-12 05:56:08 12 2013-03-12 11 375 2013-03-12 06:40:45 12 2013-03-12 11
My initial sidestep involved the rank() function (it achieved the desired result, but was S L O W). I won’t torture you with it here. I consulted with Dr. Erin Hodgess and devised this work around, which is pretty speedy.
foo <- unique(bar$datetrunc) bar$truncMatch <- match(bar$datetrunc, foo)
Here’s that strange section where the truncCut factor behaved so poorly. No problem for my new truncMatch factor.
bar[320:330,] date dateCut datetrunc truncCut truncMatch 320 2013-03-10 22:14:22 10 2013-03-10 10 10 321 2013-03-10 22:28:03 10 2013-03-10 10 10 322 2013-03-11 00:08:27 10 2013-03-11 10 11 323 2013-03-11 00:30:08 10 2013-03-11 10 11 324 2013-03-11 00:56:23 10 2013-03-11 10 11 325 2013-03-11 01:19:54 11 2013-03-11 10 11 326 2013-03-11 01:22:43 11 2013-03-11 10 11 327 2013-03-11 02:29:34 11 2013-03-11 10 11 328 2013-03-11 02:34:23 11 2013-03-11 10 11 329 2013-03-11 02:51:47 11 2013-03-11 10 11 330 2013-03-11 03:11:00 11 2013-03-11 10 11
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.