Site icon R-bloggers

Strange behavior from the cut function with dates in R

[This article was first published on Paleocave Blog » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I recently encountered some strange behavior from R when using the cut.POSIXt method with “day” as the interval specification. This function isn’t working as I intended and I doubt that it is working properly. I’ll show you the behavior I’m seeing (and what I was expecting) then I’ll show you my current base R workaround. To generate a reproducible example, I’ll use this latemail function I gleaned from this stack overflow post.

latemail <- function(N, st="2013/01/01", et="2013/12/31") {
 st <- as.POSIXct(as.Date(st))
 et <- as.POSIXct(as.Date(et))
 dt <- as.numeric(difftime(et,st,unit="sec"))
 ev <- sort(runif(N, 0, dt))
 rt <- st + ev
 }

And generate some data…

set.seed(7110)
#generate 1000 random POSIXlt dates and times
bar<-data.frame("date"=latemail(1000, st="2013/03/02", et="2013/03/30"))
# assign factors based on the day portion of the POSIXlt object
bar$dateCut <- cut(bar$date, "day", labels = FALSE)

I expected that all rows with the date 2013-03-01 would receive factor 1, all rows with the date 2013-03-02 would receive factor 2, and so on. At first glance this seems to be what is happening.

head(bar, 10)
     date                 dateCut
1    2013-03-01 19:10:31  1
2    2013-03-01 19:31:31  1
3    2013-03-01 19:55:02  1
4    2013-03-01 20:09:36  1
5    2013-03-01 20:13:32  1
6    2013-03-01 22:15:42  1
7    2013-03-01 22:16:06  1
8    2013-03-01 23:41:50  1
9    2013-03-02 00:30:53  2
10   2013-03-02 01:08:52  2

Note that at row 9 the date changes from March 1 to March 2 and the factor (dateCut) changes from 1 to 2. So far so good. But we shall see some strange things in the midnight hour.  

For additional locations where I see the expected behavior you can check

bar[ c(259, 260, 294, 295), ]
259  2013-03-08 23:22:15  8
260  2013-03-09 00:11:08  9
294  2013-03-09 23:59:11  9
295  2013-03-10 00:56:19  10

Now the weirdness.

bar[320:326, ]
320  2013-03-10 22:14:22  10
321  2013-03-10 22:28:03  10
322  2013-03-11 00:08:27  10
323  2013-03-11 00:30:08  10
324  2013-03-11 00:56:23  10
325  2013-03-11 01:19:54  11
326  2013-03-11 01:22:43  11

At row 322 the date changes from March 10 to March 11 but the dateCut factor doesn’t change until line 325. After 1:00 AM things seem to behave as expected. At first I thought maybe some sort of floor rounding was going on which was rounding midnight back to the previous day, but notice that the previous examples included times between midnight and 1:00 that were cut as expected. More weirdness examples:

bar[398:405,]
398  2013-03-12 23:56:20  12
399  2013-03-13 00:53:47  12
400  2013-03-13 01:30:33  13
401  2013-03-13 01:45:31  13
bar[430:435,]
430  2013-03-13 23:45:48  13
431  2013-03-14 00:28:40  13
432  2013-03-14 00:46:24  13
433  2013-03-14 00:55:16  13
434  2013-03-14 01:33:19  14
435  2013-03-14 02:02:45  14

I see even stranger behavior when I truncate to just the date.

bar$datetrunc=trunc(bar$date, "day")  
bar$truncCut <- cut(bar$datetrunc, "day", labels = FALSE) 

Again, things work fine for a while

head(bar, 10)
   date             dateCut datetrunc truncCut
1  2013-03-01 19:10:31 1   2013-03-01  1
2  2013-03-01 19:31:31 1   2013-03-01  1
3  2013-03-01 19:55:02 1   2013-03-01  1
4  2013-03-01 20:09:36 1   2013-03-01  1
5  2013-03-01 20:13:32 1   2013-03-01  1
6  2013-03-01 22:15:42 1   2013-03-01  1
7  2013-03-01 22:16:06 1   2013-03-01  1
8  2013-03-01 23:41:50 1   2013-03-01  1
9  2013-03-02 00:30:53 2   2013-03-02  2
10 2013-03-02 01:08:52 2   2013-03-02  2

But eventually wind up worse than ever.

bar[320:330,]
    date               dateCut datetrunc truncCut
320 2013-03-10 22:14:22  10  2013-03-10  10
321 2013-03-10 22:28:03  10  2013-03-10  10
322 2013-03-11 00:08:27  10  2013-03-11  10
323 2013-03-11 00:30:08  10  2013-03-11  10
324 2013-03-11 00:56:23  10  2013-03-11  10
325 2013-03-11 01:19:54  11  2013-03-11  10
326 2013-03-11 01:22:43  11  2013-03-11  10
327 2013-03-11 02:29:34  11  2013-03-11  10
328 2013-03-11 02:34:23  11  2013-03-11  10
329 2013-03-11 02:51:47  11  2013-03-11  10
330 2013-03-11 03:11:00  11  2013-03-11  10

The timeCut factor changes 3 rows too late but the truncCut factor stays stuck at 10 for a long time (47 rows). At row 369, the timeCut factor changes to 12 (correctly) and the truncCut factor finally turns over to 11.

bar[365:375,]
    date              dateCut datetrunc truncCut
365 2013-03-11 19:49:05  11  2013-03-11  10
366 2013-03-11 21:19:31  11  2013-03-11  10
367 2013-03-11 21:31:58  11  2013-03-11  10
368 2013-03-11 22:06:44  11  2013-03-11  10
369 2013-03-12 02:45:14  12  2013-03-12  11
370 2013-03-12 03:14:56  12  2013-03-12  11
371 2013-03-12 04:02:03  12  2013-03-12  11
372 2013-03-12 05:12:03  12  2013-03-12  11
373 2013-03-12 05:31:53  12  2013-03-12  11
374 2013-03-12 05:56:08  12  2013-03-12  11
375 2013-03-12 06:40:45  12  2013-03-12  11

My initial sidestep involved the rank() function (it achieved the desired result, but was S L O W). I won’t torture you with it here. I consulted with Dr. Erin Hodgess and devised this work around, which is pretty speedy.

foo <- unique(bar$datetrunc)
bar$truncMatch <- match(bar$datetrunc, foo)

Here’s that strange section where the truncCut factor behaved so poorly. No problem for my new truncMatch factor.

 

bar[320:330,]
    date            dateCut datetrunc truncCut truncMatch
320 2013-03-10 22:14:22  10  2013-03-10  10   10
321 2013-03-10 22:28:03  10  2013-03-10  10   10
322 2013-03-11 00:08:27  10  2013-03-11  10   11
323 2013-03-11 00:30:08  10  2013-03-11  10   11
324 2013-03-11 00:56:23  10  2013-03-11  10   11
325 2013-03-11 01:19:54  11  2013-03-11  10   11
326 2013-03-11 01:22:43  11  2013-03-11  10   11
327 2013-03-11 02:29:34  11  2013-03-11  10   11
328 2013-03-11 02:34:23  11  2013-03-11  10   11
329 2013-03-11 02:51:47  11  2013-03-11  10   11
330 2013-03-11 03:11:00  11  2013-03-11  10   11


 

To leave a comment for the author, please follow the link and comment on their blog: Paleocave Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.