Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In any programming tool, dates, times, and timezones are hard. Deceptively hard. They’ve been shaped by politics and whimsy for hundreds of years: timezones can shift with minimal notice, countries have skipped or repeated certain days, some are offset by weird increments, some observe Daylight Saving Time, leap years, leap seconds, the list goes on. Luckily, we rarely need to worry about most of those details because other teams of very smart people have spent a lot of time providing nice abstractions for us that handle most of the weird edge cases.
Here at Methods we’ve been bit by some timezone oddities in R recently, so I wanted to focus on that part of R’s “date/time stack”, starting with a crash-course on datetimes and timezones before showing how to avoid common errors, with a special focus on reading from and writing to files.
Timezones in base R
Let’s start with a single string representing noon on the first day of this year. We want to make this a proper date-time object, which in R means the POSIXct
class; we can do this with as.POSIXct()
.
x <- "2018-01-01 12:00:00" as.POSIXct(x) ## [1] "2018-01-01 12:00:00 EST"
Notice the timezone code at the end; it defaulted to “EST” because of my default timezone, which might be different for you.
Sys.timezone() ## [1] "America/Detroit"
If our x
isn’t supposed to represent Eastern time, we can specify a different timezone as an argument to as.POSIXct
.
as.POSIXct(x, tz = "America/Chicago") ## [1] "2018-01-01 12:00:00 CST"
Notice the only thing that changed is the timezone; the clock time is the same. CST and EST are an hour apart, though, so our two conversions represent different instances of time; noon in Detroit occurs one hour later than noon in Chicago. We can check this by subtracting the dates:
as.POSIXct(x, tz = "America/Chicago") - as.POSIXct(x) ## Time difference of 1 hours
I’m not really in Detroit, I’m in Ypsilanti (another city in Michigan); what happens if we use that instead?
as.POSIXct(x, tz = "America/Ypsilanti") ## [1] "2018-01-01 12:00:00 America"
Okay that’s weird; what does the “America” timezone code even mean? Is it the same as the first conversion we did? If it is, there should be a 0-hour difference between them.
as.POSIXct(x, tz = "America/Ypsilanti") - as.POSIXct(x) ## Time difference of -5 hours
That’s not right! “America/Ypsilanti” isn’t a valid timezone, but rather than throwing an error (or even a warning), R created an instance of time with no UTC offset.
UTC is a special standard that has no timezone, no daylight savings time, or any of the other weird things that make timezones hard. In R, and most other programming languages, UTC is the common, underlying representation of (date)time, and things like timezones are defined in terms of UTC (EST is UTC-5, hence the “-5” above). It’s roughly the same as GMT (Greenwich Mean Time), but not quite identical (GMT can have Daylight Saving Time!)
This above behavior in R can be really dangerous; even if you intend to use a proper timezone name, misspelling it (e.g. “America/Chicgo”) will produce this same bad behavior. We need to make sure to only use valid timezone names. The list of options can be accessed with the OlsonName()
function; on my machine, this gives 606 different options!
We can see “America/Chicago” and “America/Detroit” are in there, but “America/Ypsilanti” isn’t.
c("America/Chicago", "America/Detroit", "America/Ypsilanti") %in% OlsonNames() ## [1] TRUE TRUE FALSE
Enter lubridate
These kinds of bad-but-quiet misunderstandings between the programmer and the language exist all over base R, but can often be prevented with additional packages. The tidyverse
family in particular aims to prevent silent errors like this, and their date/time package lubridate
is no exception.
Rather than using as.POSIXct
to turn our string into a datetime object, we’ll use lubridate::as_datetime
.
library(lubridate) as_datetime(x) # ymd_hms works as well ## [1] "2018-01-01 12:00:00 UTC"
Notice as_datetime
used UTC by default, rather than my local timezone. This is great because it means that command will return the same thing on any other computer, regardless what that computers’ timezone is set to. lubridate
makes code more portable!
This as_datetime
function still produces POSIXct
objects,
class(as_datetime(x)) ## [1] "POSIXct" "POSIXt"
which means any non-lubridate
functions for working with these datetime objects will still work.
We can still specify whatever timezone we desire,
as_datetime(x, tz = "America/Detroit") ## [1] "2018-01-01 12:00:00 EST"
but more importantly, lubridate
lets us know if we use a bad timezone name!
as_datetime(x, tz = "America/Ypsilanti") ## Error in C_force_tz(time, tz = tzone, roll): CCTZ: Unrecognized output timezone: "America/Ypsilanti"
This alone, saving us from ourselves, might be reason enough to use lubridate
whenever you deal with timezones in R, but it can do a lot more, most of which we won’t cover here.
Why should we care?
For a lot of programs utilizing datetimes, you can get away with unconsidered or incorrect timezones. I can’t enumerate all the ways that could come back to bite you, but one is Daylight Saving Time.
Suppose we’re given the following datetimes as strings, perhaps read from some file:
x <- c("2018-03-10 23:30:00", "2018-03-11 00:30:00", "2018-03-11 01:30:00", "2018-03-11 03:30:00", "2018-03-11 04:30:00", "2018-03-11 05:30:00") x ## [1] "2018-03-10 23:30:00" "2018-03-11 00:30:00" "2018-03-11 01:30:00" ## [4] "2018-03-11 03:30:00" "2018-03-11 04:30:00" "2018-03-11 05:30:00"
To do much with these, we need to convert from character
to POSIXct
. We’ll use lubridate
again, but instead of parsing with the very general as_datetime()
we’ll use the more idiomatic ymd_hms()
which expects inputs to be formatted as year-month-day, hour-minute-second (exactly what we have!).
ymd_hms(x) ## [1] "2018-03-10 23:30:00 UTC" "2018-03-11 00:30:00 UTC" ## [3] "2018-03-11 01:30:00 UTC" "2018-03-11 03:30:00 UTC" ## [5] "2018-03-11 04:30:00 UTC" "2018-03-11 05:30:00 UTC"
It’s possible that those times are supposed to be UTC and we got exactly what we wanted. If instead these strings represent clock-times from a place that observes Daylight Saving Time, we’ve got a problem. These times span the “spring ahead” transition, where clocks immediately transition from 2am to 3am. This means that in a DST-observing timezone all these clock-times are evenly spaced, since 1:30am \(\rightarrow\) 3:30am is only one hour.
Since UTC does not observe DST, elements of dst
are currently not evenly-spaced (it’s not a regular timeseries), which we can see quickly with the diff
function:
diff(ymd_hms(x)) ## Time differences in hours ## [1] 1 1 2 1 1
Using a DST-observing timezone on conversion will fix this.
diff(ymd_hms(x, tz = "America/Detroit")) ## Time differences in hours ## [1] 1 1 1 1 1
Depending on what you want to do with these datetimes, that may not matter, but it could be absolutely critical.
Something lubridate
won’t help prevent is using valid timezone names which don’t align with what you’re trying to do. “EST” (Eastern Standard Time) and “EDT” (Eastern Daylight Time) are both in OlsonNames()
, but will ignore the DST shift in our vector of datetimes:
diff(ymd_hms(x, tz = "EST")) ## Time differences in hours ## [1] 1 1 2 1 1 diff(ymd_hms(x, tz = "EDT")) ## Time differences in hours ## [1] 1 1 2 1 1
When we specify “EST”, the the first 3 are correct, but the latter half is wrong because they’re really “EDT” datetimes; the opposite occurs when we specify “EDT”.
In almost all cases, use location-based timezone names like “America/City” to properly handle not only DST, but other potential oddities that may have occurred in that location, like those that have shifted between observing and not observing DST.
Timezones and I/O
It’s rare that we need to create new datetime objects from scratch; they’re nearly always coming into a project from some external dataset, generally in Excel or CSV formats.
We can make a little CSV using writeLines()
; we’ll have just one column of our DST-crossing datetimes, and we’ll call it datetime
.
writeLines(c("datetime", x), "data/dst-example.csv")
There’s very few excuses to be using read.csv()
in 2018, so we’ll use the far-superior read_csv()
from readr
which will automatically parse our datetimes into POSIXct
objects.
library(readr) tbl <- read_csv("data/dst-example.csv") ## Parsed with column specification: ## cols( ## datetime = col_datetime(format = "") ## ) tbl ## # A tibble: 6 x 1 ## datetime ## <dttm> ## 1 2018-03-10 23:30:00 ## 2 2018-03-11 00:30:00 ## 3 2018-03-11 01:30:00 ## 4 2018-03-11 03:30:00 ## 5 2018-03-11 04:30:00 ## 6 2018-03-11 05:30:00
What timezone are those? lubridate
makes this easy with their tz
function.
tz(tbl$datetime) ## [1] "UTC"
If we know these aren’t UTC times, but are instead clock times from New York, we need to do some sort of conversion here, as readr
doesn’t allow you to specify the timezone when reading.
lubridate
provides two “timezone conversion” functions:
with_tz()
applies a timezone so the clock time changes, but the instant of time (e.g. UTC) is the same, e.g. 5pm EST \(\rightarrow\) 2pm PST.force_tz()
changes the instant of time so the clock time remains constant, e.g. 5pm EST \(\rightarrow\) 5pm PST.
In this case we need force_tz
because the times in our file are non-UTC (New York, in this case) clock times.
library(dplyr) tbl <- tbl %>% mutate(datetime = force_tz(datetime, "America/New_York")) tbl ## # A tibble: 6 x 1 ## datetime ## <dttm> ## 1 2018-03-10 23:30:00 ## 2 2018-03-11 00:30:00 ## 3 2018-03-11 01:30:00 ## 4 2018-03-11 03:30:00 ## 5 2018-03-11 04:30:00 ## 6 2018-03-11 05:30:00
Notice the display isn’t any different; tibble
doesn’t print timezones for datetime columns. The timezone has changed,
tz(tbl$datetime) ## [1] "America/New_York"
and they’re now evenly spaced, too:
diff(tbl$datetime) ## Time differences in hours ## [1] 1 1 1 1 1
I’m not sure we’ve ever had a client send us a file with UTC datetimes in it, so whenever we read in a client-provided file with datetimes, we immediately call force_tz
to correct things.
Reading from Excel
This workflow is essentially the same with Excel files. We’ll use the readxl
package, another tidyverse
gem, which is much nicer than the Java-based libraries that came before it (e.g. xlsx
). I’ve already created an excel file (which you can download here) whose first sheet is identical to the CSV we just made, so we’ll read that in.
library(readxl) tbl <- read_excel("data/timezone-examples.xlsx") tbl ## # A tibble: 6 x 1 ## datetime ## <dttm> ## 1 2018-03-10 23:30:00 ## 2 2018-03-11 00:30:00 ## 3 2018-03-11 01:30:00 ## 4 2018-03-11 03:30:00 ## 5 2018-03-11 04:30:00 ## 6 2018-03-11 05:30:00
Like readr::read_csv()
, readxl::read_excel()
is smart enough to see we have datetimes, and parse them appropriately, but we have the same problem as before:
tz(tbl$datetime) ## [1] "UTC"
Thus we need the same fix. An alternative to force_tz()
is to assign to tz()
:
tz(tbl$datetime) <- "America/New_York" tbl ## # A tibble: 6 x 1 ## datetime ## <dttm> ## 1 2018-03-10 23:30:00 ## 2 2018-03-11 00:30:00 ## 3 2018-03-11 01:30:00 ## 4 2018-03-11 03:30:00 ## 5 2018-03-11 04:30:00 ## 6 2018-03-11 05:30:00 diff(tbl$datetime) ## Time differences in hours ## [1] 1 1 1 1 1
This is very base-R style and doesn’t fit nicely into pipelines of commands with %>%
, so we generally opt for force_tz
+ mutate
, but this form can also be useful.
Split Date & Time
We often receive data where the date is in one column, but the time is in the next column over. This presents some new challenges because R doesn’t have date-free time classes; we want both pieces in the same column.
We find this happens most often in Excel files; the second sheet of our timezone-examples.xlsx
file provides an example with the same data as before, but the date & time in two separate columns.
tbl <- read_excel("data/timezone-examples.xlsx", sheet = 2) tbl ## # A tibble: 6 x 2 ## date time ## <dttm> <dttm> ## 1 2018-03-10 00:00:00 1899-12-31 23:30:00 ## 2 2018-03-11 00:00:00 1899-12-31 00:30:00 ## 3 2018-03-11 00:00:00 1899-12-31 01:30:00 ## 4 2018-03-11 00:00:00 1899-12-31 03:30:00 ## 5 2018-03-11 00:00:00 1899-12-31 04:30:00 ## 6 2018-03-11 00:00:00 1899-12-31 05:30:00
We sure didn’t expect the time to be read as datetimes with some strange, irrelevant date! We’d prefer to have a single datetime
column where the dates come from date
and the times come from time
. There’s at least two ways we could do this:
- change the dates in the
time
column to be the dates in thedate
column, then renametime
->datetime
- create a new
datetime
column by taking the pieces we need fromdate
&time
We think the latter is a bit more “tidy” or idiomatic, so that’s what we’ll do. It’s a bit clunky, but we can extract the dates and times from the appropriate column, paste them together, then re-parse those strings into datetime (POSIXct
) objects. This will also let us specify the timezone when we do that, avoiding the need for force_tz
. The hardest part is pulling only the time-part out of the POSIXct
objects in our time
column; we’ll use format
for that.
tbl <- tbl %>% mutate(datetime = paste(date, format(time, "%T")) %>% ymd_hms(tz = "America/New_York")) %>% select(datetime) # drop date/time cols tbl ## # A tibble: 6 x 1 ## datetime ## <dttm> ## 1 2018-03-10 23:30:00 ## 2 2018-03-11 00:30:00 ## 3 2018-03-11 01:30:00 ## 4 2018-03-11 03:30:00 ## 5 2018-03-11 04:30:00 ## 6 2018-03-11 05:30:00 tz(tbl$datetime) ## [1] "America/New_York" diff(tbl$datetime) ## Time differences in hours ## [1] 1 1 1 1 1
There we go! One ugly line of code, but that’s all we needed.
With CSV’s this is a bit easier because the time
column will come back as a character
vector rather than a POSIXct
vector (or you can force it to POSIXct
with the col_types
argument); then you don’t need the format
part above. If we tried to force that behavior using read_excel()
’s col_types
argument, we’d get back some hard-to-interpret numbers due to the horrible way Excel stores time.
Writing back out
So we’ve got our inputs formatted the way we want, then we do something, and at some point we probably want to write that back out. This is super easy with readr
:
write_csv(tbl, "data/example-output.csv")
What was written to that text file might not be what we expected; let’s investigate with readr::read_lines()
:
read_lines("data/example-output.csv") ## [1] "datetime" "2018-03-11T04:30:00Z" "2018-03-11T05:30:00Z" ## [4] "2018-03-11T06:30:00Z" "2018-03-11T07:30:00Z" "2018-03-11T08:30:00Z" ## [7] "2018-03-11T09:30:00Z"
Two things to notice: the format is ISO 8601 (the “T” and “Z”), and they’re stored as UTC (no timezone or offset, indicated by the lack of anything after the “Z”). Just like readr
assumes everything is UTC coming in, it also makes things UTC going out.
For programmers and programming languages, this is ideal: it’s consistent, unambiguous, and portable.
If this file is going to be used by non-programmers, however, this can be a problem. If you open this in Excel, you’ll see the UTC times, and unlike all sane programming languages Excel can’t store UTC but display EST/EDT/PST/etc. “What are all these weird times? Why are they off by X hours sometimes, and X+1 hours other times?” are not questions we like getting after handing off a deliverable to a client.
We have just two options:
- explain to the next user (client, colleague, etc.) the excellence and utility of UTC and convince them it’s the right way to do things
- write the datetimes out a bit differently
Depending on the next user, option 1 can be a total no-go, in which case we need a way to write our datetimes as clock times rather than instants of time (UTC).
The simplest way is to convert to character
first
tbl %>% mutate_if(is.POSIXct, as.character) %>% # handy if you have multiple datetime cols write_csv("data/example-output.csv") read_lines("data/example-output.csv") ## [1] "datetime" "2018-03-10 23:30:00" "2018-03-11 00:30:00" ## [4] "2018-03-11 01:30:00" "2018-03-11 03:30:00" "2018-03-11 04:30:00" ## [7] "2018-03-11 05:30:00"
That seems to have done what we want, and if necessary we could replace as.character
with something else to get a different format.
This works about as well when writing to Excel files (e.g. with writexl::write_xlsx
), but in general we try to avoid writing directly to Excel files. If we know the next user will be using Excel, we prefer readr::write_excel_csv()
, though writexl
is the tool of choice for files with multiple tables spread over different sheets.
This only scratched the surface of issues that can arise when dealing with datetimes & timezones, but if there’s one lesson we hope sticks with you, it’s use lubridate
. It saves lives; check out this decade-old article from Revolution Analytics as further proof that life was a bit tougher before lubridate
.
If you want to dig deeper into the scary world of datetimes we’re mostly abstracted from, check out this excellent article from Zach Holman to learn about that one time Samoa skipped a day and so much more.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.