A Tour of Timezones (& Troubles) in R

Posted on July 2, 2018 by Rstats on pi: predict/infer in R bloggers | 0 Comments

[This article was first published on Rstats on pi: predict/infer, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In any programming tool, dates, times, and timezones are hard. Deceptively hard. They’ve been shaped by politics and whimsy for hundreds of years: timezones can shift with minimal notice, countries have skipped or repeated certain days, some are offset by weird increments, some observe Daylight Saving Time, leap years, leap seconds, the list goes on. Luckily, we rarely need to worry about most of those details because other teams of very smart people have spent a lot of time providing nice abstractions for us that handle most of the weird edge cases.

Here at Methods we’ve been bit by some timezone oddities in R recently, so I wanted to focus on that part of R’s “date/time stack”, starting with a crash-course on datetimes and timezones before showing how to avoid common errors, with a special focus on reading from and writing to files.

Timezones in base R

Let’s start with a single string representing noon on the first day of this year. We want to make this a proper date-time object, which in R means the POSIXct class; we can do this with as.POSIXct().

x <- "2018-01-01 12:00:00"

as.POSIXct(x)
## [1] "2018-01-01 12:00:00 EST"

Notice the timezone code at the end; it defaulted to “EST” because of my default timezone, which might be different for you.

Sys.timezone()
## [1] "America/Detroit"

If our x isn’t supposed to represent Eastern time, we can specify a different timezone as an argument to as.POSIXct.

as.POSIXct(x, tz = "America/Chicago")
## [1] "2018-01-01 12:00:00 CST"

Notice the only thing that changed is the timezone; the clock time is the same. CST and EST are an hour apart, though, so our two conversions represent different instances of time; noon in Detroit occurs one hour later than noon in Chicago. We can check this by subtracting the dates:

as.POSIXct(x, tz = "America/Chicago") - as.POSIXct(x)
## Time difference of 1 hours

I’m not really in Detroit, I’m in Ypsilanti (another city in Michigan); what happens if we use that instead?

as.POSIXct(x, tz = "America/Ypsilanti")
## [1] "2018-01-01 12:00:00 America"

Okay that’s weird; what does the “America” timezone code even mean? Is it the same as the first conversion we did? If it is, there should be a 0-hour difference between them.

as.POSIXct(x, tz = "America/Ypsilanti") - as.POSIXct(x)
## Time difference of -5 hours

That’s not right! “America/Ypsilanti” isn’t a valid timezone, but rather than throwing an error (or even a warning), R created an instance of time with no UTC offset.

UTC is a special standard that has no timezone, no daylight savings time, or any of the other weird things that make timezones hard. In R, and most other programming languages, UTC is the common, underlying representation of (date)time, and things like timezones are defined in terms of UTC (EST is UTC-5, hence the “-5” above). It’s roughly the same as GMT (Greenwich Mean Time), but not quite identical (GMT can have Daylight Saving Time!)

This above behavior in R can be really dangerous; even if you intend to use a proper timezone name, misspelling it (e.g. “America/Chicgo”) will produce this same bad behavior. We need to make sure to only use valid timezone names. The list of options can be accessed with the OlsonName() function; on my machine, this gives 606 different options!

We can see “America/Chicago” and “America/Detroit” are in there, but “America/Ypsilanti” isn’t.

c("America/Chicago", "America/Detroit", "America/Ypsilanti") %in% OlsonNames()
## [1]  TRUE  TRUE FALSE

Enter `lubridate`

These kinds of bad-but-quiet misunderstandings between the programmer and the language exist all over base R, but can often be prevented with additional packages. The tidyverse family in particular aims to prevent silent errors like this, and their date/time package lubridate is no exception.

Rather than using as.POSIXct to turn our string into a datetime object, we’ll use lubridate::as_datetime.

library(lubridate)

as_datetime(x)  # ymd_hms works as well
## [1] "2018-01-01 12:00:00 UTC"

Notice as_datetime used UTC by default, rather than my local timezone. This is great because it means that command will return the same thing on any other computer, regardless what that computers’ timezone is set to. lubridate makes code more portable!

This as_datetime function still produces POSIXct objects,

class(as_datetime(x))
## [1] "POSIXct" "POSIXt"

which means any non-lubridate functions for working with these datetime objects will still work.

We can still specify whatever timezone we desire,

as_datetime(x, tz = "America/Detroit")
## [1] "2018-01-01 12:00:00 EST"

but more importantly, lubridate lets us know if we use a bad timezone name!

as_datetime(x, tz = "America/Ypsilanti")
## Error in C_force_tz(time, tz = tzone, roll): CCTZ: Unrecognized output timezone: "America/Ypsilanti"

This alone, saving us from ourselves, might be reason enough to use lubridate whenever you deal with timezones in R, but it can do a lot more, most of which we won’t cover here.

Why should we care?

For a lot of programs utilizing datetimes, you can get away with unconsidered or incorrect timezones. I can’t enumerate all the ways that could come back to bite you, but one is Daylight Saving Time.

Suppose we’re given the following datetimes as strings, perhaps read from some file:

x <- c("2018-03-10 23:30:00", "2018-03-11 00:30:00", "2018-03-11 01:30:00", "2018-03-11 03:30:00", "2018-03-11 04:30:00", "2018-03-11 05:30:00")
x
## [1] "2018-03-10 23:30:00" "2018-03-11 00:30:00" "2018-03-11 01:30:00"
## [4] "2018-03-11 03:30:00" "2018-03-11 04:30:00" "2018-03-11 05:30:00"

To do much with these, we need to convert from character to POSIXct. We’ll use lubridate again, but instead of parsing with the very general as_datetime() we’ll use the more idiomatic ymd_hms() which expects inputs to be formatted as year-month-day, hour-minute-second (exactly what we have!).

ymd_hms(x)
## [1] "2018-03-10 23:30:00 UTC" "2018-03-11 00:30:00 UTC"
## [3] "2018-03-11 01:30:00 UTC" "2018-03-11 03:30:00 UTC"
## [5] "2018-03-11 04:30:00 UTC" "2018-03-11 05:30:00 UTC"

It’s possible that those times are supposed to be UTC and we got exactly what we wanted. If instead these strings represent clock-times from a place that observes Daylight Saving Time, we’ve got a problem. These times span the “spring ahead” transition, where clocks immediately transition from 2am to 3am. This means that in a DST-observing timezone all these clock-times are evenly spaced, since 1:30am \(\rightarrow\) 3:30am is only one hour.

Since UTC does not observe DST, elements of dst are currently not evenly-spaced (it’s not a regular timeseries), which we can see quickly with the diff function:

diff(ymd_hms(x))
## Time differences in hours
## [1] 1 1 2 1 1

Using a DST-observing timezone on conversion will fix this.

diff(ymd_hms(x, tz = "America/Detroit"))
## Time differences in hours
## [1] 1 1 1 1 1

Depending on what you want to do with these datetimes, that may not matter, but it could be absolutely critical.

Something lubridate won’t help prevent is using valid timezone names which don’t align with what you’re trying to do. “EST” (Eastern Standard Time) and “EDT” (Eastern Daylight Time) are both in OlsonNames(), but will ignore the DST shift in our vector of datetimes:

diff(ymd_hms(x, tz = "EST"))
## Time differences in hours
## [1] 1 1 2 1 1
diff(ymd_hms(x, tz = "EDT"))
## Time differences in hours
## [1] 1 1 2 1 1

When we specify “EST”, the the first 3 are correct, but the latter half is wrong because they’re really “EDT” datetimes; the opposite occurs when we specify “EDT”.

In almost all cases, use location-based timezone names like “America/City” to properly handle not only DST, but other potential oddities that may have occurred in that location, like those that have shifted between observing and not observing DST.

Timezones and I/O

It’s rare that we need to create new datetime objects from scratch; they’re nearly always coming into a project from some external dataset, generally in Excel or CSV formats.

We can make a little CSV using writeLines(); we’ll have just one column of our DST-crossing datetimes, and we’ll call it datetime.

writeLines(c("datetime", x), "data/dst-example.csv")

There’s very few excuses to be using read.csv() in 2018, so we’ll use the far-superior read_csv() from readr which will automatically parse our datetimes into POSIXct objects.

library(readr)

tbl <- read_csv("data/dst-example.csv")
## Parsed with column specification:
## cols(
##   datetime = col_datetime(format = "")
## )
tbl
## # A tibble: 6 x 1
##   datetime           
##   <dttm>             
## 1 2018-03-10 23:30:00
## 2 2018-03-11 00:30:00
## 3 2018-03-11 01:30:00
## 4 2018-03-11 03:30:00
## 5 2018-03-11 04:30:00
## 6 2018-03-11 05:30:00

What timezone are those? lubridate makes this easy with their tz function.

tz(tbl$datetime)
## [1] "UTC"

If we know these aren’t UTC times, but are instead clock times from New York, we need to do some sort of conversion here, as readr doesn’t allow you to specify the timezone when reading.

lubridate provides two “timezone conversion” functions:

with_tz() applies a timezone so the clock time changes, but the instant of time (e.g. UTC) is the same, e.g. 5pm EST \(\rightarrow\) 2pm PST.
force_tz() changes the instant of time so the clock time remains constant, e.g. 5pm EST \(\rightarrow\) 5pm PST.

In this case we need force_tz because the times in our file are non-UTC (New York, in this case) clock times.

library(dplyr)

tbl <- tbl %>%
  mutate(datetime = force_tz(datetime, "America/New_York"))

tbl
## # A tibble: 6 x 1
##   datetime           
##   <dttm>             
## 1 2018-03-10 23:30:00
## 2 2018-03-11 00:30:00
## 3 2018-03-11 01:30:00
## 4 2018-03-11 03:30:00
## 5 2018-03-11 04:30:00
## 6 2018-03-11 05:30:00

Notice the display isn’t any different; tibble doesn’t print timezones for datetime columns. The timezone has changed,

tz(tbl$datetime)
## [1] "America/New_York"

and they’re now evenly spaced, too:

diff(tbl$datetime)
## Time differences in hours
## [1] 1 1 1 1 1

I’m not sure we’ve ever had a client send us a file with UTC datetimes in it, so whenever we read in a client-provided file with datetimes, we immediately call force_tz to correct things.

Reading from Excel

This workflow is essentially the same with Excel files. We’ll use the readxl package, another tidyverse gem, which is much nicer than the Java-based libraries that came before it (e.g. xlsx). I’ve already created an excel file (which you can download here) whose first sheet is identical to the CSV we just made, so we’ll read that in.

library(readxl)

tbl <- read_excel("data/timezone-examples.xlsx")
tbl
## # A tibble: 6 x 1
##   datetime           
##   <dttm>             
## 1 2018-03-10 23:30:00
## 2 2018-03-11 00:30:00
## 3 2018-03-11 01:30:00
## 4 2018-03-11 03:30:00
## 5 2018-03-11 04:30:00
## 6 2018-03-11 05:30:00

Like readr::read_csv(), readxl::read_excel() is smart enough to see we have datetimes, and parse them appropriately, but we have the same problem as before:

tz(tbl$datetime)
## [1] "UTC"

Thus we need the same fix. An alternative to force_tz() is to assign to tz():

tz(tbl$datetime) <- "America/New_York"

tbl
## # A tibble: 6 x 1
##   datetime           
##   <dttm>             
## 1 2018-03-10 23:30:00
## 2 2018-03-11 00:30:00
## 3 2018-03-11 01:30:00
## 4 2018-03-11 03:30:00
## 5 2018-03-11 04:30:00
## 6 2018-03-11 05:30:00
diff(tbl$datetime)
## Time differences in hours
## [1] 1 1 1 1 1

This is very base-R style and doesn’t fit nicely into pipelines of commands with %>%, so we generally opt for force_tz + mutate, but this form can also be useful.

Split Date & Time

We often receive data where the date is in one column, but the time is in the next column over. This presents some new challenges because R doesn’t have date-free time classes; we want both pieces in the same column.

We find this happens most often in Excel files; the second sheet of our timezone-examples.xlsx file provides an example with the same data as before, but the date & time in two separate columns.

tbl <- read_excel("data/timezone-examples.xlsx", sheet = 2)
tbl
## # A tibble: 6 x 2
##   date                time               
##   <dttm>              <dttm>             
## 1 2018-03-10 00:00:00 1899-12-31 23:30:00
## 2 2018-03-11 00:00:00 1899-12-31 00:30:00
## 3 2018-03-11 00:00:00 1899-12-31 01:30:00
## 4 2018-03-11 00:00:00 1899-12-31 03:30:00
## 5 2018-03-11 00:00:00 1899-12-31 04:30:00
## 6 2018-03-11 00:00:00 1899-12-31 05:30:00

We sure didn’t expect the time to be read as datetimes with some strange, irrelevant date! We’d prefer to have a single datetime column where the dates come from date and the times come from time. There’s at least two ways we could do this:

change the dates in the time column to be the dates in the date column, then rename time -> datetime
create a new datetime column by taking the pieces we need from date & time

We think the latter is a bit more “tidy” or idiomatic, so that’s what we’ll do. It’s a bit clunky, but we can extract the dates and times from the appropriate column, paste them together, then re-parse those strings into datetime (POSIXct) objects. This will also let us specify the timezone when we do that, avoiding the need for force_tz. The hardest part is pulling only the time-part out of the POSIXct objects in our time column; we’ll use format for that.

tbl <- tbl %>%
  mutate(datetime = paste(date, format(time, "%T")) %>% ymd_hms(tz = "America/New_York")) %>%
  select(datetime)  # drop date/time cols

tbl
## # A tibble: 6 x 1
##   datetime           
##   <dttm>             
## 1 2018-03-10 23:30:00
## 2 2018-03-11 00:30:00
## 3 2018-03-11 01:30:00
## 4 2018-03-11 03:30:00
## 5 2018-03-11 04:30:00
## 6 2018-03-11 05:30:00
tz(tbl$datetime)
## [1] "America/New_York"
diff(tbl$datetime)
## Time differences in hours
## [1] 1 1 1 1 1

There we go! One ugly line of code, but that’s all we needed.

With CSV’s this is a bit easier because the time column will come back as a character vector rather than a POSIXct vector (or you can force it to POSIXct with the col_types argument); then you don’t need the format part above. If we tried to force that behavior using read_excel()’s col_types argument, we’d get back some hard-to-interpret numbers due to the horrible way Excel stores time.

Writing back out

So we’ve got our inputs formatted the way we want, then we do something, and at some point we probably want to write that back out. This is super easy with readr:

write_csv(tbl, "data/example-output.csv")

What was written to that text file might not be what we expected; let’s investigate with readr::read_lines():

read_lines("data/example-output.csv")
## [1] "datetime"             "2018-03-11T04:30:00Z" "2018-03-11T05:30:00Z"
## [4] "2018-03-11T06:30:00Z" "2018-03-11T07:30:00Z" "2018-03-11T08:30:00Z"
## [7] "2018-03-11T09:30:00Z"

Two things to notice: the format is ISO 8601 (the “T” and “Z”), and they’re stored as UTC (no timezone or offset, indicated by the lack of anything after the “Z”). Just like readr assumes everything is UTC coming in, it also makes things UTC going out.

For programmers and programming languages, this is ideal: it’s consistent, unambiguous, and portable.

If this file is going to be used by non-programmers, however, this can be a problem. If you open this in Excel, you’ll see the UTC times, and unlike all sane programming languages Excel can’t store UTC but display EST/EDT/PST/etc. “What are all these weird times? Why are they off by X hours sometimes, and X+1 hours other times?” are not questions we like getting after handing off a deliverable to a client.

We have just two options:

explain to the next user (client, colleague, etc.) the excellence and utility of UTC and convince them it’s the right way to do things
write the datetimes out a bit differently

Depending on the next user, option 1 can be a total no-go, in which case we need a way to write our datetimes as clock times rather than instants of time (UTC).

The simplest way is to convert to character first

tbl %>%
  mutate_if(is.POSIXct, as.character) %>%  # handy if you have multiple datetime cols
  write_csv("data/example-output.csv")

read_lines("data/example-output.csv")
## [1] "datetime"            "2018-03-10 23:30:00" "2018-03-11 00:30:00"
## [4] "2018-03-11 01:30:00" "2018-03-11 03:30:00" "2018-03-11 04:30:00"
## [7] "2018-03-11 05:30:00"

That seems to have done what we want, and if necessary we could replace as.character with something else to get a different format.

This works about as well when writing to Excel files (e.g. with writexl::write_xlsx), but in general we try to avoid writing directly to Excel files. If we know the next user will be using Excel, we prefer readr::write_excel_csv(), though writexl is the tool of choice for files with multiple tables spread over different sheets.

This only scratched the surface of issues that can arise when dealing with datetimes & timezones, but if there’s one lesson we hope sticks with you, it’s use lubridate. It saves lives; check out this decade-old article from Revolution Analytics as further proof that life was a bit tougher before lubridate.

If you want to dig deeper into the scary world of datetimes we’re mostly abstracted from, check out this excellent article from Zach Holman to learn about that one time Samoa skipped a day and so much more.

To leave a comment for the author, please follow the link and comment on their blog: Rstats on pi: predict/infer.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

A Tour of Timezones (& Troubles) in R

Timezones in base R

Enter `lubridate`

Why should we care?

Timezones and I/O

Reading from Excel

Split Date & Time

Writing back out

Related

Timezones in base R

Enter lubridate

Why should we care?

Timezones and I/O

Reading from Excel

Split Date & Time

Writing back out

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Enter `lubridate`

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)