Tidy time series data using tsibbles
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
There is a new suite of packages for tidy time series analysis, that integrates easily into the tidyverse way of working. We call these the tidyverts
packages, and they are available at tidyverts.org. Much of the work on these packages has been done by Earo Wang and Mitchell O’Hara-Wild.
The first of the packages to make it to CRAN was tsibble, providing the data infrastructure for tidy temporal data with wrangling tools. A tsibble (where “ts” is pronounced as in cats) is a time series object that is much easier to work with than existing classes such as ts
, xts
and others. Existing ts
objects can be easily converted to tsibble
objects using as_tsibble()
. For example
library(tidyverse) library(tsibble) USAccDeaths %>% as_tsibble() ## # A tsibble: 72 x 2 [1M] ## index value ## <mth> <dbl> ## 1 1973 Jan 9007 ## 2 1973 Feb 8106 ## 3 1973 Mar 8928 ## 4 1973 Apr 9137 ## 5 1973 May 10017 ## 6 1973 Jun 10826 ## 7 1973 Jul 11317 ## 8 1973 Aug 10744 ## 9 1973 Sep 9713 ## 10 1973 Oct 9938 ## # … with 62 more rows
This creates an Index column which species the time or date index. These are always explicit in tsibbles, and can take a rich variety of time and date classes to handle any sort of temporal data. The second column is a measurement variable. Note the [1M]
in the header, indicating this is monthly data.
It is also easy to create tsibbles from csv files, by first reading them in using readr::read_csv()
and then using the as_tsibble()
function.
A more interesting tsibble
object is tourism
, containing quarterly overnight trips across Australia.
tourism ## # A tsibble: 24,320 x 5 [1Q] ## # Key: Region, State, Purpose [304] ## Quarter Region State Purpose Trips ## <qtr> <chr> <chr> <chr> <dbl> ## 1 1998 Q1 Adelaide South Australia Business 135. ## 2 1998 Q2 Adelaide South Australia Business 110. ## 3 1998 Q3 Adelaide South Australia Business 166. ## 4 1998 Q4 Adelaide South Australia Business 127. ## 5 1999 Q1 Adelaide South Australia Business 137. ## 6 1999 Q2 Adelaide South Australia Business 200. ## 7 1999 Q3 Adelaide South Australia Business 169. ## 8 1999 Q4 Adelaide South Australia Business 134. ## 9 2000 Q1 Adelaide South Australia Business 154. ## 10 2000 Q2 Adelaide South Australia Business 169. ## # … with 24,310 more rows
Here we have some additional columns called Keys. There should be one row for each unique combination of the keys and index. Columns which are not the index or key variables are measurement variables. In this example, there are three keys (Region
, State
and Purpose
) and one Measurement (Trips
)
All the usual tidyverse wrangling verbs apply. For example, to get the total visitor nights spent on Holiday by State for each quarter (ignoring Regions):
tourism %>% filter(Purpose == "Holiday") %>% group_by(State) %>% summarise(Trips = sum(Trips)) ## # A tsibble: 640 x 3 [1Q] ## # Key: State [8] ## State Quarter Trips ## <chr> <qtr> <dbl> ## 1 ACT 1998 Q1 196. ## 2 ACT 1998 Q2 127. ## 3 ACT 1998 Q3 111. ## 4 ACT 1998 Q4 170. ## 5 ACT 1999 Q1 108. ## 6 ACT 1999 Q2 125. ## 7 ACT 1999 Q3 178. ## 8 ACT 1999 Q4 218. ## 9 ACT 2000 Q1 158. ## 10 ACT 2000 Q2 155. ## # … with 630 more rows
Note that we do not have to explicitly group by the time index as this is assumed in a tsibble
.
To switch to annual data, we can re-index the tsibble:
tourism %>% mutate(Year = lubridate::year(Quarter)) %>% index_by(Year) %>% group_by(Region, State, Purpose) %>% summarise(Trips = sum(Trips)) %>% ungroup() ## # A tsibble: 6,080 x 5 [1Y] ## # Key: Region, State, Purpose [304] ## Region State Purpose Year Trips ## <chr> <chr> <chr> <dbl> <dbl> ## 1 Adelaide South Australia Business 1998 538. ## 2 Adelaide South Australia Business 1999 641. ## 3 Adelaide South Australia Business 2000 787. ## 4 Adelaide South Australia Business 2001 608. ## 5 Adelaide South Australia Business 2002 697. ## 6 Adelaide South Australia Business 2003 690. ## 7 Adelaide South Australia Business 2004 734. ## 8 Adelaide South Australia Business 2005 490. ## 9 Adelaide South Australia Business 2006 568. ## 10 Adelaide South Australia Business 2007 584. ## # … with 6,070 more rows
The index_by()
function is the counterpart of group_by()
when dealing with the index.
I often get questions about dealing with daily and sub-daily data, for which the ts
class is particularly ill-suited. The tsibble
class handles such data with no problem. For example, here are some hourly pedestrian counts at four sites around Melbourne, Australia.
pedestrian ## # A tsibble: 66,037 x 5 [1h] <Australia/Melbourne> ## # Key: Sensor [4] ## Sensor Date_Time Date Time Count ## <chr> <dttm> <date> <int> <int> ## 1 Birrarung Marr 2015-01-01 00:00:00 2015-01-01 0 1630 ## 2 Birrarung Marr 2015-01-01 01:00:00 2015-01-01 1 826 ## 3 Birrarung Marr 2015-01-01 02:00:00 2015-01-01 2 567 ## 4 Birrarung Marr 2015-01-01 03:00:00 2015-01-01 3 264 ## 5 Birrarung Marr 2015-01-01 04:00:00 2015-01-01 4 139 ## 6 Birrarung Marr 2015-01-01 05:00:00 2015-01-01 5 77 ## 7 Birrarung Marr 2015-01-01 06:00:00 2015-01-01 6 44 ## 8 Birrarung Marr 2015-01-01 07:00:00 2015-01-01 7 56 ## 9 Birrarung Marr 2015-01-01 08:00:00 2015-01-01 8 113 ## 10 Birrarung Marr 2015-01-01 09:00:00 2015-01-01 9 166 ## # … with 66,027 more rows
The Date
and Time
variables split the index into two components, representing the date and the hour of day. This makes it easy to produce some interesting plots.
pedestrian %>% mutate( Day = lubridate::wday(Date, label = TRUE), Weekend = (Day %in% c("Sun", "Sat")) ) %>% ggplot(aes(x = Time, y = Count, group = Date)) + geom_line(aes(col = Weekend)) + facet_grid(Sensor ~ .)
The volume and pattern of pedestrian traffic at these four locations is clearly very different. See Wang et al for a more complete analysis with a cool calendar plot.
In summary, I hope tsibbles
will become the standard for handling temporal data in R (including multivariate time series, panel data, ets). I’ve been using them for about a year, and I’m still amazed at how much easier it is to do things than using other structures.
We also have a paper describing tsibbles and the package for those who want more of the detailed thinking behind the design.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.