Site icon R-bloggers

Minor updates for ggseas and Tcomp R packages by @ellis2013nz

[This article was first published on free range statistics - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Updates

I’ve made small updates to two of my R packages on CRAN: ggseas (seasonal adjustment on the fly for ggplot2 graphics) and Tcomp (tourism forecasting competition data). Neither of the packages changes in a noticeable way for most users.

I’ve written more about ggseas and Tcomp elsewhere. Given they’re both being updated at once, here’s a brief demo of them in action together.

There are 1,311 series in the list tourism in Tcomp. Each element of tourism contains some limited metadata about the series (for example its length, and how long the forecast is to be for), the original training data, and the “answer” in the form of the actual observations over the forecast period. These objects are of class Mdata, introduced in Rob Hyndmans’ Mcomp package, which comes with a convenient plotting method.

library(tidyverse)
library(scales)
library(ggseas)
library(Tcomp)

# default plot method for forecasting competition datasets of class Mdata
par(bty = "l", .main = 1)
plot(tourism[["M4"]], main = "Series M4 from the tourism forecasting competition")

ggseas makes it easy to take a seasonal time series object (ie an object of class ts with frequency > 1), convert it into a data frame, and produce exploratory graphics with it. For example, here’s code to take the training data from that same forecasting competition object and compare the original with a seasonally adjusted version (using X13-SEATS-ARIMA for the seasonal adjustment), and a 12 month rolling average:

# convert a time series to a data frame
the_data <- ggseas::tsdf(tourism[["M4"]]$x)

# draw a graphic
ggplot(the_data, aes(x = x, y = y, colour = 1)) +
  geom_line(aes(colour = "Original")) +
  stat_seas(aes(colour = "Seasonally adjusted")) +
  stat_rollapplyr(aes(colour = "12 month rolling average"), width = 12) +
  scale_colour_manual(values = c("Original" = "grey50", "Seasonally adjusted" = "red", 
                                 "12 month rolling average" = "blue")) +
  theme(legend.position = c(0.2, 0.8)) +
  scale_y_continuous("Unknown units\n", label = comma) +
  labs(x = "", colour = "") +
  ggtitle("Comparison of statistical summary methods in plotting a time series",
          "Monthly series 4 from the Tourism forecasting competition")

Then the ggsdc function makes it easy to look at a decomposition of a time series. This really comes into its own when we want to look at two or more time series, mapped to colour (there’s an example in the helpfile), but we can use it with a univariate time series too:

ggsdc(the_data, aes(x = x, y = y), method = "seas") +
  geom_line() +
  ggtitle("Seasonal decomposition with `X13-SEATS-ARIMA`, `seasonal` and `ggseas`",
          "Monthly series 4 from the Tourism forecasting competition") +
  scale_y_continuous("Unknown units\n", label = comma) +
  labs(x = "")

I wrote the first version of the functions that later became ggseas in 2012 when working for New Zealand’s Ministry of Business, Innovation and Employment. We had new access to large amounts of monthly electronic transactions data and needed an efficient way to explore it on our way to developing what eventually became the Monthly Regional Tourism Estimates.

Reflections on CRAN

This experience and recent twittering on Twitter have led me to reflect on the CRAN package update process. Both these updates went through really smoothly; I’m always impressed at the professionalism of the volunteers behind CRAN.

I think CRAN is hugely important and an amazing asset for the R community. It’s really important that packages be published on CRAN. I think the mapping of dependencies, and the process for checking that they all keep working together, is the key aspect here.

ggplot2 apparently has more than 2,000 reverse dependencies (ie packages that import some functionality from ggplot2), all of them maintained more or less voluntarily. When major changes are made I can’t think of any way other than CRAN (or something like it) for helping the system react and keep everything working. For instance, I found out that the new ggplot2 v2.3.0 would break ggseas when I got an automated email advising me of this, from the tests Hadley Wickham and others working on ggplot2 ran to identify problems for their reverse dependencies. The RStudio community site was useful for pointing me (and others with similar issues) in the direction of a fix, and of course one uses GitHub or similar to manage issues and code version control, but in the end we need the centralised discipline of something like CRAN as the publication point and the definitive map of the dependencies. So, a big shout out to the volunteers who maintain CRAN and do an awesome job.

Reflections on testing

When a user pointed out a bug in the Tcomp package by raising an issue on GitHub, I was pleased with myself for using genuine test-driven development. That is, I first wrote a test that failed because of the bug:

test_that("series lengths are correct", {
  expect_equal(sum(sapply(tourism, function(s){length(s$x) - s$n})), 0)
  
})

…and then worked on finding the problem and fixing it upstream. It’s such an effective way to work, and it means that you have a growing body of tests. Packages like Tcomp and ggseas go for years without me doing anything to them, so when I do have to go back and make a change it’s very reassuring to have a bunch of tests (plus running all the examples, and performing the checks for CRAN) to be sure that everything still works the way it’s meant to.

There’s not enough tests in the build process of those packages at the moment; the more experienced I get, the more I think “build more tests, and build them earlier” is one of the keys to success. This is just in code-driven projects either; I think there’s broad applicability in any professional situation, although the development and automation of tests is harder when you’re not just dealing with code.

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.