Dual axes time series plots with various more awkward data

Posted on August 27, 2016 by Peter's stats stuff - R in R bloggers | 0 Comments

[This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In my most recent blog post I introduced the dualplot() R function, which allows you to create time series plots with two different scales on the vertical axes in a way that minimises the potential problems of misinterpretation. See that earlier post for a discussion of the pros and cons of the whole approach, which I won’t repeat here.

I’ve now made some minor enhancements:

improved handling of series that start at different times
fixed some minor problems with series that have different frequencies (eg monthly versus quarterly)
added a fairly sensible default choice of axes for when one or both of the series dip into negative territory
improved default legend and axis titles for situations when the data are columns in a data frame
improved x axis labels for when x is a an object with date or time characteristics

The main idea behind my first version was that the two series would be rendered on the screen as though they had first been converted to indexes (like the Consumer Price Index), and then have the original scales restored so the original absolute values can still be seen. The choice to be made then becomes which point to choose as the point to make the series use as their reference (eg the period at which they are defined as 1, 100 or 1000, if we were to leave them as indices).

When time series start at different times

Default to use first available “cross over point”

For the (common) situation that the two series you wish to plot start at different times, I decided it rarely made sense to have both series start at the same vertical position on the page at their respective earliest point. So I’ve set the default to make the two series appear as though their reference period is the first time period when both series have data. In effect, this makes the later series appear to grow out of the first series, and you can see its relative growth from that period onwards.

This is illustrated nicely in the first plot below, which shows how the share price of Fonterra Cooperative Group has performed performed relative to BHP Billiton. For non-Australasian readers, Fonterra (who’s historical origin is as a dairy farming cooperative) is one of the most important economic enterprises in New Zealand; and BHP is an Anglo-Australian multinational mining giant.

starts2

As well as seeing the relative growth since data on both series has become available, we can still see the approximate absolute values, eg just under $6 for a Fonterra Cooperative Group share.

Readers who have been paying attention will notice that in drawing this plot I appear to have broken one of my own rules on when to use a dual axes graphic, which was to not use two axes when the unit of both series is fundamentally the same. In this case, both series are showing dollars per share. I had a hard think about this and think that this is (marginally) a case where it is acceptable to use two different axes. While the units are the same (dollars per share), the magnitudes are sufficiently different that it is worthwhile, if we wish to compare growth paths (typically the issue of most interest with shares), to examine them on different scales.

Here’s the code to load that data from Yahoo Finance and draw the chart with my dualplot() function:

# Load up functionality
library(dplyr)
library(tidyr)
library(quantmod) # for getSymbols for share price data
library(ggseas)   # for nzbop data

# The dualplot() function:
source("https://gist.githubusercontent.com/ellisp/4002241def4e2b360189e58c3f461b4a/raw/9ab547bff18f73e783aaf30a7e4851c9a2f95b80/dualplot.R")     

#=================Different starting points==================
# Load some example data from the stock exchange
# Fonterra data only available from Yahoo for the past few years
fonterra <- getSymbols('FCG.NZ', src='yahoo', auto.assign = FALSE) 
bhp <- getSymbols('BHP.AX', src='yahoo', auto.assign = FALSE) 

# default - a cross over point happen at the earliest point of the shorter series
dualplot(x1 = time(bhp), y1 = bhp$BHP.AX.Close,
         x2 = time(fonterra), y2 = fonterra$FCG.NZ.Close, 
         ylab1 = "BHP", ylab2 = "Fonterra", 
         legx = "topright", main = "Price per share of two key firms")

Note that this is a common case where both series would have been converted to an index somehow in many graphics. The absolute value of share prices is not meaningful to most people so there is little lost in converting them to indices. However, they do mean something for people who are actively concerned with them, so there is value in my dual axis approach which retains the absolute value of the two series while preserving the non-misleading characteristics of the index-based approach.

Starting at the same vertical position

Sometime it might make sense, depending on the data and the purpose of the graphic, to start the later series at the same vertical height as the first series. This is equivalent visually to plots of indexes where the reference point for each series is the first time period of the series. The tweaked version of dualplot() lets the user specify this (or more complex) combinations of reference points.

starts3

# or can override eg - each one begins at the same vertical height on its earliest point
dualplot(x1 = time(bhp), y1 = bhp$BHP.AX.Close,
         x2 = time(fonterra), y2 = fonterra$FCG.NZ.Close, 
         ylab1 = "BHP", ylab2 = "Fonterra", 
         legx = "topright", 
         main = "Price per share of two key firms\n(starting at same vertical position)",
         ylim.ref = c(1, 1), silent = TRUE)

Converging to a common vertical position

… or you might want to adopt a teleological view of history and and see how growth in the past has led to the latest point:

starts4

In this case, the last two charts are almost identical, but that’s not always going to be the result. In this particular case, these two charts being similar just means the decline in BHP Billiton share price from the beginning of the chart to the end of the chart is roughly the same (a bit over 10%) as the decline in Fonterra Cooperative Group share price from the end of 2012 to the end of the chart.

dualplot(x1 = time(bhp), y1 = bhp$BHP.AX.Close,
         x2 = time(fonterra), y2 = fonterra$FCG.NZ.Close, 
         ylab1 = "BHP", ylab2 = "Fonterra", 
         legx = "topright", 
         main = "Price per share of two key firms\n(finishing at same vertical position)",
         ylim.ref = c(nrow(bhp), nrow(fonterra)))

Time series that have different frequencies

I’ve tested the dualplot() function more thoroughly against data where the two series have different frequencies. Here’s an example comparing a longer series of Air New Zealand share prices against New Zealand’s national exports of services (which includes Air New Zealand sales to foreigners).

We can see that both series took a hit from the Global Financial crisis with a dip in the years after 2008; but the Air New Zealand share price was impacted much more severely than were New Zealand service exports as a whole (which of course comprise multiple industries and firms).

To do this I used an old copy of New Zealand balance of payments data that is one of the example datasets in the ggseas R package for seasonal adjustment on the fly. The data are not meant to be up to date, they just illustrate superimposing quarterly data on daily data.

There’s nothing special about the call to dualplot() to do this so I used this as a chance to show some of the optional features of the function; such as drawing gridlines (white ones in this case) and passing arguments through to legend() to colour the space behind the legend white.

airnz <- getSymbols('AIR.NZ', src='yahoo', auto.assign = FALSE, from = "2000-01-01") 

services <- nzbop %>%
   filter(Category == "Services; Exports total") %>%
   filter(TimePeriod > as.Date("1997-12-30"))

dualplot(x1 = services$TimePeriod, y1 = services$Value, x2 = time(airnz), y2 = airnz$AIR.NZ.Close,
	 ylab1 = "New Zealand service exports ($m)\n", ylab2 = "Air New Zealand share price",
	 yleg1 = "All service exports ($m) (left axis)", lwd = c(4,2), col = c("rosybrown", "steelblue"),
	 main ="New Zealand service exports and Air New Zealand share price over time",
	 colgrid = "white", bty = "o", bg = "white", box.col = "white")

For a second example, let’s go back to Fonterra, and this time compare their daily share price to weekly movements in the the world whole milk dairy powder price which I sourced from Quandl.

Having the scales carefully chosen to mimic indices means that we can see immediately the lower level of volatility in the share price than the dairy price; something that would be hidden by other visualisation methods such as showing each series on its own facet with a free y axis chosen to use the full plotting area.

dairy <- Quandl("GDT/WMP_PI")

par(family = "myfont")
dualplot(x1 = dairy[ , 1], y1 = dairy[, 8], 
	 x2 = time(fonterra), y2 = fonterra$FCG.NZ.Close,
	 ylab1 = "Whole milk powder price index\n",
	 ylab2 = "Fonterra Cooperative Group\nshare price ($)",
	 col = colorspace::rainbow_hcl(2, c = 80, l = 50, start = 45, end = 270),
	 colgrid = "grey90",
	 main = "Fonterra share prices are less volatile than global milk prices")

Before I move on, a note on colour. Usually I wouldn’t use so many palettes in a single document for basically the same purpose (identification between two qualitative variables). The effect is (to me) somewhat jarring. But as I’m demonstrating use of a new graphics function some of the normal requirements of good aesthetics are suspended.

Negative data

When data goes negative, graphing an index rarely makes sense. Most indexes are of data that is strictly positive, such as share prices or the consumer price index. This creates a new problem of what scale to use for a dual axis plot, and there is no clear answer, which makes me inclined to doubt that such a graphic is often justified. However, to get at least a reasonable default, I’ve set it so that if one of the series is negative the graphic will choose two vertical scales such that the plotting area includes the mean of each series plus three standard deviations in each direction.

Here’s the code for that example usage with negative data; I struggled to think of a good (non-misleading) use case with negative data so had to simulate two series:

set.seed(123)
data1 <- data.frame(x = 1:100, y = arima.sim(list(ar = c(0.9)), 100))
data2 <- data.frame(x = 1:100, y = arima.sim(list(ar = c(0.9)), 100) * 50 - 3)

dualplot(x1 = data1$x, y1 = data1$y, x2= data1$x, y2 = data2$y)

But the dual axis approach has limitations

Finally, as an offshoot of preparing this blog post, here’s a use case that reminded me that dual axes plots are very often not going to be a good graphic. They are cluttered, and can be hard to interpret.

A graphic that does not superimpose two data series in the same plotting area is best in this case. After various attempts to relate the volume of Air New Zealand share sales with prices, I ended up with this as my best graphic:

airnz

The vertical lines connecting the bursts of high sales volumes in the lower plot to the price changes in the upper plot work far better than superimposing the series (cluttered and hard to see connections); un-annotated facets; or a connected scatter plot (of either the original or transformed data or growth rates).

Here’s the code behind that graphic:

airnz3 <- airnz %>%
   as.data.frame() %>%
   mutate(TimePeriod = time(airnz),
          PeakVol = AIR.NZ.Volume > 2 * 10 ^ 7) 

peaks <- filter(airnz3, PeakVol)$TimePeriod

airnz3 %>%
   select(TimePeriod, AIR.NZ.Close, AIR.NZ.Volume) %>%
   mutate(SquareRootOfVolume = sqrt(AIR.NZ.Volume)) %>%
   rename(ClosingPrice = AIR.NZ.Close) %>%
   gather(variable, value, -TimePeriod, -AIR.NZ.Volume) %>%
   ggplot(aes(x = TimePeriod, y = value)) +
   facet_wrap(~variable, scales = "free_y", ncol = 1) +
   geom_line() +
   geom_vline(xintercept = as.numeric(peaks), colour = "steelblue", size = 2.5, alpha = 0.1) +
   geom_line(colour = "brown") +
   labs(x = "", y = "", title = "Four big trading events for Air New Zealand shares since 2000")

`dualplot()`

Here’s the code for the latest version of dualplot; it’s a Gist on GitHub. The code above includes an example line to source it directly into an R program.

	dualplot <- function(x1, y1, y2, x2 = x1,
	col = c("#C54E6D", "#009380"),
	lwd = c(1, 1), colgrid = NULL,
	mar = c(3, 6, 3, 6) + 0.1,
	ylab1 = paste(substitute(y1), collapse = ""),
	ylab2 = paste(substitute(y2), collapse = ""),
	nxbreaks = 5,
	yleg1 = paste(gsub("\n$", "", ylab1), "(left axis)"),
	yleg2 = paste(ylab2, "(right axis)"),
	ylim1 = NULL, ylim2 = NULL, ylim.ref = NULL,
	xlab = "", main = NULL, legx = "topleft", legy = NULL,
	silent = FALSE, bty = "n", ...){
	# Base graphics function for drawing dual axis line plot.
	# Assumed to be two time series on a conceptually similar, non-identical scale
	#
	# Assumes data is in sequence of x1 and of x2 ie ordered by time
	#
	# Use with caution!
	# Please don't use to show growth rates and the original
	# series at the same time!
	#
	# Peter Ellis, 16-27 August 2016, GNU GPL-3
	# most parameters should be obvious:
	# x1 and y1 are the x and y coordinates for first line
	# x2 and y2 are the x and y coordinates for second line. Often x2 will == x1, but can be overridden
	# ylim1 and ylim2 are the vertical limits of the 2 axes. Recommended NOT to use these, as
	# the default algorithm will set them in a way that makes the axes equivalent to using an index (for
	# positive data) or mean of each series +/- 3 standard deviations (if data include negatives)
	# ylim.ref the two numbers in the two series to use as the reference point for converting them to indices
	# when drawing on the page. If both elements are 1, both series will start together at the left of the plot.
	# nbreaks is number of breaks in horizontal axis
	# lwd and mar are graphics parameters (see ?par)
	# colgrid is colour of gridlines; if NULL there are no gridlines
	# ylab1 and ylab2 are the labels for the two y axes
	# yleg1 and yleg2 are the labels for the two series in the legend
	# xlab and main are for x label and main title as in plot()
	# legx and legy are x and y position fed through to legend()
	# ... is parameters to pass to legend()
	# Note that default colours were chosen by colorspace::rainbow_hcl(2, c = 80, l = 50)

	# strip excess attributes (eg xts etc) from the two vertical axis variables
	ylab1 <- as.character(ylab1)
	ylab2 <- as.character(ylab2)
	y1 <- as.numeric(y1)
	y2 <- as.numeric(y2)

	# is ylim.ref is NULL, calculate a good default
	if(is.null(ylim.ref)){
	if (length(y1) == length(y2)){
	ylim.ref <- c(1, 1)
	} else {
	if (min(x1) > min(x2)){
	ylim.ref <- c(1, which(abs(x2 - min(x1)) == min(abs(x2 - min(x1)))))
	} else {
	ylim.ref <- c(which(abs(x1 - min(x2)) == min(abs(x1 - min(x2)))), 1)
	}
	}


	}


	oldpar <- par(mar = mar)
	xbreaks <- round(seq(from = min(c(x1, x2)), to = max(c(x1, x2)), length.out = nxbreaks))

	# unless ylim1 or ylim2 were set, we set them to levels that make it equivalent
	# to a graphic drawn of indexed series (if all data positive), or to the mean
	# of each series +/- three standard deviations if some data are negative
	if(is.null(ylim1) & is.null(ylim2)){
	if(min(c(y1, y2), na.rm = TRUE) < 0){
	message("With negative values ylim1 or ylim2 need to be chosen by a method other than treating both series visually as though they are indexed. Defaulting to mean value +/- 3 times the standard deviations.")
	ylim1 <- c(-3, 3) * sd(y1, na.rm = TRUE) + mean(y1, na.rm = TRUE)
	ylim2 <- c(-3, 3) * sd(y2, na.rm = TRUE) + mean(y2, na.rm = TRUE)
	}


	if(ylim.ref[1] > length(y1)){
	stop("ylim.ref[1] must be a number shorter than the length of the first series.")
	}
	if(ylim.ref[2] > length(y2)){
	stop("ylim.ref[2] must be a number shorter than the length of the second series.")
	}

	if(!silent) message("The two series will be presented visually as though they had been converted to indexes.")

	# convert the variables to indexes (base value of 1 at the time specified by ylim.ref)
	ind1 <- as.numeric(y1) / y1[ylim.ref[1]]
	ind2 <- as.numeric(y2) / y2[ylim.ref[2]]

	# calculate y axis limits on the "index to 1" scale
	indlimits <- range(c(ind1, ind2), na.rm = TRUE)

	# convert that back to the original y axis scales
	ylim1 = indlimits * y1[ylim.ref[1]]
	ylim2 = indlimits * y2[ylim.ref[2]]
	} else {
	if(!silent) warning("You've chosen to set at least one of the vertical axes limits manually. Up to you, but it is often better to leave it to the defaults.")
	}

	# draw first series - with no axes.
	plot(x1, y1, type = "l", axes = FALSE, lwd = lwd[1],
	xlab = xlab, ylab = "", col = col[1], main = main,
	xlim = range(xbreaks), ylim = ylim1)

	# add in the gridlines if wanted:
	if(!is.null(colgrid)){
	grid(lty = 1, nx = NA, ny = NULL, col = colgrid)
	abline(v = xbreaks, col = colgrid)
	}

	# add in the left hand vertical axis and its label
	axis(2, col = col[1], col.axis= col[1], las=1 ) ## las=1 makes horizontal labels
	mtext(paste0("\n", ylab1, "\n"), side = 2, col = col[1], line = 1.5)

	# Allow a second plot on the same graph
	par(new=TRUE)

	# Plot the second series:
	plot(x2, y2, xlab="", ylab="", axes = FALSE, type = "l", lwd = lwd[2],
	col = col[2], xlim = range(xbreaks), ylim = ylim2)

	## add second vertical axis (on right) and its label
	mtext(paste0("\n", ylab2, "\n"), side = 4, col = col[2], line = 4.5)
	axis(4, col = col[2], col.axis = col[2], las=1)

	# Draw the horizontal time axis
	axis(1, at = xbreaks, labels = xbreaks)

	# Add Legend
	legend(x = legx, y = legy, legend=c(yleg1, yleg2),
	text.col = col, lty = c(1, 1), lwd = lwd, col = col,
	bty = bty, ...)

	par(oldpar)
	}

view raw dualplot.R hosted with ❤ by GitHub

To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Dual axes time series plots with various more awkward data

When time series start at different times

Default to use first available “cross over point”

Starting at the same vertical position

Converging to a common vertical position

Time series that have different frequencies

Negative data

But the dual axis approach has limitations

`dualplot()`

Related

When time series start at different times

Default to use first available “cross over point”

Starting at the same vertical position

Converging to a common vertical position

Time series that have different frequencies

Negative data

But the dual axis approach has limitations

dualplot()

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

`dualplot()`

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)