Five ways to improve your chart axes
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
When it comes to crafting visualisations, people often put a lot of thought into what type of plot they’re going to make and what colour scheme they’re going to use. One thing that sometimes get less attention than it should is the choice of axes. Many people rely on the default settings of whichever visualisation software they’re using. However, default settings are simply someone else’s choices. Someone who hasn’t seen your data, and doesn’t know what you are trying to communicate with your plot. That means the default settings aren’t always a good choice. A poor choice of axes can make it more difficult to understand a chart, and in some cases, can suggest misleading conclusions. In this blog post, we’ll look at five ways to make better choices about your axes and stop relying on default settings – including attempting to answer the age-old question of should the y-axis start at zero? Spoiler alert: it depends!
Five tips for better axes
1. Don’t truncate your bar chart’s y-axis
Let’s start with that should the y-axis start at zero? question – specifically thinking about bar charts!
A key feature of bar charts is that the height of the bar represents the value of a variable. If one value is twice as big as another, the height of the bar should be twice as tall. The only way that can be true, is if the axis starts at zero.
Look at the difference in the examples below. On the left, with the axis starting from zero, you can see that the number of penguins on Dream Island is around 1.25 times the number on Biscoe Island. On the right, where the y-axis has been truncated and starts at 40, at first glance it looks like there are 4 times as many. You can see how this type of axis truncation can lead to misleading conclusions – whether deliberate or not!
Starting from 0 (left) and starting from 40 (right)
Show code: truncated y-axis
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
library(palmerpenguins) library(ggplot2) library(dplyr) library(PrettyCols) # Subset by species plot_data <- penguins |> filter(species == "Adelie") # Default plot ggplot(data = plot_data) + geom_bar(mapping = aes( x = island, fill = island )) + scale_fill_pretty_d("Lively") + scale_y_continuous( breaks = seq(0, 60, by = 10), labels = seq(0, 60, by = 10), expand = c(0, 0), limits = c(0, 60) ) + labs( x = "", y = "", title = "Number of Adelie penguins per island" ) + theme( legend.position = "none", plot.background = element_rect( fill = "transparent", colour = "transparent" ) ) # Truncated axis plot offset <- 40 ggplot(data = plot_data) + geom_bar(mapping = aes( x = island, y = after_stat(count) - offset, fill = island )) + scale_fill_pretty_d("Lively") + scale_y_continuous( breaks = seq(0, 60, by = 10), labels = seq(0 + offset, 60 + offset, by = 10), expand = c(0, 0), limits = c(0, 60 - offset) ) + labs( x = "", y = "", title = "Number of Adelie penguins per island" ) + theme( legend.position = "none", plot.background = element_rect( fill = "transparent", colour = "transparent" ) ) |
Some people may say something along the lines of “the values on the axis are clearly labelled, and a reader can see it starts at 40 if they look at the chart”. This relies on people paying attention to details like the axis range when they look at your chart. When presented with a complex infographic, people might spend more time looking at the details and trying to understand it. However, bar charts are very simple to understand - so you can’t assume that everyone will come back for a second glance to double check the axis limits. Design your bar charts for someone who is only going to give it a fleeting glance. They should come to the same conclusions as the people who do double check the axis limits.
The argument often presented for truncating the axis is that the bars are of a similar height, and so it’s hard to see the differences between the bars if the axis starts from zero. My response to this reason for truncating the y-axis is - if the bars are actually similar, maybe that’s what you should be showing? Do the small deviations actually matter? If not, there’s no real reason to start from anything other than zero.
But what if the small deviations are important? One alternative is to show both: present a normal bar chart starting from zero, but also show a zoomed in version that allows a reader to distinguish the individual bar values more easily. In the example below, the normal bar chart is on the right hand side with a zoomed in version showing part of the axis on the left. The lines connecting the two versions highlight the truncated axis on the left, show which range of data the truncated axis relates to, and clearly links the two plots.
Showing a zoomed in version of the bar chart to provide both detail and context
Show code: zoomed in bar chart
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
library(ggforce) ggplot(data = plot_data) + geom_bar(mapping = aes( x = island, fill = island )) + facet_zoom(ylim = c(40, 60)) + scale_fill_pretty_d("Lively") + scale_y_continuous( breaks = seq(0, 60, by = 10), labels = seq(0, 60, by = 10), expand = c(0, 0), limits = c(0, 60) ) + labs( x = "", y = "", title = "Number of Adelie penguins per island" ) + theme( legend.position = "none", plot.background = element_rect( fill = "transparent", colour = "transparent" ) ) |
You can use a similar approach if one bar is much longer than the others, making it difficult to tell apart all of the small bars.
In R, you can create these zoomed in charts using
facet_zoom()
from {ggforce}. An alternative is the {ggmagnify} package.
2. Should the y-axis always start at zero?
Does the axis should start at zero rule hold for all types of charts? Let’s think about line charts for a second. We often use line charts to show how a variable changes over time. Look at the example below which shows the population of Germany over time - with the default axis on the left, and the axis starting from zero on the right:
Default axis (left) and starting from zero (right)
Show code: choosing a line chart range
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
library(gapminder) library(ggplot2) library(dplyr) # Filter data gm_de <- gapminder |> filter(country == "Germany") # Default line chart ggplot(data = gm_de) + geom_line( mapping = aes(x = year, y = pop / 1000000) ) + labs( x = "", y = "Population (millions)" ) + theme_minimal() # Start from 0 ggplot(data = gm_de) + geom_line( mapping = aes(x = year, y = pop / 1000000) ) + labs( x = "", y = "Population (millions)" ) + expand_limits(y = 0) + theme_minimal() |
When we extend the y-axis to include zero, the data becomes squashed at the top of the chart. The decrease that is clearly evident on the left, is much more difficult to see on the right. Extending the range of the y-axis so far outside the range of the data flattens any trend - most lines will look flat if you squash them enough!
Whether or not you extend the y-axis to include zero will depend on your data. Ask yourself - is zero a plausible value? In the population example above, zero is not a plausible value so it doesn’t make sense to extend the axis to include it. Similar statements can be made about plots of temperature (in Celsius or Fahrenheit), GDP of a particular country, the number of bee colonies in the USA, and many other variables. In contrast, if you were plotting change in average temperature from the previous year, then it would make sense to extend the axis to include zero (even if no zeros are observed in the data). Whether or not a line chart needs a zero baseline will depend on the data - sometimes it needs to, sometimes it doesn’t need to.
What about extending the axis below zero? If negative values are possible, it’s fine to extend the axis range to be less than zero. If negative values are not possible, the axis shouldn’t extend to show these values.
3. Choose an appropriate range for your scatter plot
So if bar charts should always start at zero but line charts don’t always have to start at zero, what happens with scatter plots?
By default, most software scales the limits of the axes to the range of the data. Similar to line charts, whether or not you should change the range of the axis will hugely depend on the data you are plotting: both the range of possible values, and the range of values you have observed. For example, if you are plotting percentages on the y-axis and your data ranges from 2% to 98%, then it makes sense to plot your axis from 0 to 100: the range of possible values. Instead, if your data ranges from 2% to 4.5%, it may make more sense to plot your axis from 0% to 5%. It can also be useful to think about the break points of your axis when thinking about the limits. For example, an axis that goes up in steps of 100 is easier to read than an axis that goes up in steps of 123. This advice can also be applied for axes on other types of charts showing continuous variables.
Think also about direction of the axis - it’s most common for the lowest values to be at the bottom of the y-axis with values increasing as you move upwards. However, there may also be times when this direction should be reversed - for example, when plotting rankings. Values plotted further up the chart are naturally interpreted as better but in ranking data, the better data have lower values and so reversing the direction makes sense. See this Datawrapper blog post about plotting Olympic rankings for some examples.
It’s also important to think carefully about what you are trying to show with your scatter plot. Are you trying to determine if there is a linear relationship between the x and y variables? Are you trying to compare to some baseline?
Take the example of creating a scatter plot of residuals from a linear regression model. We often plot the fitted values on the x-axis, and the residuals on the y-axis. One of the things we’re looking for in the residual scatter plot is whether the points are scattered around the zero line. In the case of residuals plots, the default settings can often result in a y-axis that isn’t symmetric about 0 (as in the below example on the left). This can make it harder to compare to 0, since points will naturally look scattered around the (non-zero) middle of the plot. Instead, adjusting the range of the y-axis to be symmetric around zero makes it easier to see if more points are above or below zero. Adding a zero-line makes it even easier for a reader.
Default y-axis range (left) and symmetric y-axis range (right)
Show code: adjusting scatter plot ranges
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
library(palmerpenguins) library(ggplot2) library(dplyr) library(broom) # Subset by species plot_data <- penguins |> filter(species == "Adelie") # Fit model fit <- lm(body_mass_g ~ bill_length_mm, data = plot_data ) fit_df <- augment(fit) # Default scatter plot ggplot(data = fit_df) + geom_point(mapping = aes( x = .fitted, y = .resid )) + labs( x = "Fitted values", y = "Residuals", title = "Residuals of linear model" ) + theme_minimal() # Symmetric y-axis ggplot(data = fit_df) + geom_hline( yintercept = 0, colour = "#D1495B" ) + geom_point(mapping = aes( x = .fitted, y = .resid )) + scale_y_continuous( limits = c(-1200, 1200), expand = c(0, 0) ) + labs( x = "Fitted values", y = "Residuals", title = "Residuals of linear model" ) + theme_minimal() |
Similar advice should be used for Q-Q plots - ensuring the same axis range is used for x- and y- values to make the comparison to a diagonal line easier.
4. Alternatives to a dual y-axis
If you’re making a line chart and there are multiple variables you want to show, it’s common to plot multiple lines on one chart. But when the units of the variables are very different, it makes it difficult for these variables to share one y-axis. What some people do, is create a secondary axis on the right hand side of the plot with a different scale. However, there are several problems with this approach.
The choice of transformation for the secondary axis is entirely arbitrary but can hugely impact how the plot is interpreted. Take a look at the example below:
Dual axis line charts with different scaling factors
Show code: dual axis plots
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 |
library(gapminder) library(ggplot2) library(dplyr) # Load data gm_uk <- gapminder |> filter(country == "United Kingdom") # Scaling of 250 coeff <- 250 ggplot( data = gm_uk, mapping = aes(x = year) ) + geom_line( mapping = aes(y = lifeExp), colour = "#4c7d96", linewidth = 1.5 ) + geom_line( mapping = aes(y = gdpPercap / coeff), colour = "#EC9F05", linewidth = 1.5 ) + scale_y_continuous( name = "Life Expectancy", sec.axis = sec_axis(~ . * coeff, name = "GDP per capita") ) + labs(x = "") + theme_minimal() + theme( axis.title.y = element_text(color = "#4c7d96", size = 13), axis.title.y.right = element_text(color = "#EC9F05", size = 13) ) # Scaling of 350 coeff <- 350 ggplot( data = gm_uk, mapping = aes(x = year) ) + geom_line( mapping = aes(y = lifeExp), colour = "#4c7d96", linewidth = 1.5 ) + geom_line( mapping = aes(y = gdpPercap / coeff), colour = "#EC9F05", linewidth = 1.5 ) + scale_y_continuous( name = "Life Expectancy", sec.axis = sec_axis(~ . * coeff, name = "GDP per capita") ) + labs(x = "") + theme_minimal() + theme( axis.title.y = element_text(color = "#4c7d96", size = 13), axis.title.y.right = element_text(color = "#EC9F05", size = 13) ) |
The only difference between these two plots is that one has a secondary axis scaling of 250 and the other 350 - there’s no difference in the values that are plotted. However, with this plot, a reader is naturally drawn to where the lines intersect. On the left, a reader may focus on the early 1980s. On the right, the late 1990s. But there’s no reason for them to be interested in either of these dates specifically. Much like truncating the y-axis of a bar chart, a particular axis scaling may mislead a reader to draw some conclusion. You might also choose a scaling where the lines do not intersect at all, but this has similar issues in looking at where the lines start to diverge or converge.
So what are the alternatives? There are a few options:
- The simplest approach is to simply plot the two variables on separate plots, each with their own axis, and place the plots side-by-side.
- If you are absolutely determined to create a dual axis plot, you can make it less misleading by directly labelling the data points. This means a reader won’t have to think about which axis relates to which line, and whether they’re looking at the correct scale.
- Plot different variables on the x- and y- axis. In the example above, we have 3 variables: time, life expectancy, and GDP. There’s no rule that says the only option is time on the x-axis and the other two on the y-axis. You could plot GDP on the x-axis and life expectancy on the y-axis. You could then create a connected scatter plot, or colour the points by time.
- Rescale the variables, rather than the axis. What are you trying to show by plotting the change in these two variables over time? Are you trying to show they change at the same rate, or different rates? One option may be to look at differences, rather than absolute values. For example, plot the change in each value compared to the first value in the data, as shown in the example below. Another, similar option may be to plot the year-on-year percentage differences.
Rescaling variables to plot both lines on the same chart
Show code: rescaling as a dual axis alternative
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
library(tidyr) # Compute percentage change plot_data4 <- gm_uk |> mutate(lifeExpPerc = 100*(lifeExp - subset(gm_uk, year == 1952)$lifeExp)/subset(gm_uk, year == 1952)$lifeExp, gdpPercapPerc = 100*(gdpPercap - subset(gm_uk, year == 1952)$gdpPercap)/subset(gm_uk, year == 1952)$gdpPercap) |> select(year, gdpPercapPerc, lifeExpPerc) |> pivot_longer(-year) # Plot ggplot( data = plot_data4, mapping = aes(x = year) ) + geom_line( mapping = aes(y = value, colour = name), linewidth = 1.5 ) + scale_colour_manual( name = "", values = c("#EC9F05", "#4c7d96"), breaks = c("gdpPercapPerc", "lifeExpPerc"), labels = c("GDP per capita", "Life expectancy")) + labs(x = "", y = "Percentage change since 1952") + theme_minimal() |
This approach does also have its limits, since for some variables (such as life expectancy in the above example) the percentage increase is capped, but for other variables it is not.
5. Alphabetical categories don’t often make sense
So far, we’ve only considered how to make better choices about numerical (continuous) axes. But, of course, that’s not the only type of axis we might have. How do we deal with categorical (discrete) axes? In the absence of further instructions, most visualisation software will do one of two things:
- plot the categories in alphabetical order; or
- plot the categories in the order they appear in the data.
There are certainly cases where each of these orderings will make sense. However, it’s often not the best choice. When it comes to deciding on the order of categories, start by asking yourself one question: is there a natural ordering of the categories? For example, days of the week, months of the year, age groups, indices of deprivation, and, Low/Medium/High rankings all have a natural order. If there’s a natural order, that’s (almost always) the order that should be used in the plot.
For example, if you are plotting data relating to the days of the week, it’s confusing if they’re not arranged in the order they actually occur in. Which day of the week is the first category is yet another choice. In many parts of the world, Monday is a natural choice for the first category. This leaves the data for the weekend grouped together at the end, making it easier to see a weekday/weekend difference as well as individual day differences. Depending on your data, a different day of the week may make more sense.
Bar chart with weekdays in default alphabetical order (left) and in a more natural order (right)
Show code: sorting categories appropriately
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
library(ggplot2) # Simulate data dow <- weekdays(as.Date(4, "1970-01-01", tz = "GMT") + 0:6) set.seed(1234) plot_data5 <- data.frame( x = dow, y = c(runif(5, 40, 45), runif(2, 20, 30)) ) # Default order ggplot(plot_data5) + geom_col( mapping = aes(x = x, y = y), fill = "#EC9F05" ) + labs(x = "", y = "Average time spent on public transport (minutes)") + theme_minimal() # Natural order ggplot(plot_data5) + geom_col( mapping = aes(x = factor(x, levels = dow), y = y), fill = "#EC9F05" ) + labs(x = "", y = "Average time spent on public transport (minutes)") + theme_minimal() |
What if there’s no natural ordering to the categories? There is an argument here for alphabetical ordering since it makes it easier for a reader to find the particular category they are interested in. However, this approach can make it more difficult to compare between categories - particularly if the categories have similar values, it can be hard to tell which one is larger if they are far apart on the plot.
Instead, ordering categories based on the value of the category (from largest to smallest or vice versa) makes it easier to compare categories, rank them, and see which is highest and lowest.
Bar chart with car models in default alphabetical order (left) and sorted by weight (right)
Show code: sorting categories by another variable
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
library(tibble) mtcars2 <- mtcars |> rownames_to_column() # Default order ggplot(mtcars2) + geom_col( mapping = aes(x = wt, y = rowname), fill = "#5869C7" ) + labs(x = "Weight (1000 lbs)", y = "") + theme_minimal() # Sorted order ggplot(mtcars2) + geom_col( mapping = aes(x = wt, y = reorder(rowname, wt)), fill = "#5869C7" ) + labs(x = "Weight (1000 lbs)", y = "") + theme_minimal() |
Though the examples shown here are bar charts, the same advice applies to any categorical axis - including heatmaps or multiple boxplots.
The other element of plotting categorical variables is deciding whether the categories go on the x- or y- axis. If categorical variables are plotted on the x-axis, the category labels can end up looking squashed, overlapping, or being printed with text that’s too small. The solution of rotating the text by 45 or 90 degrees only gives your readers a sore neck. For this reason, it makes more sense to put categorical variables on the y-axis. An exception is when you are plotting categories related to time e.g. data collected over months of the year - it’s much more common to put time on the x-axis.
Closing thoughts
I hope this blog post has illustrated that the choice of axes can have a big impact on the clarity of your visualisation, and that relying on the default settings of software isn’t always a good idea. Are there blanket rules about axes that you can apply to every type of visualisation? No. Instead, I’d advocate for having a little bit of common sense and actively thinking about your design choices. Think about the context of what you’re trying to communicate, and whether or not your data visualisation of choice communicates that effectively and honestly.
This blog post hasn’t covered everything you need to make a good choice of axis - we’ve mainly focused on choosing the axis scale. You also need to think about axis titles, tick marks, grid lines, units, orientation, aspect ratios, break points, and labels. The following resources provide some information about these topics.
Further resources
-
The Royal Statistical Society’s Best Practices for Data Visualisation has a section on axes, including information on transformations of axes (such as logarithmic axes): rss.org.uk/datavisguide.
Note that I co-authored this guide, so I might be a little bit biased…
-
The Office for National Statistics also has a data visualisation guide that includes a section on axes. It has further information on tick marks and grid lines, including some advice for designing charts for viewing on mobile devices: service-manual.ons.gov.uk/data-visualisation/guidance/axes-and-gridlines
-
Claus Wilke’s Fundamentals of Data Visualization book has a section on Proportional Ink which explains the reasoning behind why bar charts needs to start at zero and how this relates to different axis transformations: clauswilke.com/dataviz/proportional-ink
Happy plotting!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.