An R tutorial to download and plot some Queensland population data

Home

4 years ago

[This article was first published on Home, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

In this tutorial I will download Queensland population data from the Australian Bureau of Statistics (ABS), tidy it up a bit, and plot some graphs with ggplot. The data in the spreadsheet are not in the most useful format so this tutorial will use tidyverse tools to manipulate the data into something a bit more useful. < !-- more -->

The objectives are to:

Download the spreadsheet from the ABS,
Extract data from the spreadsheet,
Wrangle the data to get into into a useful format,
Plot some of the data using ggplot.

The tools you will use are:

readxl to selectively use data from a spreadsheet,
stringr to tidy up column names,
kable and kableExtra to display data in a pretty table
lubridate to change a date format,
gather and separate to make useful variables in the data set,
ggplot to make a line graph,
wesanderson colour themes for nice plot colours,
you will also see how to make a custom ggplot theme.

Load the required libraries. I use the wesanderson package for picking manual colours in my plots.

library(tidyverse)
library(lubridate)
library(readxl)
library(stringr)
library(wesanderson)
library(kableExtra)

Make a custom `ggplot` theme

Making a theme is not too hard. You reaaly just have to type a ggplot theme name at the console without the parentheses and copy what you need into a function. This is my custom ggplot theme. It sets some ggplot options so I don’t have to add them to the plotting code.

theme_marquess <- function(){
  theme_minimal(base_size = 12, base_family="Avenir") %+replace%
    theme(rect = element_rect(fill = 'white', linetype = 0, colour = NA, size=NA),
          panel.background = element_blank(),
          panel.grid.minor.x = element_line(colour = NA, size = NA),
          panel.grid.minor.y = element_line(colour = 'gray90', size = 0.2),
          panel.grid.major.x = element_line(size = NA,  colour = NA),
          panel.grid.major.y = element_line(size = 0.3, linetype = 'solid', colour = "gray80"),
          strip.background = element_rect(fill = "grey80", colour = "grey20"),
          # Optional - sets x and y axis lines
          # axis.line = element_line(colour="gray50", size=0.3, lineend="square"),
          # Optional - sets a border around the chart. Leaving it off provides a cleab 538 style chart
          # panel.border = element_rect(linetype = "solid", fill = NA, color='gray60', size=1),
          legend.key = element_rect(fill="transparent", colour=NA)
    )
}

Get the data.

The spreadsheet files are available on the ABS website. I am using TABLE 53. Estimated Resident Population By Single Year Of Age, Queensland. I tend to download the spreadsheet locally and read in the data with readxl because the spreadsheets are periodically updated.

When I have downloaded the file to my working directory I read it in with the code below. There are a number of worksheets in the spreadsheet and we are only interested in two of them. One is the main data worksheet (Data1), and the other is the Index worksheet that I will use for column names.

The data worksheet is relatively straightforward. The range argument allows a certain range of cells that have data to be loaded. This means eyeballing the spreadsheet in advance to identify what needs to be extracted. Obtaining the column names reads the Index worksheet, specifies a range of cells to read, defines the data type in the cells (as text), and gives the tibble column a name.

qld <- read_excel(path = '3101053.xls',
                  sheet = 'Data1',
                  range = 'A10:GU57')

columns <- read_excel(path = '3101053.xls',
                      sheet = 1,
                      range='A13:A214',
                      col_types = 'text',
                      col_names = 'Header')

Tidy up the data (part 1).

The data set you get is 47 rows and 203 columns of population data. The columns show the unit age population levels from males aged 1 years old and under to 100 years old and over. Following that is the same for females. So 101 columns for males and 101 for females, plus 1 column for years results in 203 columns.

The Index worksheet that was used to extract strings to use as column headers is long and contains redundant information and punctuation. These can all be removed using the stringr package as follows. Just remove what is not necessary and trim the whitespace as follows.

columns$Header <- str_remove(columns$Header, pattern = "Estimated Resident Population ;  ")
columns$Header <- str_remove(columns$Header, pattern = "\ ; ")
columns$Header <- str_remove(columns$Header, pattern = "and over")
columns$Header <- str_remove(columns$Header, pattern = ";")
columns$Header <- str_trim(columns$Header, side = 'right')


colnames(qld) <- c('Year', columns$Header)

Look at the data.

You can use the kable and kableExtra packages to make a pretty table to look at bits of the data. Obviously in a regular R script you would just output to console. However, I just discovered kableExtra yesterday so I am keen to include the kable_styling function to make a Bootstrap style table.

kable(qld[1:5, 1:5], "html") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Year	Male 0	Male 1	Male 2	Female 0	Female 1	Female 2	Female 3
1971-06-01	19091	18909	18480	18378	17975	17982	17511
1972-06-01	20106	19460	19239	19380	18669	18349	18172
1973-06-01	19586	20438	19999	18819	19785	19122	18844
1974-06-01	19281	19988	20898	18266	19234	20287	19654
1975-06-01	18738	19513	20275	17834	18583	19582	20681

Tidy up the data (part 2).

As you can see the data in wide and untidy. Each column has tow components, sex and age, that should be separated out. Also, the data would benefit from being converted to long format. The Year column is not formatted as a year, it is still in a date format. We can fix this with a short number of steps.

Gather the data to long format using the columns with Sex_Age the key, Population as the value, and ignore the Year column.
Covert the Sex_Age columns to 2 columns, Sex and Age, using separate.
Convert the date to a year.
Convert the data in the Age columnn to integers.

qld_long <- qld %>% gather(Sex_Age, Population, -'Year')
qld_pop <- qld_long %>% separate(Sex_Age, into = c('Sex', 'Age'), sep = " ")
qld_pop$Year <- year(qld_pop$Year)
qld_pop$Age <- as.integer(qld_pop$Age)

Plot the data

Now we can begin to plot the data using ggplot. There are many ways to slice the data and plot it, but to keep thing simple let’s just do a line plot of population count of males and females by year for each year of age, for a particular year.

We will use filter to pick out a particular year to plot and make a line graph using geom_line, and my theme that I included earier.

qld_pop %>% filter(Year == 2017) %>%
  ggplot(aes(x=Age, y=Population, color=Sex)) +
  geom_line(size=1.1, alpha=0.8) +
  labs(x = "Age",
       y = 'Number of people',
       title = "Population of Queensland in 2017 by sex and age.") +
  scale_color_manual(values=wes_palette("Royal1")) +
  theme_marquess()

Click the image below to enlarge.

To leave a comment for the author, please follow the link and comment on their blog: Home.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.