Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
In this tutorial I will download Queensland population data from the Australian Bureau of Statistics (ABS), tidy it up a bit, and plot some graphs with ggplot
. The data in the spreadsheet are not in the most useful format so this tutorial will use tidyverse
tools to manipulate the data into something a bit more useful.
< !-- more -->
The objectives are to:
- Download the spreadsheet from the ABS,
- Extract data from the spreadsheet,
- Wrangle the data to get into into a useful format,
- Plot some of the data using
ggplot
.
The tools you will use are:
readxl
to selectively use data from a spreadsheet,stringr
to tidy up column names,kable
andkableExtra
to display data in a pretty tablelubridate
to change a date format,gather
andseparate
to make useful variables in the data set,ggplot
to make a line graph,wesanderson
colour themes for nice plot colours,- you will also see how to make a custom
ggplot
theme.
Load the required libraries. I use the wesanderson
package for picking manual colours in my plots.
library(tidyverse) library(lubridate) library(readxl) library(stringr) library(wesanderson) library(kableExtra)
Make a custom ggplot
theme
Making a theme is not too hard. You reaaly just have to type a ggplot
theme name at the console without the parentheses and copy what you need into a function. This is my custom ggplot
theme. It sets some ggplot
options so I don’t have to add them to the plotting code.
theme_marquess <- function(){ theme_minimal(base_size = 12, base_family="Avenir") %+replace% theme(rect = element_rect(fill = 'white', linetype = 0, colour = NA, size=NA), panel.background = element_blank(), panel.grid.minor.x = element_line(colour = NA, size = NA), panel.grid.minor.y = element_line(colour = 'gray90', size = 0.2), panel.grid.major.x = element_line(size = NA, colour = NA), panel.grid.major.y = element_line(size = 0.3, linetype = 'solid', colour = "gray80"), strip.background = element_rect(fill = "grey80", colour = "grey20"), # Optional - sets x and y axis lines # axis.line = element_line(colour="gray50", size=0.3, lineend="square"), # Optional - sets a border around the chart. Leaving it off provides a cleab 538 style chart # panel.border = element_rect(linetype = "solid", fill = NA, color='gray60', size=1), legend.key = element_rect(fill="transparent", colour=NA) ) }
Get the data.
The spreadsheet files are available on the ABS website. I am using TABLE 53. Estimated Resident Population By Single Year Of Age, Queensland. I tend to download the spreadsheet locally and read in the data with readxl
because the spreadsheets are periodically updated.
When I have downloaded the file to my working directory I read it in with the code below. There are a number of worksheets in the spreadsheet and we are only interested in two of them. One is the main data worksheet (Data1
), and the other is the Index
worksheet that I will use for column names.
The data worksheet is relatively straightforward. The range
argument allows a certain range of cells that have data to be loaded. This means eyeballing the spreadsheet in advance to identify what needs to be extracted. Obtaining the column names reads the Index
worksheet, specifies a range of cells to read, defines the data type in the cells (as text), and gives the tibble column a name.
qld <- read_excel(path = '3101053.xls', sheet = 'Data1', range = 'A10:GU57') columns <- read_excel(path = '3101053.xls', sheet = 1, range='A13:A214', col_types = 'text', col_names = 'Header')
Tidy up the data (part 1).
The data set you get is 47 rows and 203 columns of population data. The columns show the unit age population levels from males aged 1 years old and under to 100 years old and over. Following that is the same for females. So 101 columns for males and 101 for females, plus 1 column for years results in 203 columns.
The Index
worksheet that was used to extract strings to use as column headers is long and contains redundant information and punctuation. These can all be removed using the stringr
package as follows. Just remove what is not necessary and trim the whitespace as follows.
columns$Header <- str_remove(columns$Header, pattern = "Estimated Resident Population ; ") columns$Header <- str_remove(columns$Header, pattern = "\ ; ") columns$Header <- str_remove(columns$Header, pattern = "and over") columns$Header <- str_remove(columns$Header, pattern = ";") columns$Header <- str_trim(columns$Header, side = 'right') colnames(qld) <- c('Year', columns$Header)
Look at the data.
You can use the kable
and kableExtra
packages to make a pretty table to look at bits of the data. Obviously in a regular R script you would just output to console. However, I just discovered kableExtra
yesterday so I am keen to include the kable_styling
function to make a Bootstrap style table.
kable(qld[1:5, 1:5], "html") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Year | Male 0 | Male 1 | Male 2 | Female 0 | Female 1 | Female 2 | Female 3 |
---|---|---|---|---|---|---|---|
1971-06-01 | 19091 | 18909 | 18480 | 18378 | 17975 | 17982 | 17511 |
1972-06-01 | 20106 | 19460 | 19239 | 19380 | 18669 | 18349 | 18172 |
1973-06-01 | 19586 | 20438 | 19999 | 18819 | 19785 | 19122 | 18844 |
1974-06-01 | 19281 | 19988 | 20898 | 18266 | 19234 | 20287 | 19654 |
1975-06-01 | 18738 | 19513 | 20275 | 17834 | 18583 | 19582 | 20681 |
Tidy up the data (part 2).
As you can see the data in wide and untidy. Each column has tow components, sex and age, that should be separated out. Also, the data would benefit from being converted to long format. The Year column is not formatted as a year, it is still in a date format. We can fix this with a short number of steps.
- Gather the data to long format using the columns with
Sex_Age
the key,Population
as the value, and ignore theYear
column. - Covert the
Sex_Age
columns to 2 columns,Sex
andAge
, using separate. - Convert the date to a year.
- Convert the data in the
Age
columnn to integers.
qld_long <- qld %>% gather(Sex_Age, Population, -'Year') qld_pop <- qld_long %>% separate(Sex_Age, into = c('Sex', 'Age'), sep = " ") qld_pop$Year <- year(qld_pop$Year) qld_pop$Age <- as.integer(qld_pop$Age)
Plot the data
Now we can begin to plot the data using ggplot
. There are many ways to slice the data and plot it, but to keep thing simple let’s just do a line plot of population count of males and females by year for each year of age, for a particular year.
We will use filter
to pick out a particular year to plot and make a line graph using geom_line
, and my theme that I included earier.
qld_pop %>% filter(Year == 2017) %>% ggplot(aes(x=Age, y=Population, color=Sex)) + geom_line(size=1.1, alpha=0.8) + labs(x = "Age", y = 'Number of people', title = "Population of Queensland in 2017 by sex and age.") + scale_color_manual(values=wes_palette("Royal1")) + theme_marquess()
Click the image below to enlarge.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.