Site icon R-bloggers

Data Visualization in R with ggplot2: A Beginner Tutorial

[This article was first published on R tutorial – Dataquest, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A famous general is thought to have said, “A good sketch is better than a long speech.” That advice may have come from the battlefield, but it’s applicable in lots of other areas — including data science. “Sketching” out our data by visualizing it using ggplot2 in R is more impactful than simply describing the trends we find.

Sketching out the design for a house communicates much more clearly than trying to describe it with words. The same thing is often true for data — and that’s where data visualization with ggplot2 comes in!

This is why we visualize data. We visualize data because it’s easier to learn from something that we can see rather than read. And thankfully for data analysts and data scientists who use R, there’s a tidyverse package called ggplot2 that makes data visualization a snap!

In this blog post, we’ll learn how to take some data and produce a visualization using R. To work through it, it’s best if you already have an understanding of R programming syntax, but you don’t need to be an expert or have any prior experience working with ggplot2.

Introducing the Data

The National Center for Health Statistics has been tracking United States mortality trends since 1900. They’ve compiled data on life expectancy and death rate of United States citizens.

We would like to know how life expectancy has been changing through time. With advances in medicine and technology, we would expect that life expectancy would be increasing, but we won’t know for sure until we have a look!

If you’d like to reproduce the graphs we’ll create in this blog post, download the data set here and follow along! 

(Not sure how you can work with R on your personal computer? Check out how to get started with RStudio!)

What’s in a Graph?

Before we dive into the post, some context is needed. There are many types of visualizations out there, but most of them will boil down to the following:

We can break down this plot into its fundamental building blocks:

  1. The data used to create the plot:

  1. The axes of the plot:

  1. The geometric shapes used to visualize the data. In this case, a line:

  1. The labels or annotations that will help a reader understand the plot:

Breaking down a plot into layers is important because it is how the ggplot2 package understands and builds a plot. The ggplot2 package is one of the packages in the tidyverse, and it is responsible for visualization. As you continue reading through the post, keep these layers in mind.

Importing the Data

In order to start on the visualization, we need to get the data into our workspace. We’ll bring in the tidyverse packages and use the read_csv() function to import the data. We have our data named as life_expec.csv, so you’ll need to rename it according to how you name the file.

library(tidyverse)
life_expec <- read_csv("life_expec.csv")

Let’s see what data we’re working with:

colnames(life_expec)
[1] "Year"    "Race"        “Sex"         "Avg_Life_Expec"    "Age_Adj_Death_Rate"

We can see that time is encoded in terms of years via the Year column. There are two columns that allow us to distinguish between different race and sex categories. Finally, the last two columns correspond to life expectancy and death rate.

Let’s have a quick look at the data to see how it looks like for one particular year:

life_expec %>%
  filter(Year == 2000)

For the year 2000, there are nine data points:

## # A tibble: 9 x 5
##    Year Race      Sex        Avg_Life_Expec Age_Adj_Death_Rate
##   < dbl> < chr>     < chr>               < dbl>              < dbl>
## 1  2000 All Races Both Sexes           76.8               869
## 2  2000 All Races Female               79.7               731.
## 3  2000 All Races Male                 74.3              1054.
## 4  2000 Black     Both Sexes           71.8              1121.
## 5  2000 Black     Female               75.1               928.
## 6  2000 Black     Male                 68.2              1404.
## 7  2000 White     Both Sexes           77.3               850.
## 8  2000 White     Female               79.9               715.
## 9  2000 White     Male                 74.7              1029.

One year has nine different rows, each one corresponding to a different demographic division. For this visualization, we’ll focus on the United States overall, so we’ll need to filter the data down accordingly:

life_expec <- life_expec %>%
  filter(Race == "All Races", Sex == "Both Sexes")

The data is in a good place, so we can pipe it into a ggplot() function to begin creating a graph. We use the ggplot() function to indicate that we want to create a plot.

life_expec %>%
  ggplot()

This code produces a blank graph (as we see below). But it now “knows” to use the life_expec data, even though we don’t see it charted yet.

Building the Axes

Now that we’ve prepared the data, we can start building our visualization. The next layer that we need to establish are the axes. We are interested in looking at how life expectancy changes with time, so this indicates what our two axes are: Year and Avg_Life_Expec.

In order to specify the axes, we need to use the aes() function. aes is short for “aesthetic”, and it is where we tell ggplot what columns we want to use for different parts of the plot. We are trying to look at life expectancy through time, so this means that Year will go to the x-axis and Avg_Life_Expec will go to the y-axis.

life_expec %>%
  ggplot(aes(x = Year, y = Avg_Life_Expec))

With the addition of the aes() function, the graph now knows what columns to attribute to the axes:

But notice that there’s still nothing on the plot! We still need to tell ggplot() what kind of shapes to use to visualize the relationships between Year and Avg_Life_Expec.

Specifying Geoms

Typically when we think of visualizations, we normally think about the type of graph since it’s really the shapes that we see that tell us most of the information. While the ggplot2 package gives us a lot of flexibility in terms of choosing a shape to draw the data, it’s worth taking some time to consider which one is best for our question.

We are trying to visualize how life expectancy has changed through time. This means that there should be a way for us to compare the past directly with the future. In other words, we want a shape that helps show a relationship between two consecutive years. For this, a line graph is great.

To create a line graph with ggplot(), we use the geom_line() function. A geom is the name for the specific shape that we want to use to visualize the data. All of the functions that are used to draw these shapes have geom in front of them. geom_line() creates a line graph, geom_point() creates a scatter plot, and so on.

life_expec %>%
  ggplot(aes(x = Year, y = Avg_Life_Expec)) +
  geom_line()

Notice how after the use of the ggplot() function, we start to add more layers to it using a + sign. This is important to note because we use %>% to tell ggplot() what data to function. After using ggplot(), we use + to add more layers to the plot.

This graph is exactly what we were looking for! Having a look at the general trend, life expectancy has grown over time.

We could stop the plot here if we were just looking at the data quickly, but this is rarely the case. More common is that you’ll be creating a visualization for a report or for others on your team. In this case, the plot is not complete: if we were to give it to a teammate with no context, they wouldn’t understand the plot. Ideally, all of your plots should be able to explain themselves through the annotations and titles.

Adding a Title and Axis Labels

Currently the graph keeps the column names as the labels for both of the axes. This is sufficient for Year, but we’ll want to change up the y-axis. In order to change the axis labels for a plot, we can use the labs() function and add it as a layer onto the plot. labs() can change both the axis labels as well as the title, so we’ll incorporate that here.

life_expec %>% # data layer
  ggplot(aes(x = Year, y = Avg_Life_Expec)) + # axes layer
  geom_line() + # geom layer
  labs(  # annotations layer
    title = "United States Life Expectancy: 100 Years of Change",
    y = "Average Life Expectancy (Years)"
  )

Our final polished graph is:

Conclusion: ggplot2 is Powerful!

In only a few lines of code, we produced a great visualization that tells us everything we need to know about life expectancy for the general population in the United States. Visualization is an essential skill for all data analysts, and R makes it easy to pick up.

Check out our Data Analyst in R path if you’re interested in learning more! The Data Analyst in R path includes a course on data visualization in R using ggplot2, where you’ll learn how to:

  • < svg class="tcb-icon" viewBox="0 0 32 32" data-id="icon-check" data-name="" style="">< path d="M29.333 10.267c0 0.4-0.133 0.8-0.533 1.2l-14.8 14.8c-0.267 0.267-0.667 0.4-1.067 0.4s-0.933-0.133-1.2-0.533l-2.4-2.267-6.267-6.267c-0.267-0.267-0.4-0.667-0.4-1.2s0.133-0.8 0.533-1.2l2.4-2.4c0.267-0.133 0.667-0.4 1.067-0.4s0.8 0.133 1.2 0.533l5.067 5.067 11.2-11.333c0.267-0.267 0.667-0.533 1.2-0.533 0.4 0 0.8 0.133 1.2 0.533l2.4 2.4c0.267 0.267 0.4 0.667 0.4 1.2z">
    Visualize changes over time using line graphs.
  • < svg class="tcb-icon" viewBox="0 0 32 32" data-id="icon-check" data-name="" style="">< path d="M29.333 10.267c0 0.4-0.133 0.8-0.533 1.2l-14.8 14.8c-0.267 0.267-0.667 0.4-1.067 0.4s-0.933-0.133-1.2-0.533l-2.4-2.267-6.267-6.267c-0.267-0.267-0.4-0.667-0.4-1.2s0.133-0.8 0.533-1.2l2.4-2.4c0.267-0.133 0.667-0.4 1.067-0.4s0.8 0.133 1.2 0.533l5.067 5.067 11.2-11.333c0.267-0.267 0.667-0.533 1.2-0.533 0.4 0 0.8 0.133 1.2 0.533l2.4 2.4c0.267 0.267 0.4 0.667 0.4 1.2z">
    Use histograms to understand data distributions.
  • < svg class="tcb-icon" viewBox="0 0 32 32" data-id="icon-check" data-name="" style="">< path d="M29.333 10.267c0 0.4-0.133 0.8-0.533 1.2l-14.8 14.8c-0.267 0.267-0.667 0.4-1.067 0.4s-0.933-0.133-1.2-0.533l-2.4-2.267-6.267-6.267c-0.267-0.267-0.4-0.667-0.4-1.2s0.133-0.8 0.533-1.2l2.4-2.4c0.267-0.133 0.667-0.4 1.067-0.4s0.8 0.133 1.2 0.533l5.067 5.067 11.2-11.333c0.267-0.267 0.667-0.533 1.2-0.533 0.4 0 0.8 0.133 1.2 0.533l2.4 2.4c0.267 0.267 0.4 0.667 0.4 1.2z">
    Compare graphs using bar charts and box plots.
  • < svg class="tcb-icon" viewBox="0 0 32 32" data-id="icon-check" data-name="" style="">< path d="M29.333 10.267c0 0.4-0.133 0.8-0.533 1.2l-14.8 14.8c-0.267 0.267-0.667 0.4-1.067 0.4s-0.933-0.133-1.2-0.533l-2.4-2.267-6.267-6.267c-0.267-0.267-0.4-0.667-0.4-1.2s0.133-0.8 0.533-1.2l2.4-2.4c0.267-0.133 0.667-0.4 1.067-0.4s0.8 0.133 1.2 0.533l5.067 5.067 11.2-11.333c0.267-0.267 0.667-0.533 1.2-0.533 0.4 0 0.8 0.133 1.2 0.533l2.4 2.4c0.267 0.267 0.4 0.667 0.4 1.2z">
     Understand relationships between variables using scatter plots.
Christian Pascual

Christian is currently a student at the University of California San Diego pursuing a PhD in Biostatistics.

The post Data Visualization in R with ggplot2: A Beginner Tutorial appeared first on Dataquest.

To leave a comment for the author, please follow the link and comment on their blog: R tutorial – Dataquest.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.