Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Histograms with R and ggplot2
Be honest. How uninspiring are your data visualizations? Expert designers make graph design look effortless, but in reality, it can’t be further from the truth. Luckily, the R programming language provides countless ways to make your visualizations eye-catching.
Read more on our ggplot series:
This article will show you how to make stunning histograms with R’s ggplot2
library. We’ll start with a brief introduction and theory behind histograms, just in case you’re rusty on the subject. You’ll then see how to create and tweak ggplot histograms taking them to new heights.
Table of contents:
- What Is a Histogram?
- Make Your First ggplot Histogram
- How to Style and Annotate ggplot Histograms
- Add Text, Titles, Subtitles, Captions, and Axis Labels to ggplot Histograms
- Conclusion
What is a Histogram?
A histogram is a way to graphically represent the distribution of your data using bars of different heights. A single bar (bin) represents a range of values, and the height of the bar represents how many data points fall into the range. You can change the number of bins easily.
The easiest way to understand them is through visualization. The image below shows a histogram of 10,000 numbers drawn from a standard normal distribution (mean = 0, standard deviation = 1):
Although at first glance the histogram doesn’t look like much, it actually tells you a lot. When data is distributed normally (bell curve), you can draw the following conclusions:
- 68.26% of the data points are located between -1 and +1 standard deviations (34.13% in either direction).
- 95.44% of the data points are located between -2 and +2 standard deviations (47.72% in either direction).
- 99.72% of the data points are located between -3 and +3 standard deviations (49.86% in either direction).
- Anything outside the -3 and +3 standard deviation range is considered to be an outlier.
In reality, you’re rarely dealing with a perfectly normal distribution. It’s usually skewed in either direction or has multiple peaks. Keep this in mind when drawing conclusions from the shape of a histogram, alone.
Let’s see how you can use R and ggplot to visualize histograms.
Make Your First ggplot Histogram
We’ll use the Gapminder
dataset throughout the article to visualize histograms. It’s a relatively small dataset showing life expectancy, population, and GDP per capita in countries between 1952 and 2007. We’ll use only a subset that shows countries in Europe and discard everything else.
Here’s the code you need to import libraries, load, and filter the dataset:
Here’s how the first couple of rows from gm_eu
look like:
We’ll visualize the lifeExp
column with histograms, as it provides enough continuous data to play around with.
Let’s make the most basic ggplot histogram first. You can use the geom_histogram()
function to do so. Provided you’ve passed in the dataset and the default aesthetics:
Well, you won’t see anything like that on a website or in a magazine, so we better get our keyboard dirty with some tweaking.
Let’s start by changing the number of bins (bars). The default value is 30, and it works in most cases. If you want your histograms to look boxier, use fewer bins. On the other hand, go big if you want your histograms to look like density plots. Here’s how a histogram with 10 bins looks like:
Let’s stick with the default number of bins for the rest of the article, as it looks somewhat better.
The coloring is painful to look at. There’s nothing wrong with gray, but it looks too boring. Here’s how to enhance your ggplot histogram to make give it some Appsilon flair — blue fill color with black borders:
Much better, provided you like the blue color. Let’s dive deeper into styling and annotations next.
How to Style and Annotate ggplot Histograms
Styling
You can bring more life to your ggplot histogram. For example, we sometimes like to add a vertical line representing the mean, and two surrounding lines representing the range between -1 and +1 standard deviations from the mean. It’s a good idea to style the lines differently, just so your histogram isn’t confusing.
The following code snippet draws a black line at the mean, and dashed black lines at -1 and +1 standard deviation marks:
Are you up for a challenge? Try to recreate our histogram from Image 1. Hint: use geom_segment()
instead of geom_vline()
.
Every so often you want to make your ggplot histogram richer by combining it with a density plot. It shows more or less the same information, just in a smoother format. Here’s how you can add a density plot overlay to your histogram:
It’s somewhat of a richer data representation than if you’d’ve gone with the histogram alone. For example, if you were to embed the above chart to a dashboard, you could let the user toggle the overlay for maximum customizability.
Do you want to build dashboards professionally? Here’s how to start a career as an R Shiny Developer.
Annotations
Finally, let’s see how you can add annotations to your ggplot histogram. Maybe you find vertical lines too intrusive, and you just want a plain textual representation of specific values.
First things first, you’ll need to create a data.frame
for annotations. It should contain X and Y values, and also the labels that will be displayed:
You can now include these in a geom_text()
layer. Hint: make the annotations bold, so they’re easier to spot:
The trick with annotations is making sure there’s some gap between them, so the text doesn’t overlap.
Let’s also see how you can remove this grayish background color. The easiest approach is by adding a more minimalistic theme to the chart. The theme_classic()
is one of our top picks:
The only thing missing from our ggplot histogram is the title and axis labels. The users don’t know what they’re looking at without them.
Add Text, Titles, Subtitles, Captions, and Axis Labels to ggplot Histograms
Titles and axis labels are mandatory for production-ready charts. Subtitles or captions are optional, but we’ll show you how to add them as well. The magic happens in the labs()
layer. You can use it to specify the values for title, subtitle, caption, X-axis, and Y-axis:
It’s a good start, but the newly added elements don’t stand out. You can change the , color, size, among other things, in the theme()
layer. Just make sure to include a custom theme layer like theme_classic()
before you write your styles. These would get overridden otherwise:
It’s starting to shape up now. And it also matches the color palette of our ggplot histogram. We’ve covered everything needed to get you started visualizing your data distributions with histograms, so we’ll call it a day here. But there’s so much more you can do with your visualizations. Check out some of our Shiny demos to see where advanced level R programming can take your data visualizations.
Did you know there’s another way to visualize data distributions? Read our complete guide to boxplots.
Conclusion
Today you’ve learned what histograms are, why they are important for visualizing the distribution of continuous data, and how to make them appealing with R and the ggplot2
library. It’s enough to set you on the right track, and now it’s up to you to apply this knowledge to your datasets. We’re sure you can manage it.
At Appsilon, we’ve used histograms and the ggplot2
package in developing enterprise R Shiny dashboards for Fortune 500 companies. If R and R Shiny is something you have experience with, we might have a position ready for you.
Start a career at Appsilon — positions available.
Article How to Make Stunning Histograms in R: A Complete Guide with ggplot2 comes from Appsilon | End to End Data Science Solutions.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.