Hands-on R and dplyr – Analyzing the Gapminder Dataset
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exploratory Data Analysis With dplyr
When it comes to data analysis in R, you should look no further than the dplyr
package. It’s an excellent all-rounder – providing you with extensive drill-down abilities while keeping the coding clean and minimal.
Are you completely new to R? Check out what you can do with the language.
Today you’ll learn how to do exploratory data analysis on the well-known Gapminder dataset. It contains historical (1952-2007) data on various indicators, such as life expectancy and GDP, for countries worldwide.
The article is structured as follows:
- Dataset Loading and Basic Exploration
- Data Summaries
- Creating Derived Variables and Testing Assumptions
- Advanced Analysis
- Conclusion
Dataset Loading and Basic Exploration
If you’re following along, you’ll need to have two packages installed – dplyr
and gapminder
. Once installed, you can import them with the following code:
A call to the head()
function will show the first six rows of the dataset:
You now have everything loaded, which means you can begin with the analysis.
Let’s start with something simple. For example, let’s say you want to records for the United States for 1997, 2002, and 2007. To get these, you’ll have to filter the dataset by continent, country, and year. It can all be done in a single filter()
function:
The results are shown in the following image:
So, what happened here? As you can see, you can use the filter()
function to keep only the records of interest. If you need an exact match, use the ==
sign. If multiple values match your search criterion, use the %in%
operator. As simple as that.
Data Summaries
Summary statistics are a great starting point in any exploratory data analysis. They enable you to find a value that best describes a sample of data or a list of values that best represents each subset of the sample.
A simple average is a good place to start. Here’s how you can find the average life expectancy in the United States for 2007:
The results are shown below:
Let’s take this a step further and calculate the average life expectancy per continent in 2007. You’ll need to use the group_by()
function to do so:
The results are shown in the following image:
If you’re anything like me, you’ll find the above information useful but not presented in the best way. We’re dealing with average life expectancy – meaning higher is better. Having that in mind, it’s a good practice to sort the results descendingly.
Let’s see how with a slightly different example. The code below sorts continents by their total population:
The results are shown below:
You now know how to calculate basic summary statistics – an essential part of any data analysis. Next, you’ll learn how to create derived columns and test assumptions.
Creating Derived Variables and Testing Assumptions
A derived column indicates a column introduced by the developer – usually by combining values from several different columns. For example, you could calculate the total GDP of a country by multiplying GDP per capita by the country’s population.
Let’s do just that in code. The mutate()
function is used to calculate derived columns. It uses the following syntax: newColumn = your_calculation
:
The results are shown in the image below:
Let’s apply this knowledge to something useful – testing assumptions. We assume that higher GDP per capita values lead to higher life expectancy. Keep in mind that we’re not doing formal hypothesis testing here – but instead examining the results and eyeballing if they make sense for our assumption.
To test the assumption, you’ll calculate the percentiles from the lifeExp
column. This will tell you how many percent of the countries have an identical or lower life expectancy than the current country:
The results as shown below:
From the above image, you can see countries sorted by GDP per capita and their respective life expectancy percentile on the right. All of the countries are well above the average (50th percentile), with the lowest one being at the 68th percentile.
Before you can “verify” the above claim, you’ll have to look at the other end – are countries with the lowest GDP per capita located near the lowest percentiles?
You’ll only need to sort the dataset ascendingly:
The results are shown in the image below:
Yes – our claim seems to make perfect sense. Once again, this wasn’t a formal hypothesis test, but instead a test of simple assumptions.
Advanced Analysis
The term “advanced” is a bit abstract in data analysis, to say at least. If you’re fluent in R and dplyr
and have a couple of years of experience, there’s virtually nothing you can’t do, so nothing seems to be advanced. On the other hand, even the most basic filtering and aggregating may seem like a big deal if you’re starting out.
For that reason, this section treats the term “advanced” as providing the complete answer to a more complicated question – so multiple operations are required.
For example, let’s say you have to find out the top 10 countries in the 90th percentile regarding life expectancy in 2007. You can reuse some of the logic from the previous sections, but answering this question alone requires multiple filterings and subsetting:
As you can see, the filter()
function was used twice – the first time to select the year, and the second time to remove the records that are below the 90th percentile, since you’re only interested in the top 10. The top_n()
function is used to select the best n countries arranged by a specific column, specified by the wt
argument.
The results are shown below:
But what if you had to calculate the opposite – worst 10 countries below the 10th percentile? The syntax is quite similar, except for the second filtering, and the top_n()
function, where n is prefixed with a minus sign:
The minus prefix ensures the bottom 10 records are shown instead of the top 10:
And that’s just enough for today. Let’s wrap things up in the next section.
Conclusion
Today you’ve learned how to use the dplyr
package for exploratory data analysis. The quality of the analysis depends much on the quality of your questions, so make sure to ask the right questions first. If you know how to do that, analysis shouldn’t be too much of a trouble.
If you want to learn more about data analysis and everything R-related, stay tuned to the Appsilon blog. Also, make sure to subscribe to our newsletter, so you never miss an update.
Learn More
- How to Analyze Data with R: A Complete Beginner Guide to dplyr
- Introduction to SQL: 5 Key Concepts Every Data Professional Must Know
- How to Make REST APIs with R: A Beginners Guide to Plumber
- Machine Learning with R: A Complete Guide to Linear Regression
- Machine Learning with R: A Complete Guide to Logistic Regression
Appsilon is hiring for remote roles! See our Careers page for all open positions, including R Shiny Developers, Fullstack Engineers, Frontend Engineers, a Senior Infrastructure Engineer, and a Community Manager. Join Appsilon and work on groundbreaking projects with the world’s most influential Fortune 500 companies.
Article Hands-on R and dplyr – Analyzing the Gapminder Dataset comes from Appsilon | End to End Data Science Solutions.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.