Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Quick Overview
Exploring-Data
is a place where I share easily digestible content aimed at making the wrangling and exploration of data more efficient (+fun).
Sign up Here to join the many other subscribers who also nerd out on new tips and tricks ????
And if you enjoy the post be sure to share it
TweetBusiness Science
Recently, I completed the Data Science for Business 101 course over at Business Science University. In the course, Matt Dancho teaches students the fundamentals of data science for business with the tidyverse.
The course is jam packed with material: from basic data wrangling all the way to applied machine learning – I highly recommend it to anyone looking to advance their skills ????
Clink this LINK to access the course.
I’ve been tracking down data then applying the techniques to help solidify concepts from the course. One of my favorite parts from Week 1 is turning a generic ggplot() into something that is Business Ready.
In this post I’ll show you how to upgrade your plots in R
so that they are Business-Ready.
The Final Plot
This is the plot that we will recreate in the post – it’s crisp, clean, and Business-Ready.
Let’s get started ????
Load our Libraries
library(tidyverse) # Work-Horse Package library(tidyquant) # Business Ready Plots library(scales) # Scaling Data for Plots
Let’s Get Some Data
These are Census data that I got here: link to data.
The original data was 4M+ rows and so I’ve already filtered it down a bit.
# Import Data edu_census_data_raw_tbl <- read_csv("../../static/01_data/edu_census_data.csv") # Glimpse Data edu_census_data_raw_tbl %>% glimpse() ## Rows: 228,737 ## Columns: 5 ## $ name <chr> "United States", "United States", "United States", "United S… ## $ type <chr> "nation", "nation", "nation", "nation", "nation", "nation", … ## $ year <dbl> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, … ## $ variable <chr> "percent_bachelors_degree_or_higher_rank", "percent_graduate… ## $ value <dbl> 1.0, 10.8, 28.8, 11.1, 1.0, 86.7, 5.8, 18.0, 27.8, 28.1, 8.5…
Filter Data for Plotting
We want to compare educational attainment statistics for the County of Los Angeles against the rest of the Nation – first, let’s do a bit of filtering to get just the data needed for our plot.
# Setup variables: for filter + for using in plot later year_f <- 2018 nation <- "United States" county <- "Los Angeles County" # Data Prep edu_census_filtered_tbl <- edu_census_data_raw_tbl %>% # Filter data to year and areas of interest filter(year == year_f, name == nation | # OR str_detect(name, county)) # View data edu_census_filtered_tbl ## # A tibble: 10 x 5 ## name type year variable value ## <chr> <chr> <dbl> <chr> <dbl> ## 1 United States nation 2018 percent_less_than_9th_grade 5.3 ## 2 United States nation 2018 percent_high_school_graduate_or… 87.7 ## 3 United States nation 2018 percent_bachelors_degree_or_hig… 31.5 ## 4 United States nation 2018 percent_associates_degree 8.4 ## 5 United States nation 2018 percent_graduate_or_professiona… 12.1 ## 6 Los Angeles County, Cali… county 2018 percent_less_than_9th_grade 12.6 ## 7 Los Angeles County, Cali… county 2018 percent_high_school_graduate_or… 78.7 ## 8 Los Angeles County, Cali… county 2018 percent_bachelors_degree_or_hig… 31.8 ## 9 Los Angeles County, Cali… county 2018 percent_associates_degree 7 ## 10 Los Angeles County, Cali… county 2018 percent_graduate_or_professiona… 11.1
The 10 x 5 table is exactly what we need to create our first plot.
Making a visualization is a great way to get a few insights in the process of better understanding your data.
Generic ggplot()
The awesomeness of ggplot()
is that we can rapidly produce a plot with just a couple of lines of code – that means we can quickly get insights that will help determine the next steps in Exploring
these Data
further.
The stacked bar-chart below is a great starting place.
edu_census_filtered_tbl %>% ggplot(aes(x = variable, y = value, fill = name)) + geom_col()
We can immediately see that ‘some’ differences exist but it’s difficult to get a sense of the magnitude. It’s also difficult to make out the names of the variables on the x-axis.
Making Business-Ready
plots can be time-consuming – thankfully, we have the Tidyquant
library to help expedite the process.
Business Ready Plots
To get those plots business-ready, it’s helpful (+best-practice) to break things up into two steps:
- Data Manipulation (Wrangling)
- Data Visualization
The data manipulation step will pay-off immensely once we get to the data visualization step; this was a key learning from Matt in the 101 course – it keeps your code nice and tidy too ????
1) Data Manipulation
# Step 1 - Manipulate Data data_manipulated_tbl <- edu_census_filtered_tbl %>% # Selecting columns to focus on select(name, variable, value) %>% # Tidy up variable names mutate(variable = str_replace(variable, "percent_", ""), variable = str_replace_all(variable, "_", " "), variable = str_to_title(variable)) %>% # Convert value to a pct (ratio) mutate(pct = value / 100) %>% # Format % Text mutate(pct_text = scales::percent(pct, accuracy = 0.1)) %>% # Select final columns for plotting select(name, variable, contains("pct"))
Now that we’ve wrangled + manipulated our data, let’s take a peak at it before diving into the generation of our visualization.
data_manipulated_tbl ## # A tibble: 10 x 4 ## name variable pct pct_text ## <chr> <chr> <dbl> <chr> ## 1 United States Less Than 9th Grade 0.053 5.3% ## 2 United States High School Graduate Or Higher 0.877 87.7% ## 3 United States Bachelors Degree Or Higher 0.315 31.5% ## 4 United States Associates Degree 0.084 8.4% ## 5 United States Graduate Or Professional Degree 0.121 12.1% ## 6 Los Angeles County, California Less Than 9th Grade 0.126 12.6% ## 7 Los Angeles County, California High School Graduate Or Higher 0.787 78.7% ## 8 Los Angeles County, California Bachelors Degree Or Higher 0.318 31.8% ## 9 Los Angeles County, California Associates Degree 0.07 7.0% ## 10 Los Angeles County, California Graduate Or Professional Degree 0.111 11.1%
Creating the pct_text
column will come in handy for adding clean labels to our plot – this will be a nice touch that will help the audience quickly make sense of the chart.
2) Data Visualization
# Step 2 - Visualize Data data_visualized_plot <- data_manipulated_tbl %>% # Setup ggplot() canvas for plotting ggplot(aes(x = variable, y = pct, fill = name)) + # Geometries geom_col() + geom_label(aes(label = pct_text), fill = "white", hjust = "center") + # Facet: splits plot into multiple plots by a categorical feature facet_wrap(~ name) + # Flip coordinates for readable variable names coord_flip() + # Formatting theme_tq() + scale_fill_tq() + scale_y_continuous(labels = scales::percent, limits = c(0, 1.0)) + theme(legend.position = "none", plot.title = element_text(face = "bold")) + labs(title = str_glue("Comparison of Educational Attainment ({year_f})"), subtitle = str_glue("{county} vs. Overall National Statistics"), caption = "Census Data", x = "", y = "")
We now have the two steps completed and our code is nicely commented for readability (+reproducibility).
Display Plot
Let’s take a look at our awesome plot ????
data_visualized_plot
Wrap Up
That’s it for today!
You learned how to turn a generic ggplot()
into one that is Business-Ready
????
Get the code here: Github Repo.
Subscribe + Share
Enter your Email Here to get the latest from Exploring-Data in your inbox.
PS: Be Kind and Tidy your Data ????
Learn R Fast ????
Interested in expediting your learning path?
Click on the link to head over to Business Science and join me on the journey.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.