Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Welcome to the last part of the series where I recreate data visualizations in R from the book Knowledge is Beautiful by David McCandless.
Links to part I, II, III of the series can be found here.
Plane Crashes
This dataset will be used for a couple of visualizations.
The first visualization is a stacked-barplot showing causes of crashes for every plane crash from 1993 to January 2017 (for flights that were not military, medical or a private chartered flight).
library(dplyr) library(ggplot2) library(tidyr) library(extra) df <- read.csv("worst_plane.csv") # Drop the year plane model entered service mini_df <- df %>% select(-year_service) %>% # Gather the wide dataframe into a tidy format gather(key = cause, value = proportion, -plane) # Order by cause mini_df$cause <- factor(mini_df$cause, levels = c("human_error","weather", "mechanical", "unknown", "criminal"), ordered = TRUE) # Create vector of plane names according to year they entered service names <- unique(mini_df$plane) names <- as.vector(names) # sort by factor mini_df$plane <- factor(mini_df$plane, levels = names) ggplot(mini_df, aes(x=plane, y=proportion, fill=cause)) + geom_bar(stat = "identity") + coord_flip() + # Reverse the order of a categorical axis scale_x_discrete(limits = rev(levels(mini_df$plane))) + # Select manual colors that McCandless used scale_fill_manual(values = c("#8E5A7E", "#A3BEC7", "#E1BD81", "#E9E4E0", "#74756F"), labels = c("Human Error", "Weather", "Mechanical", "Unknown", "Criminal")) + labs(title = "Worst Planes", caption = "Source: bit.ly/KIB_PlaneCrashes") + scale_y_reverse() + theme(legend.position = "right", panel.background = element_blank(), plot.title = element_text(size = 13, family = "Georgia", face = "bold", lineheight = 1.2), plot.caption = element_text(size = 5, hjust = 0.99, family = "Georgia"), axis.text = element_text(family = "Georgia"), # Get rid of the x axis text/title axis.text.x=element_blank(), axis.title.x=element_blank(), # and y axis title axis.title.y=element_blank(), # and legend title legend.title = element_blank(), legend.text = element_text(family = "Georgia"), axis.ticks = element_blank())
The second visualization is an alluvial diagram for which we can use the ggalluvial package. I should mention that the original visualization by McCandless is much fancier than what this produces but displays the same basic information.
library(alluvial) library(ggalluvial) crash <- read.csv("crashes_alluvial.csv") # stratum = cause, alluvium = freq ggplot(crash, aes(weight = freq, axis1 = phase, axis2 = cause, axis3 = total_crashes)) + geom_alluvium(aes(fill = cause), width = 0, knot.pos = 0, reverse = FALSE) + guides(fill = FALSE) + geom_stratum(width = 1/8, reverse = FALSE) + geom_text(stat = "stratum", label.strata = TRUE, reverse = FALSE, size = 2.5) + scale_x_continuous(breaks = 1:3, labels = c("phase", "causes", "total crashes")) + coord_flip() + labs(title = "Crash Cause", caption = "Source: bit.ly/KIB_PlaneCrashes") + theme(panel.background = element_blank(), plot.title = element_text(size = 13, family = "Georgia", face = "bold", lineheight = 1.2, vjust = -3, hjust = 0.05), plot.caption = element_text(size = 5, hjust = 0.99, family = "Georgia"), axis.text = element_text(family = "Georgia"), axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.ticks.y = element_blank())
Gender Gap
This visualization depicts the salary gap between males and females by industry in the UK with the mean salary of each position within a category. We can use group_by() and summarize_at() to create a new variable for each category and then use facet_wrap() . Since positions only belong to one category you need to set scales = "free_x" for missing observations.
gendergap <- read.csv("gendergap.csv") # gather the dataset tidy_gap <- gendergap %>% gather(key = sex, value = salary, -title, -category) category_means <- tidy_gap %>% group_by(category) %>% summarize_at(vars(salary), mean) tidy_gap %>% ggplot(aes(x = title, y = salary, color = sex)) + facet_wrap(~ category, nrow = 1, scales = "free_x") + geom_line(color = "white") + geom_point() + scale_color_manual(values = c("#F49171", "#81C19C")) + geom_hline(data = category_means, aes(yintercept = salary), color = "white", alpha = 0.6, size = 1) + theme(legend.position = "none", panel.background = element_rect(color = "#242B47", fill = "#242B47"), plot.background = element_rect(color = "#242B47", fill = "#242B47"), axis.line = element_line(color = "grey48", size = 0.05, linetype = "dotted"), axis.text = element_text(family = "Georgia", color = "white"), axis.text.x = element_text(angle = 90), # Get rid of the y- and x-axis titles axis.title.y=element_blank(), axis.title.x=element_blank(), panel.grid.major.y = element_line(color = "grey48", size = 0.05), panel.grid.minor.y = element_blank(), panel.grid.major.x = element_blank(), strip.background = element_rect(color = "#242B47", fill = "#242B47"), strip.text = element_text(color = "white", family = "Georgia"))
One thing that I’m not sure how to handle is the spacing between each of the variables on the x-axis. Since there is a different number of variables for each facet it would be nice if one could specify they want equal spacing along the x-axis as an option in the facet_wrap(); however, I don’t think it’s possible (if you know a workaround please leave a comment!).
That’s all for me, it’s been fun doing this series and I hope you’ve enjoyed!
Recreating (more) data visualizations from the book “Knowledge is Beautiful”: Part IV was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.