Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’ve recently begun reading through the book Modern Data Science with R, by Benjamin S. Baumer, Daniel T. Kaplan, and Nicholas J. Horton. It’s quite clear and informative. One of the things I especially appreciate about it is that I’m not finding the math to be too cumbersome. That is, even for someone like me, whose primary background isn’t in math or statistics, I’m able to follow along with the book quite easily.
As I’m reading through the book, I’m doing the exercises at the back of the chapters, and I recently worked through chapter 3, which covers ggplot2
basics. One of the exercises at the end of this chapter asks us to recreate this graphic from FiveThirtyEight. The goal of the exercise is to use ggplot2
to make production-quality graphics.
The project makes use of the babynames
package, which uses public data on baby names from the Social Security Administration. We then use the make_babynames_dist()
function from the mdsr
package that the authors developed to add variables relevant to the goals of the exercise. Basically, it takes the data from the lifetables
table in the babynames
package and adds variables and filters and returns just the data relevant to 2014.
Truth told, I was stumped by this exercise when I first read it. So I reached out to Nicholas Horton at Amherst and he helped me with the basic scripting and I was able to tweak it to recreate what I was looking for.
So first we load the necessary libraries and inspect the dataset.
library(mdsr) library(dplyr) library(tidyr) library(ggplot2) library(ggthemes) library(babynames) babynames_dist <- make_babynames_dist() head(babynames_dist)
# A tibble: 6 x 9 year sex name n prop alive_prob count_thousands age_today est_alive_today <dbl> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1900 F Mary 16707 0.0526 0 16.7 114 0 2 1900 F Helen 6343 0.0200 0 6.34 114 0 3 1900 F Anna 6114 0.0192 0 6.11 114 0 4 1900 F Marga… 5304 0.0167 0 5.30 114 0 5 1900 F Ruth 4765 0.0150 0 4.76 114 0 6 1900 F Eliza… 4096 0.0129 0 4.10 114 0
So what we need to do is create some new variables that provide the total number of people with a given name who are likely still alive, and from that we can then calculate the percentage who are likely dead. Then (and this is what initially had me stumped) we need to select the top 10 male names and the top 10 female names.
deadest <- babynames_dist %>% filter(year >= 1900) %>% group_by(name, sex) %>% summarise(N = n(), total_est_alive_today = sum(est_alive_today), total = sum(n)) %>% mutate(percent_dead = 1 - (total_est_alive_today / total)) %>% filter(total > 50000) %>% arrange(desc(percent_dead)) %>% group_by(sex) %>% top_n(10)
The above gives us the following dataset, which we can then use to create the graphic:
# A tibble: 20 x 6 # Groups: sex [2] name sex N total_est_alive_today total percent_dead <chr> <chr> <int> <dbl> <int> <dbl> 1 Mabel F 111 20233. 96037 0.789 2 Gertrude F 111 31360. 145693 0.785 3 Myrtle F 99 25491. 108941 0.766 4 Blanche F 111 16509. 69524 0.763 5 Beulah F 110 15642. 63361 0.753 6 Opal F 111 17471. 65821 0.735 7 Florence F 111 77679. 284945 0.727 8 Agnes F 111 37593. 134940 0.721 9 Viola F 111 32957. 116666 0.718 10 Bessie F 111 36824. 130155 0.717 11 Elmer M 111 35548. 116830 0.696 12 Wilbur M 111 17881. 54423 0.671 13 Homer M 111 18809. 55639 0.662 14 Willard M 111 28576. 74821 0.618 15 Hubert M 111 21417. 55340 0.613 16 Chester M 111 44995. 114370 0.607 17 Clarence M 111 113641. 280518 0.595 18 Herbert M 111 88652. 217291 0.592 19 Harry M 111 153501. 374524 0.590 20 Horace M 111 20723. 50340 0.588
So then here is the code to create the actual data viz:
ggplot(deadest, aes(reorder(name, percent_dead), percent_dead, fill = sex)) + geom_bar(stat = "identity") + geom_text(aes(y = percent_dead + 0.05), label = paste(round(deadest$percent_dead * 100, 1))) + coord_flip() + ggtitle("Deadest Names", subtitle = "Estimated % of Americans with a given name born since 1900\nwho were dead as of Jan. 1, 2014") + scale_x_discrete(NULL) + scale_y_continuous(NULL) + scale_fill_manual(values = c("#f6b900", "#008fd5")) + theme_fivethirtyeight() + theme(axis.text.x = element_blank(), panel.grid = element_blank(), legend.position = "none")
I’ve used the fivethirtyeight
theme from the ggthemes
package, and apart from the footer that FiveThirtyEight uses, it looks pretty close.
The post Data viz challenge: Recreating FiveThirtyEight’s ‘Deadest Names’ graphic with ggplot2 appeared first on my (mis)adventures in R programming.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.