Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
CapitalOne contacted me a few months ago and requested that I apply for an internship with them for a data science related position. I never got the job (nor did I really want it; I had already agreed to teach during the summer and I was apprehensive about leaving people hanging, and also about moving), but I did go through part of the interview process. CapitalOne had me complete their data science challenge, which had some problems that were supposedly common tasks in data science. Some of it I was not well equipped for, such as regression; I was used to regression from an econometric point of view, not a computer science or data science point of view, and I was still learning. But there was one part of the challenge that I remember very well, and I was very happy with the solution.
The Social Security Administration (SSA) releases data about baby names on an annual basis, and this data is popular to analyze by data junkies, including Hadley Wickham (he has a GitHub project and an R package on CRAN devoted to this data set). The CapitalOne challenge had multiple questions devoted to analyzing this data set, one of which was open-ended; basically, they wanted me to find something interesting in the data.
My response was a Shiny app for visualizing the SSA baby names data on a U.S. map. The app allows users to choose a name, years in which to select, and the gender of the babies with the name, and would display a map showing the relative popularity of the name. You can see a screenshot below.
After making this app, I was able to identify interesting patterns particularly surrounding gender ambiguous names, like Jaime or Ashley. Both boys and girls can receive these names, but which gender is commonly associated with which name can depend on the region. For example, look at the screenshots below for male and female use of the name Jaime:
In the South, Jaime is commonly used as a boy’s name, while elsewhere (including my neck of the woods, Utah and Idaho) it’s commonly a girl’s name. Other patterns like this exist as well, and it’s easy to identify them with the app I made.
I hosted this app on shinyapps.io and posted about it on Facebook, inviting friends to use it. It turns out the app was so popular among my friends, I ran out of available active hours on shinyapps.io, which meant that the interactive documents I was using to teach my statistics class (MATH 1070, a basic introductory statistics course with only intermediate algebra as a prerequisite, and no use of a computer beyond Microsoft Excel) would no longer work. That was an interesting announcement to make to my class.
How to make this app? The first step is to get the data in a useable format. I downloaded the data directly from the SSA website, though Hadley Wickham’s package may contain the same data in an already useable format. After downloading, I unzipped all the .csv
files into a directory, read them all in a for loop, and created a .rda
file with the data in the appropriate format. (Is this a great solution? No. But it was good enough.) The code for doing so is shown below:
baby_names <- data.frame(state = character(), gender = character(), year = integer(), name = character(), count = integer()) for (file_name in list.files()) { print(file_name) baby_names <- rbind(baby_names, read.csv(file_name, header = FALSE)) } names(baby_names) <- c("state", "gender", "year", "name", "count") # Save the resulting dataframe for easy loading later save(baby_names, file = "sca_baby_name_counts.Rda")
From here I can create the app. I found I had to use the choroplethr and choroplethrMaps packages in order to be able to make the map I wanted. For whatever reason, the R maps package’s USA map does not include Alaska and Hawai’i. I’m sure the package was made after these areas became states, so why they were omitted from the map is beyond me. ggplot2 unfortunately doesn’t show any convenient way for making plots for these states, so these packages were the easiest to work with for the job I wanted to do, which is sad. I feel more could be done for making map plots for all of the USA (including Alaska and Hawai’i) easier.
Anyway, here is the code for the app:
library(shiny) library(choroplethr) library(choroplethrMaps) library(dplyr) load("sca_baby_name_counts.Rda") get_state_name_data <- function(name, years, gender = "MF") { name_in <- name if (gender == "MF") { baby_names %>% filter(name == name_in & year >= years[1] & year <= years[2]) %>% select(state, count) %>% xtabs(count ~ state, data=.) -> state_count_name } else if (gender == "M") { baby_names %>% filter(name == name_in & year >= years[1] & year <= years[2] & gender == "M") %>% select(state, count) %>% xtabs(count ~ state, data=.) -> state_count_name } else if (gender == "F") { baby_names %>% filter(name == name_in & year >= years[1] & year <= years[2] & gender == "F") %>% select(state, count) %>% xtabs(count ~ state, data=.) -> state_count_name } else stop("Bad argument to gender: must be 'M', 'F', or 'MF'") baby_names %>% filter(year >= years[1] & year <= years[2]) %>% select(state, count) %>% xtabs(count ~ state, data=.) -> state_count names(state_count) <- c('alaska', 'alabama', 'arkansas', 'arizona', 'california', 'colorado', 'connecticut', 'district of columbia', 'delaware', 'florida', 'georgia', 'hawaii', 'iowa', 'idaho', 'illinois', 'indiana', 'kansas', 'kentucky', 'louisiana','massachusetts','maryland', 'maine','minnesota','michigan', 'missouri','mississippi','montana', 'north carolina','north dakota','nebraska', 'new hampshire','new jersey','new mexico', 'nevada','new york','ohio', 'oklahoma','oregon','pennsylvania', 'rhode island','south carolina','south dakota', 'tennessee','texas','utah', 'virginia','vermont','washington', 'wisconsin','west virginia','wyoming') return(data.frame(region = names(state_count), value = (as.vector(state_count_name) / as.vector(state_count)))) } # Define UI for application that draws a histogram ui <- shinyUI(fluidPage( # Application title titlePanel("Baby Name National Popularity Viewer"), # Sidebar with a slider input for year, name, and sex selected sidebarLayout( sidebarPanel( sliderInput("year", "Birth Years", min = min(baby_names$year), max = max(baby_names$year), value = c(max(baby_names$year), max(baby_names$year))), textInput("b_name", "Name", value = "John"), radioButtons("sex","Gender", choices = list("Male" = "M", "Female" = "F", "Both" = "MF"), selected = "MF") ), # Show a plot of the generated map mainPanel( plotOutput("mapPlot") ) ) )) # Define server logic required to draw a map server <- shinyServer(function(input, output) { output$mapPlot <- renderPlot({ if (input$sex == "M") { title_str <- paste("Proportion of Male Babies Named", input$b_name, collapse = " ") } else if (input$sex == "F") { title_str <- paste("Proportion of Female Babies Named", input$b_name, collapse = " ") } else if (input$sex == "MF") { title_str <- paste("Proportion of Babies Named", input$b_name, collapse = " ") } get_state_name_data(input$b_name, input$year, input$sex) %>% state_choropleth(title = title_str) }) }) # Run the application shinyApp(ui = ui, server = server)
There is room for improvement for the app. For example, years are formatted funny. Also, it would be great to see a line chart that plots the popularity of a name over time in addition to the choropleth plot. I would love to see such solutions. Also, if you find any interesting patterns, comment below and let me know!
As a final note, I would love to be able to create a D3 visualization app and embed it directly in a blog post, or at least somewhere on this site. As I am currently using WordPress.com for my blog, this is not possible, though there is a WordPress plugin that allows for D3 visualizations in WordPress. Imagine if in addition to giving you the code for a Shiny app, I also included a D3 visualization system for the baby names data set for you to play with on your own, online, even if you know nothing about R! Please petition WordPress.com to add Wp-D3 as a plugin, and this will become a reality (you can do so here). As D3 probably is not used by most bloggers, I need as many people as possible to suggest this, as otherwise I may be the only one. If you do so, let me know in the comments and you will have made my day!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.