Estimate Age from First Name
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Today I read a cute post from Flowing Data on the most trendy names in US history. What caught my attention was a link posted in the article to the source data, which happens to be yearly lists of baby names registered with the US social security agency since 1880
(see here). I thought that it might be good to compile and use these lists at work for two reasons:
(1) I don’t have experience handling file input programmatically in R (ie working with a bunch of files in a directory instead of manually loading one or two) and
(2) It could be useful to have age estimates in the donor files that I work with (using the year when each first name was most popular).
I’ve included the R code in this post at the bottom, after the following explanatory text.
I managed to build a dataframe that contains in each row a name, how many people were registered as having been born with that name in a given year, the year, the total population for that year, and the relative proportion of people with that name in that year.
Once I got that dataframe, I built a function to query that dataframe for the year when a given name was most popular, an estimated age using that year, and the relative proportion of people born with that name from that year.
I don’t have any testing data to check the results against, but I did do an informal check around the office, and it seems okay!
However, I’d like to scale this upwards so that age estimates can be calculated for each row in a vector of first names. As the code stands below, the function I made takes too long to be scaled up effectively.
I’m wondering what’s the best approach to take? Some ideas I have so far follow:
Does anyone have any better ideas for me ?
library(stringr) | |
library(plyr) | |
# We're assuming you've downloaded the SSA files into your R project directory. | |
file_listing = list.files()[3:135] | |
for (f in file_listing) { | |
year = str_extract(f, "[0-9]{4}") | |
if (year == "1880") { # Initializing the very long dataframe | |
name_data = read.csv(f, header=FALSE) | |
names(name_data) = c("Name", "Sex", "Pop") | |
name_data$Year = rep(year, dim(name_data)[1]) } | |
else { # adding onto the very long dataframe | |
name_data_new = read.csv(f, header=FALSE) | |
names(name_data_new) = c("Name", "Sex", "Pop") | |
name_data_new$Year = rep(year, dim(name_data_new)[1]) | |
name_data = rbind(name_data, name_data_new) | |
}} | |
year_pop_totals = ddply(name_data, .(Year), function (x) sum(x$Pop)) | |
name_data = merge(name_data, year_pop_totals, by.x="Year", by.y="Year", all.x=TRUE) | |
name_data$Rel_Pop = name_data$Pop/name_data$V1 | |
estimate_age = function (input_name, sex = NA) { | |
if (is.na(sex)) { | |
name_subset = subset(name_data, Name == input_name & Year >= 1921)} #1921 is a year I chose arbitrarily. Change how you like. | |
else { | |
name_subset = subset(name_data, Name == input_name & Year >= 1921 & Sex == sex) | |
} | |
year_and_rel_pop = name_subset[which(name_subset$Rel_Pop == max(name_subset$Rel_Pop)),c(1,6)] | |
current_year = as.numeric(substr(Sys.time(),1,4)) | |
estimated_age = current_year - as.numeric(year_and_rel_pop[1]) | |
return(list(year_of_birth=as.numeric(year_and_rel_pop[1]), age=estimated_age, relative_pop=sprintf("%1.2f%%",year_and_rel_pop[2]*100))) | |
} |
I’ll also accept any suggestions for cleaning up my code as is

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.