Names in the U.S., from James Smith to Jose Rodriguez
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Two weeks ago, @mona published an interesting post on her blog, about a difficult question, What’s The Most Common Name In America? There were stats about first names, in the U.S., and last names, too. Those informations are – somehow – easy to get. But usually, it is more complicated to get the first and the last name together. For confidentiality issues ! Datasets – the ones I deal with – are supposed to be anonymized, so I never see the first and the last names. In a previous post, a few years ago, I did mention the so-called Social Security Death Master File. In that file, we have Social Security numbers, with the date of birth, the date of death as well as the first and the last name. So I did use those files to get stats about the first and the last names of American citizens. Of course, it is very restrictive. I have only U.S. citizens that have a Social Security number (which is not compulsary in the U.S. as far as I understood) and who passed away (as mentioned in the name of the dataset: the death master file). Another great thing about that dataset is that I have the date of birth, so I can look at some cohort effect (see opendata.stackexchange for an interesting discussion on that dataset).
This dataset is quite big. Overall, it is a 8 Gigabyte file, ziped in three documents. My first idea was to work directly on the zipped file, and to scan it, piece by piece…
temp <- tempfile() download.file(paste("http://ssdmf.info/raw_data/ssdm",1,".zip",sep=""),temp) list.files <- unzip(temp,list=TRUE) s <- 1 data <- scan(unz(temp,list.files$Name[1]), what="character",sep=",",skip=(s-1)*1e5,n=1e5)
(and to loop on the last part) but it was extremely slow... after 4 days, I was still scanning the files. So I asked @3wen to help me. And using nice functions from several packages, it ran much faster,
library(stringr) library(lubridate) library(plyr) library(data.table) library(LaF)
The first step was to download the files, and to extract them. Here, you need some space, like 8 Gb. To extract the appropriate information from that text file, use
cols <- c(1,9,20,4,15,15,1,2,2,4,2,2,4,2,5,5,7) noms_col <- c("code", "ssn", "last_name", "name_suffix", "first_name", "middle_name", "VorPCode", "date_death_m", "date_death_d", "date_death_y", "date_birth_m", "date_birth_d", "date_birth_y", "state", "zip_resid", "zip_payment", "blanks")
Then, we will count many things,
cpttf_t <- NULL cpttl_t <- NULL cpttot_t <- NULL cpttf_c_t <- NULL cpttl_c_t <- NULL cpttot_c_t <- NULL
We need those empty object to start our code,
temp ="/user/arthur/ssm/ssdm1" ssn <- laf_open_fwf(temp, column_widths = cols, column_types=rep("character", length(cols)), column_names = noms_col, trim = TRUE) go_through <- seq(1, nrow(ssn), by = 1e05) if(go_through[length(go_through)] != nrow(ssn)) go_through <- c(go_through, nrow(ssn)) go_through <- cbind(go_through[- length(go_through)], c(go_through[-c(1, length(go_through))]-1, go_through[length(go_through)]))
Now, we can start counting the appearence of each name in the dataset, in a function
count_ckunk <- function(s){ print(s) data <- ssn[go_through[s,1]:go_through[s,2], c("last_name", "first_name", "date_birth_y")] data$cohort <- trunc(as.numeric(data$date_birth_y)/10)*10
The counts are
cpttf_temp <- count(data, "first_name") cpttl_temp <- count(data, "last_name") cpttot_temp <- count(data, c("last_name", "first_name"))
The first name, the last name, and the pair first and last name. And then, we also look at the cohort,
cpttf_temp_c <- count(data, c("first_name", "cohort")) cpttl_temp_c <- count(data, c("last_name", "cohort")) cpttot_temp_c <- count(data, c("last_name", "first_name", "cohort"))
We create a list, to store that information
list(cpttf = cpttf_temp, cpttl = cpttl_temp, cpttot = cpttot_temp, cpttf_c = cpttf_temp_c, cpttl_c = cpttl_temp_c, cpttot_c = cpttot_temp_c) }
Now, that we have our function, we use it
data <- lapply(seq_len(nrow(go_through)),count_ckunk)
In that dataset, we can count appearences, of all names
cpttf <- do.call("rbind",lapply(data, function(x) x$cpttf)) cpttl <- do.call("rbind",lapply(data, function(x) x$cpttl)) (nom,prenom) cpttot <- do.call("rbind",lapply(data, function(x) x$cpttot))
as well as those information and the cohort
cpttf_c <- do.call("rbind",lapply(data, function(x) x$cpttf_c)) cpttl_c <- do.call("rbind",lapply(data, function(x) x$cpttl_c)) cpttot_c <- do.call("rbind",lapply(data, function(x) x$cpttot_c))
We are almost done !
cpttf <- count(cpttf, "first_name", wt_var = "freq") cpttl <-count(cpttl, "last_name", wt_var = "freq") cpttot <- count(cpttot, c("last_name", "first_name"), wt_var = "freq") cpttf_c <- count(cpttf_c, c("first_name", "cohort"), wt_var = "freq") cpttl_c <-count(cpttl_c, c("last_name", "cohort"), wt_var = "freq") cpttot_c <- count(cpttot_c, c("last_name", "first_name", "cohort"), wt_var = "freq")
We now have our counts. We can sort them, just to visualize
cpttf_t <- arrange(cpttf_t, desc(freq)) cpttl_t <- arrange(cpttl_t, desc(freq)) cpttot_t <- arrange(cpttot_t, desc(freq)) cpttf_c_t <- arrange(cpttf_c_t, cohort, desc(freq)) cpttl_c_t <- arrange(cpttl_c_t, cohort, desc(freq)) cpttot_c_t <- arrange(cpttot_c_t, cohort, desc(freq))
For instance, in the 1930 cohort, our top-6 was
> head(cpttot_c_t[cpttot_c_t$cohort==1930,]) last_name first_name cohort freq 14033573 SMITH JAMES 1930 682 14033574 SMITH ROBERT 1930 616 14033575 JOHNSON ROBERT 1930 527 14033576 JOHNSON JAMES 1930 502 14033577 WILLIAMS JAMES 1930 473 14033578 SMITH MARY 1930 425
while in the 1970 cohort, it was
> head(cpttot_c_t[cpttot_c_t$cohort==1970,]) last_name first_name cohort freq 17950485 RODRIGUEZ JOSE 1970 54 17950486 GONZALEZ JOSE 1970 41 17950487 JOHNSON MICHAEL 1970 40 17950488 MARTINEZ JOSE 1970 38 17950489 HERNANDEZ JOSE 1970 36 17950490 SMITH MICHAEL 1970 35
Frequencies are extremely small in the later. It should come from the fact that people in the dataset are gone.
Based on those datasets, we can start visualizing. Create lists of first and last names
L <- cpttl_t[,"last_name"] F <- cpttf_t[,"first_name"] F <- F[nchar(F)>2]
Consider the 1930 cohort
YEAR <- 1930 N <- sum(cpttl_c_t[(cpttl_c_t$cohort==YEAR) ,"freq"]) Freq_L <- cpttl_c_t[(cpttl_c_t$cohort==YEAR)&(cpttl_c_t$last_name%in%L[1:20]),c("last_name","freq")] Freq_F <- cpttf_c_t[(cpttf_c_t$cohort==YEAR)&(cpttf_c_t$first_name%in%F[1:20]),c("first_name","freq")] Freq_FL <- cpttot_c_t[(cpttot_c_t$cohort==YEAR)&(cpttot_c_t$last_name%in%L[1:20])&(cpttot_c_t$first_name%in%F[1:20]),c("last_name","first_name","freq")]
Here, we focus on the top-20 for the first and the last names. The count matrix is here
dt_Freq_FL <- data.table(Freq_FL) dt_Freq_FL <- dcast.data.table(dt_Freq_FL, last_name~first_name, fun = sum, value.var = "freq") df_Freq_FL <- data.frame(dt_Freq_FL) mat_Freq_FL <- as.matrix(df_Freq_FL[,-1]) rownames(mat_Freq_FL) <- df_Freq_FL[,1] Nobs <- mat_Freq_FL
that we can plot
We can also use the chi-square metrics to compare that matrix with the matrix under the independence assumption,
Nind <- chisq.test(Nobs)$expected Q <- (Nobs-Nind)^2/Nind
The codes for the color are
library(RColorBrewer) plotclr=colorRampPalette(brewer.pal(7, "RdYlBu")[7:1] )(20) niv=c( 0, 7.651122e-03 ,2.736844e-02 ,6.642095e-02 ,1.198073e-01, 1.807768e-01, 2.599493e-01 ,3.543850e-01 ,4.645050e-01, 6.037386e-01, 7.536596e-01, 9.836452e-01 ,1.215549e+00, 1.568113e+00 ,1.990363e+00, 2.538832e+00 ,3.228198e+00 ,4.883766e+00, 6.925505e+00 ,1.121456e+01, 10000000) class_q=cut(as.vector(Q),breaks=niv,labels=1:20) colcode=matrix(plotclr[as.numeric(class_q)],20,20)
Those values are the quantiles for 1930 counts. I kept the values to use the same beaks in 1950 and 1970, and compare three cohorts.
For the 1950 cohort, the counts were
while the chi-square distance matrix was
In red, we have large contributions in the chi-square distance. But it can be either because observed counts were too small, or too large. Here, we do not know. Except very specific cells that we can guess (e.g. Willam Willams and David Davis are in red... I believe that those are uncommon combinations). But observe that if the family name is Garcia then the first name will not be distributed as it could be for other family names. At least in 1950 and 1930. But when you look at the 1970 cohort,
the chi-square contribution is much smaller. So clearly, the effect that was also observed in @mona's post, is related to the year of birth. In 1950, it was rather rare to observe an English style first name with a Spanish style last name. But it seems that it is more comon nowadays...
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.