Site icon R-bloggers

On a First Name Basis with Statistics Sweden

[This article was first published on Theory meets practice..., and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Abstract

Jugding from recent R-Bloggers posts, it appears that many data scientists are concerned with scraping data from various media sources (Wikipedia, twitter, etc.). However, one should be aware that well structured and high quality datasets are available through state’s and country’s bureau of statistics. Increasingly these are offered to the public through direct database access, e.g., using a REST API. We illustrate the usefulness of such an approach by accessing data from Statistics Sweden.


This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. The markdown+Rknitr source code of this blog is available under a GNU General Public License (GPL v3) license from .

Introduction

Scandinavian countries are world-class when it comes to public registries. So when in need for reliable population data, this is the place to look. As an example, we access Statistics Sweden data by their API using the pxweb package developed by @MansMeg, @antagomir and @LCHansson. Love was the first speaker at a Stockholm R-Meetup some years ago, where I also gave a talk. Funny how such R-Meetups become useful many years after!

library(pxweb)

By browsing the Statistics Sweden (in Swedish: Statistiska Centralbyrån (SCB)) data using their web interface one sees that they have two relevant first name datasets: one containing the tilltalsnamn of newborns for each year during 1998-2016 and one for the years 2004-2016. Note: A tilltalsnamn in Sweden is the first name (of several possible first names) by which a person is usually addressed. About 2/3 of the persons in the Swedish name registry indicate which of their first names is their tilltalsnamn. For the remaining persons it is automatically implied that their tilltalsnamn is the first of the first names. Also note: For reasons of data protection the 1998-2016 dataset contains only first names used 10 or more times in a given year, the 2004-2016 dataset contains only first names used 2 or more times in a given year.

Downloading such data through the SCB web-interface is cumbersome, because the downloads are limited to 50,000 data cells per query. Hence, one has to do several manual queries to get hold of the relevant data. This is where their API becomes a real time-saver. Instead of trying to fiddle with the API directly using rjson or RJSONIO we use the specially designed pxweb package to fetch the data. One can either use the web-interface to determine the name of the desired data matrix to query or navigate directly through the api using pxweb:

d <- interactive_pxweb(api = "api.scb.se", version = "v1", lang = "en")

and select Population followed by Name statistics and then BE0001T04Ar or BE0001T04BAr, respectively, in order to obtain the relevant data and api download url. This leads to the following R code for download:

names10 <- get_pxweb_data(
  url = "http://api.scb.se/OV0104/v1/doris/en/ssd/BE/BE0001/BE0001T04Ar",
  dims = list(Tilltalsnamn = c('*'),
              ContentsCode = c('BE0001AH'),
              Tid = c('*')),
  clean = TRUE) %>% as.tbl

For better usability we rename the columns a little and replace NA counts to be zero. For visualization we pick 10 random lines of the dataset.

names10 <- names10 %>% select(-observations) %>%
  rename(firstname=`first name normally used`,counts=values) %>%
  mutate(counts = ifelse(is.na(counts),0,counts))
##Look at 10 random lines
names10 %>% slice(sample(seq_len(nrow(names10)),size=5))
## # A tibble: 5 × 3
##   firstname   year counts
##      <fctr> <fctr>  <dbl>
## 1   Leandro   2011     15
## 2    Marlon   2004      0
## 3    Andrej   2009      0
## 4     Ester   2002     63
## 5   Muhamed   1998      0

Note: Each spelling variant of a name in the data is treated as a unique name. In similar fashion we download the BE0001AL dataset as names2.

We now join the two datasets into one large data.frame by

names <- rbind(data.frame(names2,type="min02"), data.frame(names10,type="min10"))

and thus got everything in place to compute the name collision probability over time using the birthdayproblem package (as shown in previous posts).

library(birthdayproblem)
collision <- names %>% group_by(year,type) %>% do({
  data.frame(p=pbirthday_up(n=26L, prob= .$counts / sum(.$counts),method="mase1992")$prob, gini= ineq::Gini(.$counts))
}) %>% ungroup %>% mutate(year=as.numeric(as.character(year)))

And the resulting probabilities based on the two datasets min02 (at least two instances of the name in a given year) and min10 (at least ten instances of the name in a given year) can easily be visualized over time.

ggplot( collision, aes(x=year, y=p, color=type)) + geom_line(size=1.5) +
  scale_y_continuous(label=scales::percent,limits=c(0,1)) +
  xlab("Year") + ylab("Probability") +
  ggtitle("Probability of a name collision in a class of 26 kids born in year YYYY") +
  scale_colour_discrete(name = "Dataset")

As seen in similar plots for other countries, there is a decline in the collision probability over time. Note also that the two curves are upper limits to the true collision probabilities. The true probabilities, i.e. taking all tilltalsnamn into account, would be based on the hypothetical min1 data set. These probabilities would be slightly, but not substantially, below the min2 line. The same problem occurs, e.g., in the corresponding UK and Wales data. Here, Table 6 is listing all first names with 3 or more uses, but not stating how many newborns have a name occurring once and twice, respectively. With all due respect for the need to anonymise the name statistics, it’s hard to understand why this summary figure is not automatically reported, so one would be able to at least compute correct totals or collision probabilities.

Summary

Altogether, I was still quite happy to get proper individual name data so the collision probabilities are – opposite to some of my previous blog analyses – exact!

To leave a comment for the author, please follow the link and comment on their blog: Theory meets practice....

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.