German Female Corporate Officers: Where Are You?
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Last week, the German NGO Open Knowledge Foundation Deutschland e.V. has made German Trade Resister data available via the project OffeneRegister.de, together with the British NGO opencorporates. In my last blog post I checked the general accessibility of the data. In this quick follow-up post I follow an idea inspired by a tweet by Johannes Filter to map the gender balance of German corporate officers.
Here is the code for generating the necessary data. I focus on non-dismissed officers and use a first name list which originates from a list provided by a German computer magazine (c’t) and maintained by Matthias Winkelmann. This is not perfect but the best that I can do, given the data. I eliminate names that have duplicate gender classifications.
library(DBI) library(tidyverse) library(ggmap) library(rgdal) library(rgeos) tmp <- tempdir() db <- "~/Downloads/handelsregister.db" con <- dbConnect(RSQLite::SQLite(), db) sql <- "select id, company_number as company_id, current_status, retrieved_at, registered_address from company" res <- dbSendQuery(con, sql) company <- dbFetch(res) dbClearResult(res) sql <- "select id as officer_id, company_id, firstname, start_date, end_date, dismissed from officer" res <- dbSendQuery(con, sql) officer <- dbFetch(res) dbClearResult(res) dbDisconnect(con) company %>% filter(current_status == "currently registered", !is.na(registered_address)) %>% mutate(plz = str_extract(registered_address, "\\d{5}")) %>% filter(plz != "") %>% select(company_id, plz) -> company_plz officer %>% filter(firstname != "", !is.na(firstname), is.na(dismissed)) -> officer_firstname company_plz %>% left_join(officer_firstname) %>% filter(!is.na(firstname)) -> company_firstname nl <- read_csv2("https://raw.githubusercontent.com/MatthiasWinkelmann/firstname-database/master/firstnames.csv") %>% rename(firstname = name) %>% select(firstname, gender) %>% group_by(firstname) %>% filter(n() == 1) %>% ungroup() company_firstname %>% left_join(nl) %>% filter(gender == 'M' | gender == 'F') -> plz_name_gender
The overall share of female officers in Germany based on this classification is 17.0%. How does this share vary across German regions? Let’s get some OSM shape data.
# Shape file is based on OSM data. # Source: https://www.suche-postleitzahl.org/downloads download.file("https://www.suche-postleitzahl.org/download_files/public/plz-gebiete.shp.zip", file.path(tmp, "shape5.zip")) unzip(file.path(tmp, "shape5.zip"), exdir = tmp) plz5_polys <- readOGR(file.path(tmp, "plz-gebiete.shp")) download.file("https://www.suche-postleitzahl.org/download_files/public/plz-2stellig.shp.zip", file.path(tmp, "shape2.zip")) unzip(file.path(tmp, "shape2.zip"), exdir = tmp) plz2_polys <- readOGR(file.path(tmp, "plz-2stellig.shp"))
Now we can create maps. First at the post area level (identified by the first to digits of the PLZ).
plz_name_gender %>% mutate(plz2 = substr(plz, 1, 2)) %>% group_by(plz2) %>% summarise(nobs = n(), female = sum(gender == 'F')/n()) -> plz2_female_share plz_map <- fortify(plz2_polys, region = "plz") %>% left_join(plz2_female_share, by = c("id" = "plz2")) plz2_map <- ggplot(plz_map, aes(x = long, y = lat, group = group, fill = female)) + geom_polygon(colour = NA, lwd=0, aes(group = group)) + scale_fill_gradient2(name = "Share of female officers", low = "orange", mid = "gray90", high = "purple", midpoint = median(plz2_female_share$female, na.rm = TRUE), breaks = 0.05*(2:4), labels = scales::percent(0.05*(2:4), accuracy = 1)) + coord_map() + theme_void() + theme(legend.justification=c(0,1), legend.position=c(0,1), plot.caption = element_text(hjust = 0)) + labs(caption = "Data as provided by OffeneRegister.de. Graph by @JoachimGassen with #rstats.") plz2_map
This does not look good for Baden-Wuerttemberg. Also, it seems as if the female officer share is larger in former East Germany. Without further analysis I can only guess that regions with significant tourist industry (Northern Mecklenburg-Vorpommern, Southern Bavaria) tend to have a higher share of female officers while areas with a significant agricultural sector (e.g., Northern Brandenburg) have lower female representation.
Can we say something when we look at a more granular level (PLZs instead of two digit PLZ regions)? Here I limit the data to PLZs that list at least 30 officers.
plz_name_gender %>% group_by(plz) %>% summarise(nobs = n(), female = sum(gender == 'F')/n()) %>% filter(nobs >= 30) -> plz_female_share plz_map <- fortify(plz5_polys, region = "plz") %>% left_join(plz_female_share, by = c("id" = "plz")) plz5_map <- ggplot(plz_map, aes(x = long, y = lat, group = group, fill = female)) + geom_polygon(colour = NA, lwd=0, aes(group = group)) + scale_fill_gradient2(name = "Share of female officers", low = "orange", mid = "gray90", high = "purple", midpoint = median(plz_female_share$female, na.rm = TRUE), breaks = 0.1*(0:5), labels = scales::percent(0.1*(0:5), accuracy = 1)) + coord_map() + theme_void() + theme(legend.justification=c(0,1), legend.position=c(0,1), plot.caption = element_text(hjust = 0)) + labs(caption = "Data as provided by OffeneRegister.de. Graph by @JoachimGassen with #rstats.") plz5_map
A similar but significantly more patchy picture surfaces. How does this look like when we focus on cities (e.g., Berlin)?
plz_name_gender %>% group_by(plz) %>% summarise(nobs = n(), female = sum(gender == 'F')/n()) -> plz_female_share berlin_polys <- plz5_polys[str_detect(plz5_polys$note, "\\d{5} Berlin"), ] plz_map <- fortify(berlin_polys, region = "plz") %>% left_join(plz_female_share, by = c("id" = "plz")) berlin_map <- ggplot(plz_map, aes(x = long, y = lat, group = group, fill = female)) + geom_polygon(colour = NA, lwd=0, aes(group = group)) + scale_fill_gradient2(name = "Share of female officers", low = "orange", mid = "gray90", high = "purple", midpoint = median(plz_female_share$female, na.rm = TRUE), breaks = 0.1*(0:5), labels = scales::percent(0.1*(0:5), accuracy = 1)) + coord_map() + theme_void() + theme(legend.justification=c(0,1), legend.position=c(0,1), plot.caption = element_text(hjust = 0)) + labs(caption = "Data as provided by OffeneRegister.de. Graph by @JoachimGassen with #rstats.") berlin_map
Well… Not much interesting here (besides the ladies camp in Marzahn!) but maybe other cities/areas reveal more insightful patterns. I will leave this question to the interested reader.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.