Street names

Michael

4 hours ago

[This article was first published on r.iresmi.net, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Day 2 of 30DayMapChallenge: « Lines » (previously).

We’ll make a map of the street name gender in Lyon. We need a database of french first names where we’ll find the gender. We will extract the Lyon streets from OpenStreetMap.

library(arrow)
library(dplyr)
library(tidyr)
library(readr)
library(purrr)
library(ggplot2)
library(stringr)
library(sf)
library(osmdata)
library(ggspatial)
library(glue)
library(knitr)

set.seed(42)

< section id="first-names" class="level2">

First names

if (!file.exists("freq_prenoms.rds")) {
  freq_prenoms <- read_parquet("https://www.insee.fr/fr/statistiques/fichier/8205621/prenoms-2023-nat.parquet") |> 
    filter(preusuel != "_PRENOMS_RARES") |> 
    mutate(preusuel = iconv(preusuel, to = "ASCII//TRANSLIT")) |> 
    group_by(preusuel, sexe) |> 
    summarise(n = sum(nombre, na.rm = TRUE),
              .groups = "drop_last") |>
    mutate(total = sum(n)) |> 
    ungroup() |> 
    mutate(sexe = case_when(sexe == 1 ~ "M",
                            sexe == 2 ~ "F",
                            .default = NA_character_)) |> 
    pivot_wider(names_from = sexe, 
                values_from = n,
                values_fill = 0) |> 
    mutate(across(c(M, F), \(x) x / total)) |> 
    write_rds("freq_prenoms.rds")
} else {
  freq_prenoms <- read_rds("freq_prenoms.rds")
}

We have 34234 first names and their gender frequencies since 1900.

Sample of first names
preusuel	total	M	F
ZENABOU	48	0	1
EMILIENE	25	0	1
KINGSLEY	878	1	0
DOLOVAN	73	1	0
ERCOLE	67	1	0
YVA	178	0	1
ISSEY	79	1	0
SAWSSEN	121	0	1
MISBAH	24	0	1
GOHANN	20	1	0

< section id="map-data" class="level2">

Map data

lyon_bbox <- getbb("Lyon, France", featuretype = "city")

if (!file.exists("osm.rds")) {
  lyon <- opq(lyon_bbox) |>
    add_osm_features(features = c(
      '"highway"="motorway"',
      '"highway"="trunk"',
      '"highway"="primary"',
      '"highway"="secondary"',
      '"highway"="tertiary"',
      '"highway"="motorway_link"',
      '"highway"="trunk_link"',
      '"highway"="primary_link"',
      '"highway"="secondary_link"',
      '"highway"="tertiary_link"',
      '"highway"="motorway_junction"',
      '"highway"="unclassified"',
      '"highway"="service"',
      '"highway"="pedestrian"',
      '"highway"="living_street"',
      '"highway"="residential"')) |> 
    osmdata_sf() |> 
    pluck("osm_lines") |> 
    select(osm_id, name) |> 
    drop_na(name) |> 
    group_by(name) |> 
    summarise() |> 
    write_rds("osm.rds")
} else {
  lyon <- read_rds("osm.rds")
}

< section id="finding-first-names-in-street-names" class="level2">

Finding first names in street names

We use a brute-force method: for each street we check if a part of it’s label is present in our list of female or male first names. We keep only first names with a high frequency in any of the genders.

female <- freq_prenoms |> 
  filter(F > .8,
         str_length(preusuel) > 1,
         preusuel != "LA") |> 
  pull(preusuel)

male <- freq_prenoms |> 
  filter(M > .8, 
         str_length(preusuel) > 1) |> 
  pull(preusuel)

street_gender <- lyon |> 
  mutate(name = str_to_upper(iconv(name, to = "ASCII//TRANSLIT")),
         m = str_extract_all(name, glue_collapse(male, sep = "\\b|\\b", last = "\\b")),
         f = str_extract_all(name, glue_collapse(female, sep = "\\b|\\b", last = "\\b")),
         gender = unlist(map2(f, m, ~ case_when(length(.y) > length(.x) ~ "male",
                                             length(.x) > length(.y) ~ "female",
                                             identical(.x, character(0)) & 
                                               identical(.y, character(0)) ~ "not concerned",
                                             length(.x) == length(.y) ~ "undecidable",
                                             .default = NA_character_))))

Sample of classification
name	geometry	m	f	gender
COURS DE VERDUN RECAMIER	LINESTRING (4.830426 45.748…			not concerned
IMPASSE DES ANGLAIS	LINESTRING (4.795807 45.753…			not concerned
RUE DES PROVENCES	LINESTRING (4.79335 45.7369…			not concerned
CHEMIN DES PEUPLIERS	LINESTRING (4.866587 45.801…			not concerned
ALLEE DU LEVANT	LINESTRING (4.878859 45.759…			not concerned
RUE ROPOSTE	LINESTRING (4.866353 45.760…			not concerned
ALLEE NELLIE BLY	LINESTRING (4.84882 45.7429…		NELLIE	female
QUAI JEAN MOULIN	MULTILINESTRING ((4.837853 …	JEAN		male
LA VIEILLE ROUTE	LINESTRING (4.769782 45.720…			not concerned
AVENUE DE CHAMPAGNE	MULTILINESTRING ((4.796801 …			not concerned

< section id="map" class="level2">

Map

street_gender |> 
  mutate(gender = factor(gender, levels = c("female", "male", "undecidable", "not concerned"))) |> 
  st_set_crs("EPSG:4326") |> 
  ggplot() +
  geom_sf(aes(color = gender), 
          linewidth = .5,
          key_glyph = "timeseries") +
  scale_color_manual(values = c("female" = "lightpink1",
                                "male" = "lightskyblue",
                                "undecidable" = "lightyellow4",
                                "not concerned" = "seashell2")) +
  annotation_scale(bar_cols =  c("darkgrey", "white"),
                   line_col = "darkgrey",
                   text_col = "darkgrey",
                   height = unit(0.1, "cm")) +
  coord_sf(xlim = lyon_bbox[c(1, 3)],
           ylim = lyon_bbox[c(2, 4)]) +
  labs(title = "Gender in Lyon street names",
       color = "",
       caption = glue("Map data © OpenStreetMap contributors
                      using INSEE Fichier des prénoms 2023
                      r.iresmi.net - {Sys.Date()}")) +
  theme_void() +
  theme(plot.background = element_rect(color = NA, 
                                       fill = "white"),
        plot.caption = element_text(size = 5,
                                    color = "darkgrey"))

< section id="possible-miss-classifications" class="level2">

Possible miss-classifications

Lots of bias make this map unreliable, and would need manual editing…

< section id="epicenous-first-names" class="level3">

epicenous first names

some first names can be male or female (GWEN, CAMILLE, DOMINIQUE)

< section id="not-concerned" class="level3">

not concerned

street names of people but without the first name (RUE VILLON),
title instead of first name (RUE DE L’AMIRAL COURBET),

< section id="has-a-gender-but-shouldnt" class="level3">

has a gender but shouldn’t

common names used as first name (CHEMIN DE LA POMME), mainly for girls…
strange first names (AUTOROUTE DU SOLEIL, Soleil seems to be a girl name…)

< section id="accidentally-well-classified" class="level3">

accidentally well classified

the last name is also a first name (COURS BAYARD)

< !-- -->

To leave a comment for the author, please follow the link and comment on their blog: r.iresmi.net.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.