Extracting notable deaths from Wikipedia
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I like Wikipedia. My husband likes it even more, he included it in his PhD thesis acknowledgements! I appreciate the efforts done for sharing knowledge, and also the apparently random stuff you can find on the website. In particular, I’ve been intrigued by the monthly lists of notable deaths such as this one. Who are people (or dogs, yes, dogs) whose life was deemed notable enough to be listed there? Also, using the numbers of such deaths, can I judge whether 2016 was really worse than previous years? The first step in answering these questions was to scrape the data. I’ll describe the process in this post. In another post I’ll have a look at my study population and in a third post I’ll analyse the time series of death counts.
I have extracted all deaths listed in Wikipedia from 2004 to 2016 from monthly pages. For most deaths, I extracted the Wikipedia link, the name of the person, their age and the first part of their presentation, which most often includes a nationality and reason for being famous, e.g. “Italian astrophysicist”. I chose not to get the rest of the line if it was longer, because the length of the reason for being famous was quite variable and cause of death was not consistently indicated.
Note that there are list of notable deaths for deaths occurred before 2004 but they are in a different format so let’s say it’s a challenge for another day (or another person, like you, dear reader?).
Downloading the lists of deaths
I started by downloading the content of all monthly lists of deaths from 2004 to 2016.
library("rvest")
library("dplyr")
library("purrr")
library("tidyr")
library("lazyeval")
library("tibble")
library("fuzzyjoin")
library("stringr")
months <- c("January", "February",
"March", "April",
"May", "June",
"July", "August",
"September", "October",
"November", "December")
years <- 2004:2016
pages_content <- map(months, paste, years, sep = "_") %>%
unlist() %>%
# read page with monthly deaths
map(function(x){
read_html(paste0("https://en.wikipedia.org/wiki/Deaths_in_", x))})
Transforming the content of the lists into a table
I actually did all this stuff months ago. At that time apparently I was patient enough to figure out how to extract information from the list. I was surprised to see my code still worked, for getting all of 2016 I just needed to change 2015 into 2016. Note that I use the name list, it’s a list on the webpage, not a table, which means I couldn’t just rely on rvest
and take a nap. My main goal was to be able to get something nice for typical entries and just forget about the non-typical entries, hoping there wouldn’t be many. Here are the two functions I defined. I use stringr
, not stringi
, because it’s an old code where this was enough but since reading Bob Rudis’ post I am more curious about stringi
.
transform_day <- function(day_deaths, day_in_month){
# filter only those that have the format of lines presenting deaths
day_deaths <- str_replace_all(day_deaths$days, "<ul><li>", "")
day_deaths <- str_split(day_deaths, "<li>", simplify = TRUE)
day_deaths <- day_deaths[!str_detect(day_deaths, "Death")]
day_deaths <- day_deaths[!str_detect(day_deaths, "Category")]
# Erases the end of each line
day_deaths <- str_replace_all(day_deaths, "\\(<i>.*", "")
# Create a table
day_deaths <- tibble_(list(line = ~day_deaths))
# Variable for grouping by row
day_deaths <- mutate_(day_deaths, row = interp(~1:nrow(day_deaths)))
day_deaths <- group_by_(day_deaths, "row")
# separate by the word "title", first part is a link, second part description
day_deaths <- mutate_(day_deaths, line = interp(~str_split(line, "title")))
day_deaths <- mutate_(day_deaths, wiki_link = interp(~str_replace_all(line[[1]][1], "<a href=\\\"/wiki/", "")))
# get Wikipedia link or better said the thing to paste to Wikipedia address
day_deaths <- mutate_(day_deaths,
wiki_link = interp(~str_replace_all(wiki_link, "\\\\", "")))
day_deaths <- mutate_(day_deaths,
wiki_link = interp(~str_replace_all(wiki_link, "\"", "")))
# now transform the description into several columns
# the format of the end of the description is variable so
# I only keep the beginning of the reason for notoriety i.e. country of origin
# and a role
# anyway cause of death is not written for all and I don't want to use many details
day_deaths <- mutate_(day_deaths, content = interp(~line[[1]][2]))
day_deaths <- mutate_(day_deaths,
content = interp(~str_replace_all(content, '<a.*a>', "")))
day_deaths <- separate_(day_deaths, "content",
into = c("name", "age", "country_role"),
sep = ",")
# when no age
day_deaths <- mutate_(day_deaths, country_role = interp(~ifelse(is.na(country_role),
age,
country_role)))
day_deaths <- mutate_(day_deaths, age = interp(~ as.numeric(age)))
# improves the name
day_deaths <- mutate_(day_deaths,
name = interp(~ str_replace_all(name, "\">.*", "")))
day_deaths <- mutate_(day_deaths,
name = interp(~ str_replace_all(name, "=\"", "")))
# improves the country_role
day_deaths <- mutate_(day_deaths,
country_role = interp(~ str_replace_all(country_role, "</li>", "")))
day_deaths <- mutate_(day_deaths,
country_role = interp(~ str_replace_all
(country_role, "</ul>", "")))
# get rid of original line
day_deaths <- select_(day_deaths, quote(- line))
# get rid of grouping
day_deaths <- ungroup(day_deaths)
day_deaths <- select_(day_deaths, quote(- row))
return(day_deaths)
}
parse_month_deaths <- function(month_deaths){
# remember month and year
title <- stringr::str_extract(toString(html_nodes(month_deaths, "title")),
"Deaths in .* -")
title <- str_replace_all(title, "Deaths in ", "")
title <- str_replace_all(title, " -", "")
title <- str_replace_all(title, "January", "01")
title <- str_replace_all(title, "February", "02")
title <- str_replace_all(title, "March", "03")
title <- str_replace_all(title, "April", "04")
title <- str_replace_all(title, "May", "05")
title <- str_replace_all(title, "June", "06")
title <- str_replace_all(title, "July", "07")
title <- str_replace_all(title, "August", "08")
title <- str_replace_all(title, "September", "09")
title <- str_replace_all(title, "October", "10")
title <- str_replace_all(title, "November", "11")
title <- str_replace_all(title, "December", "12")
# find days with deaths
content <- toString(month_deaths)
content <- str_split(content, "\n", simplify = TRUE)
days <- which(str_detect(content, "h3"))
paragraphs <- which(str_detect(content, "<ul>"))
last_good <- max(which(str_detect(unlist(lapply(html_nodes(month_deaths, "h3"),
toString)),
pattern = "title=Deaths_in_.*_..")))
possible_days <- 1:last_good
possible_days <- possible_days[diff(c(days[possible_days], 999999)) > 1]
# read only lines
month_deaths <- html_nodes(month_deaths, "ul")
first_not_good <- max(which(str_detect(month_deaths, "Template")))
first_good <- min(which(str_detect(month_deaths, "wiki")))
paragraphs <- paragraphs[first_good:(first_not_good - 1)]
if(length(paragraphs) > length(possible_days)){
jours <- NULL
for(paragraph in paragraphs){
x <- days[days < paragraph]
jours <- c(jours,
possible_days[which(abs(x-paragraph)==min(abs(x-paragraph)))])
}
possible_days <- jours
}
month_deaths <- month_deaths[first_good:(first_not_good - 1)]
month_deaths <- unlist(map(month_deaths, toString))
# transform for getting the different columns
month_deaths_table <- data.frame(days = month_deaths,
day_in_month = possible_days,
title = title) %>%
by_row(transform_day, .to = "all_deaths") %>%
unnest_("all_deaths") %>%
group_by_("wiki_link") %>%
mutate_(index = ~ 1:n()) %>%
filter_(~index == 1) %>%
select_(quote(-index)) %>%
group_by_("days") %>%
mutate_(date = interp(~ lubridate::dmy(paste(day_in_month, title)))) %>%
select_(quote(- day_in_month)) %>%
select_(quote(- title)) %>%
ungroup() %>%
select_(quote(- days))
month_deaths_table
}
Once the functions were written (and tested on a few pages) I simply mapped them to all the pages.
deaths_2004_2016 <- pages_content %>%
map(parse_month_deaths) %>%
dplyr::bind_rows()
knitr::kable(head(deaths_2004_2016))
I was already quite happy to get this table, but I wanted to add a country to most rows, and separate the role of the person from the adjectival.
Get the demonyms table from Wikipedia
I discovered Wikipedia has a table (a table! a table!) of adjectivals for many countries and nations. The only thing I changed was getting one line by adjectivals when there were several ones by country or nation. I also calculated the number of words in this adjectival in order to be able to easily remove it from the "country_role"
column and thus get the role on its own.
demonyms_page <- read_html("https://en.m.wikipedia.org/wiki/List_of_adjectival_and_demonymic_forms_for_countries_and_nations")
demonyms_table <- html_nodes(demonyms_page, "table")[1]
demonyms_table <- html_table(demonyms_table, fill = TRUE)[[1]]
names(demonyms_table) <- c("country", "adjectivals", "demonyms", "colloquial_demonyms")
demonyms_table <- tibble::as_tibble(demonyms_table)
demonyms_table <- by_row(demonyms_table,
function(df){
adjectivals <- tibble_(list(goodadjectivals = interp(~str_split(df$adjectivals, " or "))))
}, .collate = "list", .to = "good_adjectivals") %>%
unnest_("good_adjectivals") %>%
unnest_("goodadjectivals")
demonyms_table <- select_(demonyms_table, quote(- adjectivals))
demonyms_table <- rename_(demonyms_table, "adjectivals" = "goodadjectivals" )
demonyms_table <- by_row(demonyms_table,
function(df){
adjectivals <- tibble_(list(goodadjectivals = interp(~str_split(df$adjectivals, ","))))
}, .collate = "list", .to = "good_adjectivals") %>%
unnest_("good_adjectivals") %>%
unnest_("goodadjectivals")
demonyms_table <- select_(demonyms_table, quote(- adjectivals))
demonyms_table <- rename_(demonyms_table, "adjectivals" = "goodadjectivals" )
demonyms_table <- mutate(demonyms_table, adjectivals = trimws(adjectivals ))
demonyms_table <- by_row(demonyms_table,
function(df){
length(str_split(df$adjectivals, " ")[[1]])},
.to = "adj_length", .collate = "cols")
Using adjectivals to split deaths’ country and role
For finding which country/nation to add to a line I used fuzzyjoin::regex_left_join()
which worked well but a bit slowly given the number of lines.
deaths <- deaths_2004_2016
demonyms <- demonyms_table
deaths <- mutate(deaths, country_role = str_replace(country_role,
"<.*", ""))
demonyms <- mutate(demonyms, adjectivals = strsplit(adjectivals, ","))
demonyms <- unnest(demonyms, adjectivals)
demonyms <- mutate(demonyms,
adjectivals = paste0(adjectivals, " .*"))
deaths <- regex_left_join(deaths,
demonyms, by = c("country_role" = "adjectivals"))
deaths <- mutate(deaths,
country = ifelse(str_detect(country_role, "American"),
"United States",
country))
deaths <- mutate(deaths,
adjectivals = ifelse(country == "United States",
"American", adjectivals))
deaths <- mutate(deaths,
adj_length = ifelse(country == "United States",
1, adj_length))
# keep one country only
deaths <- group_by(deaths, wiki_link) %>%
mutate(index = 1:n()) %>%
filter(index == 1)
deaths <- by_row(deaths,
function(df){
if(!is.na(df$adj_length)) {
country_role <- trimws(df$country_role)
splitted <- str_split(country_role, " ")[[1]]
role <- toString(splitted[(df$adj_length + 1):length(splitted)])
str_replace_all(role, ",", "")
}else{
df$country_role
}
}, .to = "occupation", .collate = "cols")
deaths <- mutate(deaths,
occupation = str_replace_all(occupation, "\\r", ""))
deaths <- mutate(deaths,
occupation = str_replace_all(occupation, "\\n", ""))
deaths <- mutate(deaths,
occupation = str_replace_all(occupation, "\\.", ""))
deaths <- select(deaths, - demonyms, - colloquial_demonyms,
- index)
readr::write_csv(deaths, path = "data/deaths_with_demonyms.csv")
knitr::kable(head(deaths))
wiki_link | name | age | country_role | date | country | adj_length | adjectivals | occupation |
---|---|---|---|---|---|---|---|---|
Harold_Henning | Harold Henning | 69 | South African golfer. | 2004-01-01 | South Africa | 2 | South African .* | golfer |
Elma_Lewis | Elma Lewis | 82 | American arts leader. | 2004-01-01 | United States | 1 | American | arts leader |
Manuel_F%C3%A9lix_L%C3%B3pez | Manuel Félix López | 66 | Ecuadorian politician. | 2004-01-01 | Ecuador | 1 | Ecuadorian .* | politician |
Frederick_Redlich | Frederick Redlich | 93 | Austrian-born American dean of the | 2004-01-01 | United States | 1 | American | American dean of the |
Etta_Moten_Barnett | Etta Moten Barnett | 102 | American actress. | 2004-01-02 | United States | 1 | American | actress |
Lynn_Cartwright | Lynn Cartwright | 76 | U.S. actress. | 2004-01-02 | United States[20] | 1 | U.S. .* | actress |
In the table I have information about 56303 notable deaths. I know the age of 97% of them, a country or nation for 96.2% of them. Not too bad I think! It was then time to stop scraping webpages and to start digging into the data… Who were these people? and How bad was 2016?
I’d like to end this post with a note from my husband, who thinks having a blog makes me an influencer. If you too like Wikipedia, consider donating to the Wikimedia foundation.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.