Who were the notable dead of Wikipedia?

[This article was first published on Maëlle, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As described in my last post, I extracted all notable deaths from Wikipedia over the 2004-2016 period. In this post I want to explore this study population. Who were the notable dead?

How old were notable dead?

Let me assume here most entries of the table are humans. I won’t make the effort to remove dogs or horses from the list yet, which introduces a small mistake.

library("ggplot2")
library("viridis")
library("broom")
library("dplyr")
library("lubridate")
library("tidytext")
library("rcorpora")
deaths <- readr::read_csv("data/deaths_with_demonyms.csv")

As a reminder, in case you didn’t learn the figures from my last post (shame on you), the table contains information about 56303 notable deaths. I could extract the age of 97% of them.

ggplot(deaths) +
  geom_histogram(aes(age)) +
  ggtitle("Age at death of Wikipedia notable dead")

plot of chunk unnamed-chunk-2

Let’s be honest, I expected a bimodal distribution with a first peak at 27.

tidy(summary(deaths$age))
##   minimum q1 median  mean q3 maximum   na
## 1       1 68     80 75.94 88     176 1677

Wow this is a really high maximal age.

arrange(deaths, desc(age)) %>%
  head(n = 10) %>%
  knitr::kable()
wiki_link name age country_role date country adj_length adjectivals occupation
Harriet_(tortoise) Harriet (tortoise) 176 NA 2006-06-23 NA NA NA NA
Eisenhower_Tree Eisenhower Tree 125 American 2014-02-16 United States 1 American NA American
J%C3%B3zef_Piotrowski_(organist) Józef Piotrowski (organist) 118 Polish organist and longevity claimant. 2005-09-08 Poland 1 Polish .* organist and longevity claimant
Misao_Okawa Misao Okawa 117 Japanese supercentenarian 2015-04-01 Japan 1 Japanese .* supercentenarian
Pawe%C5%82_Parniak Pawel Parniak 116 Polish supercentenarian 2006-03-27 Poland 1 Polish .* supercentenarian
Gertrude_Weaver Gertrude Weaver 116 American 2015-04-06 United States 1 American NA American
Jai_Gurudev Jai Gurudev 116 Indian religious leader. 2012-05-18 India 1 Indian .* religious leader
Susannah_Mushatt_Jones Susannah Mushatt Jones 116 American 2016-05-12 United States 1 American NA American
Jiroemon_Kimura Jiroemon Kimura 116 Japanese 2013-06-12 Japan 1 Japanese .* NA Japanese
Jeralean_Talley Jeralean Talley 116 American supercentenarian 2015-06-17 United States 1 American supercentenarian

Ok, so the oldest beings in this table were a tortoise and a tree, which we might want to remove from the rest of the analysis.

deaths <- filter(deaths, age < 125)

What about the deaths at the youngest ages?

arrange(deaths, age) %>%
  head(n = 10) %>%
  knitr::kable()
wiki_link name age country_role date country adj_length adjectivals occupation
Manar_Maged class=mw-redirect Manar Maged 1 Egyptian girl born with two heads 2006-03-26 Egypt 1 Egyptian .* girl born with two heads
Ayelet_Galena Ayelet Galena 2 American child 2012-01-31 United States 1 American child
Colonel_Meow Colonel Meow 2 American Himalayan-Persian cat 2014-01-29 United States 1 American Himalayan-Persian cat
Ben_Bowen Ben Bowen 2 American child cancer victim 2005-02-25 United States 1 American child cancer victim
Marius_(giraffe) Marius (giraffe) 2 Danish giraffe 2014-02-09 Denmark 1 Danish .* giraffe
Disappearance_of_Aisling_Symes class=mw-redirect Disappearance of Aisling Symes 2 New Zealand child whose disappearance initiated major search 2009-10-05 New Zealand 2 New Zealand .* child whose disappearance initiated major search
Paul_the_Octopus Paul the Octopus 2 British-born 2010-10-26 NA NA NA British-born
Chriselliam Chriselliam 3 Irish-bred British-trained Thoroughbred racehorse 2014-02-07 NA NA NA Irish-bred British-trained Thoroughbred racehorse
Eight_Belles Eight Belles 3 American racehorse 2008-05-03 United States 1 American racehorse
Sybil_(cat) Sybil (cat) 3 British Downing Street cat 2009-07-27 England 1 British .* Downing Street cat

As one could have expected, the deaths at youngest ages are some sad stories, about humans but also animals.

Did the age distribution change over time?

deaths <- mutate(deaths, death_year = as.factor(year(date)))
ggplot(deaths) +
  geom_boxplot(aes(death_year, age, fill = death_year)) +
  scale_fill_viridis(discrete = TRUE) +
  theme(legend.position = "none")

plot of chunk unnamed-chunk-7

Well maybe there is an increasing trend? I wouldn’t be surprised if it were the case, since life expectancy tends to increase. I first wrote I wouldn’t take the time to test the trend and then I had a very interesting discussion with Miles McBain and Nick Tierney. I had first thought of a linear model, then of a survival analysis but I only have positive events. While using a linear model or a GLM the residuals were never normally distributed. Then Miles mentioned non-parametric tests which is something I never think about. Googling a bit around I fount the Mann-Kendall test!

I’m quite lucky I want to see if age at death monotically increases over time because that seems to be the usual use case for it. I choose to use the time series of weekly median age, which I’m not too sure is the best choice. I could have chosen monthly average age, etc.

library("trend")
library("lubridate")
weekly_median_age <- deaths %>% 
  filter(!is.na(age)) %>%
  group_by(wiki_link) %>%
  mutate(week = paste(year(date), week(date))) %>%
  group_by(week) %>%
  summarize(age = median(age)) %>% .$age
weekly_median_age <- as.ts(weekly_median_age)
plot(weekly_median_age)

plot of chunk unnamed-chunk-8

res <- mk.test(weekly_median_age)
summary.trend.test(res)
## Mann-Kendall Test
##  
## two-sided homogeinity test
## H0: S = 0 (no trend)
## HA: S != 0 (monotonic trend)
##  
## Statistics for total series
##       S     varS    Z   tau     pvalue
## 1 79925 36111378 13.3 0.337 < 2.22e-16

Using this test I have now more support for the existence of a trend, but not for its direction. The same package has an implementation of Sen’s method to compute the slope.

sens <- sens.slope(weekly_median_age, level = 0.95)
sens
##  
## Sen's slope and intercept
##  
##  
## slope:  0.0065
## 95 percent confidence intervall for slope
## 0.0074 0.0056
##  
## intercept: 77.1831
## nr. of observations: 689

With such a slope in one year one gains 0.34 years. Will we soon have humans as old as Harriet the tortoise?

Where did notable dead come from?

deaths %>% group_by(country) %>%
  summarize(n = n()) %>%
  arrange(desc(n)) %>%
  head(n = 10) %>%
  knitr::kable()
country n
United States 20220
England 6163
Canada 2183
Australia 1783
India 1604
France 1277
Germany 1277
Italy 1141
Russia[15] 884
NA 857

Unsurprisingly given what I imagine to be the countries of Wikipedia contributors in English, mostly from developped countries, and then India is a huge English-speaking country. It’d probably be interesting to repeat the same data extraction for all languages and see how we rather know celebrities speaking our own language or sharing our culture.

What were the reasons of notability?

I first played with the idea of using my monkeylearn package to associated an industry to each occupation/reason for being notable, but I soon realized the description was too short for the extractor. I also soon saw I wouldn’t be able to find a good list of jobs, so I resorted to simply look for the most present terms using tidytext. For removing the stop-words I used rcorpora.

stopwords <- corpora("words/stopwords/en")$stopWords

deaths_words <- deaths %>%
  unnest_tokens(word, occupation) %>%
  count(word, sort = TRUE) %>%
  filter(!word %in% stopwords)


head(deaths_words, n = 10) %>%
  knitr::kable()
word n
politician 6285
player 4251
actor 2807
footballer 2022
football 1899
actress 1693
singer 1593
writer 1526
american 1476
olympic 1396

From these 10 most prevalent terms we could assume being a politician, some sort of athlete (player could also be a football player) or artist can make you notable. It’s interesting to see there are far more actors than actresses. In case you didn’t get the message, in the table there are 756 businessmen, 44 businesswomen, 4 business persons.

I also noticed that there are 147 murderers and 41 serial killers vs. 232 chemists and 46 statisticians. Since the term “data scientist” is quite young, there is none in my table, and I sure wish you’ll all stay healthy, my friends! In the next post I’ll present the analysis of the time series of monthly count of deaths.

If you liked learning more about notable dead, you can have a look at the analysis Hazel Kavili started doing of celebrity deaths in 2016.

I’d like to end this post with a note from my husband, who thinks having a blog makes me an influencer. If you too like Wikipedia, consider donating to the Wikimedia foundation.

To leave a comment for the author, please follow the link and comment on their blog: Maëlle.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)