[This article was first published on You Know, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Wikipedia has pages for each day of the year (e.g., January 1, April 25). Each page contains a list of names of famous people with that birthday along with a short description of each person. I wrote an R script to scrape these lists from each wikipedia page and then analyze the data to find which days have unusually high occurrences of certain words (based on the tf-idf statistic).Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Three days stood out:
- May 2: columnists
- March 16: bicycle racers
- March 23: racing drivers
So, why are Racing Drivers so often born on March 23?
R script:
1: 2: # Prepare ----------------------------------------------------------------- 3: rm(list = ls()) 4: gc() 5: pkg <- c("tidyverse", "rvest", "tidytext", "ggplot2", "beepr") 6: inst <- pkg %in% installed.packages() 7: if(length(pkg[!inst]) > 0) install.packages(pkg[!inst]) 8: lapply(pkg, library, character.only = TRUE) 9: rm(list = c("inst", "pkg")) 10: setwd("/Users/danieldunn/Dropbox/DD Cloud/R/birthday_types") 11: set.seed(4444) 12: 13: 14: # Download wiki birthday data --------------------------------------------- 15: births <- seq(as.POSIXct("2004-01-01"), as.POSIXct("2004-12-31"), by = "days") %>% 16: format(format = "%B_%e") %>% 17: gsub(pattern = "_ ", replacement = "_") %>% 18: as.list() 19: #births <- births[1:5] # remove when ready for whole year 20: urls <- paste0("https://en.wikipedia.org/wiki/", births) 21: ndays <- length(urls) 22: 23: for(i in 1:ndays) { 24: cat("Trying", i, "of", length(births), "urls.\n") 25: section <- ifelse(urls[i] == "https://en.wikipedia.org/wiki/February_29", 26: 2, 27: read_html(urls[i]) %>% 28: html_nodes(xpath = '//*[@id="toc"]/ul') %>% 29: html_text() %>% 30: gsub(pattern = " Births.*$", replacement = "") %>% 31: gsub(pattern = "^.*\n", replacement = "")) 32: births[[i]][3] <- read_html(urls[i]) %>% 33: html_nodes(xpath = paste0('//*[@id="mw-content-text"]/div/ul[', section, ']')) %>% 34: html_text() 35: births[[i]][2] <- gregexpr(pattern = "\\n", text = births[[i]][3]) %>% 36: regmatches(x = births[[i]][3]) %>% 37: lengths() 38: Sys.sleep(time = 0.1) 39: } 40: 41: 42: # Find outlier word frequencies ------------------------------------------- 43: data(stop_words) 44: termset <- data.frame(word = as.character(), n = as.integer(), 45: day = as.character(), tf = as.numeric(), 46: stringsAsFactors = FALSE) 47: for(i in 1:ndays) { 48: terms <- data_frame(text = births[[i]][3]) %>% 49: unnest_tokens(word, text, to_lower = FALSE) %>% 50: anti_join(y = stop_words, by = "word") %>% 51: count(word, sort = TRUE) %>% 52: filter(!grepl(pattern = "[[:digit:]].*$", x = word)) %>% 53: filter(!grepl(pattern = "[[:upper:]].*$", x = word)) %>% 54: mutate(day = births[[i]][1], tf = n / sum(n)) 55: termset <- rbind(termset, terms) 56: } 57: 58: full <- termset %>% 59: count(word, sort = TRUE) %>% 60: mutate(idf = log(x = ndays / nn), base = 10) 61: 62: termset <- termset %>% 63: left_join(y = full, by = "word") %>% 64: mutate(tfidf = round(tf * idf, 3)) %>% 65: arrange(desc(tfidf)) %>% 66: mutate(dayN = as.Date(paste0(gsub(pattern = "_", replacement = " ", x = day), ", 2004"), format = "%B %e, %Y")) 67: 68: 69: # Roll up to word level --------------------------------------------------- 70: wordset <- termset %>% 71: group_by(word) %>% 72: summarize(frequency = sum(n)) %>% 73: ungroup() %>% 74: arrange(desc(frequency)) 75: 76: 77: # Display extremes -------------------------------------------------------- 78: topT <- termset %>% 79: group_by(day) %>% 80: top_n(n = 1, wt = tfidf) %>% 81: ungroup() %>% 82: filter(n >= 2) %>% 83: arrange(dayN) %>% 84: print 85: 86: topW <- wordset %>% 87: top_n(n = 10, wt = frequency) %>% 88: filter(frequency >= 1) %>% 89: arrange(desc(frequency)) %>% 90: print 91: 92: 93: # Finish ------------------------------------------------------------------ 94: beep() 95: 96: 97:
To leave a comment for the author, please follow the link and comment on their blog: You Know.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.