Site icon R-bloggers

Text Mining Gun Deaths Data

[This article was first published on Econometrics by Simulation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this post I will explore public data being collected by Slate.  I previously released code using a much early set of this data demonstrating how to turn this data into an animated map.

This data began collection as a public response to the horrifying shooting at Sandy Hook Elementary in December of 2012 in an attempt to create a database of all of the deaths as a result of guns in the US.  Since its creation, the database has expanded to a list of over 12,000 deaths in a little over a year.

In this post I will explore a little of that database as well scrape the web for additional data from the articles listed on the database.

# Let's load the raw data:
gun.deaths <-  
read.csv( 
"http://slate-interactives-prod.elasticbeanstalk.com/gun-deaths/getCSV.php")
 
tail(gun.deaths)
summary(gun.deaths)
 
# We can see that the vast majority of gun deaths are among men with 10,153 
# and only 1,850 among women.  The mean age is 33.34 while the median age is
# a little lower are 30.  Interestingly the maximum age is 107. What 
# information is not provided though would be interesting would be cause of
# death such murder, suicide, accident, etc.
 
library(XML)
 
# Read and parse HTML file
gun.deaths.text <- list()
 
html.last <- ""
 
# The following code will grab all 12k+ article text.  I have a somewhat 
# slow internet
 
# This will allow you to gather some of the text and come back to it at a 
# future point if you decide not to wait for it to ping all 12k websites.
 
for (i in (length(gun.deaths.text)+1):nrow(gun.deaths)) {
  print(paste("reading html #", i))
 
  # The following code I borrow from a post on Quantum Forest.
  # It grabs the text between paragrahs from HTML documents.
  # www.quantumforest.com/2011/10/reading-html-pages-in-r-for-text-processing/
  try(doc.html <- htmlTreeParse(gun.deaths$url[i],
                           useInternal = TRUE))
    # I have added a few 'try' error handling functions so that the web scraping
    # loop does not stop when there is a missing URL.
  if (is.null(doc.html)) doc.html <- html.last
  doc.text = unlist(try(xpathApply(doc.html, '//p', xmlValue)))
  doc.text = gsub('\\n', ' ', doc.text)
  doc.text = paste(doc.text, collapse = ' ')
 
  if (identical(html.last, doc.html)) doc.text <- "ERROR Source DROPPED"
 
  gun.deaths.text[i] <- list(doc.text)
 
  # Save the last html read as the current so that dropped documents are not 
  # counted twice.
  html.last <- doc.html
}
 
length(gun.deaths.text)
# for the following results I only collect the first ~3000 results
 
# I suggest saving the data after you have downloaded it all.
save(gun.deaths.text, file="gun.deaths.text.Rdata")
load("gun.deaths.text.Rdata")
 
# We will use the text mining library
library(tm)
 
# We first turn our list of articles into a corpus
gun.deaths.corpus <- Corpus(VectorSource(gun.deaths.text))
# Then we lowercase the words in that list.
gun.deaths.corpus <- tm_map(gun.deaths.corpus, tolower)
 
# This will create a matrix that lists all of the words
# and how frequently they appear in each article.
# It can be very long.
dtm <- DocumentTermMatrix(gun.deaths.corpus)
 
freqTerms <- findFreqTerms(dtm, 550)
 
dtmDic <- as.data.frame(inspect(DocumentTermMatrix(gun.deaths.corpus,
  list(dictionary = sort(c("suspect",  "suspects", "gunman", 
                      "fatally", "slaying","witnesses", 
                      "victim" , "victims", "homicide",  
                      "drug",   "crime", "accidentally",
                      "multiple", "suicide",
                      "accidental", "killed","children",
                      "student", "teacher", "charged",
                      "arrested", "self-defense", "defend"))))))
ndict <- ncol(dtmDic)
nobs  <- nrow(dtmDic)
 
# Let's drop the information about frequency of word use and just ask whether
# different words were used.
bimydf <- as.data.frame(dtmDic*0+1*(dtmDic>0))
 
# Let's see some word frequencies plotted
 
# First we want to count probability of word use for each article
perc <- apply(bimydf,2,mean)
 
# I will now create my first bead plot to be saved to the hard drive
png("2013-03-13GunDeaths1.png", width = 650, height = 400)
 
# Adjust the margins
par(mar=c(5,2,3,1))
 
  # Plot the beads
  plot(perc, xaxt = "n", xlab="", ylab="%", cex=1.5,
       main="Percent of Articles Using Word",pch=19)
 
  # Add guide lines
  for (i in 1:ndict) abline(v=i, col=gray(.9-.1*(i %% 4)), lwd=1)
 
  # Add text to identify each bead
  text(cex=1, x=1:length(perc)+.5, y=-.015, names(perc), 
       xpd=TRUE, srt=60, pos=2)
 
dev.off()
 

 # This is interesting.  Homicide and crime are common while accidentally 
# and suicide are quite low.
 
# Let's creating a valiable that is 1 to count how frequently a word such 
  #as homocide,gunman, victim, etc appears.
violent <- bimydf$homicide+bimydf$gunman+bimydf$victim+
  bimydf$victims+bimydf$victims+bimydf$crime+bimydf$suspect+
  bimydf$suspects+bimydf$slaying
 
# The average number of references to any of the above terms per article
# is 2.
summary(violent[doc.text != "ERROR Source DROPPED"])
 
violent.bi <- as.numeric(violent>0)
summary(violent.bi[doc.text != "ERROR Source DROPPED"])
 
# 58% of the articles seem to have some reference to violence or crime
 
with(bimydf,
     cor(data.frame(violent.bi, 
               suicide, 
               accidental, 
               accidentally, 
               multiple,
               children,
               drug)))
 
# Looking at a correlation matrix we find more results.  Our violent crime 
# variable is negatively correlated with suicide, accidental, and accidentally 
# while strongly correlated with multiple, children, and drug.
 
# Next I will plot out our data over time with each death being mapped.
library("RColorBrewer")
 
# I use color brewer to mix a brew of colors with blues representing the 
# youngest aged victims and darkred representing the oldest victims.
collist <- 
  colorRampPalette(c("blue", "darkred")) (50)
 
# Prepare to save as png
png("2013-03-13GunDeaths2.png", width = 650, height = 1200)
 
  # I adjust the margins.
  par(mar=c(5,1,3,1))
 
  # Open a plot window
  plot(0,0, xlim=c(.75,ndict-.5), ylim=c(.03,.97), 
       type="n", xaxt = "n", xlab="", 
       yaxt = "n", ylab="Account",
       main="Wordcount Beadplot-US Gun Deaths Articles")
 
  # Name the columns
  text(cex=1, x=1:ndict, y=-.015, colnames(dtmDic), 
       xpd=TRUE, srt=60, pos=2)
 
  # Insert column guides
  for (i in 1:ndict) abline(v=i-.25, col=gray(.9-.1*(i %% 4)), lwd=4)
 
  # Insert a small horizontal line for each word associated with 
  # each article
  for (i in 1:ndict) for (ii in 1:nobs)
    if (dtmDic[ii,i]>0) lines(c(i-.75,i+.15),c(ii/nobs,ii/nobs),
                    col=collist[min(gun.deaths$age[ii],50)], lwd=.5)
dev.off()
 

# Blue is young child victim while dark red is a victim who is 50 years or older.

To leave a comment for the author, please follow the link and comment on their blog: Econometrics by Simulation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.