[This article was first published on Wiekvoet, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Last week I looked at time effects of the shootingtracker database. This week I will look at the states. Some (smaller) states never made it on the database. Other states, far too frequently. The worst of these California. After correcting for population size, Washington DC with 0.8 shootings per year per 100000 inhabitants sticks out. Subsequently, the observed incidences have been shrunk towards the common distribution. In these posterior estimates Massachusetts is best while Washington DC, Illinois and Louisiana are worst.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Data
Data are from the shootingtracker database and have been updated July 25. The last entry in the database is July 25. The script is a slightly adapted version from last week. Most notably, I did not notice last week the database uses KS for Kansas, but KA is what my other data uses. Note that in order make this post readable, all code is at a separate section.Adding state information
The first thing is to get one record per state. The datasets package has state information, while number of inhabitants (2013 data) is plucked from census.gov. After merging some states have missing counts for shootings absence in the shootingtracker database is taken as 0 shootings.
Incidence is calculated per 100 000 inhabitants per year (365 days).
Calculating posteriors
Assuming a beta prior and binomial distribution for number of shootings, Jags is used to estimate posteriors. The one noticeable thing in the model is the prior for b. b really wants to be high, hence the prior is quite a bit wider than usual. The current values seem to be acceptable, see also the output at the bottom of this post.The plot shows a number of things. In red are the observed values. In black the posterior values; means and 90% intervals of the samples. The blue line is the value to which the data is shrunk. Washington DC is shrunk quite a bit, because it has relatively few people and a large value. The same is true for Delaware. Illinois on the other hand has many inhabitants, hence is shrunk much less. It also has a more narrow interval. On the other side, states like Wyoming are shrunk upwards. Again, more inhabitants means less shrinking, Massachusetts ends up as having least number of shootings. Really big states, New York, Texas and California, are not shrunk at all and have quite narrow intervals. Texas and NY on the good side, California on the bad side.
The map shows the posteriors in a different way. At this point it might be worth noting that the north east is actually doing quite well, but since these states have so little inhabitants they are shrunk upwards and look bad. The current approach does not provide a natural way to avoid this. In a year or two additional data may cause at least some change in posteriors of the smaller states, both in the top and the bottom of this list.
Code
reading data
r13 <- readLines(‘raw13.txt’)
r14 <- readLines(‘raw14.txt’)
r1 <- c(r13,r14)
r2 <- gsub(‘\\[[a-zA-Z0-9]*\\]’,”,r1)
r3 <- gsub(‘^ *$’,”,r2)
r4 <- r3[r3!=”]
r5 <- gsub(‘\\t$’,”,r4)
r6 <- gsub(‘\\t References$’,”,r5)
r7 <- read.table(textConnection(r6),
sep=’\t’,
header=TRUE,
stringsAsFactors=FALSE)
r7$Location[r7$Location==’Washington DC’] <-
‘WashingtonDC, DC’
r8 <- read.table(textConnection(as.character(r7$Location)),
sep=’,’,
col.names=c(‘Location’,’State’),
stringsAsFactors=FALSE)
r8$State <- gsub(‘ ‘,”,r8$State)
r8$State[r8$State==’Tennessee’] <- ‘TN’
r8$State[r8$State==’Ohio’] <- ‘OH’
r8$State[r8$State %in% c(‘Kansas’,’KA’)] <- ‘KS’
r8$State[r8$State==’Louisiana’] <- ‘LA’
r8$State[r8$State==’Illinois’] <- ‘IL’
r8$State <- toupper(r8$State)
r7$StateAbb <- r8$State
r7$Location <- r8$Location
r7 <- r7[! (r7$State %in% c( ‘PUERTORICO’,’PR’)),]
r7$Date <- gsub(‘/13$’,’/2013′,r7$Date)
r7$date <- as.Date(r7$Date,format=”%m/%d/%Y”)
r10$incidence <- 100000*r10$x/r10$Population*365/
r14 <- readLines(‘raw14.txt’)
r1 <- c(r13,r14)
r2 <- gsub(‘\\[[a-zA-Z0-9]*\\]’,”,r1)
r3 <- gsub(‘^ *$’,”,r2)
r4 <- r3[r3!=”]
r5 <- gsub(‘\\t$’,”,r4)
r6 <- gsub(‘\\t References$’,”,r5)
r7 <- read.table(textConnection(r6),
sep=’\t’,
header=TRUE,
stringsAsFactors=FALSE)
r7$Location[r7$Location==’Washington DC’] <-
‘WashingtonDC, DC’
r8 <- read.table(textConnection(as.character(r7$Location)),
sep=’,’,
col.names=c(‘Location’,’State’),
stringsAsFactors=FALSE)
r8$State <- gsub(‘ ‘,”,r8$State)
r8$State[r8$State==’Tennessee’] <- ‘TN’
r8$State[r8$State==’Ohio’] <- ‘OH’
r8$State[r8$State %in% c(‘Kansas’,’KA’)] <- ‘KS’
r8$State[r8$State==’Louisiana’] <- ‘LA’
r8$State[r8$State==’Illinois’] <- ‘IL’
r8$State <- toupper(r8$State)
r7$StateAbb <- r8$State
r7$Location <- r8$Location
r7 <- r7[! (r7$State %in% c( ‘PUERTORICO’,’PR’)),]
r7$Date <- gsub(‘/13$’,’/2013′,r7$Date)
r7$date <- as.Date(r7$Date,format=”%m/%d/%Y”)
state information
r7$one <- 1
agg1 <- with(r7,
aggregate(one,by=list(StateAbb=StateAbb),FUN=sum))
library(datasets)
states <- data.frame(StateAbb=as.character(state.abb),
StateArea=state.area,
State=as.character(state.name),
StateRegion=state.region,
Frost=state.x77[,’Frost’]
)
# http://www.census.gov/popest/data/state/totals/2013/index.html
# data pre processed so the csv can be used as is.
inhabitants <- read.csv(‘NST-EST2013-01.treated.csv’)
#put it all together
states <- rbind(states,data.frame(StateAbb=’DC’,
StateArea=68.3,
State=’District of Columbia’,
StateRegion= ‘South’,Frost=NA))
r9 <- merge(states,inhabitants)
r10 <- merge(r9,agg1,all=TRUE)
# No data in some states
r10$x[is.na(r10$x)] <- 0
as.numeric((max(r7$date)-min(r7$date)))
Jags model
library(R2jags)
datain <- list(
count=r10$x,
Population = r10$Population,
n=nrow(r10),
scale=100000*365/
as.numeric((max(r7$date)-min(r7$date))))
model1 <- function() {
for (i in 1:n) {
count[i] ~ dbin(p[i],Population[i])
p[i] ~ dbeta(a,b)
pp[i] <- p[i]*scale
}
a ~ dnorm(0,.000001) %_% T(0,)
b ~ dnorm(1e7,.00000000000001) %_% T(0,)
betamean <- scale * a/(a+b)
}
parameters <- c(‘a’,’b’,’betamean’,’pp’)
jagsfit <- jags(datain,
model=model1,
parameters=parameters,
n.iter=20000,
n.burnin=1000,
n.chain=4,
progress.bar=”gui”)
jagsfit
Plot 1
samples <- jagsfit$BUGSoutput$sims.matrix[,5:55]
r10$pred <- apply(samples,2,mean)
r10$l05 <- apply(samples,2,function(x) quantile(x,.05))
r10$l95 <- apply(samples,2,function(x) quantile(x,.95))
library(ggplot2)
r11 <- subset(r10,,c(State,l05,l95,pred,incidence))
r11 <- r11[order(r11$pred),]
r11$State <- factor(r11$State,levels=r11$State)
p <- ggplot(r11, aes(y=pred, x=State))
limits <- aes(ymin =l05, ymax=l95)
dodge <- position_dodge(width=0.9)
p +geom_point(position=dodge, stat=”identity”) +
geom_errorbar(limits, position=dodge, width=0.25) +
theme(legend.position = “bottom”) +
geom_point(aes(y=incidence,x=State),
position=dodge, stat=”identity”,col=’red’) +
ylab(‘Incidence’) +
geom_hline(yintercept=mean(jagsfit$BUGSoutput$sims.matrix[,3]),col=’blue’) +
coord_flip()
r10$pred <- apply(samples,2,mean)
r10$l05 <- apply(samples,2,function(x) quantile(x,.05))
r10$l95 <- apply(samples,2,function(x) quantile(x,.95))
library(ggplot2)
r11 <- subset(r10,,c(State,l05,l95,pred,incidence))
r11 <- r11[order(r11$pred),]
r11$State <- factor(r11$State,levels=r11$State)
p <- ggplot(r11, aes(y=pred, x=State))
limits <- aes(ymin =l05, ymax=l95)
dodge <- position_dodge(width=0.9)
p +geom_point(position=dodge, stat=”identity”) +
geom_errorbar(limits, position=dodge, width=0.25) +
theme(legend.position = “bottom”) +
geom_point(aes(y=incidence,x=State),
position=dodge, stat=”identity”,col=’red’) +
ylab(‘Incidence’) +
geom_hline(yintercept=mean(jagsfit$BUGSoutput$sims.matrix[,3]),col=’blue’) +
coord_flip()
Plot 2
# http://stackoverflow.com/questions/20434343/color-us-states-according-to-value-in-r
# your values (here as named vector)
library(maps)
values <- (r10$pred-min(r10$pred))/(max(r10$pred)-min(r10$pred))
#values <- (r10$incidence-min(r10$incidence))/
# (max(r10$incidence[r10$incidence<.5])-min(r10$incidence))
names(values) <- r10$state.abb
# getting the names used by map
tmp <- map(‘state’,plot=FALSE,namesonly=TRUE)
# matching (after adjusting using gsub and tolower)
tmp <- match(gsub(‘(:.*)’,”,tmp),tolower(r10$State))
map(‘state’,
fill=TRUE,
col=rgb(colorRamp(c(‘green’,’red’))(values),max=255)[tmp])
Jags output
Inference for Bugs model at “C:/Users/Kees/AppData/Local/Temp/RtmpwbAFvR/modela3419a366f0.txt”, fit using jags,
4 chains, each with 20000 iterations (first 1000 discarded), n.thin = 19
n.sims = 4000 iterations saved
mu.vect sd.vect 2.5% 25% 50%
a 9.732 6.530 2.931 5.662 7.924
b 6207140.792 4125444.443 1898003.649 3609272.959 5094169.482
betamean 0.101 0.008 0.087 0.096 0.101
pp[1] 0.087 0.032 0.032 0.064 0.085
pp[2] 0.123 0.030 0.073 0.103 0.120
pp[3] 0.071 0.026 0.026 0.053 0.070
pp[4] 0.099 0.024 0.058 0.083 0.098
pp[5] 0.130 0.014 0.105 0.120 0.130
pp[6] 0.081 0.023 0.041 0.065 0.079
pp[7] 0.097 0.027 0.050 0.078 0.094
pp[8] 0.130 0.041 0.066 0.101 0.123
pp[9] 0.110 0.017 0.079 0.097 0.109
pp[10] 0.094 0.020 0.059 0.081 0.093
pp[11] 0.079 0.030 0.027 0.058 0.077
pp[12] 0.071 0.026 0.026 0.052 0.069
pp[13] 0.076 0.029 0.024 0.055 0.074
pp[14] 0.177 0.029 0.127 0.158 0.176
pp[15] 0.122 0.027 0.077 0.103 0.120
pp[16] 0.122 0.033 0.069 0.099 0.118
pp[17] 0.109 0.028 0.062 0.089 0.106
pp[18] 0.159 0.037 0.100 0.132 0.154
pp[19] 0.056 0.020 0.022 0.041 0.054
pp[20] 0.087 0.023 0.048 0.071 0.086
pp[21] 0.090 0.031 0.035 0.068 0.087
pp[22] 0.124 0.024 0.083 0.107 0.122
pp[23] 0.080 0.023 0.040 0.063 0.078
pp[24] 0.127 0.028 0.080 0.108 0.124
pp[25] 0.079 0.026 0.032 0.061 0.078
pp[26] 0.083 0.031 0.029 0.062 0.081
pp[27] 0.121 0.023 0.080 0.105 0.119
pp[28] 0.087 0.032 0.029 0.065 0.086
pp[29] 0.092 0.030 0.041 0.072 0.090
pp[30] 0.080 0.031 0.027 0.057 0.078
pp[31] 0.106 0.022 0.068 0.090 0.104
pp[32] 0.108 0.032 0.055 0.086 0.104
pp[33] 0.139 0.037 0.081 0.113 0.135
pp[34] 0.076 0.014 0.051 0.066 0.075
pp[35] 0.097 0.019 0.063 0.084 0.096
pp[36] 0.123 0.031 0.074 0.101 0.120
pp[37] 0.071 0.024 0.029 0.054 0.070
pp[38] 0.094 0.018 0.062 0.081 0.093
pp[39] 0.106 0.035 0.049 0.082 0.102
pp[40] 0.110 0.027 0.063 0.091 0.108
pp[41] 0.086 0.032 0.030 0.064 0.084
pp[42] 0.122 0.026 0.078 0.103 0.119
pp[43] 0.078 0.013 0.055 0.069 0.077
pp[44] 0.081 0.027 0.034 0.062 0.080
pp[45] 0.111 0.023 0.070 0.094 0.109
pp[46] 0.089 0.034 0.033 0.066 0.086
pp[47] 0.081 0.021 0.043 0.066 0.080
pp[48] 0.065 0.021 0.029 0.050 0.064
pp[49] 0.103 0.032 0.051 0.080 0.100
pp[50] 0.090 0.034 0.032 0.066 0.087
pp[51] 0.185 0.063 0.095 0.139 0.174
deviance 242.511 13.874 217.150 232.666 241.477
75% 97.5% Rhat n.eff
a 11.529 28.847 1.068 63
b 7337119.577 18410168.237 1.067 64
betamean 0.106 0.117 1.001 4000
pp[1] 0.106 0.158 1.003 910
pp[2] 0.140 0.190 1.001 2600
pp[3] 0.088 0.125 1.006 460
pp[4] 0.114 0.152 1.001 4000
pp[5] 0.139 0.159 1.002 1800
pp[6] 0.096 0.130 1.005 560
pp[7] 0.112 0.157 1.001 4000
pp[8] 0.151 0.230 1.001 3000
pp[9] 0.120 0.147 1.001 2800
pp[10] 0.107 0.137 1.003 1100
pp[11] 0.096 0.143 1.007 450
pp[12] 0.087 0.126 1.005 550
pp[13] 0.094 0.137 1.007 420
pp[14] 0.196 0.238 1.012 250
pp[15] 0.138 0.182 1.001 4000
pp[16] 0.140 0.200 1.001 2900
pp[17] 0.124 0.171 1.001 4000
pp[18] 0.182 0.242 1.008 350
pp[19] 0.069 0.098 1.008 320
pp[20] 0.102 0.137 1.002 2000
pp[21] 0.108 0.158 1.003 1200
pp[22] 0.139 0.178 1.003 970
pp[23] 0.094 0.128 1.002 1800
pp[24] 0.144 0.190 1.004 660
pp[25] 0.096 0.135 1.006 550
pp[26] 0.102 0.151 1.003 3700
pp[27] 0.135 0.170 1.002 1800
pp[28] 0.106 0.159 1.004 1200
pp[29] 0.110 0.160 1.004 1200
pp[30] 0.099 0.145 1.005 520
pp[31] 0.120 0.154 1.002 1400
pp[32] 0.127 0.179 1.003 1800
pp[33] 0.161 0.224 1.006 450
pp[34] 0.085 0.105 1.006 480
pp[35] 0.108 0.138 1.001 3600
pp[36] 0.141 0.191 1.002 1300
pp[37] 0.087 0.123 1.004 700
pp[38] 0.106 0.133 1.001 4000
pp[39] 0.125 0.184 1.002 4000
pp[40] 0.126 0.171 1.002 4000
pp[41] 0.106 0.154 1.003 1200
pp[42] 0.138 0.180 1.006 470
pp[43] 0.086 0.106 1.006 480
pp[44] 0.098 0.143 1.005 590
pp[45] 0.125 0.161 1.002 2500
pp[46] 0.109 0.164 1.004 700
pp[47] 0.094 0.126 1.004 700
pp[48] 0.080 0.110 1.008 320
pp[49] 0.120 0.173 1.004 790
pp[50] 0.109 0.166 1.003 1400
pp[51] 0.217 0.337 1.012 210
deviance 251.840 271.487 1.039 88
For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
DIC info (using the rule, pD = var(deviance)/2)
pD = 92.8 and DIC = 335.3
DIC is an estimate of expected predictive error (lower deviance is better).
To leave a comment for the author, please follow the link and comment on their blog: Wiekvoet.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.