[This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Deaths from assault
A friend was looking at comparative data on violent deaths in a range of countries and I couldn’t resist the chance to turn out some graphics. The data come from the OECD, who compiles them from contributions by member and associated country governments – effectively, wealthier countries – and makes them available via a web Application Programming Interface in the SDMX format. I ended up with some graphics that I think tell a powerful story. The data in this post are age-standardised deaths per 100,000 for whole population, male population, and female population; the OECD variables coded as TXCMILTX, TXCMHOTH and TXCMFETF. I’m not an expert in this area – if they mean something else (I’m a little unsure on the age-standardisation) don’t just curse me, let me know.Compare across countries and sex
Looking at snapshot cross-sectional average data since 1990 (for which there is more comparable data across countries, with a number of countries such as Germany and Czech Republic only coming into existence at or around that period) we see two things that met the “intra-ocular impact test” ie they hit you between the eyes:- For nearly all the listed countries, males are much more likely to killed in a assault than are females
- The Americas and the former Soviet Socialist Republics (Russia, Estonia, Latvia, Lithuania) dominate the list of most violent places to live (or rather, to die). Other than countries of these two types, only South Africa makes the top 11 countries. The safer countries include those in eastern Europe (other than former USSR, but including countries that were in its political orbit), Asia, Australasia and western Europe.
Compare across time
Looking at trends across time we see an encouraging sign. Most countries’ violent death rates peaked in the 1980s or 90s and have been declining, in some cases dramatically, in recent years. In the chart below the vertical axis for each country has been set to make most use of the plot area and draw attention to trends rather than absolute levels, so you can’t compare the absolute levels of violence across countries. That’s what the first chart was for, so the two charts complement eachother. The facets in the trend chart below have been ordered from lowest to highest rates (1990+ averages), so there is still some visual indicator of absolute size of the problem. The recent declines are particularly strong in the Eastern Europe and USSR region – Russia, Estonia, Latvia, Lithuania, Poland, Slovak Republic, Slovenia, Czech Republic, and Germany have all seen dramatic drops in rates of death from assault after rapid growth in the early 1990s, associated with the magnitude of the transition in economic and political systems that took place at that time. There’s also the possibility of changing official statistical practice with the changing political system, but there’s no particular reason I know of (I’m not an expert in former and post Soviet statistics systems, either…) to think of that, and the general curve looks plausible given an outsider’s view of the recent history there.Secondary point – Catholic countries and gender (more men getting killed) violence patterns?
Looking into the imbalance between the sexes, we see that the countries with consistently high male to female ratios of rates of violent death are Colombia, Brazil, Mexico, Chile, Costa Rica and South Africa, in all of which males are at least 5 times more likely to be killed violently than are females.- The top plot ignores the Latin Americanism of those five countries and fits a simple linear regression, which finds (illusory) statistically significant evidence of a link between Catholicism and a higher ratio of male to female death-from-assault rates. As mentioned above, this approach is unsound because it ignores the grouping of the five prominent countries that drag up the slope of the line.
- The second plot controls for this by adding a dummy variable for “Americas”. The result is still statistically significant evidence of a Catholic effect, but interacting with the continent effect ie only in the Americas does it seem to matter being an increasingly Catholic country when wondering if males are more likely to be killed in an assault than females.
- The third plot (which I think is my preferred) adds a dummy variable for “Latin American” and finds no statistically significant relationship between Catholicism and male-female death ratios after that effect is accounted for.
Arranging the data
Here’s how I import the data into R. If you know exactly what you’re looking for, you can import data directly from OECD.Stat with the {rsdmx} package by sending an appropriately structured URL (line 18 below) to their website. However, to work out that URL you’ll probably need to navigate their site by hand and choose your particular ‘customising’, then choose the “export” and “SDMX” options, then copy the URL it generates for you. It’s useful for reproducibility purposes (for example, it meant I didn’t need to save a copy of the data anywhere to make the code below reproducible for others).library(ggplot2) library(scales) library(grid) library(dplyr) library(tidyr) library(showtext) # for s library(rsdmx) # for importing OECD data library(ISOcodes) # for merging country codes with names library(rvest) # for screen scraping data about Catholicism .add.google("Poppins", "my") showtext.auto() theme_set(theme_grey(base_family = "my")) #----------Import and mung the death from assault data--------------- # load deaths by assault from OECD.Stat if(!exists("viol_sdmx")){ myUrl <- "http://stats.oecd.org/restsdmx/sdmx.ashx/GetData/HEALTH_STAT/CICDHOCD.TXCMFETF+TXCMHOTH+TXCMILTX.AUS+AUT+BEL+CAN+CHL+CZE+DNK+EST+FIN+FRA+DEU+GRC+HUN+ISL+IRL+ISR+ITA+JPN+KOR+LUX+MEX+NLD+NZL+NOR+POL+PRT+SVK+SVN+ESP+SWE+CHE+TUR+GBR+USA+NMEC+BRA+CHN+COL+CRI+IND+IDN+LVA+LTU+RUS+ZAF/all?startTime=1960&endTime=2014" dataset <- readSDMX(myUrl) # takes about 30 seconds on a slow hotel internet connection viol_sdmx <- as.data.frame(dataset) # takes about 10 seconds on an i5 }
# load Country codes from ISOcodes R package data("ISO_3166_1") # mung: viol <- viol_sdmx %>% # match country codes to country names: left_join(ISO_3166_1[ , c("Alpha_3", "Name")], by = c("COU" = "Alpha_3")) %>% # more friendly names: rename(Country = Name, Year = obsTime, Value = obsValue) %>% # limit to columns we want: select(UNIT, Country, Year, Value) %>% # make graphics-friendly versions of Unit and Year variables: mutate(Unit = ifelse(UNIT == "TXCMILTX", "Deaths per 100 000 population", "Deaths per 100 000 males"), Unit = ifelse(UNIT == "TXCMFETF", "Deaths per 100 000 females", Unit), Unit = factor(Unit, levels = unique(Unit)[c(2, 3, 1)]), Year = as.numeric(Year)) %>% # not enough data for Turkey to be useful so we knock it out: filter(Country != "Turkey")
# create country x Unit summaries viol_sum <- viol %>% filter(Year > 1990) %>% group_by(Country, Unit) %>% summarise(Value = mean(Value)) %>% ungroup() # create country totals (for ordering in charts) totals <- viol_sum %>% group_by(Country) %>% summarise(Value = mean(Value)) %>% arrange(Value) # create wider version, with one column per variable: viol_spread <- viol_sum %>% mutate(Unit = as.character(Unit), Unit = ifelse(grepl("female", Unit), "Female", Unit), Unit = ifelse(grepl(" males", Unit), "Male", Unit), Unit = ifelse(grepl("population", Unit), "Population", Unit)) %>% spread(Unit, Value) %>% mutate(Country = factor(Country, levels = totals$Country))
The actual plots
Now we’re ready to draw some plots. Here’s the code that makes the main two plots:viol_sum %>% mutate(Country = factor(Country, levels = totals$Country)) %>% mutate(Label = ifelse(grepl("population", Unit), as.character(Country), "|")) %>% ggplot(aes(x = Value, y = Country)) + geom_segment(data = viol_spread, aes(y = Country, yend = Country, x = Male, xend = Female), colour = "white", size = 3) + geom_text(size = 4, aes(label = Label, colour = Unit), alpha = 0.8, family = "my") + labs(y = "") + scale_x_log10(("Deaths per 100,000 (logarithmic scale)")) + theme(legend.position = "bottom") + scale_colour_manual("", values = c("red", "grey10", "blue")) + labs(colour = "") + ggtitle("Mean annual deaths from assault 1990 to 2013") viol %>% mutate(Country = factor(Country, levels = totals$Country)) %>% ggplot(aes(x = Year, y = Value, colour = Unit)) + facet_wrap(~Country, scales = "free_y", ncol = 5) + geom_smooth(se = FALSE, method = "loess") + geom_point(alpha = 0.8, size = 1) + scale_colour_manual("", values = c("red", "grey10", "blue")) + theme(legend.position = "bottom") + labs(y = "Deaths per 100,000 per year - note changing vertical scale", title = "Deaths from assault", x = "")
# create a gender summary: viol_gender <- viol %>% filter(!grepl("population", Unit)) %>% select(UNIT, Value, Year, Country) %>% spread(UNIT, Value) %>% mutate(ratio = TXCMHOTH / TXCMFETF) %>% group_by(Country) %>% summarise(ratio = mean(ratio, tr = 0.2)) %>% arrange(ratio) %>% # knock out Luxembourg and Iceland, too many NAs: filter(!is.na(ratio)) %>% mutate(Country = factor(Country, levels = Country)) # plot it: viol_gender %>% ggplot(aes(x = ratio, y = Country)) + geom_point() + labs(x = "Trimmed mean annual ratio of male to female ratesnof deaths from assault over whole period", y = "") # download catholic proportion data cath_page <- read_html("http://www.catholic-hierarchy.org/country/sc1.html") cath_data <- html_table(cath_page)[[1]] names(cath_data) <- gsub(" ", "_", names(cath_data)) # create vectors of country names for use in creating dummy variables: lat_american <- c("Colombia", "Brazil", "Mexico", "Chile", "Costa Rica") american <- c(lat_american, c("Canada", "United States")) # mung: cath_data <- cath_data %>% mutate(Country = gsub("M.+xico", "Mexico", Country), Country = ifelse(Country == "USA", "United States", Country), Country = ifelse(Country == "Great Britain", "United Kingdom", Country), Country = ifelse(Country == "Korea (South)" , "Korea, Republic of", Country)) %>% select(Country, Percent_Catholic) %>% mutate(Percent_Catholic = as.numeric(gsub("%", "", Percent_Catholic)), Continent1 = ifelse(Country %in% american, "Americas", "Other"), Continent2 = ifelse(Country %in% lat_american, "Latin American", "Other")) # join the Catholic data to the assault data: viol_gender_cath <- viol_gender %>% left_join(cath_data, by = "Country") # set up base plot: cath1 <- viol_gender_cath %>% ggplot(aes(x = Percent_Catholic, y = ratio, label = Country)) + geom_smooth(method = "lm") + geom_text(family = "my", size = 3, alpha = 0.8) + labs(x = "Percentage of country reported to be Catholic", y = "Ratio of female to malenassault death rates") + xlim(-5, 110) # and the other two types of plots: cath2 <- cath1 + aes(colour = Continent1) cath3 <- cath1 + aes(colour = Continent2) # draw plots: grid.arrange(cath1, cath2, cath3) # for reference, look explicitly at the underlying statistical models: mod1 <- lm(ratio ~ Percent_Catholic, data = viol_gender_cath) mod2 <- lm(ratio ~ Percent_Catholic * Continent1, data = viol_gender_cath) mod3 <- lm(ratio ~ Percent_Catholic * Continent2, data = viol_gender_cath) summary(mod1) summary(mod2) summary(mod3)
To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.