Machine learning for better homicide counts in Ciudad Juarez
[This article was first published on Diego Valle's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Photo Credit: Jesús Villaseca Pérez |
Mexican and US officials explain the dramatic increase in violence as due to a conflict between the Sinaloa and Juárez Cartels. After a new governor was elected in October 2010 Ciudad Juárez does seem to have started turning around, but it is still an extremely violent city.
Mortality from ill-defined conditions is quite high in Ciudad Juárez. Deaths of unknown injury intent went from 33 in 2004 to 193 in 2010. It is an open question just how good the homicide records are in places that saw incredible rises in homicides since one can only assume that forensic services were overwhelmed and that educated professionals like doctors were the first to leave town.
Recently there was a story in the New York Times about how Target figured out a teen girl was pregnant before her father did based on her shopping patterns, and if Target can classify its shoppers into pregnant and not pregnant based on the stuff they buy, I can certainly train a computer to classify deaths in Mexico based on the characteristics of the deaths.
Intuitively if someone were to ask you to guess the type of death of a 70 year old woman whose cause of death was transport related you would probably guess it was an accident. On the other hand, if you had to guess the type of death of a young adult male whose cause of death was a firearm, and the injury took place on a public street in Juárez, you would probably guess it was a homicide.
The Mexican government supposedly keeps a record of all deaths in its mortality database. For my purposes there are four types of violent or injury intent deaths:
- Accidents (unintentional injuries)
- Suicides (self inflicted injuries)
- Homicides (intentional injuries)
- Unknown Intent (cases where forensic or legal experts determined information was not sufficient to make a decision about the injury intent)
There is a fifth type of injury intent death: “Legal intervention, operations of war, military operations, and terrorism” but there are only about 30-40 deaths of this type in Mexico each year. Most of them occur in Puebla. It is very likely that most of the shootouts where the police or military kill someone (or die themselves) are classified as plain old homicides, so I chose to recode this type of death as homicides.
The Mexican vital statistics database assigns every death an International Classification of Diseases (ICD) code which is used worldwide to categorize diseases, injuries, and external causes of injury. Sadly, it can be a little too specific since it includes distinctions for “wounds inflicted by macaws” and “wounds inflicted by parrots,” in addition it combines the injury mechanism (parrot, macaw, drop bear, chupacabras, etc.) with intent (suicide, homicide, etc.), e.g., the code for accidental death by handgun is “W32,” and the code for homicide by handgun is “X93”.
Ideally I would have a way of classifying the injury mechanism into meaningful groups separate from intent, something akin to the game of clue where there are half a dozen or so weapons like candlestick, knife, rope, gun, etc. that can be used to harm people, except since epidemiology is a real science we would have to substitute the name of the weapons for fancy names like “struck by or against,” “cut/pierce,” “suffocation,” “firearm,” etc. Lucky for me the CDC’s National Center for Health Statistics has already done exactly that with its External Cause of Injury Mortality Matrix for ICD-10
Once we have recoded the mortality database to include the mortality matrix we can visualize the different types of violent or injury intent deaths according to injury mechanism:
You can visually notice the resemblance between homicide deaths and deaths of unknown intent since both tend to involve firearms. Deaths by accident are mainly caused by transportation (motor vehicles) and suicides by suffocation (think hanging).
It’s also worth pointing out that accidental deaths by unspecified mechanism increased at precisely the same time homicides shot through the roof, which could imply that there was some leakage of homicides into accidents, but in this post I’ll go with the conservative assumption that all homicides, suicides and accidents were correctly classified by the Mexican health authorities.
The injury mechanism isn’t the only useful information we can use to differentiate the type of death since victims of homicide also tend to be younger. We can also use the location where the body was found and the year of death. I could certainly use a more complex model that included marital status, day of week when they death occurred, and so on, but with the high specificity and sensitivity that resulted from the simple model I used, I saw no need to complicate things further (that and my laptop has 2GB of RAM).
Of course things are never as simple as in textbooks, and the Mexican mortality database has some missing values. I assumed homicides where the year of occurrence was not available occurred in the same year the death was registered. For the rest of the data I used k-Nearest Neighbors to impute the missing values.
The cleaned up dataset looks like this:
EDADVALOR CAUSE SEXOtxt LUGLEStxt ANIODEF PRESUNTOtxt 7.483315 All Transport Male Home 2007 Accident 5.656854 Firearm Male Home 2007 Homicide 6.164414 Firearm Male Public Street 2009 Homicide 6.082763 Firearm Male Home 2007 Homicide 4.690416 Poisoning Male Public Street 2004 AccidentWhere:
- EDADVALOR is the square root of age in years
- CAUSE is the injury mechanism
- SEXOtxt is the sex of the victim
- LUGLEStxt is the place where the lesion occurred
- ANIODEF is the year the death occurred
- PRESUNTOtxt is the injury intent
I divided the dataset into training (75% of the data) and test sets. I then fit a penalized a regression (glmnet package), a support vector machine (with a radial kernel since it had better performance than a linear one), and a random forest model to the training set and evaluated their accuracy against the test set using the caret package.
Confusion Matrix and Statistics Reference Prediction Accident Homicide Suicide Accident 743 50 5 Homicide 92 2083 22 Suicide 19 33 92 Overall Statistics Accuracy : 0.9296 95% CI : (0.9201, 0.9383) No Information Rate : 0.69 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.8422 Mcnemar's Test P-Value : 4.468e-05 Statistics by Class: Class: Accident Class: Homicide Class: Suicide Sensitivity 0.8700 0.9617 0.77311 Specificity 0.9759 0.8828 0.98278 Pos Pred Value 0.9311 0.9481 0.63889 Neg Pred Value 0.9526 0.9119 0.99098 Prevalence 0.2721 0.6900 0.03791 Detection Rate 0.2367 0.6636 0.02931 Detection Prevalence 0.2542 0.6999 0.04587
The random forest had a sensitivity of 96% for homicides, that is, out of the 2166 homicides in our testing sample, 2083 where classified correctly. And a specificity of 88%, that is, out of the 973 suicides and accidents in our testing sample, 859 where classified correctly.
Here's the comparison with the original data:
Year | Injury Intent | Original Deaths | Imputed Deaths |
---|---|---|---|
2004 | Homicide | 208 | 226 |
2005 | Homicide | 275 | 288 |
2006 | Homicide | 221 | 248 |
2007 | Homicide | 202 | 300 |
2008 | Homicide | 1616 | 1679 |
2009 | Homicide | 2397 | 2476 |
2010 | Homicide | 3686* | 3867* |
The imputed number is still not the final number of homicides since (ignoring any statistical error and clandestine graves) homicides are under-counted by around 3% for the last year because the database has a cutoff date for registering deaths of December 31st. Taking this into account and rounding up, the number of imputed homicides in Juárez woud be close to 4,000. According to the SNSP there were 3,903 homicides in the state of Chihuahua (yeah right), and according to the criminal rivalry database there were 2,738 drug war-related homicides in Juárez during 2010.
Total 2008-2010 | Original Deaths | Imputed Deaths |
---|---|---|
7699 | 8022 |
Why did it happen?
The high number of deaths of unknown injury intent that resemble homicides could be due to several factors:
- The takeover of law enforcement functions by the military in 2008, then the federal police in 2010, and yet again the municipal police in 2011 could have played havoc on record keeping.
- The high levels of violence overwhelmed forensic services and legal authorities.
- Some estimates put the number of people who left Juárez as high as 230,000, this surely that had some effect on the quality of vital statistics.
- Given that classifying deaths of unknown intent increased the number of homicides by about 50% in 2007-- before the violence started-- statistical manipulation is certainly worth some consideration. According to wikileaks Juárez had 316 murders in 2007 (though it's not clear if the number refers to the Zona Norte) which is close to the imputed estimate.
Future Research
Juárez is not the only place where deaths of unknown intent increased, it's also worth checking out states like Tamaulipas, Coahuila, San Luís Potosí and Durango.
Sometimes accidental deaths by firearm have interesting patterns
In Michoacán homicides dropped soon after the start of the Operation Michoacán on December 11, 2007, but accidents of unknown injury mechanism rose at the same time, since transport accidents dropped at the same time as homicides it is not entirely clear what happened:
P.S. You can download the code and data at my github account
To leave a comment for the author, please follow the link and comment on their blog: Diego Valle's Blog.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.