Site icon R-bloggers

Machine learning for better homicide counts in Ciudad Juarez

[This article was first published on Diego Valle's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Photo Credit: Jesús Villaseca Pérez
Ever since March 2008 Ciudad Juárez began to register an alarming number of homicides becoming Mexico’s most violent city. According to the Mexican vital statistics system Ciudad Juárez (coterminous with the Juárez municipality) went from having just 202 murders in 2007 to 1,616 in 2008, 2,397 in 2009, and 3,686 in 2010.

Mexican and US officials explain the dramatic increase in violence as due to a conflict between the Sinaloa and Juárez Cartels. After a new governor was elected in October 2010 Ciudad Juárez does seem to have started turning around, but it is still an extremely violent city.

Mortality from ill-defined conditions is quite high in Ciudad Juárez. Deaths of unknown injury intent went from 33 in 2004 to 193 in 2010. It is an open question just how good the homicide records are in places that saw incredible rises in homicides since one can only assume that forensic services were overwhelmed and that educated professionals like doctors were the first to leave town.

Recently there was a story in the New York Times about how Target figured out a teen girl was pregnant before her father did based on her shopping patterns, and if Target can classify its shoppers into pregnant and not pregnant based on the stuff they buy, I can certainly train a computer to classify deaths in Mexico based on the characteristics of the deaths.

Intuitively if someone were to ask you to guess the type of death of a 70 year old woman whose cause of death was transport related you would probably guess it was an accident. On the other hand, if you had to guess the type of death of a young adult male whose cause of death was a firearm, and the injury took place on a public street in Juárez, you would probably guess it was a homicide.

The Mexican government supposedly keeps a record of all deaths in its mortality database. For my purposes there are four types of violent or injury intent deaths:

  1. Accidents (unintentional injuries)
  2. Suicides (self inflicted injuries)
  3. Homicides (intentional injuries)
  4. Unknown Intent (cases where forensic or legal experts determined information was not sufficient to make a decision about the injury intent)

There is a fifth type of injury intent death: “Legal intervention, operations of war, military operations, and terrorism” but there are only about 30-40 deaths of this type in Mexico each year. Most of them occur in Puebla. It is very likely that most of the shootouts where the police or military kill someone (or die themselves) are classified as plain old homicides, so I chose to recode this type of death as homicides.

The Mexican vital statistics database assigns every death an International Classification of Diseases (ICD) code which is used worldwide to categorize diseases, injuries, and external causes of injury. Sadly, it can be a little too specific since it includes distinctions for “wounds inflicted by macaws” and “wounds inflicted by parrots,” in addition it combines the injury mechanism (parrot, macaw, drop bear, chupacabras, etc.) with intent (suicide, homicide, etc.), e.g., the code for accidental death by handgun is “W32,” and the code for homicide by handgun is “X93”.

Ideally I would have a way of classifying the injury mechanism into meaningful groups separate from intent, something akin to the game of clue where there are half a dozen or so weapons like candlestick, knife, rope, gun, etc. that can be used to harm people, except since epidemiology is a real science we would have to substitute the name of the weapons for fancy names like “struck by or against,” “cut/pierce,” “suffocation,” “firearm,” etc. Lucky for me the CDC’s National Center for Health Statistics has already done exactly that with its External Cause of Injury Mortality Matrix for ICD-10

Once we have recoded the mortality database to include the mortality matrix we can visualize the different types of violent or injury intent deaths according to injury mechanism:
You can visually notice the resemblance between homicide deaths and deaths of unknown intent since both tend to involve firearms. Deaths by accident are mainly caused by transportation (motor vehicles) and suicides by suffocation (think hanging).

It’s also worth pointing out that accidental deaths by unspecified mechanism increased at precisely the same time homicides shot through the roof, which could imply that there was some leakage of homicides into accidents, but in this post I’ll go with the conservative assumption that all homicides, suicides and accidents were correctly classified by the Mexican health authorities.

The injury mechanism isn’t the only useful information we can use to differentiate the type of death since victims of homicide also tend to be younger. We can also use the location where the body was found and the year of death. I could certainly use a more complex model that included marital status, day of week when they death occurred, and so on, but with the high specificity and sensitivity that resulted from the simple model I used, I saw no need to complicate things further (that and my laptop has 2GB of RAM).

Of course things are never as simple as in textbooks, and the Mexican mortality database has some missing values. I assumed homicides where the year of occurrence was not available occurred in the same year the death was registered. For the rest of the data I used k-Nearest Neighbors to impute the missing values.


Most accidents by unspecified injury mechanism were classified as transportation accidents which agrees with previous findings that road deaths are under-counted in Mexico.

The cleaned up dataset looks like this:
EDADVALOR         CAUSE SEXOtxt     LUGLEStxt ANIODEF PRESUNTOtxt
7.483315 All Transport    Male          Home    2007    Accident
5.656854       Firearm    Male          Home    2007    Homicide
6.164414       Firearm    Male Public Street    2009    Homicide
6.082763       Firearm    Male          Home    2007    Homicide
4.690416     Poisoning    Male Public Street    2004    Accident
Where:

I divided the dataset into training (75% of the data) and test sets. I then fit a penalized a regression (glmnet package), a support vector machine (with a radial kernel since it had better performance than a linear one), and a random forest model to the training set and evaluated their accuracy against the test set using the caret package.
The random forest model had the highest accuracy (though within the margin of error of the SVM) and that’s the algorithm I used to classify the deaths. Here’s its confusion matrix:
Confusion Matrix and Statistics

          Reference
Prediction Accident Homicide Suicide
  Accident      743       50       5
  Homicide       92     2083      22
  Suicide        19       33      92

Overall Statistics
                                          
               Accuracy : 0.9296          
                 95% CI : (0.9201, 0.9383)
    No Information Rate : 0.69            
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.8422          
 Mcnemar's Test P-Value : 4.468e-05       

Statistics by Class:

                     Class: Accident Class: Homicide Class: Suicide
Sensitivity                   0.8700          0.9617        0.77311
Specificity                   0.9759          0.8828        0.98278
Pos Pred Value                0.9311          0.9481        0.63889
Neg Pred Value                0.9526          0.9119        0.99098
Prevalence                    0.2721          0.6900        0.03791
Detection Rate                0.2367          0.6636        0.02931
Detection Prevalence          0.2542          0.6999        0.04587

The random forest had a sensitivity of 96% for homicides, that is, out of the 2166 homicides in our testing sample, 2083 where classified correctly. And a specificity of 88%, that is, out of the 973 suicides and accidents in our testing sample, 859 where classified correctly.

Here’s the comparison with the original data:
Year Injury Intent Original Deaths Imputed Deaths
2004 Homicide 208 226
2005 Homicide 275 288
2006 Homicide 221 248
2007 Homicide 202 300
2008 Homicide 1616 1679
2009 Homicide 2397 2476
2010 Homicide 3686* 3867*
* Under-counted by about 3% 

The imputed number is still not the final number of homicides since (ignoring any statistical error and clandestine graves) homicides are under-counted by around 3% for the last year because the database has a cutoff date for registering deaths of December 31st. Taking this into account and rounding up, the number of imputed homicides in Juárez woud be close to 4,000. According to the SNSP there were 3,903 homicides in the state of Chihuahua (yeah right), and according to the criminal rivalry database there were 2,738 drug war-related homicides in Juárez during 2010.

Total 2008-2010 Original Deaths Imputed Deaths
  7699 8022

Why did it happen?

The high number of deaths of unknown injury intent that resemble homicides could be due to several factors:


Future Research 

Juárez is not the only place where deaths of unknown intent increased, it’s also worth checking out states like Tamaulipas, Coahuila, San Luís Potosí and Durango.
Sometimes accidental deaths by firearm have interesting patterns
In Michoacán homicides dropped soon after the start of the Operation Michoacán on December 11, 2007, but accidents of unknown injury mechanism rose at the same time, since transport accidents dropped at the same time as homicides it is not entirely clear what happened:

P.S. You can download the code and data at my github account

To leave a comment for the author, please follow the link and comment on their blog: Diego Valle's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.