Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
On July 26, 2020, Propublica released a dataset on police discipline records.
I wanted to get a look at the data and do some exploratory analysis.
library(tidyverse) ## ── Attaching packages ──────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ── ## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4 ## ✓ tibble 3.0.3 ✓ dplyr 1.0.0 ## ✓ tidyr 1.1.0 ✓ stringr 1.4.0 ## ✓ readr 1.3.1 ✓ forcats 0.5.0 ## Warning: package 'ggplot2' was built under R version 3.6.2 ## Warning: package 'tibble' was built under R version 3.6.2 ## Warning: package 'tidyr' was built under R version 3.6.2 ## Warning: package 'purrr' was built under R version 3.6.2 ## Warning: package 'dplyr' was built under R version 3.6.2 ## ── Conflicts ─────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ── ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() library(gt) ## Warning: package 'gt' was built under R version 3.6.2 library(here) ## here() starts at /Users/samportnow/sam-portnow-website library(fs) ## Warning: package 'fs' was built under R version 3.6.2 library(skimr) ## Warning: package 'skimr' was built under R version 3.6.2 library(janitor) ## Warning: package 'janitor' was built under R version 3.6.2 ## ## Attaching package: 'janitor' ## The following objects are masked from 'package:stats': ## ## chisq.test, fisher.test
Read in Data
data = read_csv(here::here('content', 'post_data', 'allegations_20200726939.csv')) ## Parsed with column specification: ## cols( ## .default = col_character(), ## unique_mos_id = col_double(), ## complaint_id = col_double(), ## month_received = col_double(), ## year_received = col_double(), ## month_closed = col_double(), ## year_closed = col_double(), ## mos_age_incident = col_double(), ## complainant_age_incident = col_double(), ## precinct = col_double() ## ) ## See spec(...) for full column specifications.
Explore Data
skim(data)
Name | data |
Number of rows | 33358 |
Number of columns | 26 |
_______________________ | |
Column type frequency: | |
character | 17 |
numeric | 9 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
first_name | 0 | 1.00 | 2 | 10 | 0 | 1217 | 0 |
last_name | 0 | 1.00 | 2 | 18 | 0 | 2835 | 0 |
command_now | 0 | 1.00 | 2 | 7 | 0 | 415 | 0 |
command_at_incident | 1544 | 0.95 | 2 | 7 | 0 | 361 | 0 |
rank_abbrev_incident | 0 | 1.00 | 2 | 3 | 0 | 18 | 0 |
rank_abbrev_now | 0 | 1.00 | 2 | 3 | 0 | 20 | 0 |
rank_now | 0 | 1.00 | 7 | 22 | 0 | 8 | 0 |
rank_incident | 0 | 1.00 | 7 | 22 | 0 | 8 | 0 |
mos_ethnicity | 0 | 1.00 | 5 | 15 | 0 | 5 | 0 |
mos_gender | 0 | 1.00 | 1 | 1 | 0 | 2 | 0 |
complainant_ethnicity | 4464 | 0.87 | 5 | 15 | 0 | 8 | 0 |
complainant_gender | 4195 | 0.87 | 4 | 21 | 0 | 6 | 0 |
fado_type | 0 | 1.00 | 5 | 18 | 0 | 4 | 0 |
allegation | 1 | 1.00 | 4 | 40 | 0 | 115 | 0 |
contact_reason | 199 | 0.99 | 5 | 58 | 0 | 53 | 0 |
outcome_description | 56 | 1.00 | 12 | 36 | 0 | 23 | 0 |
board_disposition | 0 | 1.00 | 10 | 40 | 0 | 11 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
unique_mos_id | 0 | 1.00 | 18169.91 | 9566.32 | 2 | 9671.00 | 19215 | 25412 | 36374 | ▆▆▇▇▃ |
complaint_id | 0 | 1.00 | 23905.06 | 11954.43 | 517 | 13684.75 | 25132 | 34252 | 43703 | ▅▆▆▇▇ |
month_received | 0 | 1.00 | 6.32 | 3.36 | 1 | 3.00 | 6 | 9 | 12 | ▇▆▅▆▇ |
year_received | 0 | 1.00 | 2010.73 | 6.03 | 1985 | 2007.00 | 2012 | 2016 | 2020 | ▁▁▃▇▇ |
month_closed | 0 | 1.00 | 6.47 | 3.34 | 1 | 4.00 | 6 | 9 | 12 | ▇▆▆▆▇ |
year_closed | 0 | 1.00 | 2011.53 | 6.09 | 1985 | 2008.00 | 2013 | 2016 | 2020 | ▁▁▂▆▇ |
mos_age_incident | 0 | 1.00 | 32.35 | 6.04 | 20 | 28.00 | 31 | 36 | 60 | ▅▇▃▁▁ |
complainant_age_incident | 4812 | 0.86 | 32.48 | 28.41 | -4301 | 23.00 | 30 | 41 | 101 | ▁▁▁▁▇ |
precinct | 24 | 1.00 | 64.37 | 31.45 | 0 | 43.00 | 67 | 81 | 1000 | ▇▁▁▁▁ |
# Officer Ethnicity |
data %>% tabyl(mos_ethnicity) %>% adorn_pct_formatting() %>% gt()
mos_ethnicity | n | percent |
---|---|---|
American Indian | 32 | 0.1% |
Asian | 1178 | 3.5% |
Black | 4924 | 14.8% |
Hispanic | 9150 | 27.4% |
White | 18074 | 54.2% |
It looks like the officer ethnicity matches the overall NYPD demogrphaic breakdown
Complainant Ethnicity
data %>% tabyl(complainant_ethnicity) %>% adorn_pct_formatting() %>% gt()
complainant_ethnicity | n | percent | valid_percent |
---|---|---|---|
American Indian | 64 | 0.2% | 0.2% |
Asian | 532 | 1.6% | 1.8% |
Black | 17114 | 51.3% | 59.2% |
Hispanic | 6424 | 19.3% | 22.2% |
Other Race | 677 | 2.0% | 2.3% |
Refused | 259 | 0.8% | 0.9% |
Unknown | 1041 | 3.1% | 3.6% |
White | 2783 | 8.3% | 9.6% |
NA | 4464 | 13.4% | – |
As for as the complaintant data is concerned, there isn’t an obvious population to compare to, but I think the closest is the arrest data because arrests can give some approximation of contact with the police.
Here, we see that Black people are overrepresented in their complaints compared to the percentage of arrestees that are Black. Also, Hispanic people are markedly underrepresented. I wonder if this is because of a large undocumented population in New York who may try to avoid contact with the government, or perhaps an issue with the missing ethnicity category.
Cross Tab Complaintant Ethnicity and Officer Ethnicity
data %>% tabyl(complainant_ethnicity, mos_ethnicity) %>% adorn_percentages("row") %>% adorn_pct_formatting(digits = 2) %>% adorn_ns() %>% gt()
complainant_ethnicity | American Indian | Asian | Black | Hispanic | White |
---|---|---|---|---|---|
American Indian | 0.00% (0) | 3.12% (2) | 6.25% (4) | 12.50% (8) | 78.12% (50) |
Asian | 0.00% (0) | 11.47% (61) | 10.90% (58) | 18.05% (96) | 59.59% (317) |
Black | 0.11% (18) | 3.26% (558) | 16.63% (2846) | 27.59% (4722) | 52.41% (8970) |
Hispanic | 0.03% (2) | 3.67% (236) | 11.04% (709) | 34.48% (2215) | 50.78% (3262) |
Other Race | 0.89% (6) | 2.81% (19) | 14.48% (98) | 27.62% (187) | 54.21% (367) |
Refused | 0.00% (0) | 6.95% (18) | 14.67% (38) | 22.39% (58) | 55.98% (145) |
Unknown | 0.19% (2) | 2.59% (27) | 17.39% (181) | 29.01% (302) | 50.82% (529) |
White | 0.07% (2) | 4.64% (129) | 13.65% (380) | 22.46% (625) | 59.18% (1647) |
NA | 0.04% (2) | 2.87% (128) | 13.66% (610) | 20.99% (937) | 62.43% (2787) |
The first thing I know is that there doesn’t seem to be much over or underrepresentation in the officer ethnicity relative to the demographic breakdown of the NYPD, with the exception of White, Native American, and Missing Complaitants and White officers. For these groups, which White officers are overrepresnted relative to the demographic breakdown of the NYPD. Actually, it appears that all ethnicity groups overrepresent their officers in complaints when the offending officer is of their same ethnicity, with the exception of Black complaitants. There is a large psychological literature that suggests people on on average display intergroup bias in favor of their own race. Perhaps when someone is mistreated by someone of their own race, it feels like a larger transgression.
Top Level Category of Complaint
data %>% tabyl(fado_type) %>% adorn_pct_formatting() %>% gt()
fado_type | n | percent |
---|---|---|
Abuse of Authority | 20292 | 60.8% |
Discourtesy | 4677 | 14.0% |
Force | 7636 | 22.9% |
Offensive Language | 753 | 2.3% |
data %>% tabyl(allegation) %>% adorn_pct_formatting() %>% arrange(desc(n)) %>% gt()
allegation | n | percent | valid_percent |
---|---|---|---|
Physical force | 4849 | 14.5% | 14.5% |
Word | 3942 | 11.8% | 11.8% |
Stop | 2300 | 6.9% | 6.9% |
Search (of person) | 2047 | 6.1% | 6.1% |
Frisk | 1926 | 5.8% | 5.8% |
Premises entered and/or searched | 1555 | 4.7% | 4.7% |
Refusal to provide name/shield number | 1483 | 4.4% | 4.4% |
Vehicle search | 1405 | 4.2% | 4.2% |
Threat of arrest | 1370 | 4.1% | 4.1% |
Vehicle stop | 1094 | 3.3% | 3.3% |
Threat of force (verbal or physical) | 912 | 2.7% | 2.7% |
Question and/or stop | 696 | 2.1% | 2.1% |
Other | 641 | 1.9% | 1.9% |
Gun Pointed | 628 | 1.9% | 1.9% |
Strip-searched | 556 | 1.7% | 1.7% |
Question | 527 | 1.6% | 1.6% |
Property damaged | 359 | 1.1% | 1.1% |
Pepper spray | 324 | 1.0% | 1.0% |
Action | 316 | 0.9% | 0.9% |
Race | 307 | 0.9% | 0.9% |
Retaliatory summons | 291 | 0.9% | 0.9% |
Entry of Premises | 289 | 0.9% | 0.9% |
Refusal to obtain medical treatment | 276 | 0.8% | 0.8% |
Nightstick as club (incl asp & baton) | 260 | 0.8% | 0.8% |
Chokehold | 244 | 0.7% | 0.7% |
Frisk and/or search | 241 | 0.7% | 0.7% |
Curse | 202 | 0.6% | 0.6% |
Refusal to process civilian complaint | 185 | 0.6% | 0.6% |
Gun Drawn | 182 | 0.5% | 0.5% |
Push/Shove | 178 | 0.5% | 0.5% |
Seizure of property | 178 | 0.5% | 0.5% |
Failure to provide RTKA card | 176 | 0.5% | 0.5% |
Hit against inanimate object | 176 | 0.5% | 0.5% |
Threat to damage/seize property | 171 | 0.5% | 0.5% |
Search of Premises | 159 | 0.5% | 0.5% |
Retaliatory arrest | 147 | 0.4% | 0.4% |
Other – Force | 134 | 0.4% | 0.4% |
Refusal to show search warrant | 125 | 0.4% | 0.4% |
Forcible Removal to Hospital | 120 | 0.4% | 0.4% |
Gender | 115 | 0.3% | 0.3% |
Punch/Kick | 114 | 0.3% | 0.3% |
Interference with recording | 111 | 0.3% | 0.3% |
Threat of summons | 111 | 0.3% | 0.3% |
Refusal to provide shield number | 97 | 0.3% | 0.3% |
Threat to notify ACS | 97 | 0.3% | 0.3% |
Refusal to provide name | 96 | 0.3% | 0.3% |
Person Searched | 94 | 0.3% | 0.3% |
Nasty Words | 88 | 0.3% | 0.3% |
Nonlethal restraining device | 87 | 0.3% | 0.3% |
Ethnicity | 84 | 0.3% | 0.3% |
Sexual orientation | 78 | 0.2% | 0.2% |
Handcuffs too tight | 76 | 0.2% | 0.2% |
Threat of force | 74 | 0.2% | 0.2% |
Other – Abuse | 70 | 0.2% | 0.2% |
Dragged/Pulled | 63 | 0.2% | 0.2% |
Beat | 57 | 0.2% | 0.2% |
Black | 56 | 0.2% | 0.2% |
Gesture | 51 | 0.2% | 0.2% |
Other blunt instrument as a club | 49 | 0.1% | 0.1% |
Threat of Arrest | 39 | 0.1% | 0.1% |
Demeanor/tone | 38 | 0.1% | 0.1% |
Gun fired | 38 | 0.1% | 0.1% |
Vehicle | 35 | 0.1% | 0.1% |
Premise Searched | 31 | 0.1% | 0.1% |
Gun as club | 28 | 0.1% | 0.1% |
Property Damaged | 27 | 0.1% | 0.1% |
Nightstick/Billy/Club | 25 | 0.1% | 0.1% |
Search of recording device | 23 | 0.1% | 0.1% |
Radio as club | 22 | 0.1% | 0.1% |
Vehicle Searched | 22 | 0.1% | 0.1% |
Restricted Breathing | 21 | 0.1% | 0.1% |
Threat re: removal to hospital | 21 | 0.1% | 0.1% |
Mace | 20 | 0.1% | 0.1% |
Detention | 19 | 0.1% | 0.1% |
Flashlight as club | 18 | 0.1% | 0.1% |
Other- Discourtesy | 18 | 0.1% | 0.1% |
Photography/Videography | 15 | 0.0% | 0.0% |
Physical disability | 15 | 0.0% | 0.0% |
Religion | 14 | 0.0% | 0.0% |
Slap | 14 | 0.0% | 0.0% |
Radio As Club | 13 | 0.0% | 0.0% |
Sexual Misconduct (Sexual Humiliation) | 13 | 0.0% | 0.0% |
Threat to Property | 13 | 0.0% | 0.0% |
Gun As Club | 12 | 0.0% | 0.0% |
Electronic device information deletion | 11 | 0.0% | 0.0% |
Other – Ethnic Slur | 11 | 0.0% | 0.0% |
Police shield | 11 | 0.0% | 0.0% |
Sex Miscon (Sexual Harassment, Verbal) | 11 | 0.0% | 0.0% |
Flashlight As Club | 10 | 0.0% | 0.0% |
Refusal to show arrest warrant | 10 | 0.0% | 0.0% |
Gun pointed/gun drawn | 9 | 0.0% | 0.0% |
Hispanic | 9 | 0.0% | 0.0% |
Property Seized | 8 | 0.0% | 0.0% |
White | 8 | 0.0% | 0.0% |
Rude Gesture | 7 | 0.0% | 0.0% |
Arrest/D. A. T. | 6 | 0.0% | 0.0% |
Sh Refuse Cmp | 5 | 0.0% | 0.0% |
Threat of Summons | 5 | 0.0% | 0.0% |
Arrest/Onlooker | 4 | 0.0% | 0.0% |
Gender Identity | 4 | 0.0% | 0.0% |
Improper dissemination of medical info | 4 | 0.0% | 0.0% |
Sex Miscon (Sexual/Romantic Proposition) | 4 | 0.0% | 0.0% |
Animal | 3 | 0.0% | 0.0% |
Body Cavity Searches | 3 | 0.0% | 0.0% |
Gay/Lesbian Slur | 3 | 0.0% | 0.0% |
Gun pointed | 3 | 0.0% | 0.0% |
Profane Gesture | 3 | 0.0% | 0.0% |
Failed to Obtain Language Interpretation | 2 | 0.0% | 0.0% |
Gun Fired | 2 | 0.0% | 0.0% |
Jewish | 2 | 0.0% | 0.0% |
Sex Miscon (Sexual Harassment, Gesture) | 2 | 0.0% | 0.0% |
Oriental | 1 | 0.0% | 0.0% |
Other Asian | 1 | 0.0% | 0.0% |
Questioned immigration status | 1 | 0.0% | 0.0% |
Sexist Remark | 1 | 0.0% | 0.0% |
NA | 1 | 0.0% | – |
Contact Reason
data %>% tabyl(contact_reason) %>% arrange(desc(n)) %>% adorn_pct_formatting() %>% gt()
contact_reason | n | percent | valid_percent |
---|---|---|---|
PD suspected C/V of violation/crime – street | 10078 | 30.2% | 30.4% |
Other | 4104 | 12.3% | 12.4% |
PD suspected C/V of violation/crime – auto | 2981 | 8.9% | 9.0% |
PD suspected C/V of violation/crime – bldg | 2542 | 7.6% | 7.7% |
Moving violation | 1983 | 5.9% | 6.0% |
Other violation of VTL | 1140 | 3.4% | 3.4% |
Report-dispute | 1085 | 3.3% | 3.3% |
Execution of search warrant | 913 | 2.7% | 2.8% |
Report of other crime | 906 | 2.7% | 2.7% |
Execution of arrest/bench warrant | 683 | 2.0% | 2.1% |
Parking violation | 635 | 1.9% | 1.9% |
Report-domestic dispute | 563 | 1.7% | 1.7% |
C/V intervened on behalf of/observed encounter w/3rd party | 540 | 1.6% | 1.6% |
Report-gun possession/shots fired | 492 | 1.5% | 1.5% |
PD suspected C/V of violation/crime – subway | 380 | 1.1% | 1.1% |
Patrol Encounter | 362 | 1.1% | 1.1% |
Report-noise/disturbance | 356 | 1.1% | 1.1% |
Report-possession/sale of narcotics | 318 | 1.0% | 1.0% |
C/V requested investigation of crime | 316 | 0.9% | 1.0% |
EDP aided case | 313 | 0.9% | 0.9% |
Traffic Incidents/Accident/Prk Violation | 219 | 0.7% | 0.7% |
Stop/Question/Frisk | 206 | 0.6% | 0.6% |
Traffic accident | 200 | 0.6% | 0.6% |
NA | 199 | 0.6% | – |
Others | 189 | 0.6% | 0.6% |
C/V telephoned PCT | 175 | 0.5% | 0.5% |
C/V at PCT to obtain information | 167 | 0.5% | 0.5% |
Report of Crime Past/Present | 165 | 0.5% | 0.5% |
Aided case | 154 | 0.5% | 0.5% |
C/V requested info from officer | 112 | 0.3% | 0.3% |
Dispute | 104 | 0.3% | 0.3% |
C/V at PCT to file complaint of crime | 102 | 0.3% | 0.3% |
CV already in custody | 100 | 0.3% | 0.3% |
Arrest/Complainant | 73 | 0.2% | 0.2% |
Vehicle Stop and Check | 51 | 0.2% | 0.2% |
Arrest/Not Complainant | 46 | 0.1% | 0.1% |
Parade/special event | 46 | 0.1% | 0.1% |
C/V at PCT to retrieve property | 40 | 0.1% | 0.1% |
Assist ACS or other agency | 39 | 0.1% | 0.1% |
PD auto checkpoint | 39 | 0.1% | 0.1% |
Summons/Complainant | 39 | 0.1% | 0.1% |
Regulatory inspection | 36 | 0.1% | 0.1% |
Report of Disturbance/Noise Complaint | 36 | 0.1% | 0.1% |
Demonstration/protest | 35 | 0.1% | 0.1% |
PD telephones CV | 18 | 0.1% | 0.1% |
Complainant at Pct. to make a Cmpl/info | 16 | 0.0% | 0.0% |
EDP Aided Cases | 12 | 0.0% | 0.0% |
Telephone Call to Precinct/Command | 11 | 0.0% | 0.0% |
Transit checkpoint | 11 | 0.0% | 0.0% |
Aided Cases | 10 | 0.0% | 0.0% |
Demonstrations | 8 | 0.0% | 0.0% |
Complainant Witnessing Incident | 5 | 0.0% | 0.0% |
No contact | 3 | 0.0% | 0.0% |
Victim Subject of Sex Crime | 2 | 0.0% | 0.0% |
data %>% tabyl(outcome_description) %>% arrange(desc(n)) %>% adorn_pct_formatting() %>% gt()
outcome_description | n | percent | valid_percent |
---|---|---|---|
No arrest made or summons issued | 12822 | 38.4% | 38.5% |
Arrest – other violation/crime | 10196 | 30.6% | 30.6% |
Summons – disorderly conduct | 2118 | 6.3% | 6.4% |
Summons – other violation/crime | 1940 | 5.8% | 5.8% |
Arrest – resisting arrest | 1593 | 4.8% | 4.8% |
Arrest – disorderly conduct | 1013 | 3.0% | 3.0% |
Arrest – assault (against a PO) | 852 | 2.6% | 2.6% |
Moving violation summons issued | 839 | 2.5% | 2.5% |
Arrest – OGA | 649 | 1.9% | 1.9% |
Other VTL violation summons issued | 531 | 1.6% | 1.6% |
Parking summons issued | 279 | 0.8% | 0.8% |
Disorderly-Conduct/Arr/Summons | 137 | 0.4% | 0.4% |
Arrest on Other Charge | 81 | 0.2% | 0.2% |
Traffic Summons Claimed or Issued | 59 | 0.2% | 0.2% |
Juvenile Report | 57 | 0.2% | 0.2% |
NA | 56 | 0.2% | – |
Other Summons Claimed or Issued | 38 | 0.1% | 0.1% |
Assault/Arrested | 34 | 0.1% | 0.1% |
Resisting Arrest/Arrested | 25 | 0.1% | 0.1% |
Arrest – harrassment (against a PO) | 15 | 0.0% | 0.0% |
Obstruct-Govt-Admin/Arrested | 10 | 0.0% | 0.0% |
Harrassment/Arrested/Summons | 8 | 0.0% | 0.0% |
Summons – harrassment (against a PO) | 5 | 0.0% | 0.0% |
Summons – OGA | 1 | 0.0% | 0.0% |
Disposition
data %>% tabyl(board_disposition) %>% adorn_pct_formatting() %>% gt()
board_disposition | n | percent |
---|---|---|
Exonerated | 9609 | 28.8% |
Substantiated (Charges) | 3796 | 11.4% |
Substantiated (Command Discipline A) | 964 | 2.9% |
Substantiated (Command Discipline B) | 789 | 2.4% |
Substantiated (Command Discipline) | 851 | 2.6% |
Substantiated (Command Lvl Instructions) | 454 | 1.4% |
Substantiated (Formalized Training) | 1033 | 3.1% |
Substantiated (Instructions) | 248 | 0.7% |
Substantiated (MOS Unidentified) | 1 | 0.0% |
Substantiated (No Recommendations) | 165 | 0.5% |
Unsubstantiated | 15448 | 46.3% |
So, now I want to know what is the best predictor of having a substantiated claim. I am going to use tidymodels to evaluate a lasso regression to determine the most important predictors. The largely majority of this code is borrowed from Julia Silge
First, I’ll code my outcome
library(stringr) data = data %>% mutate( outcome = if_else(str_detect(board_disposition, 'Substantiated'), 1, 0) )
Next, I’ll select the columns I want.
library(tidymodels) ## ── Attaching packages ─────────────────────────────────────────────────────────────────────── tidymodels 0.1.1 ── ## ✓ broom 0.7.0 ✓ recipes 0.1.13 ## ✓ dials 0.0.8 ✓ rsample 0.0.7 ## ✓ infer 0.5.3 ✓ tune 0.1.1 ## ✓ modeldata 0.0.2 ✓ workflows 0.1.2 ## ✓ parsnip 0.1.2 ✓ yardstick 0.0.7 ## Warning: package 'broom' was built under R version 3.6.2 ## Warning: package 'dials' was built under R version 3.6.2 ## Warning: package 'scales' was built under R version 3.6.2 ## Warning: package 'infer' was built under R version 3.6.2 ## Warning: package 'modeldata' was built under R version 3.6.2 ## Warning: package 'recipes' was built under R version 3.6.2 ## Warning: package 'rsample' was built under R version 3.6.2 ## Warning: package 'tune' was built under R version 3.6.2 ## Warning: package 'yardstick' was built under R version 3.6.2 ## ── Conflicts ────────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ── ## x scales::discard() masks purrr::discard() ## x dplyr::filter() masks stats::filter() ## x recipes::fixed() masks stringr::fixed() ## x dplyr::lag() masks stats::lag() ## x yardstick::spec() masks readr::spec() ## x recipes::step() masks stats::step() data = data %>% select(command_now, command_at_incident, year_received, year_closed, rank_abbrev_incident, rank_abbrev_now, mos_ethnicity:contact_reason, outcome)
I want to see what predicts a founded complaint, and the allegation of the complaint seems too proximally related to the complaints, so I am going to drop those variables.
data = data %>% select(- c(fado_type, allegation))
Create a length of investigation variable
data$length_of_investigation = data$year_closed - data$year_received
Change years to factors
data$year_closed = factor(data$year_closed) data$year_received = factor(data$year_received)
Recode character variables to factor
data = data %>% mutate_if(is.character, as.factor)
Recoding missing in factor variables as missing
data = data %>% mutate_if(is.factor, ~ fct_explicit_na(., 'Missing'))
Change outcome to a factor
data$outcome = as.factor(data$outcome)
Look at data again
skim(data)
Name | data |
Number of rows | 33358 |
Number of columns | 16 |
_______________________ | |
Column type frequency: | |
factor | 12 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
command_now | 0 | 1 | FALSE | 415 | INT: 1412, WAR: 1165, NAR: 692, DB : 683 |
command_at_incident | 0 | 1 | FALSE | 362 | Mis: 1544, 075: 1364, 046: 825, 044: 763 |
year_received | 0 | 1 | FALSE | 36 | 201: 2345, 201: 2312, 201: 2281, 201: 2220 |
year_closed | 0 | 1 | FALSE | 36 | 201: 3251, 201: 2408, 201: 2326, 201: 2322 |
rank_abbrev_incident | 0 | 1 | FALSE | 18 | POM: 19807, SGT: 5701, DT3: 2712, POF: 1398 |
rank_abbrev_now | 0 | 1 | FALSE | 20 | POM: 9454, DT3: 6838, SGT: 6067, LT: 2815 |
mos_ethnicity | 0 | 1 | FALSE | 5 | Whi: 18074, His: 9150, Bla: 4924, Asi: 1178 |
mos_gender | 0 | 1 | FALSE | 2 | M: 31598, F: 1760 |
complainant_ethnicity | 0 | 1 | FALSE | 9 | Bla: 17114, His: 6424, Mis: 4464, Whi: 2783 |
complainant_gender | 0 | 1 | FALSE | 7 | Mal: 24058, Fem: 5021, Mis: 4195, Not: 57 |
contact_reason | 0 | 1 | FALSE | 54 | PD : 10078, Oth: 4104, PD : 2981, PD : 2542 |
outcome | 0 | 1 | FALSE | 2 | 0: 25057, 1: 8301 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
mos_age_incident | 0 | 1.00 | 32.35 | 6.04 | 20 | 28 | 31 | 36 | 60 | ▅▇▃▁▁ |
complainant_age_incident | 4812 | 0.86 | 32.48 | 28.41 | -4301 | 23 | 30 | 41 | 101 | ▁▁▁▁▇ |
precinct | 24 | 1.00 | 64.37 | 31.45 | 0 | 43 | 67 | 81 | 1000 | ▇▁▁▁▁ |
length_of_investigation | 0 | 1.00 | 0.80 | 0.59 | 0 | 0 | 1 | 1 | 9 | ▇▁▁▁▁ |
Dropping age at incident be | cause it’s a | lmost always bla | nk. Also | recodin | g precin | ct to | factor |
data = data %>% select(-complainant_age_incident) data$precinct = factor(data$precinct) data$precinct = fct_explicit_na(data$precinct, 'Missing') skim(data)
Name | data |
Number of rows | 33358 |
Number of columns | 15 |
_______________________ | |
Column type frequency: | |
factor | 13 |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
command_now | 0 | 1 | FALSE | 415 | INT: 1412, WAR: 1165, NAR: 692, DB : 683 |
command_at_incident | 0 | 1 | FALSE | 362 | Mis: 1544, 075: 1364, 046: 825, 044: 763 |
year_received | 0 | 1 | FALSE | 36 | 201: 2345, 201: 2312, 201: 2281, 201: 2220 |
year_closed | 0 | 1 | FALSE | 36 | 201: 3251, 201: 2408, 201: 2326, 201: 2322 |
rank_abbrev_incident | 0 | 1 | FALSE | 18 | POM: 19807, SGT: 5701, DT3: 2712, POF: 1398 |
rank_abbrev_now | 0 | 1 | FALSE | 20 | POM: 9454, DT3: 6838, SGT: 6067, LT: 2815 |
mos_ethnicity | 0 | 1 | FALSE | 5 | Whi: 18074, His: 9150, Bla: 4924, Asi: 1178 |
mos_gender | 0 | 1 | FALSE | 2 | M: 31598, F: 1760 |
complainant_ethnicity | 0 | 1 | FALSE | 9 | Bla: 17114, His: 6424, Mis: 4464, Whi: 2783 |
complainant_gender | 0 | 1 | FALSE | 7 | Mal: 24058, Fem: 5021, Mis: 4195, Not: 57 |
precinct | 0 | 1 | FALSE | 80 | 75: 2172, 73: 1163, 44: 1139, 46: 1120 |
contact_reason | 0 | 1 | FALSE | 54 | PD : 10078, Oth: 4104, PD : 2981, PD : 2542 |
outcome | 0 | 1 | FALSE | 2 | 0: 25057, 1: 8301 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
mos_age_incident | 0 | 1 | 32.35 | 6.04 | 20 | 28 | 31 | 36 | 60 | ▅▇▃▁▁ |
length_of_investigation | 0 | 1 | 0.80 | 0.59 | 0 | 0 | 1 | 1 | 9 | ▇▁▁▁▁ |
Finally, if any factor has an n less than 10, I went to recode that to ther
recode_to_other = function(.x){ keeps = which(table(.x) > 10) %>% names() fct_other(.x, keep = keeps, other_level = 'other') } data = data %>% mutate_if(is.factor, recode_to_other) ## Warning: Unknown levels in `f`: other ## Warning: Unknown levels in `f`: other ## Warning: Unknown levels in `f`: other ## Warning: Unknown levels in `f`: other ## Warning: Unknown levels in `f`: other
Before fitting the model, I want to know the distribution of outcomes.
prop.table(table(data$outcome)) ## ## 0 1 ## 0.7511541 0.2488459
75% of complains are unfounded. So, if we fit a model without any predictors, we’ll already be at 75% accuracy. I confirm this below with a logistic regression with no predictors. The predict probability of an sustantiated claim is ~25%.
glm.mod = glm(outcome ~ 1, data = data, family = binomial(link = 'logit')) predictions = predict(glm.mod, type = 'response') table(predictions) ## predictions ## 0.248845854068537 ## 33358 data_split <- initial_split(data) data_train <- training(data_split) data_test <- testing(data_split) data_rec <- recipe(outcome~ ., data = data_train) %>% step_zv(all_numeric(), -all_outcomes()) %>% step_dummy(all_predictors(), -all_numeric()) %>% step_normalize(all_numeric(), -all_outcomes()) data_prep <- data_rec %>% prep(strings_as_factors = FALSE) summary(data_prep) ## # A tibble: 805 x 4 ## variable type role source ## <chr> <chr> <chr> <chr> ## 1 mos_age_incident numeric predictor original ## 2 length_of_investigation numeric predictor original ## 3 outcome nominal outcome original ## 4 command_now_X001.PCT numeric predictor derived ## 5 command_now_X005.DET numeric predictor derived ## 6 command_now_X005.PCT numeric predictor derived ## 7 command_now_X006.PCT numeric predictor derived ## 8 command_now_X007.PCT numeric predictor derived ## 9 command_now_X009.DET numeric predictor derived ## 10 command_now_X009.PCT numeric predictor derived ## # … with 795 more rows lasso_spec <- logistic_reg(mode = 'classification', penalty = 0.1, mixture = 1) %>% set_engine("glmnet") wf <- workflow() %>% add_recipe(data_rec) lasso_fit <- wf %>% add_model(lasso_spec) %>% fit(data = data_train) lasso_fit %>% pull_workflow_fit() %>% tidy() ## # A tibble: 30,881 x 5 ## term step estimate lambda dev.ratio ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 1 -1.11 0.0295 9.99e-14 ## 2 (Intercept) 2 -1.11 0.0269 1.68e- 3 ## 3 (Intercept) 3 -1.11 0.0245 4.34e- 3 ## 4 (Intercept) 4 -1.11 0.0224 6.95e- 3 ## 5 (Intercept) 5 -1.11 0.0204 9.13e- 3 ## 6 (Intercept) 6 -1.11 0.0186 1.15e- 2 ## 7 (Intercept) 7 -1.11 0.0169 1.38e- 2 ## 8 (Intercept) 8 -1.12 0.0154 1.60e- 2 ## 9 (Intercept) 9 -1.12 0.0140 1.81e- 2 ## 10 (Intercept) 10 -1.12 0.0128 2.01e- 2 ## # … with 30,871 more rows set.seed(1234) data_boot <- bootstraps(data_train) tune_spec <- logistic_reg(mode = 'classification', penalty = tune(), mixture = 1) %>% set_engine("glmnet") lambda_grid <- grid_regular(penalty(), levels = 50) doParallel::registerDoParallel() set.seed(2020) lasso_grid <- tune_grid( wf %>% add_model(tune_spec), resamples = data_boot, grid = lambda_grid )
What results did we get?
lasso_grid %>% collect_metrics() ## # A tibble: 100 x 7 ## penalty .metric .estimator mean n std_err .config ## <dbl> <chr> <chr> <dbl> <int> <dbl> <fct> ## 1 1.00e-10 accuracy binary 0.739 14 0.00104 Model01 ## 2 1.00e-10 roc_auc binary 0.622 14 0.00102 Model01 ## 3 1.60e-10 accuracy binary 0.739 14 0.00104 Model02 ## 4 1.60e-10 roc_auc binary 0.622 14 0.00102 Model02 ## 5 2.56e-10 accuracy binary 0.739 14 0.00104 Model03 ## 6 2.56e-10 roc_auc binary 0.622 14 0.00102 Model03 ## 7 4.09e-10 accuracy binary 0.739 14 0.00104 Model04 ## 8 4.09e-10 roc_auc binary 0.622 14 0.00102 Model04 ## 9 6.55e-10 accuracy binary 0.739 14 0.00104 Model05 ## 10 6.55e-10 roc_auc binary 0.622 14 0.00102 Model05 ## # … with 90 more rows lasso_grid %>% collect_metrics() %>% ggplot(aes(penalty, mean, color = .metric)) + geom_errorbar(aes( ymin = mean - std_err, ymax = mean + std_err ), alpha = 0.5 ) + geom_line(size = 1.5) + facet_wrap(~.metric, scales = "free", nrow = 2) + scale_x_log10() + theme(legend.position = "none")
highest_auc <- lasso_grid %>% select_best("roc_auc", maximize = FALSE) ## Warning: The `maximize` argument is no longer needed. This value was ignored. final_lasso <- finalize_workflow( wf %>% add_model(tune_spec), highest_auc ) library(vip) ## Warning: package 'vip' was built under R version 3.6.2 ## ## Attaching package: 'vip' ## The following object is masked from 'package:utils': ## ## vi lasso_table = final_lasso %>% fit(data_train) %>% pull_workflow_fit() %>% vi(lambda = highest_auc$penalty) %>% mutate( Importance = abs(Importance), Variable = fct_reorder(Variable, Importance) ) lasso_table %>% mutate(Importance = round(Importance, 2)) %>% filter(Importance > .1) %>% arrange(desc(Importance)) %>% gt()
Variable | Importance | Sign |
---|---|---|
year_closed_X2007 | 0.26 | NEG |
year_closed_X2008 | 0.23 | NEG |
year_closed_X2011 | 0.22 | NEG |
complainant_gender_Missing | 0.20 | POS |
contact_reason_Execution.of.search.warrant | 0.19 | NEG |
year_closed_X2009 | 0.18 | NEG |
year_closed_X2015 | 0.16 | POS |
complainant_gender_Male | 0.14 | POS |
command_at_incident_ND.SEQI | 0.13 | POS |
year_closed_X2004 | 0.13 | POS |
year_received_X2006 | 0.12 | POS |
year_received_X2003 | 0.12 | NEG |
year_closed_X2012 | 0.12 | NEG |
year_closed_X1998 | 0.11 | POS |
year_closed_X2013 | 0.11 | NEG |
Probably the most interesting finding here is that the excecution of a search warrant predicts a lower likelihood fo having a substantiated claim, which makes sense.
Next, let’s look at how the model actually performed.
last_fit( final_lasso, data_split ) %>% collect_metrics() ## # A tibble: 2 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 accuracy binary 0.748 ## 2 roc_auc binary 0.643
Unfortunately, the model did not perform well at all. We would just as accurate if we had went without any predictors at all.
Conclusion
This looks like a situation where a predictive model doesn’t tell us much of anything, which makes sense. I don’t really know much of anything with regards to substantiated claims of police misconduct, and I am probably missing lots of important variables that can describe the process. A predictive model that is applied to data without any idea on the data generation process is bound to be useless.
However, the descriptive statistics are very interesting here. In particular, I found it interesting that the following categories make up over 50% of all claims of police misconduct:
- PD suspected C/V of violation/crime – street
- Other
- PD suspected C/V of violation/crime – auto
- PD suspected C/V of violation/crime – bldg
- Moving violation
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.