Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Recently, I had a real pleasure to work with various types of data pulled from public APIs, one of them being data.police.uk API. Oh, those hours of pure intellectual exploration it’s given me! I have a soft spot for crime data and I explored it using heat maps in the past. Apart from checking and visualising stats for the new area in London we just moved to, it made me think more about good and better ways of presenting complex and multidimensional information. I’m dedicating this post to my favourite heat maps, so expect some lovely colours (side by side with insightful findings on London crime)!
Project description
I’m going to scrape Wikipedia to get coordinates of all Tube Stations in London. Then I’ll pick a random sample of 20 of them and use their latitude and longitude to pull all crime information for those locations between Jan 2016 and June 2018. Then, I’ll explore crime frequency and crime type per location over time using faceted heatmaps. Finally, I play a bit with leaflet
package to explore best way of visualising this data on geographical heat maps.
Packages
Here’s a sweet collection of packages required to run this analysis:
library(httr) library(rvest) library(purrr) library(dplyr) library(ggplot2) library(jsonlite) library(lubridate) library(stringr) library(viridis) library(leaflet) library(leaflet.extras) library(htmltools)
Scraping Wiki
Let’s first start with scraping a table from Wikipedia website that holds coordinates for all London Tube stations. I used Google Selector tool for it, you can learn more about it here:
### scraping with rvest() #### # Wikipedia linke wiki_link <- "https://wiki.openstreetmap.org/wiki/List_of_London_Underground_stations" # scraping information from the table in the above URL wiki_tbl <- wiki_link %>% read_html() %>% html_nodes(xpath='//*[@id="mw-content-text"]/div/table[1]') %>% html_table(fill = TRUE) %>% as.data.frame() head(wiki_tbl) ## Name Latitude Longitude Platform...Entrance ## 1 Acton Town 51.502500 -0.278126 Platform ## 2 Acton Central 51.50883531 -0.263033174 Entrance ## 3 Acton Central 51.50856013 -0.262879534 Platform ## 4 Aldgate 51.51394 -0.07537 Aldgate High Street entrance ## 5 Aldgate East 51.51514 -0.07178 Entrance ## 6 Alperton 51.54097 -0.30061 Platform ## Collected.By Collected.On Line Step.free ## 1 User:Gagravarr 24/11/06 District, Piccadilly <NA> ## 2 User:Firefishy 08/05/2007 London Overground <NA> ## 3 User:Firefishy 08/05/2007 London Overground <NA> ## 4 User:Morwen 28/4/2007 Metropolitan No ## 5 User:Parsingphase (2006) District, Hammersmith & City <NA> ## 6 User:Mattwestcott 28/01/2007 Piccadilly <NA>
First thing done. Now, it would take a stupid amount of data to pull all the crime information for extended periods of time for ALL of the Tube stations, so let’s pick a sample of 20 places instead and take it from there:
# pick only coordinates and a station name from Wikipedia table wiki_selected <- wiki_tbl %>% select(name = Name, lat = Latitude, long = Longitude) # pick a random sample of 20 Tube stations set.seed(999) k <- sample(as.numeric(rownames(wiki_selected)), 20) wiki_sample <- wiki_selected[k, ] wiki_sample ## name lat long ## 118 Heathrow Terminals 1-2-3 51.4712899 -0.4527438 ## 176 North Acton 51.52370 -0.26019 ## 29 Boston Manor 51.495371 -0.325573 ## 255 Tooting Broadway 51.42780008 -0.167910811 ## 235 South Woodford 51.59088 +0.02726 ## 36 Brondesbury Park 51.54060 -0.21093 ## 180 North Wembley 51.56258091 -0.304072648 ## 24 Bethnal Green 51.52718 -0.05504 ## 115 Harrow & Wealdstone 51.59205973 -0.334725352 ## 182 Northolt 51.548054 -0.368257 ## 84 Epping 51.69365 +0.11495 ## 161 Manor House 51.57038 -0.09601 ## 9 Archway 51.56536 -0.13474 ## 286 West Hampstead 51.54755518 -0.190999685 ## 30 Bounds Green 51.60700 -0.12418 ## 226 Sloane Square 51.49258474 -0.156090904 ## 48 Chal & Latimer 51.66822 -0.56053 ## 298 Wimbledon 51.42200 -0.20544 ## 38 Burnt Oak 51.60300 -0.26427 ## 20 Bayswater 51.51224 -0.187569
Pulling data from data.police.uk API
Now that we have coordinates, we can feed them into the UK Police API that requires the following parameters: latitude, longitude and year-month which we want to pull crime data for. To make everyone’s life easier, I define a function get_crime()
which will require all three parameters but also a name of the station that coordinates correspond to:
get_crime <- function(lat, long, date, name) { # pull specified data from the API base <- "https://data.police.uk/api/crimes-street/all-crime" police_url <- paste0(base, "?", "lat=", lat, "&lng=", long, "&date=", date) get_police <- GET(police_url) # parse JSON file into a clean data frame police_data_json <- content(get_police, "text") police_data <- fromJSON(police_data_json, flatten = TRUE) %>% mutate(location = name) return(police_data) }
So far so good. However, our brand new get_crime()
function can pull data only for a single location and a single month. Let’s use it in the for loop
together with amazing pmap()
function from purrr
package to get all the data we need. WARNING: the code below needs good 15 minutes to run, so think twice before you execute it!
# create a series of months that we want data for iter_months <- str_sub(seq(ymd('2016-01-01'), ymd('2018-06-20'), by = 'month'), start = 1, end = 7) # pull data from API for all the locations over all the months final_df<-data.frame() for(i in 1:length(iter_months)){ # result will be a list pre_final_list <- pmap(list(lat = wiki_sample$lat, long = wiki_sample$long, name = wiki_sample$name, date = iter_months[i]), get_crime) # turn a list of locations in one month into a data.frame pre_final_df <- bind_rows(pre_final_list) # put all the clean data.frames together final_df <- bind_rows(final_df, pre_final_df) }
Phew! That was hard work, but it was clearly worth it! Let’s have a glimpse()
!
## Observations: 421,496 ## Variables: 14 ## $ category <chr> "anti-social-behaviour", "anti-social-... ## $ location_type <chr> "Force", "BTP", "Force", "Force", "For... ## $ context <chr> "", "", "", "", "", "", "", "", "", ""... ## $ persistent_id <chr> "", "", "", "", "", "", "", "", "", ""... ## $ id <int> 46149941, 64792634, 46150855, 46150854... ## $ location_subtype <chr> "", "LU STATION", "", "", "", "", "", ... ## $ month <chr> "2016-01", "2016-01", "2016-01", "2016... ## $ location.latitude <chr> "51.469708", "51.471600", "51.479579",... ## $ location.longitude <chr> "-0.451842", "-0.452836", "-0.463993",... ## $ location.street.id <int> 944231, 1488721, 944266, 944296, 94426... ## $ location.street.name <chr> "On or near Airport/airfield", "Heathr... ## $ outcome_status.category <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... ## $ outcome_status.date <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... ## $ location <chr> "Heathrow Terminals 1-2-3", "Heathrow ...
Looks great, but did we really pull data for all the locations and all months?
table(final_df$location, final_df$month) ## ## 2016-01 2016-02 2016-03 2016-04 2016-05 2016-06 ## Archway 831 832 814 809 862 914 ## Bayswater 1377 1198 1365 1495 1524 1617 ## Bethnal Green 1803 1734 1827 1891 2144 2251 ## Boston Manor 291 239 278 323 324 294 ## Bounds Green 523 512 581 509 587 590 ## Brondesbury Park 870 777 805 847 1025 987 ## Burnt Oak 557 478 600 544 643 661 ## Chal & Latimer 17 25 20 26 18 14 ## Epping 68 77 93 107 92 115 ## Harrow & Wealdstone 484 423 498 486 530 541 ## Heathrow Terminals 1-2-3 166 143 152 137 212 209 ## Manor House 821 819 944 926 1045 1057 ## North Acton 349 331 346 356 391 400 ## North Wembley 472 453 506 477 574 590 ## Northolt 307 257 275 326 349 387 ## Sloane Square 1248 1040 1084 1260 1414 1312 ## South Woodford 225 169 261 251 236 217 ## Tooting Broadway 554 561 621 651 743 662 ## West Hampstead 790 784 790 807 973 909 ## Wimbledon 378 346 422 297 348 359 ## ## 2016-07 2016-08 2016-09 2016-10 2016-11 2016-12 ## Archway 1020 918 876 983 816 871 ## Bayswater 1757 2119 1494 1596 1505 1608 ## Bethnal Green 2600 2459 2296 2278 2078 2036 ## Boston Manor 351 358 313 327 277 279 ## Bounds Green 642 550 572 622 567 610 ## Brondesbury Park 1104 1179 1043 1055 891 977 ## Burnt Oak 648 639 628 628 644 563 ## Chal & Latimer 18 23 26 22 20 21 ## Epping 81 92 87 108 108 103 ## Harrow & Wealdstone 608 573 521 526 471 536 ## Heathrow Terminals 1-2-3 208 221 210 211 207 262 ## Manor House 1218 1096 1045 1163 1020 987 ## North Acton 394 461 459 441 405 427 ## North Wembley 669 568 539 597 503 594 ## Northolt 462 393 381 315 313 347 ## Sloane Square 1465 1526 1327 1343 1249 1342 ## South Woodford 247 256 229 255 227 218 ## Tooting Broadway 836 751 771 746 711 653 ## West Hampstead 973 894 888 925 829 956 ## Wimbledon 454 424 394 390 416 362 ## ## 2017-01 2017-02 2017-03 2017-04 2017-05 2017-06 ## Archway 867 851 1004 1021 1186 1187 ## Bayswater 1339 1227 1744 1579 1537 1507 ## Bethnal Green 1867 1991 2090 2126 2217 2089 ## Boston Manor 305 281 337 302 334 367 ## Bounds Green 535 552 601 618 657 582 ## Brondesbury Park 907 869 1050 924 1100 1082 ## Burnt Oak 532 480 597 600 654 581 ## Chal & Latimer 29 23 25 24 34 30 ## Epping 83 83 122 86 130 112 ## Harrow & Wealdstone 485 494 543 494 543 529 ## Heathrow Terminals 1-2-3 226 198 231 209 260 231 ## Manor House 959 1004 1180 1109 1178 1253 ## North Acton 324 370 449 429 432 385 ## North Wembley 448 460 527 501 582 576 ## Northolt 281 271 347 373 342 392 ## Sloane Square 1166 1094 1273 1199 1470 1269 ## South Woodford 254 233 290 231 302 319 ## Tooting Broadway 572 604 710 642 707 733 ## West Hampstead 853 848 1026 956 1078 1026 ## Wimbledon 359 356 414 384 432 378 ## ## 2017-07 2017-08 2017-09 2017-10 2017-11 2017-12 ## Archway 1291 1305 1244 1205 1089 958 ## Bayswater 1772 2134 1526 1592 1573 1531 ## Bethnal Green 2526 2103 2070 2239 2112 1742 ## Boston Manor 351 372 316 365 354 246 ## Bounds Green 635 719 658 729 791 606 ## Brondesbury Park 1164 1181 1140 1182 1085 1007 ## Burnt Oak 705 629 605 638 594 497 ## Chal & Latimer 37 29 26 26 39 26 ## Epping 114 118 132 140 116 146 ## Harrow & Wealdstone 643 575 545 568 562 485 ## Heathrow Terminals 1-2-3 211 248 247 239 206 250 ## Manor House 1410 1188 1183 1240 1097 993 ## North Acton 528 453 407 460 402 396 ## North Wembley 550 534 501 593 574 533 ## Northolt 351 341 312 328 306 288 ## Sloane Square 1533 1484 1322 1305 1351 1424 ## South Woodford 315 323 355 356 330 282 ## Tooting Broadway 711 675 699 801 775 689 ## West Hampstead 1157 1027 1070 1181 1050 946 ## Wimbledon 424 436 392 405 408 371 ## ## 2018-01 2018-02 2018-03 2018-04 2018-05 2018-06 ## Archway 1086 932 956 1038 1008 1160 ## Bayswater 1457 1392 1470 1663 1550 1597 ## Bethnal Green 2033 1668 1902 2160 2254 2315 ## Boston Manor 289 282 303 345 348 338 ## Bounds Green 597 600 612 614 645 666 ## Brondesbury Park 1137 866 992 1068 1106 1136 ## Burnt Oak 515 518 602 634 693 645 ## Chal & Latimer 25 22 24 31 31 35 ## Epping 145 100 113 88 123 98 ## Harrow & Wealdstone 604 478 536 533 618 557 ## Heathrow Terminals 1-2-3 201 204 238 210 208 233 ## Manor House 1032 899 1012 1136 1131 1243 ## North Acton 377 345 390 397 397 405 ## North Wembley 492 515 566 529 547 573 ## Northolt 325 274 332 324 377 328 ## Sloane Square 1304 1331 1385 1260 1393 1414 ## South Woodford 306 258 278 288 248 287 ## Tooting Broadway 723 568 608 687 795 786 ## West Hampstead 1058 823 1016 941 923 1000 ## Wimbledon 366 355 411 376 444 397
Oh, yes we did!
Let’s finish off with some minor data cleaning: API returns coordinates for crimes, but not for the original location of interest (Tube station), let’s fix it:
final_df <- final_df %>% left_join(wiki_sample, by = c("location" = "name")) %>% rename(date = month, search_lat = lat, search_long = long)
Finally, for some heat maps, all time data will be too much to handle. In those cases I’ll focus on one month only, July 2017, and I pick only random 10% of it, to make interactive maps more responsive:
july_data <- final_df %>% filter(date == "2017-07") %>% sample_frac(.1) ## Warning: package 'bindrcpp' was built under R version 3.4.4
Heat-mapping
FACETED HEAT MAPS
Here come the heat maps! Let’s start summarising the frequency of all crimes for each location and month:
# summarise available data and save it in a data.frame police_grid <- final_df %>% unique() %>% group_by(location, date) %>% summarise(n_crimes = n()) ### faceted heat map of all the crimes per location and month! ggplot(police_grid,aes(x=date,y=location, fill=n_crimes))+ #add border white colour of line thickness 0.25 geom_tile(colour="white",size=0.25)+ labs(x="",y="")+ #remove extra space scale_y_discrete(expand=c(0,0))+ #define new breaks on x-axis scale_x_discrete(expand=c(0,0), breaks=c("2016-01","2017-01","2018-01"))+ scale_fill_viridis(option = "B", name = "Number of crimes") + ggtitle("Number of crimes at London Tube stations") + coord_fixed()+ #set a base size for all s theme_grey(base_size=8)+ #theme options theme( # vertical labels on x axis axis.text.x = element_text(), #bold for both axis text axis.text=element_text(face="bold"), #set thickness of axis ticks axis.ticks=element_line(size=0.4), #remove plot background plot.background=element_blank(), #remove plot border panel.border=element_blank() )
So pretty! I’m a big-big fan of viridis
colour pallette and I must say you see it at its best in heat maps. And how informative this heat map is! From the first glimpse you can see straight away which areas tend to have more crime (e.g. Bethnal Green, Bayswater) and when (summer months). We know what and when, what about the type of crime they experiene most? This can be sorted with another faceted heatmap:
# summarise data by location and crime type crime_grid <- final_df %>% group_by(location, category) %>% summarise(n_crimes = n()) # plot above data in faceted heat map ggplot(crime_grid,aes(x=category,y=location, fill=n_crimes))+ geom_tile(colour="white",size=0.25)+ labs(x="",y="")+ scale_y_discrete(expand=c(0,0))+ scale_fill_viridis(option = "B", name = "Number of crimes") + coord_fixed()+ theme_grey(base_size=8)+ theme( axis.text.x = element_text(angle = 45, hjust = 1), axis.text=element_text(face="bold"), axis.ticks=element_line(size=0.4), plot.background=element_blank(), panel.border=element_blank() ) #+
#guides(fill=guide_legend(title="Number of crimes"), # colour = guide_legend(reverse=T))
Another informative beauty! You can see that 3 types of crime dominate: anit-social behaviour, violent crime and vehicle theft. Again, avoid Bethnal Green if you can and make sure you hang out in Epping, instead!
GEOGRAPHIC HEAT MAPS
Facets are brilliant when it comes to quantitative comparison, but where are these areas? are they close/far away from each other? These questions can be only answered by geo heat maps. Lets plot a very basic one using leaflet
package:
july_data %>% leaflet() %>% addProviderTiles(providers$CartoDB.Positron) %>% addHeatmap(lng=~as.numeric(location.longitude), lat=~as.numeric(location.latitude), radius = 8)
Great, it looks like the farther away you are from the city centre, the fewer crimes you see (doh!). Now, we can add information about number of plotted crime-points using clusterOptions
functionality, have a look:
# adding color schemes to crime types color_scheme <- viridis::cividis(n_distinct(july_data$category)) pal = colorFactor(color_scheme, july_data$category) july_data %>% leaflet() %>% addProviderTiles(providers$CartoDB.Positron) %>% addCircleMarkers(~as.numeric(location.longitude), ~as.numeric(location.latitude), fillColor = ~pal(category), stroke = FALSE, fillOpacity = 0.8, clusterOptions = markerClusterOptions(), # adds summary circles popup = ~as.character(category) ) %>% addHeatmap(lng=~as.numeric(location.longitude), lat=~as.numeric(location.latitude), radius = 8)
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.