Plotting Driving Routes and Rental Data for Houston | gepaf, gmap, plotly, Leaflet

Abdalla

5 years ago

[This article was first published on r – Recommended Texts, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I started with a new script to scrape rental data for Houston, and i’m curious to see if i could not only plot the posts and the rentals, but also a cluster plot using Leaflet. I’m also eager to try out the gepaf, gmapdistance, and the Google Directions API to get distance data and polyline strings for two given coordinates, and then maybe try to plot those too.

But first, i need data. So i had to come up with another script that’ll scrape rental data for Houston. As always, I’ve changed the site’s URL in the script below so the URL below is not the real deal.


#load packages
library(dplyr)
library(rvest)
library(stringr)

#---get URLs----------------------------------------------

urlsFin <- c()

seq(from=5,to=5900,by=5) -> pages

#start loop to scrape post urls
for(i in 1:196){ 
  
  #url to site's search results page
  #needs to be amended to include search results
  link <- paste('http://www.HoustonRentals.com/Rentals/HoustonTexas/', i, '_p', sep = "")
  
  hLink <- 1
  
  #get HTML
  try(hLink <- html(link))

    try(   
      hLink %>% #scrape the urls leading to the post
        html_nodes(".primaryLink.pdpLink.activeLink") %>% 
        html_attr("href") %>% unique() -> urls
    )
  
  urlsFin <- c(urlsFin, urls)
  
  print(paste("iteration No. ", i))
  
  if(i %in% pages){Sys.sleep(20)}
  
}

#-----------------------------------------------------------
    
      #add prefix to the urls 
      paste("http://www.HoustonRentals.com", urlsFin, sep = "") -> cUrls
    
      #empty dataframe to populate later
      matrix(NA, nrow = length(cUrls), 8) %>% 
      data.frame() -> cDf
    
      #name columns
      names(cDf) <- c("r_Range", "beds", "baths", "desc", "desc_2", "lat", "long", "crime")

#---Start scrape---------------------------------------------------
  
  #navigate to each post and extract details
      for(i in 1:length(cUrls)){
        
        cHLink <- hLink <- r_Range <- beds <- baths <- desc <- desc_2 <- crime <- "N.A."
      
        #get HTML for the post
        try(hLink <- html(cUrls[i]))
      
        if(class(hLink) != "N.A."){ 
        
          #rental range
          try(
            hLink %>% 
              html_nodes("#rentalPdpContactLeadForm > div.backgroundInverseAlternate.phl.pvm.priceModule > div > span > span") %>% 
              html_text() %>% str_trim() -> r_Range)
          
          #bedrooms
          try( 
            hLink %>% 
              html_nodes("#propertyDetails > div.col.cols16.ptm > ul:nth-child(3) > li:nth-child(1)") %>% 
              html_text() %>% str_trim() -> beds)
        
          #bathrooms
          try(     
            hLink %>% 
              html_nodes("#propertyDetails > div.col.cols16.ptm > ul:nth-child(3) > li:nth-child(2)") %>% 
              html_text() %>% str_trim() -> baths)
           
          #residence decription
          try(  
            hLink %>% 
              html_nodes(".mediaBody.h7.ptl.prxl.prxxl > p:nth-child(1)") %>% 
              html_text() %>% str_trim() %>% str_replace_all(pattern = "\n", replacement = "") %>% 
              str_replace_all(pattern = "  ", replacement = " ") -> desc)
          
          #crime
          try(
            hLink %>% html_nodes("#listingHomeDetailsContainer > div:nth-child(5) > div.col.lastCol > div.box.boxBasic.mtn.backgroundLowlight.clickable.crime_glance_details > div.boxBody.media.pvs.h5") %>% 
                html_text() %>% str_replace_all(pattern = "Crime", replacement = "") %>% str_trim() -> crime)
          
            
          #stat details
          try( 
            hLink %>% 
              html_nodes(".mediaBody.h7.ptl.prxl.prxxl > p:nth-child(2)") %>% 
              html_text() %>% str_trim() %>% str_replace_all(pattern = "\n", replacement = "") %>% 
              str_replace_all(pattern = "  ", replacement = " ") -> desc_2)
          
          #convert HTML to class character
          try(hLink %>% as.character() -> cHLink)
          
          #find location of latitude and longitude
          try(cHLink %>% str_locate(pattern = "latitude") %>% as.vector() -> latLoc)
          try(cHLink %>% str_locate(pattern = "longitude") %>% as.vector() -> lonLoc)
          
          #extract coordinates
          try(substr(cHLink, latLoc[2] + 3, lonLoc[1] - 3) -> latitude)
          try(substr(cHLink, lonLoc[2] + 3, lonLoc[2] + 12) -> longitude)
          
          #populate empty dataframe
          try(cDf[i, "r_Range"] <- r_Range)
          try(cDf[i, "beds"] <- beds)
          try(cDf[i, "baths"] <- baths)
          try(cDf[i, "desc"] <- desc)
          try(cDf[i, "desc_2"] <- desc_2)
          try(cDf[i, "lat"] <- latitude)
          try(cDf[i, "long"] <- longitude)
          try(cDf[i, "crime"] <- crime)
          
        }
        
        #progress
        paste("Iteration No. ", i, sep = "") %>% print()
        
        #wait 20 seconds
        if(i %in% pages){Sys.sleep(20)}
        
      }

#---------------------------------------------------------------
  
  write.csv(cDf, "Houston.csv", row.names = FALSE) #backup
  
  #make another copy for cleaning process
  cDf -> CDfCopy

Alright. At this point, i have the raw data that shows:

1. Rental range
2. No. of bedrooms
3. No. of baths
4. Description of the residence
5. Latitude of the residence
6. Longitude of the residence
7. Crime level of the area that the residence is situated

Now it’s time for the cleaning.


#_______Cleaning_________
  
  #Remove incorrect populations
  filter(CDfCopy, !grepl(pattern = "ath", CDfCopy$beds), 
         !grepl(pattern = "ay", CDfCopy$beds),
         !grepl(pattern = "ay", CDfCopy$baths),
         !grepl(pattern = "N.A.", CDfCopy$beds),
         !grepl(pattern = "N.A.", CDfCopy$beds), 
         !is.na(CDfCopy$beds), 
         !is.na(CDfCopy$baths)) -> CDfCopy
  
  
  #1. coordinates

   #unique latitudes
  strsplit(CDfCopy[, "lat"], "") %>% 
    unlist() %>% unique() -> NA_lats
  
   #location of latitudes with special characters
  strsplit(CDfCopy[, "lat"], "") %>% 
    unlist() %>% unique() %>% 
    as.numeric() %>% is.na() -> NA_lats_log
  
   #extract only the special characters
  NA_lats[NA_lats_log] -> NA_lats_fin
  NA_lats_fin[!is.na(NA_lats_fin)] -> NA_lats_fin
  NA_lats_fin[NA_lats_fin != "."] -> NA_lats_fin
 
    #doing the same for longitudes
  strsplit(CDfCopy[, "long"], "") %>% 
    unlist() %>% unique() -> NA_longs
  
  strsplit(CDfCopy[, "long"], "") %>% 
    unlist() %>% unique() %>% 
    as.numeric() %>% is.na() -> NA_longs_log
  
  NA_longs[NA_longs_log] -> NA_longs_fin
  NA_longs_fin[!is.na(NA_longs_fin)] -> NA_longs_fin
  NA_longs_fin[NA_longs_fin != "."] -> NA_longs_fin
  NA_longs_fin[NA_longs_fin != "-"] -> NA_longs_fin
  
#identify all lat/lon with special characters and remove them
for(i in 1:nrow(CDfCopy)){
  
  if(length(NA_lats_fin) != 0){ 
      #replace characters for the latitudes
      for(j in 1:length(NA_lats_fin)){
        
        str_replace_all(string = CDfCopy[i, "lat"], 
                        pattern = NA_lats_fin[j], 
                        replacement = "") -> CDfCopy[i, "lat"]
        
      }
  }
  
  if(length(NA_longs_fin) != 0){ 
      #replace characters for the longitudes
      for(j in 1:length(NA_longs_fin)){
        
        str_replace_all(string = CDfCopy[i, "long"], 
                        pattern = NA_longs_fin[j], 
                        replacement = "") -> CDfCopy[i, "long"]
        
      }
  }
  
}
  
 #rentals, bedrooms, and bathroom data are given as a range
 #need to separate the min and max for each in a different column

 #new empty columns to be populated later
  CDfCopy[, "min_Rental"] <- NA
  CDfCopy[, "max_Rental"] <- NA

  CDfCopy[, "min_Rooms"] <- NA
  CDfCopy[, "max_Rooms"] <- NA

  CDfCopy[, "min_Baths"] <- NA
  CDfCopy[, "max_Baths"] <- NA
  
  
  for(i in 1:nrow(CDfCopy)){
    
    #2. Rentals
 
    CDfCopy[i, 1] %>% str_locate_all(pattern = "\\$") %>% 
      unlist() %>% unique() -> locs
    
    if(length(locs) == 2){ 
      
      substr(CDfCopy[i, "r_Range"], locs[1], locs[2] - 6) %>% 
        str_replace_all(pattern = "\\$", replacement = "") %>% 
        str_replace_all(pattern = ",", replacement = "") %>% 
        as.numeric() -> CDfCopy[i, "min_Rental"]
      
      substr(CDfCopy[i, "r_Range"], locs[2], nchar(CDfCopy[i, "r_Range"])) %>% 
        str_replace_all(pattern = "\\$", replacement = "") %>% 
        str_replace_all(pattern = ",", replacement = "") %>% 
        as.numeric() -> CDfCopy[i, "max_Rental"]
      
    }else{
        
      if(length(locs) == 1){
        
        substr(CDfCopy[i, "r_Range"], locs[1], nchar(CDfCopy[i, "r_Range"])) %>% 
          str_replace_all(pattern = "\\$", replacement = "") %>% 
          str_replace_all(pattern = ",", replacement = "") %>% 
          as.numeric() -> CDfCopy[i, "min_Rental"]
        
      }
      
    }
    
    #3. Bedrooms

    if(grepl(pattern = "-", CDfCopy[i, "beds"])){
      
      CDfCopy[i, "beds"] %>% str_locate(pattern = "-") %>% 
        as.vector() %>% unique() -> locs
      
      substr(CDfCopy[i, "beds"], 0, locs - 2) -> CDfCopy[i, "min_Rooms"]
      
      substr(CDfCopy[i, "beds"], locs + 2, nchar(CDfCopy[i, "beds"])) -> CDfCopy[i, "max_Rooms"]
      
    }else{
      
      CDfCopy[i, "beds"] -> CDfCopy[i, "min_Rooms"]
      
    }
    
    #4. Bathrooms
    if(grepl(pattern = "-", CDfCopy[i, "baths"])){
      
      CDfCopy[i, "baths"] %>% str_locate(pattern = "-") %>% 
        as.vector() %>% unique() -> locs
      
      substr(CDfCopy[i, "baths"], 0, locs - 2) -> CDfCopy[i, "min_Baths"]
      
      substr(CDfCopy[i, "baths"], locs + 2, nchar(CDfCopy[i, "baths"])) -> CDfCopy[i, "max_Baths"]
      
    }else{
      
      CDfCopy[i, "baths"] -> CDfCopy[i, "min_Baths"]
      
    }
    
    
  }
  
  
  CDfCopy$min_Rooms %>% 
    str_replace_all(pattern = " Bedrooms", replacement = "") %>% 
    str_replace_all(pattern = " Bedroom", replacement = "") -> CDfCopy$min_Rooms_clean

Now that i have a nice tidy dataframe, it’s time to add three more columns:

1. Distance between the origin and the destination.
2. Time it takes to drive from the origin to the destination.
3. Encoded polyline of the route.

I’ll be using the University of Houston as the origin, while the destinations will be coordinates for each residency in the data frame. If you’re not too familiar with what encoded polylines are, you can check out the documentation on the Google Maps API page. Basically, it is a way to store a series of coordinates as a string. So instead of rows showing latitudes and longitudes, you can have the entire list of coordinates shown as a string. I would need these encoded polylines so that i could then convert these polylines to actual coordinate data, and then plot it on the Leaflet map.


#load packages
library(gmapsdistance)
library(gepaf)

#___add distance, driving time, and encoded polylines____
  
  CDfCopy$distance <- NA
  CDfCopy$drive_time <- NA
  CDfCopy$poly <- NA
  
  #Coordinates of the University of Houston
  Origin <- paste("29.719471", "-95.342308", sep = "+")
  
  Key <- "GOOGLE_API_KEY_GOES_HERE" #Obviously, i can't share this.
  
 #function to get the encoded polylines using Google API
  gDir <- function(origin, long, lat, key){
    
    link_1 <- "https://maps.googleapis.com/maps/api/directions/json?origin="
    link_2 <- "destination="
    link_3 <- "key="
    
    #concatenate coordinates
      coordinates <- paste(lat, long, sep = "+")
    
    #concatenate url
      url <- paste(paste(link_1, origin, sep = ""), 
                 paste(link_2, coordinates, sep = ""), 
                 key, sep = "&")
    
    #retrieve json
      jsonlite::fromJSON(url) -> jData
    
    #extract polyline
      as.character(jData$routes$overview_polyline) -> Poly
    
    #coerce to character
      as.character(Poly) -> Poly
    
    return(Poly)
    
  }
  
 #NOTE: Google limits requests to 2,500 per day, for free
 #Anything more than 2,500 is charged

  for(i in 1:nrow(CDfCopy)){
    
    gOutput_d <- NA
    gOutput_t <- NA
    
 
    try(
          gmapsdistance(Origin, 
                        paste(CDfCopy$lat[i], 
                              CDfCopy$long[i], sep = "+") , 
                        key = Key, mode = "driving") -> gOutput
    )
    
   #driving time is given in seconds, convert to minutes
    try(gOutput_t <- gOutput$Time / 60)
   #distance is given in meters, converting to KM
    try(gOutput_d <- gOutput$Distance / 1000)
    
    try(CDfCopy$distance[i] <- gOutput_d)
    try(CDfCopy$drive_time[i] <- gOutput_t)
    
    try(CDfCopy$poly[i] <- gDir(Origin, CDfCopy$long[i], CDfCopy$lat[i], Key))
    
    print(paste("Completed iteration No.", i, sep = ""))
    
  }

Alright, now that i got the data, it’s time for some plots. Starting off with taking a look at how the rental data is dispersed for each room type. I’ve used the ggplot2 package to plot data before, but i’ve never combined it with the ggplotly r package:

#Rental Boxplots
ggplotly(

ggplot(CDPlot) + 
        geom_boxplot(aes(min_Rooms, 
                      min_Rental, fill = min_Rooms)) +
        ylim(0, 3000) + 
        scale_x_discrete(limits=c('Studio', 1, 2, 3, 4, 
                                  5, 6), expand = c(0,-3)) + 
	labs(title = "Rentals Rates per Apartment Type", 
		x = "Number of Rooms", 
		y = "Rental") + 
	theme(plot.title = element_text(face="bold"))

         ) %>% 

htmlwidgets::saveWidget(file = "plotly.html", selfcontained = FALSE)

I can’t embed the graph here because it’ll look all funny, but you can check it out by clicking here. Quite natural for the Studio category to have about the same rental rate as the 1-Bedroom, since they are practically the same thing for a lot of people.

It would be interesting to see where most of the rental posts are located using Leaflet’s cool cluster markers. I’ve never actually tried this out before, but here goes.

#cluster map
library(leaflet)

leaflet(CDPlot) %>% 

	addProviderTiles("CartoDB.DarkMatter") %>% 

	addCircleMarkers(

  		lng = ~as.numeric(CDPlot$long), 
		lat = ~as.numeric(CDPlot$lat), 
		clusterOptions = markerClusterOptions(), 
		radius = 3

				) %>% 

htmlwidgets::saveWidget(file = "Leaflet.html", 
				selfcontained = FALSE)

You can try clicking on any of the cluster markers and it will zoom in and break the cluster down further. Pretty neat stuff.

And finally, plotting the polylines. I actually ran in to a little bit of trouble plotting the driving routes that i retrieved from the Google Directions API. The thorn was that plotting a single line on the Leaflet map requires the use of the addPolylines() function, but that was just it. I couldn’t figure out how to plot multiple lines (in this case hundreds) of lines in the most concise way possible. The kept coming back to the ridiculous idea of using hundreds of addPolylines() functions, which just sounds silly.

I was stumped, so i took a sample and asked the nice community at Stack Overflow for some help. The solution was to convert the coordinates decoded from the polylines in to a “Spatial Lines” object. I had never heard of this term before, so i had to read up on it a bit. I worked on the code sample given back by one of the users, and i’ve managed to make it work. Below is the code and the result.


#load package
library(leaflet)

#filter only Studios and 1-bedroom apartments
filter(dfPlot, rooms == c("Studio", "1")) -> dfPlot

#isolate polylines
dfPlot$polylines -> polylines

#__Below is the sample i got from stack over flow, but slightly modified_______
#__comments are mine except where explicitly mentioned_________________________

    #decode all polylines and turn in to list
    lapply(polylines, function(x) gepaf::decodePolyline(x)) -> lPoly
    
    #rowbind into data.table, with unique ID column
    rbindlist(lPoly, idcol = "id") -> dfPoly
    
    #into list with unique IDs
    lst_lines <- lapply(unique(dfPoly$id), function(x){
      #Comment from stack over flow user:
        #"the order of the 'lon' and 'lat' fields is important"
      Lines(Line(dfPoly[id == x, .(lon, lat)]), ID = x)
    })
#______________________________________________________________________________

#Map distances to colors
colorNumeric(c("#00FF15", "#faff00","#FF0000", "#17129e"), 
             domain = dfPlot$distances,
             alpha = FALSE) -> pal

#Map rental rates to colors
colorNumeric(c("#ff00ee", "#00e5ff", "#ffff00"), 
             domain = dfPlot$rentals,
             alpha = FALSE) -> pal_rent

#Convert to spatial lines object
lst_lines %>% 
  SpatialLines() %>% 
  
  #generate leaflet map
    leaflet() %>% 
      
      #use a dark theme
      addProviderTiles('CartoDB.DarkMatterNoLabels') %>% 
  
      #plot the lines with the values mapped
      addPolylines(color = pal(dfPlot$distances), 
                   opacity = 0.3, fillOpacity = 0.1, 
                   weight = 3) %>% 
  
      #add circle markers with colors mapped and...
      #..with popups
      addCircleMarkers(lng = as.numeric(dfPlot$long), 
                       lat = as.numeric(dfPlot$lat), 
                       color = pal_rent(dfPlot$rentals), 
                       radius = 2, popup = dfPlot$desc) %>% 
      
      #legend for the rental amounts on the bottom right
      addLegend("bottomright", pal = pal_rent, 
                values = dfPlot$rentals, 
                title = "Rental Amount", 
                labFormat = labelFormat(prefix = "USD ")) %>% 
  
      #legend for the driving distances on the top right
      addLegend("topright", pal = pal,
                values = dfPlot$distances,
                title = "Driving Distance", 
                labFormat = labelFormat(suffix = " Kilometers")) %>% 
  
      #set the center of the map and the zoom
      setView(lat = "29.83", lng = "-95.368817",zoom = 11) %>% 
  
      #export all the files
      htmlwidgets::saveWidget(file = "Leaflet_Houston.html", 
                              selfcontained = FALSE)

You can click here for a much bigger view. I noticed that when the map is zoomed out, even at a reasonable distance, it looks like a lot of visual noise. But zooming in even a little bit gives more clarity on the information that is visualized. I’ve also added the popup feature on the rental locations, so clicking on any one of the little circles will give you the details of the residency and some relevant information.

It would be great if there was readily available data on recent crime stats in this city which would then enable me to (reasonably) plot the most suitable property to live in. It would be even cooler if we could incorporate within that the locations of nearby amenities, like grocery stores and pharmacies.

Really, this has always been my interest in learning programming about a year or so ago. We all know the benefits of programming/computers to companies and institutions and how it always circles back to us ordinary people in the form of neat public/private services. But i’ve always been interested in how it can help carrying out even the most ordinary of tasks in a more informed manner. Plus, at the risk of sounding a little nerdy, it’s also kind of fun.

To leave a comment for the author, please follow the link and comment on their blog: r – Recommended Texts.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.