Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I started with a new script to scrape rental data for Houston, and i’m curious to see if i could not only plot the posts and the rentals, but also a cluster plot using Leaflet. I’m also eager to try out the gepaf, gmapdistance, and the Google Directions API to get distance data and polyline strings for two given coordinates, and then maybe try to plot those too.
But first, i need data. So i had to come up with another script that’ll scrape rental data for Houston. As always, I’ve changed the site’s URL in the script below so the URL below is not the real deal.
#load packages
library(dplyr)
library(rvest)
library(stringr)
#---get URLs----------------------------------------------
urlsFin <- c()
seq(from=5,to=5900,by=5) -> pages
#start loop to scrape post urls
for(i in 1:196){
#url to site's search results page
#needs to be amended to include search results
link <- paste('http://www.HoustonRentals.com/Rentals/HoustonTexas/', i, '_p', sep = "")
hLink <- 1
#get HTML
try(hLink <- html(link))
try(
hLink %>% #scrape the urls leading to the post
html_nodes(".primaryLink.pdpLink.activeLink") %>%
html_attr("href") %>% unique() -> urls
)
urlsFin <- c(urlsFin, urls)
print(paste("iteration No. ", i))
if(i %in% pages){Sys.sleep(20)}
}
#-----------------------------------------------------------
#add prefix to the urls
paste("http://www.HoustonRentals.com", urlsFin, sep = "") -> cUrls
#empty dataframe to populate later
matrix(NA, nrow = length(cUrls), 8) %>%
data.frame() -> cDf
#name columns
names(cDf) <- c("r_Range", "beds", "baths", "desc", "desc_2", "lat", "long", "crime")
#---Start scrape---------------------------------------------------
#navigate to each post and extract details
for(i in 1:length(cUrls)){
cHLink <- hLink <- r_Range <- beds <- baths <- desc <- desc_2 <- crime <- "N.A."
#get HTML for the post
try(hLink <- html(cUrls[i]))
if(class(hLink) != "N.A."){
#rental range
try(
hLink %>%
html_nodes("#rentalPdpContactLeadForm > div.backgroundInverseAlternate.phl.pvm.priceModule > div > span > span") %>%
html_text() %>% str_trim() -> r_Range)
#bedrooms
try(
hLink %>%
html_nodes("#propertyDetails > div.col.cols16.ptm > ul:nth-child(3) > li:nth-child(1)") %>%
html_text() %>% str_trim() -> beds)
#bathrooms
try(
hLink %>%
html_nodes("#propertyDetails > div.col.cols16.ptm > ul:nth-child(3) > li:nth-child(2)") %>%
html_text() %>% str_trim() -> baths)
#residence decription
try(
hLink %>%
html_nodes(".mediaBody.h7.ptl.prxl.prxxl > p:nth-child(1)") %>%
html_text() %>% str_trim() %>% str_replace_all(pattern = "\n", replacement = "") %>%
str_replace_all(pattern = " ", replacement = " ") -> desc)
#crime
try(
hLink %>% html_nodes("#listingHomeDetailsContainer > div:nth-child(5) > div.col.lastCol > div.box.boxBasic.mtn.backgroundLowlight.clickable.crime_glance_details > div.boxBody.media.pvs.h5") %>%
html_text() %>% str_replace_all(pattern = "Crime", replacement = "") %>% str_trim() -> crime)
#stat details
try(
hLink %>%
html_nodes(".mediaBody.h7.ptl.prxl.prxxl > p:nth-child(2)") %>%
html_text() %>% str_trim() %>% str_replace_all(pattern = "\n", replacement = "") %>%
str_replace_all(pattern = " ", replacement = " ") -> desc_2)
#convert HTML to class character
try(hLink %>% as.character() -> cHLink)
#find location of latitude and longitude
try(cHLink %>% str_locate(pattern = "latitude") %>% as.vector() -> latLoc)
try(cHLink %>% str_locate(pattern = "longitude") %>% as.vector() -> lonLoc)
#extract coordinates
try(substr(cHLink, latLoc[2] + 3, lonLoc[1] - 3) -> latitude)
try(substr(cHLink, lonLoc[2] + 3, lonLoc[2] + 12) -> longitude)
#populate empty dataframe
try(cDf[i, "r_Range"] <- r_Range)
try(cDf[i, "beds"] <- beds)
try(cDf[i, "baths"] <- baths)
try(cDf[i, "desc"] <- desc)
try(cDf[i, "desc_2"] <- desc_2)
try(cDf[i, "lat"] <- latitude)
try(cDf[i, "long"] <- longitude)
try(cDf[i, "crime"] <- crime)
}
#progress
paste("Iteration No. ", i, sep = "") %>% print()
#wait 20 seconds
if(i %in% pages){Sys.sleep(20)}
}
#---------------------------------------------------------------
write.csv(cDf, "Houston.csv", row.names = FALSE) #backup
#make another copy for cleaning process
cDf -> CDfCopy
Alright. At this point, i have the raw data that shows:
1. Rental range
2. No. of bedrooms
3. No. of baths
4. Description of the residence
5. Latitude of the residence
6. Longitude of the residence
7. Crime level of the area that the residence is situated
Now it’s time for the cleaning.
#_______Cleaning_________
#Remove incorrect populations
filter(CDfCopy, !grepl(pattern = "ath", CDfCopy$beds),
!grepl(pattern = "ay", CDfCopy$beds),
!grepl(pattern = "ay", CDfCopy$baths),
!grepl(pattern = "N.A.", CDfCopy$beds),
!grepl(pattern = "N.A.", CDfCopy$beds),
!is.na(CDfCopy$beds),
!is.na(CDfCopy$baths)) -> CDfCopy
#1. coordinates
#unique latitudes
strsplit(CDfCopy[, "lat"], "") %>%
unlist() %>% unique() -> NA_lats
#location of latitudes with special characters
strsplit(CDfCopy[, "lat"], "") %>%
unlist() %>% unique() %>%
as.numeric() %>% is.na() -> NA_lats_log
#extract only the special characters
NA_lats[NA_lats_log] -> NA_lats_fin
NA_lats_fin[!is.na(NA_lats_fin)] -> NA_lats_fin
NA_lats_fin[NA_lats_fin != "."] -> NA_lats_fin
#doing the same for longitudes
strsplit(CDfCopy[, "long"], "") %>%
unlist() %>% unique() -> NA_longs
strsplit(CDfCopy[, "long"], "") %>%
unlist() %>% unique() %>%
as.numeric() %>% is.na() -> NA_longs_log
NA_longs[NA_longs_log] -> NA_longs_fin
NA_longs_fin[!is.na(NA_longs_fin)] -> NA_longs_fin
NA_longs_fin[NA_longs_fin != "."] -> NA_longs_fin
NA_longs_fin[NA_longs_fin != "-"] -> NA_longs_fin
#identify all lat/lon with special characters and remove them
for(i in 1:nrow(CDfCopy)){
if(length(NA_lats_fin) != 0){
#replace characters for the latitudes
for(j in 1:length(NA_lats_fin)){
str_replace_all(string = CDfCopy[i, "lat"],
pattern = NA_lats_fin[j],
replacement = "") -> CDfCopy[i, "lat"]
}
}
if(length(NA_longs_fin) != 0){
#replace characters for the longitudes
for(j in 1:length(NA_longs_fin)){
str_replace_all(string = CDfCopy[i, "long"],
pattern = NA_longs_fin[j],
replacement = "") -> CDfCopy[i, "long"]
}
}
}
#rentals, bedrooms, and bathroom data are given as a range
#need to separate the min and max for each in a different column
#new empty columns to be populated later
CDfCopy[, "min_Rental"] <- NA
CDfCopy[, "max_Rental"] <- NA
CDfCopy[, "min_Rooms"] <- NA
CDfCopy[, "max_Rooms"] <- NA
CDfCopy[, "min_Baths"] <- NA
CDfCopy[, "max_Baths"] <- NA
for(i in 1:nrow(CDfCopy)){
#2. Rentals
CDfCopy[i, 1] %>% str_locate_all(pattern = "\\$") %>%
unlist() %>% unique() -> locs
if(length(locs) == 2){
substr(CDfCopy[i, "r_Range"], locs[1], locs[2] - 6) %>%
str_replace_all(pattern = "\\$", replacement = "") %>%
str_replace_all(pattern = ",", replacement = "") %>%
as.numeric() -> CDfCopy[i, "min_Rental"]
substr(CDfCopy[i, "r_Range"], locs[2], nchar(CDfCopy[i, "r_Range"])) %>%
str_replace_all(pattern = "\\$", replacement = "") %>%
str_replace_all(pattern = ",", replacement = "") %>%
as.numeric() -> CDfCopy[i, "max_Rental"]
}else{
if(length(locs) == 1){
substr(CDfCopy[i, "r_Range"], locs[1], nchar(CDfCopy[i, "r_Range"])) %>%
str_replace_all(pattern = "\\$", replacement = "") %>%
str_replace_all(pattern = ",", replacement = "") %>%
as.numeric() -> CDfCopy[i, "min_Rental"]
}
}
#3. Bedrooms
if(grepl(pattern = "-", CDfCopy[i, "beds"])){
CDfCopy[i, "beds"] %>% str_locate(pattern = "-") %>%
as.vector() %>% unique() -> locs
substr(CDfCopy[i, "beds"], 0, locs - 2) -> CDfCopy[i, "min_Rooms"]
substr(CDfCopy[i, "beds"], locs + 2, nchar(CDfCopy[i, "beds"])) -> CDfCopy[i, "max_Rooms"]
}else{
CDfCopy[i, "beds"] -> CDfCopy[i, "min_Rooms"]
}
#4. Bathrooms
if(grepl(pattern = "-", CDfCopy[i, "baths"])){
CDfCopy[i, "baths"] %>% str_locate(pattern = "-") %>%
as.vector() %>% unique() -> locs
substr(CDfCopy[i, "baths"], 0, locs - 2) -> CDfCopy[i, "min_Baths"]
substr(CDfCopy[i, "baths"], locs + 2, nchar(CDfCopy[i, "baths"])) -> CDfCopy[i, "max_Baths"]
}else{
CDfCopy[i, "baths"] -> CDfCopy[i, "min_Baths"]
}
}
CDfCopy$min_Rooms %>%
str_replace_all(pattern = " Bedrooms", replacement = "") %>%
str_replace_all(pattern = " Bedroom", replacement = "") -> CDfCopy$min_Rooms_clean
Now that i have a nice tidy dataframe, it’s time to add three more columns:
1. Distance between the origin and the destination.
2. Time it takes to drive from the origin to the destination.
3. Encoded polyline of the route.
I’ll be using the University of Houston as the origin, while the destinations will be coordinates for each residency in the data frame. If you’re not too familiar with what encoded polylines are, you can check out the documentation on the Google Maps API page. Basically, it is a way to store a series of coordinates as a string. So instead of rows showing latitudes and longitudes, you can have the entire list of coordinates shown as a string. I would need these encoded polylines so that i could then convert these polylines to actual coordinate data, and then plot it on the Leaflet map.
#load packages
library(gmapsdistance)
library(gepaf)
#___add distance, driving time, and encoded polylines____
CDfCopy$distance <- NA
CDfCopy$drive_time <- NA
CDfCopy$poly <- NA
#Coordinates of the University of Houston
Origin <- paste("29.719471", "-95.342308", sep = "+")
Key <- "GOOGLE_API_KEY_GOES_HERE" #Obviously, i can't share this.
#function to get the encoded polylines using Google API
gDir <- function(origin, long, lat, key){
link_1 <- "https://maps.googleapis.com/maps/api/directions/json?origin="
link_2 <- "destination="
link_3 <- "key="
#concatenate coordinates
coordinates <- paste(lat, long, sep = "+")
#concatenate url
url <- paste(paste(link_1, origin, sep = ""),
paste(link_2, coordinates, sep = ""),
key, sep = "&")
#retrieve json
jsonlite::fromJSON(url) -> jData
#extract polyline
as.character(jData$routes$overview_polyline) -> Poly
#coerce to character
as.character(Poly) -> Poly
return(Poly)
}
#NOTE: Google limits requests to 2,500 per day, for free
#Anything more than 2,500 is charged
for(i in 1:nrow(CDfCopy)){
gOutput_d <- NA
gOutput_t <- NA
try(
gmapsdistance(Origin,
paste(CDfCopy$lat[i],
CDfCopy$long[i], sep = "+") ,
key = Key, mode = "driving") -> gOutput
)
#driving time is given in seconds, convert to minutes
try(gOutput_t <- gOutput$Time / 60)
#distance is given in meters, converting to KM
try(gOutput_d <- gOutput$Distance / 1000)
try(CDfCopy$distance[i] <- gOutput_d)
try(CDfCopy$drive_time[i] <- gOutput_t)
try(CDfCopy$poly[i] <- gDir(Origin, CDfCopy$long[i], CDfCopy$lat[i], Key))
print(paste("Completed iteration No.", i, sep = ""))
}
Alright, now that i got the data, it’s time for some plots. Starting off with taking a look at how the rental data is dispersed for each room type. I’ve used the ggplot2 package to plot data before, but i’ve never combined it with the ggplotly r package:
#Rental Boxplots
ggplotly(
ggplot(CDPlot) +
geom_boxplot(aes(min_Rooms,
min_Rental, fill = min_Rooms)) +
ylim(0, 3000) +
scale_x_discrete(limits=c('Studio', 1, 2, 3, 4,
5, 6), expand = c(0,-3)) +
labs(title = "Rentals Rates per Apartment Type",
x = "Number of Rooms",
y = "Rental") +
theme(plot.title = element_text(face="bold"))
) %>%
htmlwidgets::saveWidget(file = "plotly.html", selfcontained = FALSE)
I can’t embed the graph here because it’ll look all funny, but you can check it out by clicking here. Quite natural for the Studio category to have about the same rental rate as the 1-Bedroom, since they are practically the same thing for a lot of people.
It would be interesting to see where most of the rental posts are located using Leaflet’s cool cluster markers. I’ve never actually tried this out before, but here goes.
#cluster map
library(leaflet)
leaflet(CDPlot) %>%
addProviderTiles("CartoDB.DarkMatter") %>%
addCircleMarkers(
lng = ~as.numeric(CDPlot$long),
lat = ~as.numeric(CDPlot$lat),
clusterOptions = markerClusterOptions(),
radius = 3
) %>%
htmlwidgets::saveWidget(file = "Leaflet.html",
selfcontained = FALSE)
You can try clicking on any of the cluster markers and it will zoom in and break the cluster down further. Pretty neat stuff.
And finally, plotting the polylines. I actually ran in to a little bit of trouble plotting the driving routes that i retrieved from the Google Directions API. The thorn was that plotting a single line on the Leaflet map requires the use of the addPolylines() function, but that was just it. I couldn’t figure out how to plot multiple lines (in this case hundreds) of lines in the most concise way possible. The kept coming back to the ridiculous idea of using hundreds of addPolylines() functions, which just sounds silly.
I was stumped, so i took a sample and asked the nice community at Stack Overflow for some help. The solution was to convert the coordinates decoded from the polylines in to a “Spatial Lines” object. I had never heard of this term before, so i had to read up on it a bit. I worked on the code sample given back by one of the users, and i’ve managed to make it work. Below is the code and the result.
#load package
library(leaflet)
#filter only Studios and 1-bedroom apartments
filter(dfPlot, rooms == c("Studio", "1")) -> dfPlot
#isolate polylines
dfPlot$polylines -> polylines
#__Below is the sample i got from stack over flow, but slightly modified_______
#__comments are mine except where explicitly mentioned_________________________
#decode all polylines and turn in to list
lapply(polylines, function(x) gepaf::decodePolyline(x)) -> lPoly
#rowbind into data.table, with unique ID column
rbindlist(lPoly, idcol = "id") -> dfPoly
#into list with unique IDs
lst_lines <- lapply(unique(dfPoly$id), function(x){
#Comment from stack over flow user:
#"the order of the 'lon' and 'lat' fields is important"
Lines(Line(dfPoly[id == x, .(lon, lat)]), ID = x)
})
#______________________________________________________________________________
#Map distances to colors
colorNumeric(c("#00FF15", "#faff00","#FF0000", "#17129e"),
domain = dfPlot$distances,
alpha = FALSE) -> pal
#Map rental rates to colors
colorNumeric(c("#ff00ee", "#00e5ff", "#ffff00"),
domain = dfPlot$rentals,
alpha = FALSE) -> pal_rent
#Convert to spatial lines object
lst_lines %>%
SpatialLines() %>%
#generate leaflet map
leaflet() %>%
#use a dark theme
addProviderTiles('CartoDB.DarkMatterNoLabels') %>%
#plot the lines with the values mapped
addPolylines(color = pal(dfPlot$distances),
opacity = 0.3, fillOpacity = 0.1,
weight = 3) %>%
#add circle markers with colors mapped and...
#..with popups
addCircleMarkers(lng = as.numeric(dfPlot$long),
lat = as.numeric(dfPlot$lat),
color = pal_rent(dfPlot$rentals),
radius = 2, popup = dfPlot$desc) %>%
#legend for the rental amounts on the bottom right
addLegend("bottomright", pal = pal_rent,
values = dfPlot$rentals,
title = "Rental Amount",
labFormat = labelFormat(prefix = "USD ")) %>%
#legend for the driving distances on the top right
addLegend("topright", pal = pal,
values = dfPlot$distances,
title = "Driving Distance",
labFormat = labelFormat(suffix = " Kilometers")) %>%
#set the center of the map and the zoom
setView(lat = "29.83", lng = "-95.368817",zoom = 11) %>%
#export all the files
htmlwidgets::saveWidget(file = "Leaflet_Houston.html",
selfcontained = FALSE)
You can click here for a much bigger view. I noticed that when the map is zoomed out, even at a reasonable distance, it looks like a lot of visual noise. But zooming in even a little bit gives more clarity on the information that is visualized. I’ve also added the popup feature on the rental locations, so clicking on any one of the little circles will give you the details of the residency and some relevant information.
It would be great if there was readily available data on recent crime stats in this city which would then enable me to (reasonably) plot the most suitable property to live in. It would be even cooler if we could incorporate within that the locations of nearby amenities, like grocery stores and pharmacies.
Really, this has always been my interest in learning programming about a year or so ago. We all know the benefits of programming/computers to companies and institutions and how it always circles back to us ordinary people in the form of neat public/private services. But i’ve always been interested in how it can help carrying out even the most ordinary of tasks in a more informed manner. Plus, at the risk of sounding a little nerdy, it’s also kind of fun.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
