Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I started with a new script to scrape rental data for Houston, and i’m curious to see if i could not only plot the posts and the rentals, but also a cluster plot using Leaflet. I’m also eager to try out the gepaf, gmapdistance, and the Google Directions API to get distance data and polyline strings for two given coordinates, and then maybe try to plot those too.
But first, i need data. So i had to come up with another script that’ll scrape rental data for Houston. As always, I’ve changed the site’s URL in the script below so the URL below is not the real deal.
#load packages library(dplyr) library(rvest) library(stringr) #---get URLs---------------------------------------------- urlsFin <- c() seq(from=5,to=5900,by=5) -> pages #start loop to scrape post urls for(i in 1:196){ #url to site's search results page #needs to be amended to include search results link <- paste('http://www.HoustonRentals.com/Rentals/HoustonTexas/', i, '_p', sep = "") hLink <- 1 #get HTML try(hLink <- html(link)) try( hLink %>% #scrape the urls leading to the post html_nodes(".primaryLink.pdpLink.activeLink") %>% html_attr("href") %>% unique() -> urls ) urlsFin <- c(urlsFin, urls) print(paste("iteration No. ", i)) if(i %in% pages){Sys.sleep(20)} } #----------------------------------------------------------- #add prefix to the urls paste("http://www.HoustonRentals.com", urlsFin, sep = "") -> cUrls #empty dataframe to populate later matrix(NA, nrow = length(cUrls), 8) %>% data.frame() -> cDf #name columns names(cDf) <- c("r_Range", "beds", "baths", "desc", "desc_2", "lat", "long", "crime") #---Start scrape--------------------------------------------------- #navigate to each post and extract details for(i in 1:length(cUrls)){ cHLink <- hLink <- r_Range <- beds <- baths <- desc <- desc_2 <- crime <- "N.A." #get HTML for the post try(hLink <- html(cUrls[i])) if(class(hLink) != "N.A."){ #rental range try( hLink %>% html_nodes("#rentalPdpContactLeadForm > div.backgroundInverseAlternate.phl.pvm.priceModule > div > span > span") %>% html_text() %>% str_trim() -> r_Range) #bedrooms try( hLink %>% html_nodes("#propertyDetails > div.col.cols16.ptm > ul:nth-child(3) > li:nth-child(1)") %>% html_text() %>% str_trim() -> beds) #bathrooms try( hLink %>% html_nodes("#propertyDetails > div.col.cols16.ptm > ul:nth-child(3) > li:nth-child(2)") %>% html_text() %>% str_trim() -> baths) #residence decription try( hLink %>% html_nodes(".mediaBody.h7.ptl.prxl.prxxl > p:nth-child(1)") %>% html_text() %>% str_trim() %>% str_replace_all(pattern = "\n", replacement = "") %>% str_replace_all(pattern = " ", replacement = " ") -> desc) #crime try( hLink %>% html_nodes("#listingHomeDetailsContainer > div:nth-child(5) > div.col.lastCol > div.box.boxBasic.mtn.backgroundLowlight.clickable.crime_glance_details > div.boxBody.media.pvs.h5") %>% html_text() %>% str_replace_all(pattern = "Crime", replacement = "") %>% str_trim() -> crime) #stat details try( hLink %>% html_nodes(".mediaBody.h7.ptl.prxl.prxxl > p:nth-child(2)") %>% html_text() %>% str_trim() %>% str_replace_all(pattern = "\n", replacement = "") %>% str_replace_all(pattern = " ", replacement = " ") -> desc_2) #convert HTML to class character try(hLink %>% as.character() -> cHLink) #find location of latitude and longitude try(cHLink %>% str_locate(pattern = "latitude") %>% as.vector() -> latLoc) try(cHLink %>% str_locate(pattern = "longitude") %>% as.vector() -> lonLoc) #extract coordinates try(substr(cHLink, latLoc[2] + 3, lonLoc[1] - 3) -> latitude) try(substr(cHLink, lonLoc[2] + 3, lonLoc[2] + 12) -> longitude) #populate empty dataframe try(cDf[i, "r_Range"] <- r_Range) try(cDf[i, "beds"] <- beds) try(cDf[i, "baths"] <- baths) try(cDf[i, "desc"] <- desc) try(cDf[i, "desc_2"] <- desc_2) try(cDf[i, "lat"] <- latitude) try(cDf[i, "long"] <- longitude) try(cDf[i, "crime"] <- crime) } #progress paste("Iteration No. ", i, sep = "") %>% print() #wait 20 seconds if(i %in% pages){Sys.sleep(20)} } #--------------------------------------------------------------- write.csv(cDf, "Houston.csv", row.names = FALSE) #backup #make another copy for cleaning process cDf -> CDfCopy
Alright. At this point, i have the raw data that shows:
1. Rental range
2. No. of bedrooms
3. No. of baths
4. Description of the residence
5. Latitude of the residence
6. Longitude of the residence
7. Crime level of the area that the residence is situated
Now it’s time for the cleaning.
#_______Cleaning_________ #Remove incorrect populations filter(CDfCopy, !grepl(pattern = "ath", CDfCopy$beds), !grepl(pattern = "ay", CDfCopy$beds), !grepl(pattern = "ay", CDfCopy$baths), !grepl(pattern = "N.A.", CDfCopy$beds), !grepl(pattern = "N.A.", CDfCopy$beds), !is.na(CDfCopy$beds), !is.na(CDfCopy$baths)) -> CDfCopy #1. coordinates #unique latitudes strsplit(CDfCopy[, "lat"], "") %>% unlist() %>% unique() -> NA_lats #location of latitudes with special characters strsplit(CDfCopy[, "lat"], "") %>% unlist() %>% unique() %>% as.numeric() %>% is.na() -> NA_lats_log #extract only the special characters NA_lats[NA_lats_log] -> NA_lats_fin NA_lats_fin[!is.na(NA_lats_fin)] -> NA_lats_fin NA_lats_fin[NA_lats_fin != "."] -> NA_lats_fin #doing the same for longitudes strsplit(CDfCopy[, "long"], "") %>% unlist() %>% unique() -> NA_longs strsplit(CDfCopy[, "long"], "") %>% unlist() %>% unique() %>% as.numeric() %>% is.na() -> NA_longs_log NA_longs[NA_longs_log] -> NA_longs_fin NA_longs_fin[!is.na(NA_longs_fin)] -> NA_longs_fin NA_longs_fin[NA_longs_fin != "."] -> NA_longs_fin NA_longs_fin[NA_longs_fin != "-"] -> NA_longs_fin #identify all lat/lon with special characters and remove them for(i in 1:nrow(CDfCopy)){ if(length(NA_lats_fin) != 0){ #replace characters for the latitudes for(j in 1:length(NA_lats_fin)){ str_replace_all(string = CDfCopy[i, "lat"], pattern = NA_lats_fin[j], replacement = "") -> CDfCopy[i, "lat"] } } if(length(NA_longs_fin) != 0){ #replace characters for the longitudes for(j in 1:length(NA_longs_fin)){ str_replace_all(string = CDfCopy[i, "long"], pattern = NA_longs_fin[j], replacement = "") -> CDfCopy[i, "long"] } } } #rentals, bedrooms, and bathroom data are given as a range #need to separate the min and max for each in a different column #new empty columns to be populated later CDfCopy[, "min_Rental"] <- NA CDfCopy[, "max_Rental"] <- NA CDfCopy[, "min_Rooms"] <- NA CDfCopy[, "max_Rooms"] <- NA CDfCopy[, "min_Baths"] <- NA CDfCopy[, "max_Baths"] <- NA for(i in 1:nrow(CDfCopy)){ #2. Rentals CDfCopy[i, 1] %>% str_locate_all(pattern = "\\$") %>% unlist() %>% unique() -> locs if(length(locs) == 2){ substr(CDfCopy[i, "r_Range"], locs[1], locs[2] - 6) %>% str_replace_all(pattern = "\\$", replacement = "") %>% str_replace_all(pattern = ",", replacement = "") %>% as.numeric() -> CDfCopy[i, "min_Rental"] substr(CDfCopy[i, "r_Range"], locs[2], nchar(CDfCopy[i, "r_Range"])) %>% str_replace_all(pattern = "\\$", replacement = "") %>% str_replace_all(pattern = ",", replacement = "") %>% as.numeric() -> CDfCopy[i, "max_Rental"] }else{ if(length(locs) == 1){ substr(CDfCopy[i, "r_Range"], locs[1], nchar(CDfCopy[i, "r_Range"])) %>% str_replace_all(pattern = "\\$", replacement = "") %>% str_replace_all(pattern = ",", replacement = "") %>% as.numeric() -> CDfCopy[i, "min_Rental"] } } #3. Bedrooms if(grepl(pattern = "-", CDfCopy[i, "beds"])){ CDfCopy[i, "beds"] %>% str_locate(pattern = "-") %>% as.vector() %>% unique() -> locs substr(CDfCopy[i, "beds"], 0, locs - 2) -> CDfCopy[i, "min_Rooms"] substr(CDfCopy[i, "beds"], locs + 2, nchar(CDfCopy[i, "beds"])) -> CDfCopy[i, "max_Rooms"] }else{ CDfCopy[i, "beds"] -> CDfCopy[i, "min_Rooms"] } #4. Bathrooms if(grepl(pattern = "-", CDfCopy[i, "baths"])){ CDfCopy[i, "baths"] %>% str_locate(pattern = "-") %>% as.vector() %>% unique() -> locs substr(CDfCopy[i, "baths"], 0, locs - 2) -> CDfCopy[i, "min_Baths"] substr(CDfCopy[i, "baths"], locs + 2, nchar(CDfCopy[i, "baths"])) -> CDfCopy[i, "max_Baths"] }else{ CDfCopy[i, "baths"] -> CDfCopy[i, "min_Baths"] } } CDfCopy$min_Rooms %>% str_replace_all(pattern = " Bedrooms", replacement = "") %>% str_replace_all(pattern = " Bedroom", replacement = "") -> CDfCopy$min_Rooms_clean
Now that i have a nice tidy dataframe, it’s time to add three more columns:
1. Distance between the origin and the destination.
2. Time it takes to drive from the origin to the destination.
3. Encoded polyline of the route.
I’ll be using the University of Houston as the origin, while the destinations will be coordinates for each residency in the data frame. If you’re not too familiar with what encoded polylines are, you can check out the documentation on the Google Maps API page. Basically, it is a way to store a series of coordinates as a string. So instead of rows showing latitudes and longitudes, you can have the entire list of coordinates shown as a string. I would need these encoded polylines so that i could then convert these polylines to actual coordinate data, and then plot it on the Leaflet map.
#load packages library(gmapsdistance) library(gepaf) #___add distance, driving time, and encoded polylines____ CDfCopy$distance <- NA CDfCopy$drive_time <- NA CDfCopy$poly <- NA #Coordinates of the University of Houston Origin <- paste("29.719471", "-95.342308", sep = "+") Key <- "GOOGLE_API_KEY_GOES_HERE" #Obviously, i can't share this. #function to get the encoded polylines using Google API gDir <- function(origin, long, lat, key){ link_1 <- "https://maps.googleapis.com/maps/api/directions/json?origin=" link_2 <- "destination=" link_3 <- "key=" #concatenate coordinates coordinates <- paste(lat, long, sep = "+") #concatenate url url <- paste(paste(link_1, origin, sep = ""), paste(link_2, coordinates, sep = ""), key, sep = "&") #retrieve json jsonlite::fromJSON(url) -> jData #extract polyline as.character(jData$routes$overview_polyline) -> Poly #coerce to character as.character(Poly) -> Poly return(Poly) } #NOTE: Google limits requests to 2,500 per day, for free #Anything more than 2,500 is charged for(i in 1:nrow(CDfCopy)){ gOutput_d <- NA gOutput_t <- NA try( gmapsdistance(Origin, paste(CDfCopy$lat[i], CDfCopy$long[i], sep = "+") , key = Key, mode = "driving") -> gOutput ) #driving time is given in seconds, convert to minutes try(gOutput_t <- gOutput$Time / 60) #distance is given in meters, converting to KM try(gOutput_d <- gOutput$Distance / 1000) try(CDfCopy$distance[i] <- gOutput_d) try(CDfCopy$drive_time[i] <- gOutput_t) try(CDfCopy$poly[i] <- gDir(Origin, CDfCopy$long[i], CDfCopy$lat[i], Key)) print(paste("Completed iteration No.", i, sep = "")) }
Alright, now that i got the data, it’s time for some plots. Starting off with taking a look at how the rental data is dispersed for each room type. I’ve used the ggplot2 package to plot data before, but i’ve never combined it with the ggplotly r package:
#Rental Boxplots ggplotly( ggplot(CDPlot) + geom_boxplot(aes(min_Rooms, min_Rental, fill = min_Rooms)) + ylim(0, 3000) + scale_x_discrete(limits=c('Studio', 1, 2, 3, 4, 5, 6), expand = c(0,-3)) + labs(title = "Rentals Rates per Apartment Type", x = "Number of Rooms", y = "Rental") + theme(plot.title = element_text(face="bold")) ) %>% htmlwidgets::saveWidget(file = "plotly.html", selfcontained = FALSE)
I can’t embed the graph here because it’ll look all funny, but you can check it out by clicking here. Quite natural for the Studio category to have about the same rental rate as the 1-Bedroom, since they are practically the same thing for a lot of people.
It would be interesting to see where most of the rental posts are located using Leaflet’s cool cluster markers. I’ve never actually tried this out before, but here goes.
#cluster map library(leaflet) leaflet(CDPlot) %>% addProviderTiles("CartoDB.DarkMatter") %>% addCircleMarkers( lng = ~as.numeric(CDPlot$long), lat = ~as.numeric(CDPlot$lat), clusterOptions = markerClusterOptions(), radius = 3 ) %>% htmlwidgets::saveWidget(file = "Leaflet.html", selfcontained = FALSE)
You can try clicking on any of the cluster markers and it will zoom in and break the cluster down further. Pretty neat stuff.
And finally, plotting the polylines. I actually ran in to a little bit of trouble plotting the driving routes that i retrieved from the Google Directions API. The thorn was that plotting a single line on the Leaflet map requires the use of the addPolylines()
function, but that was just it. I couldn’t figure out how to plot multiple lines (in this case hundreds) of lines in the most concise way possible. The kept coming back to the ridiculous idea of using hundreds of addPolylines()
functions, which just sounds silly.
I was stumped, so i took a sample and asked the nice community at Stack Overflow for some help. The solution was to convert the coordinates decoded from the polylines in to a “Spatial Lines” object. I had never heard of this term before, so i had to read up on it a bit. I worked on the code sample given back by one of the users, and i’ve managed to make it work. Below is the code and the result.
#load package library(leaflet) #filter only Studios and 1-bedroom apartments filter(dfPlot, rooms == c("Studio", "1")) -> dfPlot #isolate polylines dfPlot$polylines -> polylines #__Below is the sample i got from stack over flow, but slightly modified_______ #__comments are mine except where explicitly mentioned_________________________ #decode all polylines and turn in to list lapply(polylines, function(x) gepaf::decodePolyline(x)) -> lPoly #rowbind into data.table, with unique ID column rbindlist(lPoly, idcol = "id") -> dfPoly #into list with unique IDs lst_lines <- lapply(unique(dfPoly$id), function(x){ #Comment from stack over flow user: #"the order of the 'lon' and 'lat' fields is important" Lines(Line(dfPoly[id == x, .(lon, lat)]), ID = x) }) #______________________________________________________________________________ #Map distances to colors colorNumeric(c("#00FF15", "#faff00","#FF0000", "#17129e"), domain = dfPlot$distances, alpha = FALSE) -> pal #Map rental rates to colors colorNumeric(c("#ff00ee", "#00e5ff", "#ffff00"), domain = dfPlot$rentals, alpha = FALSE) -> pal_rent #Convert to spatial lines object lst_lines %>% SpatialLines() %>% #generate leaflet map leaflet() %>% #use a dark theme addProviderTiles('CartoDB.DarkMatterNoLabels') %>% #plot the lines with the values mapped addPolylines(color = pal(dfPlot$distances), opacity = 0.3, fillOpacity = 0.1, weight = 3) %>% #add circle markers with colors mapped and... #..with popups addCircleMarkers(lng = as.numeric(dfPlot$long), lat = as.numeric(dfPlot$lat), color = pal_rent(dfPlot$rentals), radius = 2, popup = dfPlot$desc) %>% #legend for the rental amounts on the bottom right addLegend("bottomright", pal = pal_rent, values = dfPlot$rentals, title = "Rental Amount", labFormat = labelFormat(prefix = "USD ")) %>% #legend for the driving distances on the top right addLegend("topright", pal = pal, values = dfPlot$distances, title = "Driving Distance", labFormat = labelFormat(suffix = " Kilometers")) %>% #set the center of the map and the zoom setView(lat = "29.83", lng = "-95.368817",zoom = 11) %>% #export all the files htmlwidgets::saveWidget(file = "Leaflet_Houston.html", selfcontained = FALSE)
You can click here for a much bigger view. I noticed that when the map is zoomed out, even at a reasonable distance, it looks like a lot of visual noise. But zooming in even a little bit gives more clarity on the information that is visualized. I’ve also added the popup feature on the rental locations, so clicking on any one of the little circles will give you the details of the residency and some relevant information.
It would be great if there was readily available data on recent crime stats in this city which would then enable me to (reasonably) plot the most suitable property to live in. It would be even cooler if we could incorporate within that the locations of nearby amenities, like grocery stores and pharmacies.
Really, this has always been my interest in learning programming about a year or so ago. We all know the benefits of programming/computers to companies and institutions and how it always circles back to us ordinary people in the form of neat public/private services. But i’ve always been interested in how it can help carrying out even the most ordinary of tasks in a more informed manner. Plus, at the risk of sounding a little nerdy, it’s also kind of fun.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.