An Overview of Room Rentals in Sydney | RSelenium, rvest, Leaflet, googleway
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Started another data scraping script similar to the post about rental rates in Houston; except this time i picked Sydney, Australia. The site that i’ve selected uses an awful lot of javascript, so the rvest package won’t be enough in this case. I’m going to have to use RSelenium.
The script for scraping the site is below. I say this every time i post a scraping script, and i’ll say it again: the URL to the actual site has been replaced for obvious reasons. Although i place plenty of comments in my code, this time i’m going to try and break down the code in little chunks and explain each part:
library(dplyr) library(rvest) library(RSelenium) library(stringr) library(tcltk) check_error <- function(code) { tryCatch(code, error = function(c) "error", warning = function(c) "warning", message = function(c) "message" ) } #checkForServer() #Check if server file is available startServer() #Start the server mybrowser <- remoteDriver(browser = "chrome") #Change the browser to chrome mybrowser$open(silent = TRUE) #navigate to url. enter according to whatever is filtered in the site mybrowser$navigate("https://lookingforrooms.au/room/sydney") #run a loop to scroll all the way to the bottom for(i in 1:100){ mybrowser$executeScript("scroll(0, 500000);") Sys.sleep(3) #wait for posts to load print(i) #check progress } #get HTML mybrowser$getPageSource()[[1]] %>% read_html() -> source
The packages dplyr
, rvest
, RSelenium
, and stringr
are all packages that i’ve used before so i’m not going to explain what they’re used for. The only thing that’s different is the tcltk
package. The only reason i loaded this library is so that i can create a nice progress bar for the loops i use.
Unlike many other rental sites, this site uses some kind of javascript function that shows more and more posts as you scroll down the page. So, i would have to somehow simulate this scrolling. That’s what the loop at the end of the script is for. The line mybrowser$executeScript("scroll(0, 500000);")
scrolls down all the way to the bottom of the page while the line Sys.sleep(3)
waits three seconds for the posts to be generated before scrolling down again. This is done 100 times before the HTML of the page is extracted and stored in source
.
The next step is to try and retrieve as much information from the posts that have been generated without actually having to navigate to each post.
#get rentals rates source %>% html_nodes("#search-results") %>% html_nodes(".ribbon.property") %>% html_text() -> rentals #get title of the post source %>% html_nodes("#search-results") %>% html_nodes(".listing-head") %>% html_text() -> title #get the URL to the post source %>% html_nodes("#search-results") %>% html_nodes(".content-column") %>% html_nodes("a") %>% html_attr("href") %>% unique() -> links #remove redundant URLs links[grep("/P", links)] %>% unique() -> links
There is some more information that i need but this data can’t be retrieved without navigating to each post.
#create an empty datafrmae to populated later matrix(NA, length(links), 14) %>% as.data.frame() -> df #headers names(df) <- c("title", "lat", "lon", "room_type", "bathroom", "furnishing", "available", "length_of_stay", "gender", "weekly_rent", "bond", "bills", "internet", "parking") #might need to remove this lstCharges <- list() lstDetails <- list() vecLat <- c() vecLon <- c() lstSource <- list() #total for the progress bar total <- length(links) for(i in 1:length(links)){ if(i == 1){ pb <- tkProgressBar(title = "progress bar", min = 0, max = total, width = 300) } #navigate to the post URL try(mybrowser$navigate(paste("https://lookingforrooms.au", links[i], sep = ""))) Sys.sleep(10) #get HTML try(mybrowser$getPageSource()[[1]] %>% html() -> source) #check if a post has been deleted source %>% as.character() %>% str_locate("longer available") %>% is.na() %>% c() %>% unique() -> logError #if the post has not been deleted... if(!logError){ lat <- 0 #extract latitude try( source %>% html_nodes("#map-canvas") %>% html_attr("data-latitude") -> lat ) if(length(lat) != 0){ #if lat retrieved is not empty vecLat[length(vecLat) + 1] <- lat }else{vecLat[length(vecLat) + 1] <- 0} lon <- 0 #extract longitude try( source %>% html_nodes("#map-canvas") %>% html_attr("data-longitude") -> lon ) if(length(lon) != 0){ #if lon retrieved is not empty vecLon[length(vecLon) + 1] <- lon }else{vecLon[length(vecLon) + 1] <- 0} matDetails <- matrix(NA, 1, 1) #extract table showing Room Details try( source %>% html_nodes("#property-view > div:nth-child(2) > div.property-panel > div.property-description > div:nth-child(4) > div > div > div:nth-child(1) > table") %>% html_table() -> matDetails ) if(check_error(matDetails[[1]]) != "error"){ lstDetails[[length(lstDetails) + 1]] <- matDetails[[1]] }else{ lstDetails[[length(lstDetails) + 1]] <- NA } matCharges <- matrix(NA, 1, 1) #extract table showing Charges for the room try( source %>% html_nodes("#property-view > div:nth-child(2) > div.property-panel > div.property-description > div:nth-child(4) > div > div > div:nth-child(2) > table") %>% html_table() -> matCharges ) if(check_error(matCharges[[1]]) != "error"){ lstCharges[[length(lstCharges) + 1]] <- matCharges[[1]] }else{ lstCharges[[length(lstCharges) + 1]] <- NA } }else{ #if page is empty, populate with NA vecLat[length(vecLat) + 1] <- NA vecLon[length(vecLon) + 1] <- NA lstDetails[[length(lstDetails) + 1]] <- matrix(NA,1,1) lstCharges[[length(lstCharges) + 1]] <- matrix(NA,1,1) } #progress bar setTkProgressBar(pb, i, label=paste(i, " out of ", total, sep = "")) } close(pb) #close progress bar mybrowser$close()#close browser #notification beeps using beepr package beepr::beep() Sys.sleep(0.2) beepr::beep() Sys.sleep(0.2) beepr::beep() Sys.sleep(0.2)
Now for the cleaning of the data. There are quite a number of old posts on the site that did not have any basic details posted, so those need to be removed.
#posts with no details nonNA <- is.na(lstDetails) #remove all extractions with no details lstDet_Clean <- lstDetails[!nonNA] #filter charges table with according to no posts lstCharges_Clean <- lstCharges[!nonNA] #oops! mistakenly removed original charges list #filter lon, lat, and title by no post filter vecLat_Clean <- vecLat[!nonNA] vecLon_Clean <- vecLon[!nonNA] title_Clean <- title[!nonNA] #change dim of original df to accomodate no posts df <- df[1:length(lstCharges_Clean),]
Now it’s just a matter of populating the empty dataframe, using the lists and vectors that were generated from the scraping script:
#start loop to populate empty dataframe for(i in 1:nrow(df)){ #title, lon, and lat df[i, "title"] <- title_Clean[i] df[i, "lat"] <- vecLat_Clean[i] df[i, "lon"] <- vecLon_Clean[i] lstCharges_Clean[[i]] -> tblCharges #rent tolower(tblCharges[,1]) == "weekly rent" -> logRent if(sum(logRent) != 0){ tblCharges[logRent,2] -> df[i, "weekly_rent"] } #internet tolower(tblCharges[,1]) == "internet" -> logRent if(sum(logRent) != 0){ tblCharges[logRent,2] -> df[i, "internet"] } #parking tolower(tblCharges[,1]) == "parking" -> logRent if(sum(logRent) != 0){ tblCharges[logRent,2] -> df[i, "parking"] } #bond tolower(tblCharges[,1]) == "bond" -> logRent if(sum(logRent) != 0){ tblCharges[logRent,2] -> df[i, "bond"] } #bills tolower(tblCharges[,1]) == "bills" -> logRent if(sum(logRent) != 0){ tblCharges[logRent,2] -> df[i, "bills"] } lstDet_Clean[[i]] -> tblDetails #room type tolower(tblDetails[,1]) == "room type" -> logDetails if(sum(logDetails) != 0){ tblDetails[logDetails,2] -> df[i, "room_type"] } #bathroom tolower(tblDetails[,1]) == "bathroom" -> logDetails if(sum(logDetails) != 0){ tblDetails[logDetails,2] -> df[i, "bathroom"] } #furnishing tolower(tblDetails[,1]) == "furnishing" -> logDetails if(sum(logDetails) != 0){ tblDetails[logDetails,2] -> df[i, "furnishing"] } #available tolower(tblDetails[,1]) == "available" -> logDetails if(sum(logDetails) != 0){ tblDetails[logDetails,2] -> df[i, "available"] } #length of stay tolower(tblDetails[,1]) == "length of stay" -> logDetails if(sum(logDetails) != 0){ tblDetails[logDetails,2] -> df[i, "length_of_stay"] } #gender tolower(tblDetails[,1]) == "gender" -> logDetails if(sum(logDetails) != 0){ tblDetails[logDetails,2] -> df[i, "gender"] } }
All the information i needed to examine the rentals rates are in one dataframe. Now it’s a matter of plotting them, starting off with a break down of the number of posts. The next three plots will show:
1. The total number of posts per room type.
2. The total number of posts per room type and bathroom type
3. The total number of posts per room type, bathroom type, and gender.
As a side note, the “gender” field represents what sort of gender does the poster prefer to live with. The choices are male, female, couple, male/female but not couple, or anyone.
#Total number posts per room_type ggplot(df, aes(room_type)) + geom_bar(fill = "dark blue") + theme_minimal() + labs(x = "Room Type", y = "Number of Posts") + ggtitle("Number of Posts by Room Type")
#total number of posts per room type and bathroom type ggplot(df, aes(room_type, fill = bathroom)) + geom_bar(position = "dodge") + theme_minimal() + labs(x = "Room Type", y = "Number of Posts", fill = "Bathroom") + ggtitle("Number of Posts by Room and Bathroom Type")
#total number of posts per room type, bathroom type, furnishing, and gender ggplot(df, aes(room_type, fill = bathroom, color = furnishing)) + geom_bar(width = 0.7, position = position_dodge(width = 0.9), lwd = 0.7)+ scale_color_brewer(palette = "Dark2") + scale_fill_brewer(palette = "Set2") + labs(x = "Room Type", y = "Number of Posts", color = "Furnishing", fill = "Bathroom") + ggtitle("Number of Posts: By Room Type, Bathroom Type, Furnishing, and Gender") + facet_wrap(~gender) + theme( panel.background = element_rect(fill = "grey"), legend.position = c(0.85, 0.25)) + scale_y_continuous(breaks = seq(0, max(df$weekly_rent), by = 100))
Click here to see the third plot a little clearer.
The obvious point you’d have to take away from this is that most people have no real preference when it comes to gender, given how the “Anyone Welcome” category is selected in most posts. If you’re looking to find a room to share with some one else, you might be out of luck due to the fact that a big portion of the total number of posts are about private rooms and not shared rooms, however most posts show that the bathrooms are shared.
The next two plots reveal the frequency distribution of the weekly rentals rates. The first plot will show the frequencies with all the posts in scope, while the second one only shows the weekly rent distribution for only shared room posts.
#Rental distribution ggplot(df, aes(x=weekly_rent)) + geom_histogram(aes(fill =..density..), binwidth = 50) + scale_fill_gradient("Count", low = "green", high = "red") + scale_x_continuous(breaks = seq(0, max(df$weekly_rent), by = 50)) + labs(x = "Weekly Rent", y = "Frequency") + guides(fill = FALSE) + ggtitle("Frequency Distribution of Rentals")
#Rental distribution for != private room ggplot(filter(df, room_type != "Private room"), aes(x=weekly_rent)) + geom_histogram(aes(fill =..density..), binwidth = 20) + scale_fill_gradient("Count", low = "green", high = "red") + scale_x_continuous(breaks = seq(0, max(df$weekly_rent), by = 50)) + labs(x = "Weekly Rent", y = "Count") + guides(fill = FALSE) + ggtitle("Frequency Distribution of Rentals: Excluding Private Rooms")
The weekly rental range for most posts fall under the A$200 to A$300 range. If i isolate only the non-private rooms, the range drops down to A$150 to $200. I think a boxplot should give us a more clearer picture on the median weekly rental rate for each kind of post. The following plots show the median rentals and range in the following order:
1. By room type
2. By room type, bathroom type, and gender
3. By room type, bathroom type, gender, and furnishing
#Average rental per room type. ggplot(df, aes(room_type, weekly_rent)) + geom_boxplot(fill = "yellow") + labs(x = "Room Type", y = "Weekly Rent", title = "Median Rentals by Room Type") + theme(panel.background = element_rect(fill = "grey")) + ylim(0,600)
#Average rental per room type, bathroom, and gender ggplot(df, aes(room_type, weekly_rent, fill = bathroom)) + geom_boxplot() + labs(x = "Room Type", y = "Weekly Rent", title = "Median Rentals: Room Type VS Bathroom Type VS Gender", fill = "Bathroom") + theme(panel.background = element_rect(fill = "grey"), legend.position = c(0.85, 0.25)) + ylim(0,600) + facet_wrap(~gender)
Higher resolution: Median Rentals: Room Type VS Bathroom Type VS Gender.
#Average rental (y) per room type (x), bathroom (fill), gender (facet), and furnishing (color) ggplot(df, aes(room_type, weekly_rent, fill = bathroom, color = furnishing)) + geom_boxplot(lwd = 0.7, fatten = 1, position = position_dodge(1)) + labs(x = "Room Type", y = "Weekly Rent", title = "Median Rentals: Room Type VS Bathroom Type VS Gender VS Furnishing", fill = "Bathroom", color = "Furnishing") + theme(panel.background = element_rect(fill = "#f2f2f2"), legend.position = c(0.85, 0.25)) + ylim(0,600) + facet_wrap(~gender) + scale_color_brewer(palette = "Set1") + scale_fill_brewer(palette = "Pastel2")
Higher resolution: Median Rentals: Room Type VS Bathroom Type VS Gender VS Furnishing.
You might be able to save a decent amount of money if you go for a private room with a shared bathroom rather than a room with an ensuite/private bathroom. If a private room is a little pricy, you can go for a shared room with an ensuite bathroom, since a shared bathroom is going to cost roughly the same amount. Judging from the third plot, the median weekly rental for most private rooms that are furnished where you get to have your own bathroom float around the A$300 to A$400 range. The best deal out there seems to be private room posts where the gender is tagged as “Anyone welcome”, the furnishing is “flexible with furnishing”, and the bathroom is “Own bathroom”.
There is one other point though, and that’s the question of location. I honestly don’t think that the most ideal room will cost the same if you compare between one near the city center and one that is in the outskirts. So i’m guessing there must be some kind of middle ground where you can get what you want for a reasonable rate.
First off, a rough density plot showing where most of the posts are concentrated.
#Density plot, total SYD_map <- ggmap::get_map(location = "Sydney, Australia", source = "google", maptype = "roadmap", crop = FALSE, color = "bw", zoom = 11) ggmap::ggmap(SYD_map) + stat_density2d(data=df, aes(x = lon, y = lat, fill = ..level.., alpha = ..level..), geom = "polygon") + scale_fill_gradient(low = "#3BE819", high = "#B5170B") + theme(axis.text=element_blank(), panel.grid=element_blank(), axis.title=element_blank(), axis.ticks.x=element_blank(), axis.ticks.y=element_blank(), legend.position="none", panel.background=element_blank(), plot.title = element_text(face="bold", colour="black")) + labs(title = "Density plot of posts") + scale_alpha_continuous(range=c(0.1,0.4))
It looks like most of the posts are around the University of Sydney, Univeristy of Technology Sydney, and Surry Hills. However, the issue with a density plot using ggmaps is that i can’t break down this concentration. But you could do that using the Leaflet package.
#Leaflet cluster map leaflet(df) %>% addProviderTiles("CartoDB.DarkMatter") %>% addCircleMarkers( lng = ~as.numeric(df$lon), lat = ~as.numeric(df$lat), clusterOptions = markerClusterOptions(), radius = 3 ) %>% htmlwidgets::saveWidget(file = "Leaflet_Sydney.html", selfcontained = TRUE)
Bigger version: Leaflet Cluster Map.
If you’re not too familiar with a leaflet cluster map, you can perhaps try clicking on any of the clusters. You’ll notice that the clusters are then broken down further. It’s pretty obvious now that most of the posts are concentrated in the polygon shaped section making up Darlinghurst, Paddington, Randwick, Kensington, Mascot, Newtown, and Campterdown. To see if there really is any real difference between two posts with identical details but different locations, it’s best to just plot all the posts but map the weekly rental amount to a spectrum of colors.
#Leaflet circle markers mapped to rentals, plus html popup filter(df, weekly_rent <= 700) -> df_new #create column showing html popup details paste( paste("Location: ", df_new$title, sep = ""), paste("Room Type: ", df_new$room_type, sep = ""), paste("Bathroom: ", df_new$bathroom, sep = ""), paste("Furnishing: ", df_new$furnishing, sep = ""), paste("Internet: ", df_new$internet, sep = ""), paste("Parking: ", df_new$parking, sep = ""), paste("Preference: ", df_new$gender, sep = ""), paste("Bills: ", df_new$bills, sep = ""), paste("Bond: ", df_new$bond, sep = ""), paste("Weekly Rent: ", df_new$weekly_rent, sep = ""), paste("Availability: ", df_new$available, sep = ""), sep = "<br/>") -> df_new$hPop colorNumeric(c("#ff00ee", "#00e5ff", "#ffff00"), domain = df_new$weekly_rent) -> pal leaflet(df_new) %>% addProviderTiles("CartoDB.DarkMatter") %>% #Greyscale map addLegend("bottomright", pal = pal, values = ~weekly_rent, title = "Rental Price", labFormat = labelFormat(prefix = "AUD "), opacity = 1) %>% addCircleMarkers(lng = ~lon, lat = ~lat, #Latitudes and Longitudes radius = ~ifelse(weekly_rent <= 500, 4, 3), #Size of circles dependent on rental amount color = ~pal(weekly_rent), #color mapped to rental amount stroke = FALSE, fillOpacity = 0.5, popup = ~hPop) %>% htmlwidgets::saveWidget(file = "Leaflet_Sydney_Rentals.html", selfcontained = TRUE)
Bigger version: Leaflet Circle Makers with HTML Popup.
It would appear the suspicion that certain locations are more expensive than others could very well be true. If we start from the left and move on to the right towards the city center, the weekly rental amount gradually rises. This information, along with the average plots i made earlier, are normally enough for me to make an informed decision on what sort of posts should i be focusing on. But i’ve recently become more and more interested in plotting polylines. The reason being that it would be interesting to see if driving distance can be added in to the list of variables to be considered when moving to a place.
#Leaflet driving distances Key <- "GOOGLE_API_KEY_GOES_HERE" #Location of company's offices origin_lat <- -33.872795 origin_lon <- 151.207307 #list to be populated later gResponses <- list() #get data for each lat/lon in the dataframe for(i in 1:nrow(df_new)){ googleway::google_directions(origin = c(origin_lat, origin_lon), destination = c(df_new[i, c("lat")], df_new[i, "lon"]), key = Key, mode = "driving", simplify = TRUE) -> gResponses[[i]] } OK_Resp <- function(x){ lstReturn <- list() for(i in 1:length(x)){ if(x[[i]]$status == "OK"){ x[[i]] -> lstReturn[[length(lstReturn) + 1]] } } return(lstReturn) } #extract only responses with a status OK OK_Resp(gResponses) -> gResp_OK #filter the dataframe with status OK df_new[logResp,] -> df_new_plots #extract duration from API response lapply(gResp_OK, function(x){x$routes$legs[[1]]$duration$value/60}) -> duration #extract distance from API response lapply(gResp_OK, function(x){x$routes$legs[[1]]$distance$value/1000}) -> distance #Add distance and duration to the data frame df_new_plots$duration <- unlist(duration) df_new_plots$distance <- unlist(distance) #function to get only the polylines from the response getPoly <- function(list_element){ return(list_element$routes$overview_polyline) } #compile all polylines to one vector lapply(gResp_OK, FUN = getPoly) %>% unlist() %>% as.character() -> lstPoly #decode all polylines and compile to list lapply(lstPoly, FUN = decode_pl) -> lstCoords data.table::rbindlist(lstCoords, idcol = "id") -> dfPoly #into list with unique IDs lst_lines <- lapply(unique(dfPoly$id), function(x){ #Comment from stack over flow user: #"the order of the 'lon' and 'lat' fields is important" sp::Lines(sp::Line(dfPoly[id == x, .(lon, lat)]), ID = x) }) spl_lst <- sp::SpatialLines(lst_lines) filter(df_new_plots, weekly_rent <= 400) -> df_new_plots #Map distances to colors colorNumeric(c("#000000", "#faff00","#FF0000", "#17129e"), domain = df_new_plots$distance, alpha = FALSE) -> pal #Map rental rates to colors colorNumeric(c("#ff00ee", "#00e5ff", "#ffff00"), domain = df_new_plots$weekly_rent, alpha = FALSE) -> pal_rent leaflet(spl_lst) %>% addProviderTiles("CartoDB.DarkMatterNoLabels") %>% addPolylines(opacity = 0.2, weight = 3, color = pal(df_new_plots$distance)) %>% addCircleMarkers(lng = df_new_plots$lon, lat = df_new_plots$lat, color = pal_rent(df_new_plots$weekly_rent), radius = 2, popup = df_new_plots$hPop) %>% #legend for the rental amounts on the bottom right addLegend("bottomright", pal = pal_rent, values = df_new_plots$weekly_rent, title = "Rental Amount", labFormat = labelFormat(prefix = "AUD ")) %>% #legend for the driving distances on the top right addLegend("topright", pal = pal, values = df_new_plots$distance, title = "Driving Distance", labFormat = labelFormat(suffix = " Kilometers")) %>% htmlwidgets::saveWidget(file = "Leaflet_Sydney_Rent_Driving.html", selfcontained = FALSE)
Bigger version: Leaflet Circle Makers with HTML Popup and Polylines.
I still think that the map seems a bit too noisy, although it looks a little better if you zoom in closer. But regardless of how you look at this map or the data on the rental averages in this city, you’ll have to come to one conclusion, especially if you’re someone who’s living in Malaysia like myself. Rent in Sydney is just too damn high! Someone needs to contact Jimmy McMillan to open an Australian chapter of his famously hilarious, but very relatable political party.
As always, if you’d like a copy of the data set, do get in touch and i’ll try to send you a copy of the CSV file.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.