Site icon R-bloggers

Interactive Map of Singapore Rentals | rvest, stringr, leaflet, gmailr

[This article was first published on r – Recommended Texts, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Decided to have a go at plotting another interactive map, this time using fresh data involving room rentals in Singapore. I’ve gone about extracting the data in the same way as i normally do, the only difference being that i’ve just learned about the gmailr package which allows you to send emails using R. Useful when you’re not at your desk but would still like to know the progress of your script.

Nothing special about the first part of the script. Just the loop that scrapes the data from the site, and a little cleaning of the data.

library(dplyr)
library(rvest)
library(stringr)
library(leaflet)
library(htmltools)
library(htmlwidgets)
library(gmailr)

TP <- 8929 #Total number of posts on the site (a)
TP_page <- 20 #total number of posts per search page (b)
T_pages <- round(TP/TP_page,0) + 1 #Total number of pages (a/b)

matrix(NA, 1,3) %>% data.frame() -> complete #empty data frame to be rbinded later
names(complete) <- c("rentals", "location", "links") #rename columns

for(i in 1:T_pages){
    
    #URL for the site (Not the real site)
    url <- paste("http://www.singaporerentalsite?p=", i, sep = "") #That's not the real site
    
    #Get HTML of site
    hUrl <- html(url)
    
    #extraction of rentals
    hUrl %>% html_nodes(".col-right.col-75.col--right-pad") %>% 
      html_nodes(".listing-img__price.listing-img__price--large") %>% 
      html_text() -> rentals
    
    #extraction of locations
    hUrl %>% html_nodes(".col-right.col-75.col--right-pad") %>% 
      html_nodes("h3") %>% html_text() -> location
    
    #extraction of links
    hUrl %>% html_nodes(".page-container") %>% 
      html_nodes(".listing-meta__desc.listing-meta__desc--tall") %>% 
      html_nodes("a") %>% html_attr("href") -> links
    
    #assign results to a separate dataframe
    data.frame(rentals, location, links) -> extract
    
    #row bind the empty dataframe with extracted dataframe
    complete <- rbind(complete, extract)
    
    #Just for monitoring the progress
    print(paste("Completed iteration No.", i, sep = ""))

}

#Compose mail
mime() %>%
  to("youremail@someplace.com") %>%
  from("me@somewhere.com") %>%
  text_body("First loop is completed!") -> text_msg

#Send mail
send_message(text_msg)

#delete the first row because it has no values
complete[2:nrow(complete),] -> complete

#isolate rentals
complete[, "rentals"] -> rentals

#remove unwanted characters/symbols and convert to numeric
rentals %>% str_replace_all(pattern = "\\$", replacement = "") %>% 
  str_replace_all(pattern = ",", replacement = "") %>% 
  str_replace_all(pattern = " pcm", replacement = "") %>% as.numeric() -> rentals

#create new column holding the cleaned rental amounts
complete[, "rentals_clean"] <- rentals

#isolate the locations
complete[, "location"] -> locations

#remove trailing and leading whitespaces
locations %>% str_trim() -> complete[, "location_clean"]

#add prefix (e.g.http://www.yadayada.com) to the links
paste("http://www.blahblah.com", 
      complete[, "links"], sep = "") -> complete[, "ref_links"]

At this point, i have the links to each individual post, among other things. Extracting the latitudes and longitudes from each post is a little trickier for one annoying reason: the coordinates are in some kind of javascript code.

I noticed that the total length of the coordinates are fixed to a certain number of digits. So i figured that it’s just a matter of using the str_locate function to find the location of the words “longitude” and “latitude” in the javascript function, and then work from there.

#empty columns to be populated later
complete[, "long"] <- NA
complete[, "lat"] <- NA

complete[, "title"] <- NA
complete[, "residents"] <- NA
complete[, "ideal"] <- NA

for(i in 1:nrow(complete)){
  
    #get the HTML for the post   
    hUrl <- html(complete[i, "ref_links"])
  
    #location of the latitude
    hUrl %>% as.character() %>% 
      str_locate(pattern = "latitude") -> lat_loc
    
    #location of the longitude
    hUrl %>% as.character() %>% 
      str_locate(pattern = "longitude") -> lon_loc
    
    #assign latitides and longitudes to the dataframes
    substr(hUrl %>% as.character(), lat_loc[2] + 4, lat_loc[2] + 19) -> complete[i, "lat"]
    substr(hUrl %>% as.character(), lon_loc[2] + 4, lon_loc[2] + 19) -> complete[i, "long"]
    
    #Post title
    hUrl %>% html_nodes("title") %>% html_text() -> complete[i, "title"]
    
    #Who lives there
    hUrl %>% html_nodes(".detail") %>% html_nodes(".detail__row") -> test
    test[2] %>% html_nodes(".detail__text") %>% html_text() -> complete[i, "residents"]
    
    #Ideal Room mates
    test[3] %>% html_nodes(".detail__text") %>% html_text() -> complete[i, "ideal"]

    #Monitoring the progress
    print(paste("Completed iteration No.", i, sep = ""))
}

#Compose mail for completion of second loop
mime() %>%
  to("youremail@someplace.com") %>%
  from("me@somewhere.com") %>%
  text_body("Second loop is completed!") -> text_msg

#Send mail
send_message(text_msg)

The issue with isolating the coordinates this way is that i’ll end up with some special characters at the end of each longitude and latitude. Something that looks like this:

> complete[8:10, c("long", "lat")] 103.852615356445 1.280886054039", 103.852615356445 1.280886054039", 103.852867126465 1.28101551532745

Quite a number of them have quotation marks and commas, and i wasn’t sure how many other special characters there are. I couldn’t think of any smart, elegant way of getting rid of these characters. I could only think of combining all the longitudes and latitudes into one long string, separating each character in to an individual string, and check if any of these strings can’t be turned in to class numeric; which should identify the special characters. I then assigned all those special characters to a single vector, and looped through it to replace each special character in the longitude and latitude columns.


#islolate the lat and long
complete[, "lat"] %>% str_trim() -> lats
complete[, "long"] %>% str_trim() -> longs

#function to combine all the elements into one string
cChar<- function(vector){
  
  unique(vector) -> vector
  vec <- c(1)
  
  for(i in 1:length(vector)){
    
    if(i == 1){paste(vec[1], vector[i], sep = "") -> vec_com}else{ 
      
      paste(vec_com, vector[i], sep = "") -> vec_com
      
      }
    
  }
  
  return(vec_com)
  
}

#run function on the latitudes ad longitudes
cChar(lats) -> clats
cChar(longs) -> clongs

#split all the elements in each string and turn into one list
strsplit(c(clats, clongs), "") %>% unique() -> com_l

#turn list into vector
unlist(com_l) -> com_l

#don't need to remove decimals, remove them from cleaning process
com_l != "." -> nums_l
com_l[nums_l] -> com_l_d

#get a logical of all the elements that can't be converted into a number
com_l_d %>% as.numeric() %>% is.na() -> sym_l

#isolate the special characters
com_l_d[sym_l] %>% unique() -> excls

for(i in 1:length(lats)){
  
  for(j in 1:length(excls)){
    
    str_replace(lats[i], excls[j], "") -> lats[i]
    str_replace(longs[i], excls[j], "") -> longs[i]
    
  }
  
}

#convert to numeric
as.numeric(lats) -> lats
as.numeric(longs) -> longs

#add new columns holding cleaned values
complete[, "clean_lats"] <- lats
complete[, "clean_longs"] <- longs

Now that i have the complete dataframe with all the information i need, i can start creating the column that holds the HTML for the popups on the leaflet map.

#concatenate different columns and add HTML, for the circle popups
paste("<b><a href=", complete[, "ref_links"], ">", complete[, "title"], "</a></b>", sep = "") -> hLinks
paste(sep = "<br/>", hLinks, paste("Rental: ", complete[, "rentals"], sep = ""), 
      paste("Tenants: ", complete[, "residents"], sep = ""), 
      paste("Ideal Roommates: ", complete[, "ideal"])) -> hLinks_new
complete[, "hPopup"] <- hLinks_new

Now for the fun part: the interactive map using Leaflet.

#filter only rentals equal to or less than SGD 2000. 
filter(complete, rentals_clean <= 1500) -> df_plot

#color scheme for the legend
colorNumeric(palette = rainbow(3), domain = df_plot$rentals_clean) -> pal

leaflet(df_plot) %>% 
  addProviderTiles("CartoDB.Positron") %>% #Greyscale map
  
  addLegend("bottomright", pal = pal, values = ~rentals_clean,
            title = "Rental Price",
            labFormat = labelFormat(prefix = "SGD "),
            opacity = 1) %>%
  
  addCircleMarkers(lng = ~clean_longs, lat = ~clean_lats, #Latitudes and Longitudes
    radius = ~ifelse(rentals_clean <= 1000, 4, 3), #Size of circles dependent on rental amount
    color = ~pal(rentals_clean), #color mapped to rental amount
    stroke = FALSE, fillOpacity = 0.5, 
    popup = ~hPopup) -> SG_map 

saveWidget(SG_map, file = "SG_map.html", selfcontained = FALSE)

The colors of the circles have been mapped to the rental rate, and clicking on any of them will popup a message with the title of the post, short description of the current residents, and the posters ideal roommate(s). There were actually a whole lot more posts than the ones that are in the map, but i filtered them out because the range of the rental rates was too large. So the rentals were limited to <= 1500, in the interest of a more meaningful map. All in all, I think this worked out quite OK.

To leave a comment for the author, please follow the link and comment on their blog: r – Recommended Texts.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.