Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Decided to have a go at plotting another interactive map, this time using fresh data involving room rentals in Singapore. I’ve gone about extracting the data in the same way as i normally do, the only difference being that i’ve just learned about the gmailr package which allows you to send emails using R. Useful when you’re not at your desk but would still like to know the progress of your script.
Nothing special about the first part of the script. Just the loop that scrapes the data from the site, and a little cleaning of the data.
library(dplyr) library(rvest) library(stringr) library(leaflet) library(htmltools) library(htmlwidgets) library(gmailr) TP <- 8929 #Total number of posts on the site (a) TP_page <- 20 #total number of posts per search page (b) T_pages <- round(TP/TP_page,0) + 1 #Total number of pages (a/b) matrix(NA, 1,3) %>% data.frame() -> complete #empty data frame to be rbinded later names(complete) <- c("rentals", "location", "links") #rename columns for(i in 1:T_pages){ #URL for the site (Not the real site) url <- paste("http://www.singaporerentalsite?p=", i, sep = "") #That's not the real site #Get HTML of site hUrl <- html(url) #extraction of rentals hUrl %>% html_nodes(".col-right.col-75.col--right-pad") %>% html_nodes(".listing-img__price.listing-img__price--large") %>% html_text() -> rentals #extraction of locations hUrl %>% html_nodes(".col-right.col-75.col--right-pad") %>% html_nodes("h3") %>% html_text() -> location #extraction of links hUrl %>% html_nodes(".page-container") %>% html_nodes(".listing-meta__desc.listing-meta__desc--tall") %>% html_nodes("a") %>% html_attr("href") -> links #assign results to a separate dataframe data.frame(rentals, location, links) -> extract #row bind the empty dataframe with extracted dataframe complete <- rbind(complete, extract) #Just for monitoring the progress print(paste("Completed iteration No.", i, sep = "")) } #Compose mail mime() %>% to("youremail@someplace.com") %>% from("me@somewhere.com") %>% text_body("First loop is completed!") -> text_msg #Send mail send_message(text_msg) #delete the first row because it has no values complete[2:nrow(complete),] -> complete #isolate rentals complete[, "rentals"] -> rentals #remove unwanted characters/symbols and convert to numeric rentals %>% str_replace_all(pattern = "\\$", replacement = "") %>% str_replace_all(pattern = ",", replacement = "") %>% str_replace_all(pattern = " pcm", replacement = "") %>% as.numeric() -> rentals #create new column holding the cleaned rental amounts complete[, "rentals_clean"] <- rentals #isolate the locations complete[, "location"] -> locations #remove trailing and leading whitespaces locations %>% str_trim() -> complete[, "location_clean"] #add prefix (e.g.http://www.yadayada.com) to the links paste("http://www.blahblah.com", complete[, "links"], sep = "") -> complete[, "ref_links"]
At this point, i have the links to each individual post, among other things. Extracting the latitudes and longitudes from each post is a little trickier for one annoying reason: the coordinates are in some kind of javascript code.
I noticed that the total length of the coordinates are fixed to a certain number of digits. So i figured that it’s just a matter of using the str_locate
function to find the location of the words “longitude” and “latitude” in the javascript function, and then work from there.
#empty columns to be populated later complete[, "long"] <- NA complete[, "lat"] <- NA complete[, "title"] <- NA complete[, "residents"] <- NA complete[, "ideal"] <- NA for(i in 1:nrow(complete)){ #get the HTML for the post hUrl <- html(complete[i, "ref_links"]) #location of the latitude hUrl %>% as.character() %>% str_locate(pattern = "latitude") -> lat_loc #location of the longitude hUrl %>% as.character() %>% str_locate(pattern = "longitude") -> lon_loc #assign latitides and longitudes to the dataframes substr(hUrl %>% as.character(), lat_loc[2] + 4, lat_loc[2] + 19) -> complete[i, "lat"] substr(hUrl %>% as.character(), lon_loc[2] + 4, lon_loc[2] + 19) -> complete[i, "long"] #Post title hUrl %>% html_nodes("title") %>% html_text() -> complete[i, "title"] #Who lives there hUrl %>% html_nodes(".detail") %>% html_nodes(".detail__row") -> test test[2] %>% html_nodes(".detail__text") %>% html_text() -> complete[i, "residents"] #Ideal Room mates test[3] %>% html_nodes(".detail__text") %>% html_text() -> complete[i, "ideal"] #Monitoring the progress print(paste("Completed iteration No.", i, sep = "")) } #Compose mail for completion of second loop mime() %>% to("youremail@someplace.com") %>% from("me@somewhere.com") %>% text_body("Second loop is completed!") -> text_msg #Send mail send_message(text_msg)
The issue with isolating the coordinates this way is that i’ll end up with some special characters at the end of each longitude and latitude. Something that looks like this:
> complete[8:10, c("long", "lat")]
103.852615356445 1.280886054039",
103.852615356445 1.280886054039",
103.852867126465 1.28101551532745
Quite a number of them have quotation marks and commas, and i wasn’t sure how many other special characters there are. I couldn’t think of any smart, elegant way of getting rid of these characters. I could only think of combining all the longitudes and latitudes into one long string, separating each character in to an individual string, and check if any of these strings can’t be turned in to class numeric; which should identify the special characters. I then assigned all those special characters to a single vector, and looped through it to replace each special character in the longitude and latitude columns.
#islolate the lat and long complete[, "lat"] %>% str_trim() -> lats complete[, "long"] %>% str_trim() -> longs #function to combine all the elements into one string cChar<- function(vector){ unique(vector) -> vector vec <- c(1) for(i in 1:length(vector)){ if(i == 1){paste(vec[1], vector[i], sep = "") -> vec_com}else{ paste(vec_com, vector[i], sep = "") -> vec_com } } return(vec_com) } #run function on the latitudes ad longitudes cChar(lats) -> clats cChar(longs) -> clongs #split all the elements in each string and turn into one list strsplit(c(clats, clongs), "") %>% unique() -> com_l #turn list into vector unlist(com_l) -> com_l #don't need to remove decimals, remove them from cleaning process com_l != "." -> nums_l com_l[nums_l] -> com_l_d #get a logical of all the elements that can't be converted into a number com_l_d %>% as.numeric() %>% is.na() -> sym_l #isolate the special characters com_l_d[sym_l] %>% unique() -> excls for(i in 1:length(lats)){ for(j in 1:length(excls)){ str_replace(lats[i], excls[j], "") -> lats[i] str_replace(longs[i], excls[j], "") -> longs[i] } } #convert to numeric as.numeric(lats) -> lats as.numeric(longs) -> longs #add new columns holding cleaned values complete[, "clean_lats"] <- lats complete[, "clean_longs"] <- longs
Now that i have the complete dataframe with all the information i need, i can start creating the column that holds the HTML for the popups on the leaflet map.
#concatenate different columns and add HTML, for the circle popups paste("<b><a href=", complete[, "ref_links"], ">", complete[, "title"], "</a></b>", sep = "") -> hLinks paste(sep = "<br/>", hLinks, paste("Rental: ", complete[, "rentals"], sep = ""), paste("Tenants: ", complete[, "residents"], sep = ""), paste("Ideal Roommates: ", complete[, "ideal"])) -> hLinks_new complete[, "hPopup"] <- hLinks_new
Now for the fun part: the interactive map using Leaflet.
#filter only rentals equal to or less than SGD 2000. filter(complete, rentals_clean <= 1500) -> df_plot #color scheme for the legend colorNumeric(palette = rainbow(3), domain = df_plot$rentals_clean) -> pal leaflet(df_plot) %>% addProviderTiles("CartoDB.Positron") %>% #Greyscale map addLegend("bottomright", pal = pal, values = ~rentals_clean, title = "Rental Price", labFormat = labelFormat(prefix = "SGD "), opacity = 1) %>% addCircleMarkers(lng = ~clean_longs, lat = ~clean_lats, #Latitudes and Longitudes radius = ~ifelse(rentals_clean <= 1000, 4, 3), #Size of circles dependent on rental amount color = ~pal(rentals_clean), #color mapped to rental amount stroke = FALSE, fillOpacity = 0.5, popup = ~hPopup) -> SG_map saveWidget(SG_map, file = "SG_map.html", selfcontained = FALSE)
The colors of the circles have been mapped to the rental rate, and clicking on any of them will popup a message with the title of the post, short description of the current residents, and the posters ideal roommate(s). There were actually a whole lot more posts than the ones that are in the map, but i filtered them out because the range of the rental rates was too large. So the rentals were limited to <= 1500, in the interest of a more meaningful map. All in all, I think this worked out quite OK.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.