Navigating & Scraping a Job Site | rvest & RSelenium
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
One of my family members gave me an idea to perhaps try scraping data from a job site, and arranging the data in a way that can then easily be filtered and checked using a spreadsheet. I’m actually a little embarrassed that i didn’t think of this idea myself. Needless to say, i was anxious to try this out.
I picked a site and started inspecting the HTML code to see how would i get the information i needed from each job posting. Normally, the easiest scrapes (for me) are the ones where the site is structured with two characteristics.
First, it helps if all (or at least most) of the information that i need to extract is in the site’s search results page. For instance, in the context of job postings, if you search for “Data Scientist”, and the search results show the job title, the company that’s hiring, the years of experience required, the location, and a short summary – then there is no real need to navigate to each post and get that data from the post itself.
Second characteristic is if the URL of the search results shows the result page number that you are currently in – or even shows any indication of which search result number i am looking at. For instance, google “Data Scientist” and the take note of the URL. Scroll down and click the second page, and notice that the URL now ends with “start=10”. Go to the third page and you’ll notice that the it now ends with “start=20”. Although it doesn’t mention which page, it does indicate that if you were to change those last two digits to anything (go ahead and try), the search results would begin from start + 1; i.e. if start = 10, the search results would begin with search result no. 11. If i’m lucky, some websites have clear indications in the URL, like “page=2”, which makes the task even more easier.
Now why would these two characteristics make it much easier? Mainly because you can split the URL into different parts, with only one variable – the page number – and then concatenate the different parts back. After that it’s just a matter of looping through these URLs and picking up the information you need from the HTML source.
If the above two characteristics exist, all i need is the rvest package to make it all work, with dplyr and stringr for some of the “tidying”.
There are certain instances however, when both of these characteristics do not exist. It’s usually because the site incorporates some javascript and so the URL does not change when going through different search pages. This means that in order to make this work, i would actually have to click the page buttons in order to get the HTML source – and i can’t do that with rvest.
Enter RSelenium. The wonderful R package that allows me to do all that.
As always i started off with loading the packages, assigning the URL for the search result page, and extracting the data for just the first page. You’ll have to excuse me for using the “=” operator. WordPress seems to screw up the formatting if i use the “less than” operator combined with a hyphen; which is sort of annoying.
#Load packages library(dplyr) library(rvest) library(stringr) library(RSelenium) library(beepr) #Manually paste the URL for the search results here link = "jobsite search results URL here" #Get the html source of the URL hlink = html(link) #Extract Job Title hlink %>% html_nodes("#main_section") %>% html_nodes(".tpjob_item") %>% html_nodes(".tpjob_title") %>% html_text() %>% data.frame(stringsAsFactors = FALSE) -> a names(a) = "Title" #Extract Recruitment Company hlink %>% html_nodes("#main_section") %>% html_nodes(".tpjobwrap") %>% html_nodes(".tpjob_cname") %>% html_text() %>% data.frame(stringsAsFactors = FALSE) -> b names(b) = "Company" #Extract Links to Job postings hlink %>% html_nodes("#main_section") %>% html_nodes(".tpjob_item") %>% html_nodes(".tpjob_lnk") %>% html_attr("href") %>% data.frame(stringsAsFactors = FALSE) -> c names(c) = "Links"
At this point i’ve only extracted the job titles, the hiring company’s name, and the link to the post. In order for me to get the same details for the remaining posts, i would need to first navigate to the next page, which involves clicking the Next button at the bottom of the search results page.
#From RSelenium checkForServer() #Check if server file is available startServer() #Start the server mybrowser = remoteDriver(browser = "chrome") #Change the browser to chrome mybrowser$open(silent = TRUE) #Open the browser Sys.sleep((5)) #Wait a few seconds mybrowser$navigate(link) #Navigate to URL Sys.sleep(5) Pages = 16 #Select how many pages to go through for(i in 1:Pages){ #Find the "Next" button and click it try(wxbutton = mybrowser$findElement(using = 'css selector', "a.pagination_item.next.lft")) try(wxbutton$clickElement()) # Click Sys.sleep(8) hlink = html(mybrowser$getPageSource()[[1]]) #Get the html source from site hlink %>% html_text() -> service_check #If there is a 503 error, go back if(grepl("503 Service", service_check)){ mybrowser$goBack() } else { #Job Title hlink %>% html_nodes("#main_section") %>% html_nodes(".tpjob_item") %>% html_nodes(".tpjob_title") %>% html_text() %>% data.frame(stringsAsFactors = FALSE) -> x names(x) = "Title" a = rbind(a,x) #Add the new job postings to the ones extracted earlier #Recruitment Company hlink %>% html_nodes("#main_section") %>% html_nodes(".tpjobwrap") %>% html_nodes(".tpjob_cname") %>% html_text() %>% data.frame(stringsAsFactors = FALSE) -> y names(y) = "Company" b = rbind(b,y) #Links hlink %>% html_nodes("#main_section") %>% html_nodes(".tpjob_item") %>% html_nodes(".tpjob_lnk") %>% html_attr("href") %>% data.frame(stringsAsFactors = FALSE) -> z names(z) = "Links" c = rbind(c,z) } } beep() #Put everything together in one dataframe compile = cbind(a,b,c) #export a copy, for backup write.csv(compile, "Backup.csv", row.names = FALSE) #close server and browser mybrowser$close() mybrowser$closeServer()
Now that i have all the links to the posts, i can now loop through the previously compiled dataframe and get all the details from all the URLS.
#Make another copy to loop through compile_2 = compile #Create 8 new columns to represent the details to be extracted compile_2$Location = NA compile_2$Experience = NA compile_2$Education = NA compile_2$Stream = NA compile_2$Function = NA compile_2$Role = NA compile_2$Industry = NA compile_2$Posted_On = NA #3 loops, 2 in 1 #First loop to go through the links extracted for(i in 1:nrow(compile_2)){ hlink = "" link = compile_2$Links[i] try(hlink = html(link)) if(html_text(hlink) != ""){ hlink %>% html_nodes(".jd_infoh") %>% html_text() %>% data.frame(stringsAsFactors = FALSE) -> a_column hlink %>% html_nodes(".jd_infotxt") %>% html_text() %>% data.frame(stringsAsFactors = FALSE) -> l_column if(nrow(a_column) != 0){ #Second loop to check if the details are in the same order in each page for(j in nrow(l_column):1){ if(nchar(str_trim(l_column[j,1])) == 0){l_column[-j,] %>% data.frame(stringsAsFactors = FALSE) -> l_column} } if(nrow(a_column) == nrow(l_column)){ cbind(a_column, l_column) -> comp_column #Third loop to update dataframe with all the details from each post for(k in 1:nrow(comp_column)){ if(grepl("Location", comp_column[k,1])){compile_2$Location[i] = comp_column[k,2]} if(grepl("Experience", comp_column[k,1])){compile_2$Experience[i] = comp_column[k,2]} if(grepl("Education", comp_column[k,1])){compile_2$Education[i] = comp_column[k,2]} if(grepl("Stream", comp_column[k,1])){compile_2$Stream[i] = comp_column[k,2]} if(grepl("Function", comp_column[k,1])){compile_2$Function[i] = comp_column[k,2]} if(grepl("Role", comp_column[k,1])){compile_2$Role[i] = comp_column[k,2]} if(grepl("Industry", comp_column[k,1])){compile_2$Industry[i] = comp_column[k,2]} if(grepl("Posted", comp_column[k,1])){compile_2$Posted_On[i] = comp_column[k,2]} } } } } } beep() #Export a copy for backup write.csv(compile_2, "Raw_Complete.csv", row.names = FALSE) #Alert beep() Sys.sleep(0.2) beep() Sys.sleep(0.2) beep() Sys.sleep(0.3) beep(sound = 8) #That one's just me goofing around
Alright, we now have a nice dataframe of 1840 jobs and 11 columns showing:
1. Job Title
2. Company: The hiring company.
3. Links: The URL of the job posting.
4. Location: Where the job is situated.
5. Experience: Level of experience required for the job, shown as a range (e.g. 2-3 years)
6. Education: Minimum educational qualification.
7. Stream: Work stream category.
8. Function: Job function category
9. Role: Then job’s general role.
10. Industry: Which industry the hiring company is involved in.
11. Posted_On: The day the job was originally posted.
As a matter of convenience, i decided to split the 5th column, Experience, into two other columns:
12. Min: Minimum years of experience required.
13. Max: Maximum years of experience.
The code used to mutate this Experience column was:
com_clean = compile_2 #logical vector of all the observation with no details extracted because of error is.na(com_clean[,4]) -> log_vec #Place NA tows in separate dataframe com_clean_NA = com_clean[log_vec,] #Place the remaining in onther dataframe com_clean_OK = com_clean[!log_vec,] com_clean_OK[,"Experience"] -> Exp #Remove whitespace and the "years" part str_replace_all(Exp, " ", "") %>% str_replace_all(pattern = "years", replacement = "") -> Exp #Assign the location of the hyphen to a list str_locate_all(Exp[], "-") -> hyphens #Assign empty vectors to be populated with a loop Min = c() Max = c() for(i in 1:length(Exp)){ substr(Exp[i], 0, hyphens[[i]][1,1] - 1) %>% as.integer() -> Min[i] substr(Exp[i], hyphens[[i]][1,1] + 1, nchar(Exp[i])) %>% as.integer() -> Max[i] } #Assign results to new columns com_clean_OK$Min_Experience = Min com_clean_OK$Max_Experience = Max #Rearrange the columns select(com_clean_OK, 1:4, 12:13, 5:11) -> com_clean_OK write.csv(com_clean_OK, "Complete_No_NA.csv", row.names = FALSE) write.csv(com_clean_NA, "Complete_All_NA.csv", row.names = FALSE)
And with that, i have a nice dataframe of all the information i need to go through the posts. I was flirting with the idea of even trying to compile some code that would automatically apply for a job if it meets certain criteria, e.g. if a job title equals X, minimum experience is less than Y, and location is in a list of Z; then click this and, so on. Obviously, there is the question of how to go through the Captcha walls, as a colleague had once highlighted. In any case, i thought i should leave this idea for a different post. Till then, i’ll be involved in some intense googling to see if someone else has actually tried it out (using R, or even Python) and maybe pick up a few things.
Tagged: browser, data, headless, headless browser, programming, r, RSelenium, rstats, scraping, web scraping
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.