How to download image files with RoboBrowser
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In a previous post, we showed how RoboBrowser can be used to fill out online forms for getting historical weather data from Wunderground. This article will talk about how to use RoboBrowser to batch download collections of image files from Pexels, a site which offers free downloads. If you’re looking to work with images, or want to build a training set for an image classifier with Python, this post will help you do that.
In the first part of the code, we’ll load the RoboBrowser class from the robobrowser package, create a browser object which acts like a web browser, and navigate to the Pexels website.
# load the RoboBrowser class from robobrowser from robobrowser import RoboBrowser # define base site base = "https://www.pexels.com/" # create browser object, # which serves as an invisible web browser browser = RoboBrowser() # navigate to pexels.com browser.open(base)
If you actually go to the website, you’ll see there’s a search box. We can identify this search form using the get_form method. Once we have this form, we can check what fields it contains by printing the fields attribute.
form = browser.get_form() print(form.fields)
Printing the fields shows us that the search field name associated with the search box is “s”. This means that if we want to programmatically type something into the search box, we need to set that field’s value to our search input. Thus, if we want to search for pictures of “water”, we can do it like so:
# set search input to water form["s"] = "water" # submit form browser.submit_form(form)
After we’ve submitted the search box, we can see what our current URL is by checking the url attribute.
print(browser.url)
Next, let’s examine the links on the page of results.
# get links on page links = browser.get_links() # get the URL of each page by scraping the href attribute # of each link urls = [link.get("href") for link in links] # filter out any URLs from links without href attributes # these appear as Python's special NoneType urls = [url for url in urls if url is not None]
Now that we have the URLs on the page, let’s filter them to only the ones showing the search result images on the page. We can do this by looking for where “/photo/” appears in each URL.
# filter to what we need -- photo URLs photos = [url for url in urls if '/photo/' in url] # get photo link objects photo_links = [link for link in links if link.get("href") in photos]
For our first example, let’s click on the first photo link (0th Python-indexed link) using the follow_link method.
browser.follow_link(photo_links[0])
Next, we want to find the link on the screen that says “Free Download.” This can be done using the get_link method of browser. All we need to do is to pass whatever text on the link we’re looking for i.e. “Free Download” in this case.
# get the download link download_link = browser.get_link("Free Download") # hit the download link browser.open(download_link.get("href"))
Once we’ve hit the link, we can write the file data to disk, using browser.response.content
# get file name of picture from URL file_name = photo[0].split("-")[-2] with open(file_name + '.jpeg', "wb") as image_file: image_file.write(browser.response.content)
Suppose we wanted to download all of the images on the results page. We can tweak our code above like this:
# download every image on the first screen of results: for url in photos: full_url = "https://pexels.com" + url try: browser.open(full_url) download_link = browser.get_link("Free Download") browser.open(download_link.get("href")) file_name = url.split("/")[-2] + '.jpg' with open(file_name, "wb") as image_file: image_file.write(browser.response.content) except Exception: pass
Instead of just downloading the images for one particular search query, we can modify our code above to create a generalized function, which will download the search result images associated with whatever input we choose.
Generalized function
def download_search_results(search_query): browser.open(base) form = browser.get_form() form['s'] = search_query browser.submit_form(form) links = browser.get_links() urls = [link.get("href") for link in links] urls = [url for url in urls if url is not None] # filter to what we need -- photos photos = [url for url in urls if '/photo/' in url] for url in photos: full_url = "https://pexels.com" + url try: browser.open(full_url) download_link = browser.get_link("Free Download") browser.open(download_link.get("href")) file_name = url.split("/")[-2] + '.jpg' with open(file_name, "wb") as image_file: image_file.write(browser.response.content) except Exception: pass
Then, we can call our function like this:
download_search_results("water") download_search_results("river") download_search_results("lake")
As mentioned previously, using RoboBrowser to download images from Pexels can also be really useful if you’re looking to build your own image classifier and you need to collect images for a training set. For instance, if you want to build a classifier that predicts if an image contains a dog or not, you could scrape Pexels with our function above to get training images, like this:
download_search_results("dog")
So far our code has one main drawback — our function currently just pull’s one page’s worth of results. If you want to pull additional pages, we need to adjust the search URL. For instance, to get the second page of dog images, we need to hit https://www.pexels.com/search/dog/?page=2. If we want to pull additional pages, we need to change our function’s code to hit these sequential URLS — ?page=2, ?page=3, …until either we have all of the images downloaded, or we’ve reached a certain page limit. Given that there might be thousands of images associated with a particular search query, you might want to set a limit to how many search results pages you hit.
Generalized function – download all page results up to a limit
def download_all_search_results(search_query, MAX_PAGE_INDEX = 30): browser.open(base) form = browser.get_form() form['s'] = search_query browser.submit_form(form) search_url = browser.url # new URL # create page index counter page_index = 1 # define temporary holder for the photo links photos = [None] # set limit on page index # loop will break before MAX_PAGE_INDEX if there are less page results while photos != [] and page_index <= MAX_PAGE_INDEX: browser.open(search_url + "?page=" + str(page_index)) links = browser.get_links() urls = [link.get("href") for link in links] urls = [url for url in urls if url is not None] # filter to what we need -- photos photos = [url for url in urls if '/photo/' in url] if photos == []: break for url in photos: full_url = "https://pexels.com" + url try: browser.open(full_url) download_link = browser.get_link("Free Download") browser.open(download_link.get("href"), stream = True) file_name = url.split("/")[-2] + '.jpg' with open(file_name, "wb") as image_file: image_file.write(browser.response.content) except Exception: pass print("page index ==> " + str(page_index)) page_index += 1
Now we can call our function similar to above:
# download all the images from the first 30 pages # of results for 'dog' download_all_search_results("dog") # Or -- just get first 10 pages of results download_all_search_results("dog", 10)
You could also limit the number of downloads directly i.e. set a limit of 100, 200, 300, or “X” number of images you want to download.
See more Python articles on this site here. If you’re interested in learning more about web scraping with Python, check out the book below, or click here for recommended books on Python and open source programming.
The post How to download image files with RoboBrowser appeared first on Open Source Automation.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.