Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I love Nottingham Forest and have been trying to find a way to include them in one of my tutorials, as they are in the play-offs to go into the top flight leagues. This tutorial allows you to download images from Selenium and automate Google Chrome.
Can I get a live tutorial?
The live tutorial is here:
Creating the bones of the project
The first stage would be to define the dependencies:
Downloading the web driver
The next step would be to download the ChromeDriver for your Google Chrome. There are different versions that work the best with Google Chrome. To download the driver see here: https://chromedriver.chromium.org/downloads.
The next step is to install the relevant packages. I have made this simple for you with a requirements.txt in the associated repository of this project.
Set the web driver path
Then we would need to set the web driver path to where you have stored this on your local machine, or server:
Create the get images function to list URLs
The function will inspect the webpage via Javascript, go down the page and click on the thumbnails. Once it has clicked on the thumbnails it will get the image URL for the image and store this in an empty set:
This function:
- Creates a scroll down function with some Javascript to scroll down the Google page
- The url is set to the image url
- We create an empty set to store the image urls
- We create a While Loop to start from the beginning until it reaches the maximum number of images:
- This scrolls down the page
- Finds the HTML thumbnail element by the class = ‘Q4LuWd’
- Then loops through each image url and clicks on the image
- Then we find the actual image URL path using a special class name = ‘n3VNCb’ – see the video tutorial on how to inspect these items on the web page
- If the image has an empty source then it skips over missing ones, otherwise we add the image url to the image_urls set structure using the add method
Create the download image routine
Once we have a set of all the image URLs we might want to then store them in a directory on our machine? That is exactly what the next routine will do:
The download image routine does the following:
- Has parameters down_path, url, file_name, image_type and verbose – the image_type and verbose arguments are optional parameters
- Starts a try and except statement to throw errors if the image cannot be downloaded for some reason
- Get the current time at the start of the routine
- Uses the requests package to get the content of the url through scraping the tags, class names and contents of the HTML
- Creates a bytes objects of our image content
- Opens the image with the Pillow package
- Sets the file path equal to the download path and file name parameters
- Saves the image with the open write binary (‘wb’) command
- Prints out where the relevant file was stored
Working with our functions
What we would then need to do is source a Google Photos page link with a query string – should look like this: https://www.google.com/search?q=brennan+johnson&tbm=isch&ved=2ahUKEwi-i4WJ9Ov3AhVE0RoKHc47BZoQ2-cCegQIABAA&oq=brennan+johnson&gs_lcp=CgNpbWcQAzIKCAAQsQMQgwEQQzILCAAQgAQQsQMQgwEyCwgAEIAEELEDEIMBMgUIABCABDIFCAAQgAQyBQgAEIAEMgUIABCABDIFCAAQgAQyBQgAEIAEMgUIABCABFDMBljMBmDnCGgAcAB4AIAB_gGIAc4DkgEDMi0ymAEAoAEBqgELZ3dzLXdpei1pbWfAAQE&sclient=img&ei=82WGYv7FCMSia873lNAJ&bih=858&biw=1745&rlz=1C5CHFA_enGB980GB980 and associate this with the label of the directory to create:
Looping through each URL and label
The final step to deal with multiple list inputs above would be to loop over the labels and download them sequentially. This is what the final step does:
Step by step:
- This checks that the label and the url lists are the same size, otherwise it raises an error
- Sets the path where you want to store the images of the players
- If the directory doesn’t exist for the relevant player it will be created
- Uses the zip function to iterate through both the url list and the labels and within the loop:
- Creates a loop to iterate through the urls and create an index using enumerate
- Uses the download image function and loops through all the player pictures until the maximum number of pictures has been exhausted
- Finally, our web driver object quits and that is the end of the process
Now you should have some lovely images, in my case of Nottingham Forest players, but you could substitute this for images of cats.
Final note
I do hope you have enjoyed this whistle-stop tour into the power of Selenium for web automation.
These are the examples of the images I scraped for my Deep Learning project to predict players based off their images.
The repository that support this contains the full working code.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.