How to Scrape Images from Google
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In my last post, I tried to train a deep neural net to detect brand logos in images. For that, I downloaded the Flickr27-dataset, containing 270 images of 27 different brands. As this dataset is rather small I turned to google image search and wrote a small R script to download the first 20 images for each search term.
In order to use the script you need to download phantomJS and create a small Javascript file. (See the Gist-file at the end of this post.) If you have the phantomJS.exe, the small JS-file and the R-file in one folder, you just need to run the following lines to download the images …
gg <- scrapeJSSite(searchTerm = "adidas+logo")
downloadImages(as.character(gg$images), i)
Good luck scraping some images from Google. I will use the script to enhance the brand logo dataset, hopefully improving my model with some better image data.
var url ='https://www.google.de/search?q=Yahoo+logo&source=lnms&tbm=isch&sa=X'; | |
var page = new WebPage() | |
var fs = require('fs'); | |
var vWidth = 1080; | |
var vHeight = 1920; | |
page.viewportSize = { | |
width: vWidth , | |
height: vHeight | |
}; | |
//Scroll throu! | |
var s = 0; | |
var sBase = page.evaluate(function () { return document.body.scrollHeight; }); | |
page.scrollPosition = { | |
top: sBase, | |
left: 0 | |
}; | |
function sc() { | |
var sBase2 = page.evaluate(function () { return document.body.scrollHeight; }); | |
if (sBase2 != sBase) { | |
sBase = sBase2; | |
} | |
if (s> sBase) { | |
page.viewportSize = {width: vWidth, height: vHeight}; | |
return; | |
} | |
page.scrollPosition = { | |
top: s, | |
left: 0 | |
}; | |
page.viewportSize = {width: vWidth, height: s}; | |
s += Math.min(sBase/20,400); | |
setTimeout(sc, 110); | |
} | |
function just_wait() { | |
setTimeout(function() { | |
fs.write('1.html', page.content, 'w'); | |
phantom.exit(); | |
}, 2500); | |
} | |
page.open(url, function (status) { | |
sc(); | |
just_wait(); | |
}); | |
library(plyr) | |
library(reshape2) | |
require(rvest) | |
scrapeJSSite <- function(searchTerm){ | |
url <- paste0("https://www.google.de/search?q=",searchTerm, "&source=lnms&tbm=isch&sa=X") | |
lines <- readLines("imageScrape.js") | |
lines[1] <- paste0("var url ='", url ,"';") | |
writeLines(lines, "imageScrape.js") | |
## Download website | |
system("phantomjs imageScrape.js") | |
pg <- read_html("1.html") | |
files <- pg %>% html_nodes("img") %>% html_attr("src") | |
df <- data.frame(images=files, search=searchTerm) | |
return(df) | |
} | |
downloadImages <- function(files, brand, outPath="images"){ | |
for(i in 1:length(files)){ | |
download.file(files[i], destfile = paste0(outPath, "/", brand, "_", i, ".jpg"), mode = 'wb') | |
} | |
} | |
### exchange the search terms here! | |
gg <- scrapeJSSite(searchTerm = "Adidas+logo") | |
downloadImages(as.character(gg$images), i) | |
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.