Web scraping with Python – the dark side of data
[This article was first published on Musings of a forgetful functor, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In searching for some information on web-scrapers, I found a great presentation given at Pycon in 2010 by Asheesh Laroia. I thought this might be a valuable resource for R users who are looking for ways to gather data from user-unfriendly websites. The presentation can be found here:
http://python.mirocommunity.org/video/1616/pycon-2010-scrape-the-web-stra
Highlights (at least from my perspective)
The mechanise package features heavily in the examples for this presentation. The following link provides some good examples of how to use mechanise to automate forms:
http://wwwsearch.sourceforge.net/mechanize/forms.html
There was also some mention of how Javascript causes problems for web scrapers, although this problem can be overcome via the use of web-drivers such as Selenium (see http://pypi.python.org/pypi/selenium) and Watir. I have used safari-watir before, and from my experience it can perform many complex data gathering tasks with relative ease.
Please feel free to post your comments about your experiences with screen scraping, and other tools that you use to collect web data for R.
http://python.mirocommunity.org/video/1616/pycon-2010-scrape-the-web-stra
Highlights (at least from my perspective)
- Screen scraping is not about regular expressions. It is just too hard to use pattern matching for these tasks, as the tags can change regularly and have significant maintenance issues.
- BeautifulSoup is the go-to html parser for poor quality source. I have used this in the past and am pleased to hear that I was not too far off the money!
- Configuration of User Agent settings is discussed in detail, as well as other mechanisms that websites exploit to stop you from scraping content
- Good description of how to use the Live HTTP Headers add-on for Firefox.
- A thought-provoking discussion about APIs, and comments that suggest that their maintenance and support is woefully inadequate. I was interested to hear his views, as they imply that scraping may be the only alternative when you really need data that is highly inaccessible.
The mechanise package features heavily in the examples for this presentation. The following link provides some good examples of how to use mechanise to automate forms:
http://wwwsearch.sourceforge.net/mechanize/forms.html
There was also some mention of how Javascript causes problems for web scrapers, although this problem can be overcome via the use of web-drivers such as Selenium (see http://pypi.python.org/pypi/selenium) and Watir. I have used safari-watir before, and from my experience it can perform many complex data gathering tasks with relative ease.
Please feel free to post your comments about your experiences with screen scraping, and other tools that you use to collect web data for R.
To leave a comment for the author, please follow the link and comment on their blog: Musings of a forgetful functor.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.