Web Scraping Exercises
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
[For this exercise, before proceeding, first read the rvest package help and the selectorgadget help.]
Answers to the exercises are available here.
Exercise 1
Consider the url ‘http://statbel.fgov.be/en/statistics/figures/economy/indicators/prix_prod_con/’
Extract all the information load on table ‘Third Quarter 2016’.
Exercise 2
Consider the url ‘http://www2.sas.com/proceedings/sugi30/toc.html’
Extract all the papers names, from 001-30 to 268-30
Exercise 3
Consider the url ‘http://www.gibbon.se/Retailer/Map.aspx?SectionId=832’
Extract all the options (countries) availables on select button.
Exercise 4
Consider the url ‘http://r-exercises.com/start-here-to-learn-r/’
Extract all the topics available on the url.
Exercise 5
Consider the url ‘http://www.immobiliare.it/Roma/agenzie_immobiliari_provincia-Roma.html’
Extract all inmobiliaries names published on first page.
Exercise 6
Consider the url ‘http://www.gibbon.se/Retailer/Map.aspx?SectionId=832’.
Extract the links to the detailed information of each row on the table.
For example, for the first adress, Karlbergsvägen 32, 113 27 stockholm, the details are
A.E.N HUND I STAN AB
ADRESS OCH ÖPPETTIDER
Karlbergsvägen 32
113 27 STOCKHOLM
Öppettider:
Telefon: 08-313058
Mail-adress: [email protected]
Hemsida:
The link to that details (clicking on Karlbergsvägen 32, 113 27 stockholm) is http://www.gibbon.se/Retailer/Retailer.aspx?ItemId=45128.
You have to extract all the links available, one per row.
Exercise 7
Consider the url ‘https://www.bkk-klinikfinder.de/suche/suchergebnis.php?next=1’
Extract the links to the detailed information of each hospital. For example, for the hospital
Krankenhaus Dresden-Friedrichstadt Städtisches Klinikum, the details are available on the link:
https://www.bkk-klinikfinder.de/krankenhaus/index.php?id=26140094900
Exercise 8
Consider the url scraped in Exercise 7.
Extract the links to ‘Details’ for each hospital display on the first 4 pages.
Exercise 9
Consider the url=’http://www.dictionary.com/browse/’ and the words ‘handy’,’whisper’,’lovely’,’scrape’.
Build a data frame, where the first variables is “Word” and the second variables is “definitions”. Scrape the definitions from the url.
Exercise 10
Consider the url ‘http://www.gibbon.se/Retailer/Map.aspx?SectionId=832’.
Build a data frame with all the information available for each row.
For example, for the first adress, Karlbergsvägen 32, 113 27 stockholm, the details are
A.E.N HUND I STAN AB
ADRESS OCH ÖPPETTIDER
Karlbergsvägen 32
113 27 STOCKHOLM
Öppettider:
Telefon: 08-313058
Mail-adress: [email protected]
Hemsida:
For the second row, Inedalsgatan 5, 112 33 stockholm, the details are
ARKENZOO KUNGSHOLMEN A
ADRESS OCH ÖPPETTIDER
Kungs Zoo AB
Inedalsgatan 5
112 33 STOCKHOLM
Öppettider:
Telefon: 08-7248110
Mail-adress: [email protected]
Hemsida: www.arkenzoo.se
This details will be saved on the first row of the data.frame.
Website address Name of store Phone Number Email adress City Country
1 A.E.N Hund i Stan AB 08-313058 [email protected] Stocholm Sweden
2 www.arkenzoo.se ArkenZoo Kungsholmen A 08-7248110 [email protected] Stocholm Sweden
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.