Site icon R-bloggers

Web Scraping Exercises

[This article was first published on R-exercises, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

[For this exercise, before proceeding, first read the rvest package help and the selectorgadget help.]

Answers to the exercises are available here.

Exercise 1

Consider the url ‘http://statbel.fgov.be/en/statistics/figures/economy/indicators/prix_prod_con/’
Extract all the information load on table ‘Third Quarter 2016’.

Exercise 2

Consider the url ‘http://www2.sas.com/proceedings/sugi30/toc.html’
Extract all the papers names, from 001-30 to 268-30

Exercise 3

Consider the url ‘http://www.gibbon.se/Retailer/Map.aspx?SectionId=832’
Extract all the options (countries) availables on select button.

Exercise 4

Consider the url ‘http://r-exercises.com/start-here-to-learn-r/’
Extract all the topics available on the url.

Exercise 5

Consider the url ‘http://www.immobiliare.it/Roma/agenzie_immobiliari_provincia-Roma.html’
Extract all inmobiliaries names published on first page.

Exercise 6

Consider the url ‘http://www.gibbon.se/Retailer/Map.aspx?SectionId=832’.
Extract the links to the detailed information of each row on the table.
For example, for the first adress, Karlbergsvägen 32, 113 27 stockholm, the details are
A.E.N HUND I STAN AB
ADRESS OCH ÖPPETTIDER
Karlbergsvägen 32
113 27 STOCKHOLM
Öppettider:
Telefon: 08-313058
Mail-adress: info@hundistan.eu
Hemsida:
The link to that details (clicking on Karlbergsvägen 32, 113 27 stockholm) is http://www.gibbon.se/Retailer/Retailer.aspx?ItemId=45128.
You have to extract all the links available, one per row.

Exercise 7

Consider the url ‘https://www.bkk-klinikfinder.de/suche/suchergebnis.php?next=1’
Extract the links to the detailed information of each hospital. For example, for the hospital
Krankenhaus Dresden-Friedrichstadt Städtisches Klinikum, the details are available on the link:
https://www.bkk-klinikfinder.de/krankenhaus/index.php?id=26140094900

Exercise 8

Consider the url scraped in Exercise 7.
Extract the links to ‘Details’ for each hospital display on the first 4 pages.

Exercise 9

Consider the url=’http://www.dictionary.com/browse/’ and the words ‘handy’,’whisper’,’lovely’,’scrape’.
Build a data frame, where the first variables is “Word” and the second variables is “definitions”. Scrape the definitions from the url.

Exercise 10
Consider the url ‘http://www.gibbon.se/Retailer/Map.aspx?SectionId=832’.
Build a data frame with all the information available for each row.
For example, for the first adress, Karlbergsvägen 32, 113 27 stockholm, the details are
A.E.N HUND I STAN AB

ADRESS OCH ÖPPETTIDER
Karlbergsvägen 32
113 27 STOCKHOLM
Öppettider:
Telefon: 08-313058
Mail-adress: info@hundistan.eu
Hemsida:
For the second row, Inedalsgatan 5, 112 33 stockholm, the details are
ARKENZOO KUNGSHOLMEN A
ADRESS OCH ÖPPETTIDER
Kungs Zoo AB
Inedalsgatan 5
112 33 STOCKHOLM
Öppettider:
Telefon: 08-7248110
Mail-adress: kungsholmen@arkenzoo.se
Hemsida: www.arkenzoo.se

This details will be saved on the first row of the data.frame.
Website address Name of store Phone Number Email adress City Country
1 A.E.N Hund i Stan AB 08-313058 info@hundistan.eu Stocholm Sweden
2 www.arkenzoo.se ArkenZoo Kungsholmen A 08-7248110 kungsholmen@arkenzoo.se Stocholm Sweden

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.