Soccer data sparring: Scraping, merging and analyzing exercises
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
While understanding and spending time improving specific techniques, and strengthening indvidual muscles is important, occasionally it is necessary to do some rounds of actual sparring to see your flow and spot weaknesses. This exercise sets forces you to use all that you have practiced: to scrape links, download data, regular expressions, merge data and then analyze it.
We will download data from the website football-data.co.uk that has data on some football/soccer leagues results and odds quoted by bookmakers where you can bet on the results.
Answers are available here.
Exercise 1
Use R to scan the German section on football-data.co.uk for any links and save them in a character vector called all_links
. There are many ways to accomplish this.
Exercise 2
Among the links you found should be a number pointing to comma-separated values files with data on Bundesliga 1 and 2 separated by season. Now update all_links
vector so that only links to csv
files remain. Use regular expressions.
Exercise 3
Again, update all_links
so that only links to csv tables ‘from Bundesliga 1 from Season 1993/1994 to 2013/2014 inclusive’ remain.
Exercise 4
Import to a list in your workspace all the 21 remaining csv files in all_links
, each one as a data.frame
. Use read.csv
, with the url and na.strings = c("", "NA")
. Not that you might need to add a prefix for them, so the links are complete.
Exercise 5
Take the list and generate a one big data.frame with all the data.frame
s previously imported. One way to do this is using rbind.fill
function from a well-known package. Name the new data.frame
as bundesl
.
Exercise 6
Take a good look at the new dataset. Our read.csv
did not work perfectly on this data: it turns out that there are some empty rows and empty columns, identify and count them. Update the bundesl
so it no longer has empty rows m nor columns.
Exercise 7
Format the Date column so R understands using as.Date()
.
Exercise 8
Remove all columns which are not 100% complete, and the variable Div
as well.
Exercise 9
Which are the top 3 teams in terms of numbers of wins in Bundesliga 1 for our period? You are free to use base-R functions or any package. Be warned that his task is not as simple as it seems due the nature in the data and small inconsitency in the data.
Exercise 10
Which team has held the longest winning streak in our data?
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.