Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Now that Spain has won the World Cup, it’s interesting to go back and look at some metrics from the matches and see if we can tease out what characteristics made for a winning Cup team this time around. Fortunately, the Guardian’s Data Blog has made a wealth of World Cup statistics available, with data on every player of every team (position, shots at goal, passes, tackles made, and saves), plus aggregate statistics for each team (goals, % shots on target, fouls, and much more). The data are ripe for analysis in R, especially given that you can download the data directly from the cloud as an R object with the following commands:
players <- read.csv("http://spreadsheets.google.com/pub?key=tOM2qREmPUbv76waumrEEYg&single=true&gid=1&range=A1%3AH596&output=csv")
teams <- read.csv("http://spreadsheets.google.com/pub?key=tOM2qREmPUbv76waumrEEYg&single=true&gid=0&range=A1%3AAG15&output=csv")
The method I’ve described before for accessing a Google Spreadsheet from R didn’t quite apply here, as those instructions assume you own the document (and have access to the Publish menu). But some experimentation and tweaking of the spreadsheet URL made it work: the key parameters seem to be the "&gid=" (sheet number) and "%range=" (cell ranges, use %3A to encode the colon) and "&output=csv" to download in CSV format. It would be nice if Google published the specs to form URLs like these, but as far as I know they don’t.
Anyway, a couple of bloggers have used these data to great effect to express the results of the World Cup visually using R graphics. For example, the R Charts blog used ggplot2 to look at the number of fouls committed by each team during the tournament:
(Personally, I would have sorted the rows by descending number of fouls, rather than alphabetically.) Interesting to see that Cup champions Spain are in the middle of the pack on fouls, whereas runners-up Netherlands lead this table (boosted heavily by their performance in the final).
Blogger Jason Priem also took a look at the data, this time with a scatterplot of goals per game by fouls per game, related to how far each team advanced in the competition:
(Download Jason’s code for this chart here.) Again it’s interesting to see the positions of the two finalists here, with Netherlands on the extreme frontier for both fouls and goals, while spain is moderate on goals per game and near the lowest on fouls per games.
It’s a rich dataset and I’m sure other Revolutions readers could come up with some equally interesting visualizations. If you do, tell us about it in the comments.
Guardian Data Blog: World Cup 2010 statistics: every match and every player in data
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.