Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
[1] [2] [3] Over the last 40-50 years, the international spread of the passion for football has revealed as one of the most pandemic social phenomena. Something that was considered as a fun form of national craziness typical of Brazilian, British and Italian people in the 1960s and ’70s is now commonly shared by a vast majority of the Earth population (including orbiting astronauts that are regularly kept informed of the matches results).
As I am an absolute outsider to that trend, I randomly scraped the web[4] in search of results and scorings, and ended up with a dataset of approx. 400,000 first leagues matches (381,257 after a bit of cleaning) which I don’t really have a precise idea of what to do with. A clear advantage of this outsider positioning is that I can dig deeper into something while not having an ounce of positive or negative preconceived ideas on the topic. However, a clear disadvantage is that I may not even think to analytical approaches that would be obvious to a football fan or expert.
1. Dataset
The dataset is very international comprising matches from 60 different countries spread over 6 continents representing all FIFA regions.
continent | matches | FIFAregion | matches | top_countries | matches | ||
---|---|---|---|---|---|---|---|
Africa | 16748 | AFC | 28065 | united kingdom | 66454 | ||
Asia | 25062 | CAF | 16748 | france | 26417 | ||
Europe | 292733 | CONCACAF | 13853 | italy | 24716 | ||
North America | 10649 | CONMEBOL | 25475 | spain | 23140 | ||
Oceania | 7386 | OFC | 560 | netherlands | 17790 | ||
South America | 28679 | UEFA | 296556 | germany | 15854 |
From 65 matches in 1888 to more than 15,000 matches per annum from 2006 onwards,[5] the dataset shows a sort of exponential growth in the number of matches logged annually (with the exception of the two World Wars). Actually, this is not only due to an overall trend in the football industry but also to the way the original data sources I taped data from are fed.
As a matter of fact, since the 1950s a growing number of countries have an official championship which results are made available (by their respective federations or fans communities).
2. Ingenuous findings
Summary statistics confirm what most fans and non-fans say:
a. Football is a low scoring sport. The mean of total number of goals per match is 2.77 with an average difference in scoring between winner and loser of 0.57 goal only. To put it differently, there was 1 goal every 32 minutes 30 seconds across the dataset.
b. The pattern of matches results is quite predictable, with almost twice as many home wins as visiting wins or draws.
Win | Frequency |
---|---|
Home | 189886 |
Draw | 97563 |
Visiting | 93808 |
However, continent where matches are played seems to somehow impact the distribution of home wins, draws and visiting wins –the over-representation of Europe in the dataset (76.8% of all cases) forces to more cautiousness in comparing subsets; for example, the under-representation of visiting wins in Africa compared to the rest strongly contributes to the ChiSquare despite this is only a very small proportion (0.87%) of the whole dataset.
Home | Draw | Visiting | |
---|---|---|---|
Africa | 8736 | 4706 | 3306 |
Asia | 11141 | 6828 | 7093 |
Europe | 148413 | 73557 | 70763 |
North America | 4976 | 2693 | 2980 |
Oceania | 3394 | 1755 | 2237 |
South America | 13226 | 8024 | 7429 |
ChiSquare: 985.26 — df: 10 — p.value: 2.797101e-205
c. On average, more goals are scored by home teams than visiting teams. Overall, in the dataset 636,034 goals were scored by the home teams while 419,775 by the visiting teams. Not only the sum is significantly different,[6] but also the shape of the distribution.
d. As a further confirmation to the perception of football as a low scoring sport, approximately two third of all the results in the dataset (67.8 %) are within a 2:2 score (i.e., 0:0, 1:0, 2:0, 0:1, 0:2, 1:1, 2:2), and 86.4% if we consider all matches with score up to 3:3.[7]
3. Some details
a. Many times I heard football fans but, mostly, newsreaders and commentators stating that football is more offensive and more goals are scored in some specific countries, which make the games more entertaining overall. According to the same commentators other countries seem to have a mainly defensive football tradition characterized by a lower number of goals per match, and, ultimately, less fun watching the games. As topical examples of these 2 extreme ways of playing football, Brazil and Italy were always mentionned: a dynamic and high scoring football in Brazil, while chilly and defensive in Italy. Well, data tell us a different story ! Football in New Zealand, Scandinavian countries, Germany, Holland, Canada and the UK offers its fans more goals scored overall, while Brazil and Italy both belong to a less-scoring category, with an average total number of goals per match in the 2.51 to 3.00 range.[8]
b. From a historical perspective, the average total number of goals per match tends to decrease over time. From the dataset, I filtered out all the countries which have less than 50 years (football seasons) of data and was left with 10 countries for a total of 226,671 matches. Plotting the average total number of goals per year against the season over time shows that “younger championships” generate more goals but go through quite a steep fall over the initial 25 years, then a slower decrease over the following 50 years.[9]

To download R code and dataset, click here (4.0 MB).
Notes
[1] This document is the result of an analysis conducted by the author and exclusively reflects the author’s positions. Therefore, this document does not involve, either directly or indirectly, any of the employers, past and present, of the author. The author also declares not to have any conflict of interest with companies, institutions, organizations, authorities related to the football eco-system.
[2] Contact: salvino [dot] salvaggio [at] gmail [dot] com
[3] In this document, football refers to the European definition, which is soccer in the USA.
[4] Sites such as http://www.calciostoria.it/ or http://www.calcio.com/
[5] Current football season is still ongoing, which explains the substantial drop in the number of matches of the last available year in the dataset.
[6] p-value of t.test < 2.2e-16
[7] If no colored tile is shown in the graph, it means no matches in the dataset ended with such score. If a colored tile reporting 0% is shown, it means that less than 0.005% (but more than 0) of all the matches ended with such score.
[8] Pr-value of one-way ANOVA < 2e-16.
[9] Stabilization in the average total number of goals per match after the 75th year does not mean a lot in this case because only one national football championship, the UK, has such longevity.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.