Most soccer matches are within one goal of 1-0

[This article was first published on R – Decision Science News, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

ANALYSIS OF A LOW-SCORING SPORT

Most_Common_Scores

Recently, our friend and co-author Sid Suri predicted that a soccer match would end 2-1. This got us thinking. Suppose you didn’t say which team would score 2 and which would score 1. And suppose people considered you “close” if you got within one goal of 2-1 (so 1-0, 1-1, 2-1, 3-1 or 2-2 would be close). How often would you be close by guessing 2-1?

We’ll save the answer for the end of this post.

To figure this out, we went to http://www.football-data.co.uk and pulled every English soccer match from 1993 to 2016 (so far), for the following leagues: Premier, Championship, League 1, League 2, Conference, Division 1, Division 2, and Division 3. In all: 52,017 matches.

First, let’s establish that soccer is a low-scoring sport. We see below that a two-goal match is the most likely outcome in soccer. The average number of goals per match is 2.6 (median 2).

Total_Points_Scored

This leads to certain low scores (where a score is a high-low pair) being quite common. The graph at top of this post shows that 1-0 (or 0-1, which is the same thing by our definition) is the most common outcome of a football match there is. Nearly 20% of matches end 1-0.

Now to our key question. What score is just one goal away from the most soccer matches’ scores? It’s close, but the winner is 2-1. 52.9% of matches end within one point of that score. So if you can be vague about who you predict will win, just confidently proclaim that it will end 2-1 and you’ll be right most of the time. Same deal for saying 2-0 or 1-0. All three of these scores are within one point of most soccer matches’ outcomes.

Within_One_Point

One cute math-y thing about this is that some scores are one point away from more possible scores than others. See below. 2-1 is within one point of five scores (including itself), but 2-0 is only within one point of four possible scores. Same with 1-0. To consider why, realize that scores can’t be negative. Despite being neighbors with fewer scores, 2-0 and 1-0 are within one goal of roughly the same number of matches as 2-1 is. Even 1-1 does quite well with just two neighbors beyond itself.

Breakdown

The figure above shows the counts of each of the neighbors of 2-1 (top), 2-0 (middle), and 1-0 (bottom).

Want to play around more? Here’s R code to play around. Thanks to https://twitter.com/hadleywickham for ggplot, dplyr, httr

library(dplyr)
library(ggplot2)
library(httr)
setwd("C:/Dropbox/Projects/20160206_Soccer_Scores")
if (!file.exists("20160206_Soccer_Scores.csv.gz")) {
cat("must reread")
}
if (!file.exists("20160206_Soccer_Scores.csv.gz"))
{
cat("Reading data from server\n")
STARTYEAR = 1993
STOPYEAR = 2015 #Stop at 2015 to get 2015-2016 data
yearstrings = vector(mode = "character", length = STOPYEAR - STARTYEAR +
1)
for (i in STARTYEAR:STOPYEAR) {
yearstrings[i - STARTYEAR + 1] =
paste(
substr(as.character(i), start = 3, stop = 4),
substr(as.character(i + 1), start = 3, stop = 4),
sep = ""
)
}
reslist = vector(mode = "list", length = 1000)
i = 1
for (yearstring in yearstrings) {
for (confstring in c("E0", "E1", "E2", "E3", "EC")) {
aurl = paste(
"http://www.football-data.co.uk/mmz4281/",
yearstring,
"/",
confstring,
".csv",
sep = ""
)
if (http_status(GET(aurl))$category == "Success")
{
reslist[[i]] = read.csv(url(aurl))
i = i + 1
} else {
cat("Failed: ", aurl, "\n")
}
}
}
if (exists("reslist2")) {
rm(reslist2)
}
reslist2 = vector(mode = "list", length = 50)
for (i in 1:(5 * (STOPYEAR - STARTYEAR + 1))) {
reslist2[[i]] = reslist[[i]][, c(1:6)]
}
df = do.call(rbind, reslist2)
df = subset(df, !is.na(FTHG))
write.csv(df,
file = gzfile("20160206_Soccer_Scores.csv.gz"),
row.names = FALSE)
}
get_neighbors = function(low, high) {
#returns own score plus neighbors
if (high < low)
stop("high > low in get_neighbors")
rbind(expand.grid(a = (low - 1):(low + 1), b = high),
expand.grid(a = low, b = (high - 1):(high + 1))) %>%
filter(a >= 0 &
b >= 0 &
b >= a) %>%
mutate(neighbor = paste(b, a, sep = "-"),
neighbor_group = paste(high, low, sep = "-")) %>%
group_by(neighbor_group, neighbor) %>%
summarise()
}
#create the neighbor data frame ndf
curr = 1
reslist = vector(mode = "list", length = 1000)
for (i in 0:11) {
for (j in i:11) {
reslist[[curr]] = get_neighbors(i, j)
curr = curr + 1
}
}
ndf = do.call('rbind', reslist)
ndf_summ = ndf %>% group_by(neighbor_group) %>% summarise(count = length(neighbor)) %>% arrange(-count) %>% ungroup()
df = read.csv("20160206_Soccer_Scores.csv.gz")
df = df %>%
mutate(
Low = ifelse(FTHG < FTAG, FTHG, FTAG),
High = ifelse(FTHG < FTAG, FTAG, FTHG),
Total = FTHG + FTAG
)
#Total points scored hist
p = ggplot(df, aes(x = Total))
p = p + geom_bar(
stat = "bin",
binwidth = 1,
fill = "green",
color = "black",
width = 2,
alpha = .2
)
p = p + scale_x_continuous(breaks = seq(-.5, 12.5, by = 1),
labels = c("", 0:12))
p = p + coord_cartesian(xlim = c(0.1, 12))
p = p + labs(x = "Total Goals", y = "Matches", title = "Distribution of Total Goals / Match \n(N = 52,017 Matches)")
p = p + theme(panel.grid = element_blank())
p
ggsave(
plot = p,
file = "Total_Points_Scored.png",
width = 4,
height = 4
)
matcher =
rbind(expand.grid(1, 1:3),
expand.grid(0:2, 2))[-5, ] %>%
mutate(match = TRUE)
names(matcher) = c("Low", "High", "Match")
df = left_join(df, matcher) %>%
mutate(Score = paste(High, Low, sep = "-"))
df_summ = df %>% group_by(Score) %>%
summarise(Count = n(),
Goals = Low[1] + High[1]) %>% ungroup()
df_summ = df_summ %>%
arrange(-Count) %>%
mutate(CumProp = cumsum(Count) / sum(Count),
Prop = Count / sum(Count))
df_summ2 = df_summ %>%
group_by(Goals) %>%
summarise(Proportion = sum(Prop)) %>%
ungroup() %>%
mutate(TotalProp = cumsum(Proportion))
p = ggplot(df_summ %>% filter(CumProp < .95), aes(y = 100 * Prop, x = reorder(Score, Prop))) +
geom_bar(
stat = 'identity',
fill = "green",
color = "black",
alpha = .2
) +
coord_flip() +
labs(x = "Score", y = "Percentage of Games", title = "Most Common Final Scores") +
theme(legend.position = "bottom")
p = p + theme(panel.grid.major = element_blank())
p
ggsave(
plot = p,
file = "Most_Common_Scores.png",
width = 4,
height = 4
)
###NOW JUST JOIN df_summ and NDF by x=Score y=neighbor
#Games within 1 point hist
plot_data = left_join(ndf, df_summ, by = c("neighbor" = "Score")) %>% select(-CumProp) %>%
group_by(neighbor_group) %>%
summarise(
group_count = sum(Count, na.rm = TRUE),
group_prop = sum(Prop, na.rm = TRUE),
obs = n()
) %>%
arrange(-group_count) %>%
ungroup() %>%
filter(group_count > 0) %>%
filter(group_prop > .1)
plot_data$score = factor(plot_data$neighbor_group, levels = as.character(plot_data$neighbor_group))
p = ggplot(plot_data, aes(x = score, y = 100 * group_prop))
p = p + geom_bar(
stat = "identity",
fill = "green",
color = "black",
alpha = .2
)
p = p + labs(x = "Score", y = "% of Matches Within 1 Point of Score",
title = "% of Matches Within 1 Point \nof Various Scores")
p = p + scale_y_continuous(breaks = seq(0, 55, by = 5))
p = p + theme(panel.grid.major = element_blank())
p
ggsave(
plot = p,
file = "Within_One_Point.png",
width = 4,
height = 4
)
#show all the games within 1 of a given score
plot_data = left_join(ndf, df_summ, by = c("neighbor" = "Score"))
plot_data = subset(plot_data, neighbor_group %in% c("2-1", "2-0", "1-0", "1-1"))
plot_data$neighbor_group = factor(plot_data$neighbor_group, levels = c("2-1", "2-0", "1-0", "1-1"))
p = ggplot(plot_data, aes(x = neighbor, y = Count))
p = p + geom_bar(
stat = "identity",
fill = "green",
color = "black",
alpha = .2
)
p = p + labs(x = "Scores in Score Group", y = "Number of Matches", title =
"Frequency of Scores Within Score Groups")
p = p + facet_grid(neighbor_group ~ .)
p
ggsave(
plot = p,
file = "Breakdown.png",
width = 4,
height = 4
)
view raw soccer.R hosted with ❤ by GitHub

The post Most soccer matches are within one goal of 1-0 appeared first on Decision Science News.

To leave a comment for the author, please follow the link and comment on their blog: R – Decision Science News.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)