Quick off the mark
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
With none of the top teams overimpressing this season, Alan Pardew’s performance with Newcastle – especially in the transfer market -is likely to see him receive coniderable recognition in the Manager of the Year award
Recent acquisition, Papiss Cissé, has proved particularly fruitful with a brace against Swansea last time out taking him to nine goals in just eight outings. At half-time, a question posed in the Guardian was whether or not he was the fastest in EPL careers to reach the eight strikes mark. Sounded like a good excuse to haul Rs. If you are only interested in results check out the tables below
The data I need to utilize is on a MSSQL database so first thing to do is load the RODBC package and execute a query
The query extracts every match played by all players (the Where statement ensures that bench-only appearances are not included) and the goals scored.
library(RODBC) channel <- odbcConnect("eplR") goalGames <- sqlQuery(channel,paste( " SELECT soccer.tblTeam_Names.TEAMNAME as team, soccer.tblPlayers.PLAYERID, CASE when soccer.tblPlayers.FIRSTNAME is null then soccer.tblPlayers.LASTNAME else soccer.tblPlayers.FIRSTNAME + ' ' + soccer.tblPlayers.LASTNAME end as name, soccer.tblMatch.DATE as gameDate, soccer.tblPlayer_Match.GOALS as goals FROM soccer.tblPlayer_Match INNER JOIN soccer.tblPlayers INNER JOIN soccer.tblPlayerClub ON soccer.tblPlayers.PLAYERID = soccer.tblPlayerClub.PLAYERID ON soccer.tblPlayer_Match.PLAYER_TEAM = soccer.tblPlayerClub.PLAYER_TEAM INNER JOIN soccer.tblMatchTeam ON soccer.tblPlayer_Match.TEAMMATCHID = soccer.tblMatchTeam.TEAMMATCHID INNER JOIN soccer.tblMatch ON soccer.tblMatchTeam.MATCHID = soccer.tblMatch.MATCHID INNER JOIN soccer.tblTeam_Names ON soccer.tblMatchTeam.TEAMID = soccer.tblTeam_Names.TEAMID where (soccer.tblPlayer_Match.START + soccer.tblPlayer_Match.[ON]) > 0 ORDER BY soccer.tblMatch.DATE " )); odbcClose(channel) |
I want to add columns to show both the game order – there are no double-headers in the EPL so the date reflect the game – and also a cumulative sum of the goals scored. To assist in this I utilize the popular, plyr package
library(plyr) # PLAYERID is unique, name is user-friendly goalGames <- ddply(goalGames,c("PLAYERID","name"), transform, games = 1:NROW(piece),cumGoals = cumsum(goals)) # this what the dataframe now looks like head(goalGames,1) team PLAYERID name gameDate goals games cumGoals 1 Ipswich T ABIDALN Nabil Abidallah 2001-02-24 0 1 0 |
The original question posed was whether or not Cisse was the quickest to reach the eight goal mark. The following code shows that his impressive goal per game mark still did not hack it
# Note the inequality as a player might score more than one goal in a game goalCount <- 8 minGames <- min(subset(goalGames,cumGoals>=goalCount)$games) # Ans 5 # Now find the rows that fit the 8 goals in 5 games criteria # and reduce the answer to relevant columns only fastest <-subset(goalGames,cumGoals>=goalCount&games==minGames)[,c(2,3,6,7)] print(fastest) # Uh-Oh we have a tie PLAYERID name games cumGoals 1215 AGUEROS Sergio Aguero 5 8 149644 QUINNM Mick Quinn 5 8 |
In order to determine exactly who takes the biscuit, we need to obtain data on the actual time of goals scored, apply a ddply
to this result and then filter th dataframe for the players, games and goals under consideration
channel <- odbcConnect("eplR") goalTimes <- sqlQuery(channel,paste( " SELECT soccer.tblTeam_Names.TEAMNAME as team, soccer.tblPlayers.PLAYERID, CASE when soccer.tblPlayers.FIRSTNAME is null then soccer.tblPlayers.LASTNAME else soccer.tblPlayers.FIRSTNAME + ' ' + soccer.tblPlayers.LASTNAME end as name, soccer.tblMatch.DATE as gameDate, soccer.tblGoals.time FROM soccer.tblPlayer_Match INNER JOIN soccer.tblPlayers INNER JOIN soccer.tblPlayerClub ON soccer.tblPlayers.PLAYERID = soccer.tblPlayerClub.PLAYERID ON soccer.tblPlayer_Match.PLAYER_TEAM = soccer.tblPlayerClub.PLAYER_TEAM INNER JOIN soccer.tblMatchTeam ON soccer.tblPlayer_Match.TEAMMATCHID = soccer.tblMatchTeam.TEAMMATCHID INNER JOIN soccer.tblMatch ON soccer.tblMatchTeam.MATCHID = soccer.tblMatch.MATCHID INNER JOIN soccer.tblTeam_Names ON soccer.tblMatchTeam.TEAMID = soccer.tblTeam_Names.TEAMID INNER JOIN soccer.tblGoals ON soccer.tblPlayer_Match.PLAYER_MATCH = soccer.tblGoals.PLAYER_MATCH WHERE (soccer.tblPlayer_Match.START + soccer.tblPlayer_Match.[ON] > 0) ORDER BY soccer.tblMatch.DATE,soccer.tblGoals.time " )); odbcClose(channel) # obtain ID's of tied game players ties <- fastest$PLAYERID # add cumulative goals to the goalTimes dataframe goalTimes <- ddply(goalTimes,c("PLAYERID","name"), transform,sumGoals = 1:NROW(piece) ) final <- arrange(subset(goalTimes,PLAYERID %in% ties&sumGoals==goalCount),time) print(final[1,3,5,6]) # and ladies and gentleman the winner is... team name time sumGoals 1 Man. City Sergio Aguero 46 8 2 Coventry C Mick Quinn 73 8 |
That’s a lot of work for one result. It is a pretty simple matter to extend the result to show data for fastest player to one, two, three goals etc. Open code box below to see full details
library(RODBC) channel <- odbcConnect("eplR") goalGames <- sqlQuery(channel,paste( " SELECT soccer.tblTeam_Names.TEAMNAME as team, soccer.tblPlayers.PLAYERID, CASE when soccer.tblPlayers.FIRSTNAME is null then soccer.tblPlayers.LASTNAME else soccer.tblPlayers.FIRSTNAME + ' ' + soccer.tblPlayers.LASTNAME end as name, soccer.tblMatch.DATE as gameDate, soccer.tblPlayer_Match.GOALS as goals FROM soccer.tblPlayer_Match INNER JOIN soccer.tblPlayers INNER JOIN soccer.tblPlayerClub ON soccer.tblPlayers.PLAYERID = soccer.tblPlayerClub.PLAYERID ON soccer.tblPlayer_Match.PLAYER_TEAM = soccer.tblPlayerClub.PLAYER_TEAM INNER JOIN soccer.tblMatchTeam ON soccer.tblPlayer_Match.TEAMMATCHID = soccer.tblMatchTeam.TEAMMATCHID INNER JOIN soccer.tblMatch ON soccer.tblMatchTeam.MATCHID = soccer.tblMatch.MATCHID INNER JOIN soccer.tblTeam_Names ON soccer.tblMatchTeam.TEAMID = soccer.tblTeam_Names.TEAMID where (soccer.tblPlayer_Match.START + soccer.tblPlayer_Match.[ON]) > 0 ORDER BY soccer.tblMatch.DATE " )); goalTimes <- sqlQuery(channel,paste( " SELECT soccer.tblTeam_Names.TEAMNAME as team, soccer.tblPlayers.PLAYERID, CASE when soccer.tblPlayers.FIRSTNAME is null then soccer.tblPlayers.LASTNAME else soccer.tblPlayers.FIRSTNAME + ' ' + soccer.tblPlayers.LASTNAME end as name, soccer.tblMatch.DATE as gameDate, soccer.tblGoals.time FROM soccer.tblPlayer_Match INNER JOIN soccer.tblPlayers INNER JOIN soccer.tblPlayerClub ON soccer.tblPlayers.PLAYERID = soccer.tblPlayerClub.PLAYERID ON soccer.tblPlayer_Match.PLAYER_TEAM = soccer.tblPlayerClub.PLAYER_TEAM INNER JOIN soccer.tblMatchTeam ON soccer.tblPlayer_Match.TEAMMATCHID = soccer.tblMatchTeam.TEAMMATCHID INNER JOIN soccer.tblMatch ON soccer.tblMatchTeam.MATCHID = soccer.tblMatch.MATCHID INNER JOIN soccer.tblTeam_Names ON soccer.tblMatchTeam.TEAMID = soccer.tblTeam_Names.TEAMID INNER JOIN soccer.tblGoals ON soccer.tblPlayer_Match.PLAYER_MATCH = soccer.tblGoals.PLAYER_MATCH WHERE (soccer.tblPlayer_Match.START + soccer.tblPlayer_Match.[ON] > 0) ORDER BY soccer.tblMatch.DATE,soccer.tblGoals.time " )); odbcClose(channel) goalGames <- ddply(goalGames,c("PLAYERID","name"), transform, games = 1:NROW(piece), cumGoals = cumsum(goals)) goalTimes <- ddply(goalTimes,c("PLAYERID","name"), transform, sumGoals = 1:NROW(piece) ) # create df to hold results. count is number of players reaching each goal mark myTable <- data.frame(player=character(),goals=integer(), game=integer(),time=integer(),count=integer()) for(goalCount in 1:200) { #goalCount <- 8 minGames <- min(subset(goalGames,cumGoals>=goalCount)$games) playerCount <- subset(goalGames,cumGoals>=goalCount) count <-length(unique(playerCount$name)) fastest <-subset(goalGames,cumGoals>=goalCount&games==minGames)[,c(2,3,6,7)] if (nrow(fastest) > 1) { ties <- fastest$PLAYERID final <- head(arrange(subset(goalTimes,PLAYERID %in% ties&sumGoals==goalCount),time),1) answer <- data.frame(player=final$name,goals=goalCount,game=minGames,minute=final$time,count=count) } else { answer <- data.frame(player=fastest$name,goals=goalCount,game=minGames,minute="",count=count) } # add number of players who have achieved it myTable <- rbind(myTable,answer) } print(myTable) |
All time Leaders
An abbreviated result is shown below
There are several points of interest
- Nasri was only 21 when he converted a penalty in his debut. Brian Deane scored the very first goal in the EPL five minutes in
- Ravanelli is the only player to score a hat-trick on his debut
- Pogrebybak was fastest to five goals earlier this season
- Cole and Shearer vied for the lead at around the 50 goal mark. The last occasion where a time tie break is required was
when both of them scored their 43rd goal in 52 games. Shearer left it late but his 82nd minute strike topped Cole by four minutes - van Nistelrooy grabbed his 56th goal in his 74th appearance but failed to score in his subsequent four outings. Shearer nabbed his 55th in game 74, powered in eight more in the next five and holds the record for all subsequent goals reached
Individual Club
Further analysis can be done by club by simply extending the ddply to include team
Here are Man U’s figures
- 10 different Man U players scored on their debut
- van Nistelrooy played for the team in his prime. Players like Ronaldo and Rooney started much younger
- Uniteds two other 100 goal scorers, Scholes and Giggs, converted at half the rate Rooney has
Individual Season
To look at this season’s data I need to take the bin the game date data I have into its appropriate EPL season. For this I use the cut function and then apply a new ddply function
years <- 1992:2012 goalGames$season <- cut(goalGames$gameDate, breaks=as.POSIXct(paste(years,"-08-01",sep="")), labels=paste(years[-length(years)],years[-length(years)]+1,sep="/")) goalTimes$season <- cut(goalTimes$gameDate, breaks=as.POSIXct(paste(years,"-08-01",sep="")), labels=paste(years[-length(years)],years[-length(years)]+1,sep="/")) |
- Suarez actually tied Djibrill Cisse on the, surprisingly late, 12 minute mark of the season
- Dzeko’s early season success was not maintained. typifying Man City’s campaign
- Rooney has played in less games so his scoring has lagged van Persie. However, they both reached the 20 goal mark in their 24th game
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.