Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this article, we analyse the strike rates of the top batsmen in the Indian Premier League. We will use the tidyverse packages for the analysis, primarily dplyr and ggplot2.
The code for all the data processing and analysis can be found in this Github repo.
We will be using data from Cricsheet. Unfortunately, the data is only available till 2017 and does not include last year’s edition. The analysis will be updated once the 2018 data is available. We have already looked at how to process the data in the articles here and here. In these two articles, we wrote functions to process the YAML file for one match and used functions from the package purrr to create a data frame comprising the information for all matches. In this article, it is assumed that the data frame match_deliveries is available in the global environment. To create this data frame, ensure that all packages mentioned in 00_libs.R are installed, then run this code followed by 01_read_data.R.
To recall, the match_deliveries data frame consists of the following information –
- the innings number
- over
- ball
- batsman
- non-striker
- bowler
- runs attributed to the batsman, extras and total respectively
- a flag to indicate if it was a non-boundary (so the total runs may be 4 but the non-boundary flag may be 1 indicating that the batsmen ran 4 runs)
- a flag to indicate if a wicket was taken in that delivery, the kind of wicket, the player who got out and the fielders who were involved
- the type of extras (in case there are extras in that delivery)
We start by looking at the number of batsmen who faced at least 1 ball in any match. The function distinct from the dplyr package can be used to obtain the distinct values in a column.
batsmen_striker <- match_deliveries %>% distinct(delivery_batsman) nrow(batsmen_striker) ## [1] 461
There are 461 batsmen. A lot of them would have played in very few matches. So we first look at some summary statistics on the number of matches played by each batsmen. To do this, recall that the column id contains the unique id for the match – so we’ll first get the unique combination of delivery_batsman and id and then count the number of distinct id for each batsman.
batsmen_matches <- match_deliveries %>% distinct(delivery_batsman, id) %>% group_by(delivery_batsman) %>% summarise(n_matches = n()) %>% ungroup() %>% arrange(desc(n_matches)) summary(batsmen_matches$n_matches) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.00 3.00 8.00 20.64 23.00 157.00
The distribution is skewed, with a median value of 8 and mean above 20. There are 636 unique matches in the data, and since we are primarily interested in the top batsmen, we will only consider those who have played at least 20 matches. So the next step is to restrict the data to these batsmen – we first filter to obtain the names of the batsmen who have played at least 20 matches, and then use inner_join from dplyr to restrict the data.
batsmen_matches <- batsmen_matches %>% filter(n_matches >= 20) batsmen_match_details <- match_deliveries %>% inner_join(batsmen_matches, by = c("delivery_batsman" = "delivery_batsman"))
As we are interested in analysing strike rates – which is the number of runs scored divided by balls faced – it is important to correctly calculate the runs scored by a batsman. For example, we should not be including wides, byes or leg-byes as runs scored. However, while wides do not count as a ball faced, we need to correctly account for no-balls, byes and leg-byes in the denominator of the strike rate calculation.
batsmen_match_details %>% distinct(delivery_extras_type) ## # A tibble: 6 x 1 ## delivery_extras_type ## <chr> ## 1 <NA> ## 2 wides ## 3 legbyes ## 4 noballs ## 5 byes ## 6 penalty noballs <- batsmen_match_details %>% filter(delivery_extras_type == "noballs", delivery_runs_total > 1) nrow(noballs) ## [1] 318 byes <- batsmen_match_details %>% filter(delivery_extras_type == "byes", delivery_runs_batsman > 1) nrow(byes) ## [1] 0 legbyes <- batsmen_match_details %>% filter(delivery_extras_type == "legbyes", delivery_runs_batsman > 1) nrow(legbyes) ## [1] 0 wides <- batsmen_match_details %>% filter(delivery_extras_type == "wides", delivery_runs_batsman > 1) nrow(wides) ## [1] 0 penalty <- batsmen_match_details %>% filter(delivery_extras_type == "penalty") nrow(penalty) ## [1] 2
This shows that the column delivery_runs_batsman correctly records runs only in cases of no-balls. The first row in the noballs data is from a match between Sunrisers Hyderabad and Royal Challengers Bangalore, played in 2017. In the first ball of the ninth over, Moises Henriques bowled to Kedar Jadhav who scored one run of a no ball. The column delivery_runs_batsman holds the value 1, while the column delivery_runs_extras also holds the value 1 for a total of 2 runs for the delivery. We verify from the commentary section of the actual scorecard that this was indeed the case. There are also two rows with the value penalty. This is not verifiable from the match details available in ESPN Cricinfo; as there are only two such cases, these two records are not going to visibly affect this analysis so we exclude them from the data.
Given the above observations, we are now in a position to filter the data and calculate the number of balls faced and runs scored by a batsman in each delivery.
batsmen_match_details <- batsmen_match_details %>% filter(!(delivery_extras_type %in% c("wides", "penalty"))) batsmen_match_cumruns <- batsmen_match_details %>% select(delivery_batsman, id, delivery_over, delivery_ball, delivery_runs_batsman) %>% arrange(delivery_batsman, id, delivery_over, delivery_ball) %>% group_by(delivery_batsman, id) %>% mutate(ball_number = 1:n(), cum_runs = cumsum(delivery_runs_batsman))
Note that we sort the data using the arrange function from dplyr. This is necessary as a batsman may not have faced the consecutive deliveries in an over. For example, if the batsman runs 1 in the first ball and hits the fourth ball for 4, then the ball_number for the batsman should only increment in the fourth ball and not the second or third ball. Also, using a grouped data frame ensures that the functions n and cumsum calculate the statistics for each group (delivery_batsman and id).
Our next step is to calculate the strike rate at the end of each ball faced by a batsman. Before that, we notice that there will be a few innings where the batsman would not have too many balls. We further restrict the data to innings’ of at least 10 balls faced. This should allow for a reasonable comparison among batsmen. Again, the summarise function in the code below is being applied to a grouped data frame, so the total_balls will be calculated for each innings of a batsman.
batsmen_match_ballsfaced <- batsmen_match_cumruns %>% summarise(total_balls = max(ball_number)) %>% filter(total_balls >= 10) batsmen_match_cumruns <- batsmen_match_cumruns %>% inner_join(batsmen_match_ballsfaced, by = c("delivery_batsman" = "delivery_batsman", "id" = "id")) %>% arrange(delivery_batsman, id, ball_number) batsmen_match_cumruns <- batsmen_match_cumruns %>% mutate(strike_rate = cum_runs / ball_number)
As Virat Kohli is one of the hottest stars in the cricketing world today, we start by looking at how his strike rate changes with the progress of his innings.
kohli_match_cumruns <- batsmen_match_cumruns %>% filter(delivery_batsman == "V Kohli") ggplot(kohli_match_cumruns, aes(ball_number, strike_rate, colour = as.factor(id))) + geom_line() + ggtitle("Virat Kohli - Cumulative strike rate by ball") + xlab("Ball Number") + ylab("Strike Rate") + theme_bw() + theme( panel.border = element_blank(), legend.position = "none", plot.title = element_text(size = 20, hjust = 0.5, vjust = 0.5), axis.title = element_text(size = 15), axis.text = element_text(size = 15) )
There is too much noise due to the sheer number of innings he has played, so we restrict the data to his top 10 innings. At this point, it is best to write a function which will take a player name and create the above graph for their top 10 innings. The number of innings can also be an argument to the function, with a default value of 10. We use geom_smooth to obtain a smooth curve of the strike rates by ball; the original data is also shown but with a very low alpha value to make it almost transparent.
player_cumruns_topn <- function(delivery_batsman_name, top_n = 10) { player_match_cumruns <- batsmen_match_cumruns %>% filter(delivery_batsman == delivery_batsman_name) player_match_top10 <- player_match_cumruns %>% group_by(id) %>% summarise(total_runs = sum(delivery_runs_batsman)) %>% arrange(desc(total_runs)) player_match_cumruns <- player_match_cumruns %>% inner_join(filter(player_match_top10, row_number() <= top_n), by = c("id" = "id")) ggplot(player_match_cumruns, aes(ball_number, strike_rate, colour = as.factor(id))) + geom_line(alpha = 0.1) + geom_smooth(se = FALSE, alpha = 0.5) + ggtitle(paste0(delivery_batsman_name, " - Cumulative strike rate by ball"), subtitle = paste0("Top ", top_n, " matches")) + xlab("Ball Number") + ylab("Strike Rate") + theme_bw() + theme( panel.border = element_blank(), legend.position = "none", plot.title = element_text(size = 20, hjust = 0.5, vjust = 0.5), plot.subtitle = element_text(size = 15, hjust = 0.5, vjust = 0.5), axis.title = element_text(size = 15), axis.text = element_text(size = 15) ) } player_cumruns_topn("V Kohli")
We also look at similar curves for David Warner and AB de Villers.
player_cumruns_topn(delivery_batsman_name = "DA Warner")
player_cumruns_topn(delivery_batsman_name = "AB de Villiers")
While Kohli and de Villiers tend to accelerate more the number of balls they face, Warner tends to accelerate quickly at the beginning of his innings and maintain a consistent strike rate throughout the innings. We only look at these three players in this article, but the data and the function written above can be used to analyse your favourite player from the IPL.
Our next step is to analyse who accelerates the best in IPL matches. While there are very complex analyses possible, here we look at a simple one. We calculate the difference in strike rate between the first and second half of the innings. Since we already have the number of balls played in an innings, the halves are simply defined as the number of balls faced divided by 2. Also recall that we have restricted the data to innings which are at least 10 balls, so each half comprises of at least 5 balls each.
batsmen_match_halves <- batsmen_match_cumruns %>% mutate(half_01 = ball_number / total_balls > 0.5) %>% group_by(delivery_batsman, id, half_01) %>% filter(row_number() == n())
Again, this is relatively simple to define using dplyr. The distinction between first and second half is simply obtained by dividing the ball_number by the total_balls. Also note the use of the condition in the filter function. row_number calculates the row for each group, and we restrict the data to the last row in each half. The condition in the code ensures that the second half has the value TRUE, so using the functions lag and if_else available in dplyr, we calculate the difference in strike rates between the two halves.
batsmen_match_halves2 <- batsmen_match_halves %>% ungroup() %>% mutate( diff_rate = if_else(half_01, strike_rate - lag(strike_rate), 0) ) %>% filter(half_01)
For each batsmen in our data, we then calculate the total number of runs across all matches, as well as the average and median difference in strike rates between the two halves.
batsmen_acceleration <- batsmen_match_halves2 %>% group_by(delivery_batsman) %>% summarise( total_runs = sum(cum_runs), avg_diff_strike_rate = mean(diff_rate), median_diff_strike_rate = median(diff_rate) ) %>% arrange(desc(median_diff_strike_rate))
The final filter we apply is restrict to batsmen who have scored at least 500 runs in IPL.
batsmen_acceleration <- batsmen_acceleration %>% filter(total_runs >= 500)
This is followed by looking at those batsmen who have the highest median difference in strike rates between the first and second half. We ensure that the data is sorted by the descending median difference; this is achieved by using a reorder as part of the aesthetics specification in ggplot2.
ggplot(batsmen_acceleration, aes(reorder(delivery_batsman, median_diff_strike_rate), median_diff_strike_rate)) + geom_col(fill = "lightblue", colour = "black") + scale_y_continuous(labels = scales::percent_format()) + ggtitle("Difference in strike rate between 1st and 2nd half of innings") + xlab("Batsman") + ylab("Median difference in strike rate") + theme_bw() + theme( panel.border = element_blank(), legend.position = "none", plot.title = element_text(size = 20, hjust = 0.5, vjust = 0.5), axis.title = element_text(size = 15), axis.text = element_text(size = 15) ) + coord_flip()
Albie Morkel takes the honours – the median difference between strike rates between the two halves of all innings he has played in the IPL is almost 35%. MS Dhoni, who also has the anecdotal reputation of starting slowly and then accelerating pretty quick (at least in the last few years) also appears in the top 10.
Hopefully, this article has given you a flavour of the kind of analyses possible once you have ball-by-ball data available. We have also tried using the tidyverse functions as much as possible. In the future, there will be more articles in this website using the same data.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.