Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This is the 2nd in the series of articles to analyse IPL cricket matches using data from cricsheet. The first article in the series can be found here.
The first article showed you how to load the information pertaining to the matches into five main tables. In this, our focus will be on loading the ball by ball information. Since we are restricted to IPL matches, we are free to make certain assumptions. For example, a maximum of two innings will be available in the data. Also, there are no matches where penalty runs were awarded and the number of times a bowler was replaced in the middle of an over were too few. Thus, the code does not try to parse and load this information.
The format of the data is described in this page. YAML data has a tree-like structure. The R package yaml loads a YAML file and converts it into a deeply nested list structure. The user-defined function cricsheet_ipl_load_innings then converts the data into a set of tables. This function uses a helper function process_delivery which extracts all the information for a single delivery. The following tables (all of them linked via a common match id) are created –
- match_innings – captures the innings number and the team playing that innings
- match_deliveries – captures the innings number, over, ball, batsman, non-striker, bowler, runs attributed to the batsman, extras and total respectively, a flag to indicate if it was a non-boundary (so the total runs may be 4 but the non-boundary flag may be 1 indicating that the batsmen ran 4 runs), a flag to indicate if a wicket was taken in that delivery, the kind of wicket, the player who got out and the fielders who were involved, and finally the type of extras (in case there are extras in that delivery)
I have also assumed that an RStudio project has been set up and the input files are all stored in a folder called data inside the project. The code starts by loading the required packages and defining the functions described above.
library(tidyverse) library(yaml) library(purrr) library(lubridate) process_delivery <- function(delivery) { delivery_name <- names(delivery) delivery_double <- as.double(delivery_name) delivery_over <- trunc(delivery_double) + 1 delivery_ball <- (delivery_double - trunc(delivery_double)) * 10 delivery_batsman <- delivery[[delivery_name]]$batsman delivery_non_striker <- delivery[[delivery_name]]$non_striker delivery_bowler <- delivery[[delivery_name]]$bowler delivery_runs_batsman <- as.integer(delivery[[delivery_name]]$runs$batsman) delivery_runs_extras <- as.integer(delivery[[delivery_name]]$runs$extras) delivery_runs_total <- as.integer(delivery[[delivery_name]]$runs$total) delivery_runs_non_boundary <- ifelse("non_boundary" %in% names(delivery[[delivery_name]]$runs), 1, 0) delivery_wicket <- ifelse("wicket" %in% names(delivery[[delivery_name]]), 1, 0) if (delivery_wicket == 1) { delivery_wicket_kind <- delivery[[delivery_name]]$wicket$kind delivery_wicket_player_out <- delivery[[delivery_name]]$wicket$player_out delivery_wicket_fielders <- ifelse("fielders" %in% names(delivery[[delivery_name]] $wicket), paste(delivery[[delivery_name]] $wicket$fielders, collapse = ","), NA) } else { delivery_wicket_kind <- NA delivery_wicket_player_out <- NA delivery_wicket_fielders <- NA } delivery_extras_type <- ifelse("extras" %in% names(delivery[[delivery_name]]), names(delivery[[delivery_name]]$extras)[1], NA) return(list(delivery_over = delivery_over, delivery_ball = delivery_ball, delivery_batsman = delivery_batsman, delivery_non_striker = delivery_non_striker, delivery_bowler = delivery_bowler, delivery_runs_batsman = delivery_runs_batsman, delivery_runs_extras = delivery_runs_extras, delivery_runs_total = delivery_runs_total, delivery_runs_non_boundary = delivery_runs_non_boundary, delivery_wicket = delivery_wicket, delivery_wicket_kind = delivery_wicket_kind, delivery_wicket_player_out = delivery_wicket_player_out, delivery_wicket_fielders = delivery_wicket_fielders, delivery_extras_type = delivery_extras_type)) } cricsheet_ipl_load_innings <- function(input_file) { # Assign the match id based on the file name match_id <- str_extract(input_file, "[0-9]+") match_id <- parse_integer(match_id) writeLines(as.character(match_id)) # Load the input file input_data <- yaml.load_file(input_file) # Innings table innings <- input_data$innings number_of_innings <- length(input_data$innings) # Ignore absent_hurt, penalty_runs, declared i1 <- innings[[1]]$`1st innings` i2 <- NULL if (number_of_innings > 1) { i2 <- innings[[2]]$`2nd innings` teams <- c(i1$team, i2$team) } else { teams <- c(i1$team, NA) } match_innings <- tibble( id = rep(match_id, 2), innings_num = as.integer(c(1, 2)), innings_team = teams ) # Deliveries table # Ignore replacements i1_deliveries <- i1$deliveries i1_delivery_list <- map(i1_deliveries, process_delivery) i1_deliveries <- bind_rows(i1_delivery_list) temp <- tibble(id = rep(match_id, nrow(i1_deliveries)), innings_num = rep(1, nrow(i1_deliveries))) i1_deliveries <- bind_cols(temp, i1_deliveries) if (number_of_innings > 1) { i2_deliveries <- i2$deliveries i2_delivery_list <- map(i2_deliveries, process_delivery) i2_deliveries <- bind_rows(i2_delivery_list) temp <- tibble(id = rep(match_id, nrow(i2_deliveries)), innings_num = rep(2, nrow(i2_deliveries))) i2_deliveries <- bind_cols(temp, i2_deliveries) match_deliveries <- bind_rows(i1_deliveries, i2_deliveries) } else { match_deliveries <- i1_deliveries } # Return a list of tables retlist <- list(match_innings = match_innings, match_deliveries = match_deliveries) return(retlist) }
Once the above function is loaded, it is a simple job of mapping it over all the file names.
# Read all the IPL data filenames <- list.files("data", pattern = "*.yaml", full.names = TRUE) ipl_data <- map(filenames, cricsheet_ipl_load_innings)
The call to map returns a large list, each element of which stores two tables described above. The following code creates two individual tables which hold the complete information.
# Store all the data as individual data frames ret_table <- function(x, table) { return(x[[table]]) } temp <- map(ipl_data, ret_table, "match_innings") match_innings <- bind_rows(temp) temp <- map(ipl_data, ret_table, "match_deliveries") match_deliveries <- bind_rows(temp) # Clean up rm(temp) rm(ipl_data) rm(filenames)
The complete code is available in github.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.