Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In a series of articles, I will be analysing Indian Premier League (IPL) cricket matches using data from cricsheet and using the R programming language. Cricsheet is an excellent website which provides ball-by-ball data for a large number of cricket matches. The IPL is a professional Twenty20 cricket league in India. I chose the IPL because the complete data for all seasons are available.
The data used is the ipl.zip file downloaded as of December 2017. The data is provided in YAML format, and requires some processing before it can be used for additional analysis. I prefer to convert the data into multiple tables which makes it easier to query and summarise the data at various levels. In this article, I will only be looking at reading match level information; in a subsequent article, I will be covering the use of ball-by-ball information.
The format of the data is described in this page. YAML data has a tree-like structure. The R package yaml loads a YAML file and converts it into a deeply nested list structure. The user-defined function cricsheet_ipl_load_then converts the data into a set of tables. The following tables (all of them linked via a common match id) are created –
- metadata – match id, file version, revision and date of creation
- match_info – captures the city, match date, player of the match, venue and a flag to indicate if the venue was neutral
- match_teams – captures the two teams which played the match
- match_toss – captures the team which won the toss and the decision to bat or field
- match_umpires – captures the two on-field umpires for the match
- match_outcome – captures the result of the match and the number of runs or wickets by which the match was won; if the match was decided via a super-over, then the result is recorded as a tie and a flag is set to determine that an eliminator over was used
I have also assumed that an RStudio project has been set up and the input files are all stored in a folder called data inside the project. The code starts by loading the required packages and defining the function described above.
library(tidyverse) library(yaml) library(purrr) library(lubridate) cricsheet_ipl_load_<- function(input_file) { # Assign the match id based on the file name match_id <- str_extract(input_file, "[0-9]+") match_id <- parse_integer(match_id) writeLines(as.character(match_id)) # Load the input file input_data <- yaml.load_file(input_file) # Metadata table meta_version <- input_data$meta$data_version meta_created <- ymd(input_data$meta$created) meta_revision <- input_data$meta$revision metadata <- tibble( id = match_id, version = meta_version, created = meta_created, revision = meta_revision ) # Match information table info <- input_data$info info_city <- ifelse("city" %in% names(info), info$city, NA) info_date <- ymd(info$dates) # Assume IPL match will be played only on a day info_player_of_match <- ifelse("player_of_match" %in% names(info), info$player_of_match, NA) info_venue <- ifelse("venue" %in% names(info), info$venue, NA) info_neutral_venue <- ifelse("neutral_venue" %in% names(info), info$neutral_venue, 0) # Ignore competition, gender, overs match_info <- tibble( id = match_id, city = info_city, date = info_date, player_of_match = info_player_of_match, venue = info_venue, neutral_venue = info_neutral_venue ) # Match teams table info_teams <- info$teams match_teams <- tibble( id = rep(match_id, 2), teams = info_teams ) # Match toss table info_toss_winner <- info$toss$winner info_toss_decision <- info$toss$decision match_toss <- tibble( id = match_id, winner = info_toss_winner, decision = info_toss_decision ) # Match umpires info_umpires <- info$umpires match_umpires <- tibble( id = rep(match_id, 2), umpires = info_umpires ) # Match outcomes info_outcome <- input_data$info$outcome info_winner <- NA info_result <- NA info_result_margin <- NA info_eliminator <- NA if ("winner" %in% names(info_outcome)) { info_winner <- info_outcome$winner info_eliminator <- "N" info_result <- ifelse("runs" %in% names(info_outcome$by), "runs", "wickets") info_result_margin <- ifelse("runs" %in% names(info_outcome$by), info_outcome$by$runs, info_outcome$by$wickets) } else if ("eliminator" %in% names(info_outcome)) { info_winner <- info_outcome$eliminator info_eliminator <- "Y" info_result <- info_outcome$result } info_method <- ifelse("method" %in% names(info_outcome), info_outcome$method, NA) match_outcome <- tibble( id = match_id, winner = info_winner, result = info_result, result_margin = info_result_margin, eliminator = info_eliminator, method = info_method ) # Return a list of tables retlist <- list(metadata = metadata, match_info = match_info, match_teams = match_teams, match_toss = match_toss, match_umpires = match_umpires, match_outcome = match_outcome) return(retlist) }
Once the above function is loaded, it is a simple job of mapping it over all the file names.
# Read all the IPL data filenames <- list.files("data", pattern = "*.yaml", full.names = TRUE) ipl_data <- map(filenames, cricsheet_ipl_load_meta)
The call to map returns a large list, each element of which stores six tables described above. The following code creates six individual tables which hold the complete information.
# Store all the data as individual data frames ret_table <- function(x, table) { return(x[[table]]) } temp <- map(ipl_data, ret_table, "metadata") metadata <- bind_rows(temp) temp <- map(ipl_data, ret_table, "match_info") match_info <- bind_rows(temp) temp <- map(ipl_data, ret_table, "match_teams") match_teams <- bind_rows(temp) temp <- map(ipl_data, ret_table, "match_toss") match_toss <- bind_rows(temp) temp <- map(ipl_data, ret_table, "match_umpires") match_umpires <- bind_rows(temp) temp <- map(ipl_data, ret_table, "match_outcome") match_outcome <- bind_rows(temp) # Clean up rm(temp) rm(ipl_data) rm(filenames)
Continue to the 2nd part. The complete code is available in github.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.