MLB Baseball Pitching Matchups ~ grabbing pitcher and/or batter codes by specify game date using R XML
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
MLB Gameday stores its game data in XML format, with the players denoted in ID numbers. To find out who is who, the codes are stored in pitchers.xml or batters.xml of each game.
My DownloadPitchFX.R script can download the ID numbers, but it doesn’t look to see who the ID is because of the extra processing time. But to use the data (say in RMySQL), it helps to have another script that figures out the ID number for any player.
The following script (GetPitcherBatterCodes.R) requires the last and/or first name of the player, the team that he plays on and the specific date the player is assumed to play. It outputs a data frame with the matched name (however many) and their ID numbers. You can also let just.player = FALSE
to download all of the players listed in that game (although it does that anyways).
The input for the team name is fairly general. You can use the codes that are specified in Gameday (“SF”, “sfn”), or the actual city of the team (“San Francisco”), or its team name (“Giants”).
## GetPitcherBatterCodes.R ## get pitcher batter codes for pitch f/x library(XML) # -- Outputs # data frame of all matching names, OR # data frame of all batters or pitchers in game # -- Inputs # game.date ~ game date player plays in, default POSIXlt format, e.g. "2009-05-20" # is.pitcher ~ TRUE for pitcher, FALSE for batter # last_name ~ a character vector for the last name # first_name ~ a char vector for first name, # have to spell correctly but don't need both first and last names.. # team ~ denote team that player plays in, # use any of the following code within quotes.. example for SF Giants, or SD Padres: # away_name_abbrev="SF" home_name_abbrev="SD" away_code="sfn" away_file_code="sf" away_team_city="San Francisco" away_team_name="Giants" home_code="sdn" home_file_code="sd" home_team_city="San Diego" home_team_name="Padres" # just.player ~ TRUE to get ID for player, FALSE to grab all pitchers OR batters in game GetPitcherBatterCodes <- function(game.date = "2009-05-20", is.pitcher = TRUE, last_name = "Lincecum", first_name = "Tim", team = "sfn", just.player = TRUE, URL.base = "http://gd2.mlb.com/components/game/mlb/") { # extract date game.date <- as.POSIXlt(game.date) year <- game.date$year + 1900 month <- game.date$mon + 1 day <- game.date$mday URL.date <- paste(URL.base, "year_", year, "/", ifelse(month >= 10, "month_", "month_0"), month, "/", ifelse(day >= 10, "day_", "day_0"), day, "/", sep = "") # extract miniscoreboard.xml URL.scoreboard <- paste(URL.date, "miniscoreboard.xml", sep = "") XML.scoreboard <- xmlInternalTreeParse(URL.scoreboard) parse.scoreboard <- sapply(c("gameday_link", "away_name_abbrev", "home_name_abbrev", "away_code", "home_code", "away_file_code", "home_file_code", "away_team_city", "home_team_city", "away_team_name", "home_team_name"), function(x) xpathSApply(XML.scoreboard, "//game[@*]", xmlGetAttr, x)) # get game URL of specified team team.index <- apply(parse.scoreboard, 1, function(x) team %in% x) team.URL <- parse.scoreboard[team.index, 1][1] # protect from double headers URL.game <- paste(URL.date, "gid_", team.URL, "/", sep = "") # get player data URL.players <- ifelse(is.pitcher, paste(URL.game, "pitchers/", sep = ""), paste(URL.game, "batters/", sep = "")) HTML.players <- htmlParse(URL.players) codes.players <- xpathSApply(HTML.players, "//a[@*]", xmlGetAttr, "href")[-1] # loop through player codes to match last AND/OR first name info.players <- sapply(codes.players, function(x) { URL.player <- paste(URL.players, x, sep = "") XML.player <- xmlInternalTreeParse(URL.player) print(x) info.player <- sapply(c("team", "id", "type", "first_name", "last_name"), function(x) xpathSApply(XML.player, "//Player[@*]", xmlGetAttr, x)) }) # get results and match player names if necessary if (just.player == TRUE) { last.index <- last_name == info.players["last_name",] first.index <- first_name == info.players["first_name",] matched.index <- as.logical(last.index + first.index) matched.players <- data.frame(id = info.players["id", matched.index], first_name = info.players["first_name", matched.index], last_name = info.players["last_name", matched.index]) return(matched.players) } else return(info.players) }
Some output:
> aho <- GetPitcherBatterCodes() > aho id first_name last_name 1 453311 Tim Lincecum > aho2 <- GetPitcherBatterCodes(just.player = FALSE) > aho2 116615.xml 133982.xml 217096.xml 277405.xml 346793.xml 408241.xml team "sfn" "sfn" "sfn" "sfn" "sfn" "sdn" id "116615" "133982" "217096" "277405" "346793" "408241" type "pitcher" "pitcher" "pitcher" "pitcher" "pitcher" "pitcher" first_name "Randy" "Bob" "Barry" "Justin" "Jeremy" "Jake" last_name "Johnson" "Howry" "Zito" "Miller" "Affeldt" "Peavy" 425514.xml 429718.xml 429723.xml 429781.xml 429985.xml 430161.xml team "sdn" "sdn" "sfn" "sdn" "sdn" "sfn" id "425514" "429718" "429723" "429781" "429985" "430161" type "pitcher" "pitcher" "pitcher" "pitcher" "pitcher" "pitcher" first_name "Heath" "Shawn" "Merkin" "Kevin" "Chad" "Noah" last_name "Bell" "Hill" "Valdez" "Correia" "Gaudin" "Lowry" 430606.xml 430650.xml 430657.xml 430665.xml 430912.xml 432934.xml team "sdn" "sdn" "sdn" "sfn" "sfn" "sdn" id "430606" "430650" "430657" "430665" "430912" "432934" type "pitcher" "pitcher" "pitcher" "pitcher" "pitcher" "pitcher" first_name "Mike" "Edwin" "Cha Seung" "Brandon" "Matt" "Chris" last_name "Adams" "Moreno" "Baek" "Medders" "Cain" "Young" 435619.xml 445995.xml 446207.xml 448592.xml 450312.xml 450527.xml team "sfn" "sdn" "sdn" "sdn" "sdn" "sfn" id "435619" "445995" "446207" "448592" "450312" "450527" type "pitcher" "pitcher" "pitcher" "pitcher" "pitcher" "pitcher" first_name "Pat" "Arturo" "Josh" "Cla" "Mark" "Alex" last_name "Misch" "Lopez" "Geer" "Meredith" "Worrell" "Hinshaw" 450832.xml 451216.xml 452724.xml 453281.xml 453311.xml 456043.xml team "sfn" "sfn" "sfn" "sdn" "sfn" "sfn" id "450832" "451216" "452724" "453281" "453311" "456043" type "pitcher" "pitcher" "pitcher" "pitcher" "pitcher" "pitcher" first_name "Jesse" "Brian" "Billy" "Wade" "Tim" "Jonathan" last_name "English" "Wilson" "Sadler" "LeBlanc" "Lincecum" "Sanchez" 457117.xml 457566.xml 458155.xml 459987.xml 460044.xml 464351.xml team "sdn" "sdn" "sfn" "sdn" "sdn" "sfn" id "457117" "457566" "458155" "459987" "460044" "464351" type "pitcher" "pitcher" "pitcher" "pitcher" "pitcher" "pitcher" first_name "Ernesto" "Greg" "Joe" "Cesar" "Cesar" "Kelvin" last_name "Frieri" "Burke" "Martinez" "Ramos" "Carrillo" "Pichardo" 464400.xml 465629.xml 466412.xml 467683.xml 471183.xml 477581.xml team "sfn" "sdn" "sdn" "sfn" "sfn" "sdn" id "464400" "465629" "466412" "467683" "471183" "477581" type "pitcher" "pitcher" "pitcher" "pitcher" "pitcher" "pitcher" first_name "Henry" "Edward" "Luis" "Osiris" "Waldis" "Walter" last_name "Sosa" "Mujica" "Perdomo" "Matos" "Joaquin" "Silva" 489265.xml 491159.xml 502381.xml 503355.xml team "sfn" "sdn" "sdn" "sdn" id "489265" "491159" "502381" "503355" type "pitcher" "pitcher" "pitcher" "pitcher" first_name "Sergio" "Joe" "Luke" "Jackson" last_name "Romo" "Thatcher" "Gregerson" "Quezada"
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.