[This article was first published on NYC Data Science Academy » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Slides: Taipei opendata 20141209 from Summit Suen
What comes up with you while talking about SPORTS DATA ANALYSIS?!
Hot, Hot, Hot: Big Data & Sports (1/2)
Hot, Hot, Hot: Big Data & Sports (2/2)
In fact, works analyzing baseball data with statistical techniques has already exist: Sabermatrics
There are plenty of excellent packages that could help us analyzing baseball data.
A simple demo with r-packages: dplyr & lahman
library(<span class="pl-vo">Lahman</span>) library(<span class="pl-vo">dplyr</span>) <span class="pl-vo">totalRS</span> <span class="pl-k"><-</span> <span class="pl-vo">Teams</span> %<span class="pl-k">></span>% select(<span class="pl-vo">yearID</span>, <span class="pl-vo">R</span>, <span class="pl-vo">G</span>) %<span class="pl-k">></span>% mutate(<span class="pl-v">AvgRperG</span> <span class="pl-k">=</span> <span class="pl-vo">R</span><span class="pl-k">/</span><span class="pl-vo">G</span>) %<span class="pl-k">></span>% group_by(<span class="pl-vo">yearID</span>) %<span class="pl-k">></span>% summarise(sum(<span class="pl-vo">AvgRperG</span>)) names(<span class="pl-vo">totalRS</span>) <span class="pl-k"><-</span> c(<span class="pl-s1"><span class="pl-pds">"</span>yearID<span class="pl-pds">"</span></span>, <span class="pl-s1"><span class="pl-pds">"</span>RUN<span class="pl-pds">"</span></span>) head(<span class="pl-vo">totalRS</span>)
## Source: local data frame [6 x 2] ## ## yearID RUN ## 1 1871 93.12897 ## 2 1872 95.21474 ## 3 1873 73.15998 ## 4 1874 58.55903 ## 5 1875 70.08774 ## 6 1876 47.01267
And one could easily observe the trend of scoring in MLB history via lahman + ggplot2
library(<span class="pl-vo">ggplot2</span>) ggplot(<span class="pl-v">data</span> <span class="pl-k">=</span> <span class="pl-vo">totalRS</span>, aes(<span class="pl-v">x</span> <span class="pl-k">=</span> <span class="pl-vo">yearID</span>, <span class="pl-v">y</span> <span class="pl-k">=</span> <span class="pl-vo">RUN</span>)) <span class="pl-k">+</span> stat_smooth() <span class="pl-k">+</span> geom_line()
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
Even admire the career numbers of Derek Jeter! LOL
library(<span class="pl-vo">Lahman</span>) library(<span class="pl-vo">dplyr</span>) head(filter(<span class="pl-vo">Batting</span>, <span class="pl-vo">playerID</span> <span class="pl-k">==</span> <span class="pl-s1"><span class="pl-pds">"</span>jeterde01<span class="pl-pds">"</span></span>))
## playerID yearID stint teamID lgID G G_batting AB R H X2B X3B HR ## 1 jeterde01 1995 1 NYA AL 15 15 48 5 12 4 1 0 ## 2 jeterde01 1996 1 NYA AL 157 157 582 104 183 25 6 10 ## 3 jeterde01 1997 1 NYA AL 159 159 654 116 190 31 7 10 ## 4 jeterde01 1998 1 NYA AL 149 149 626 127 203 25 8 19 ## 5 jeterde01 1999 1 NYA AL 158 158 627 134 219 37 9 24 ## 6 jeterde01 2000 1 NYA AL 148 148 593 119 201 31 4 15 ## RBI SB CS BB SO IBB HBP SH SF GIDP G_old ## 1 7 0 0 3 11 0 0 0 0 0 15 ## 2 78 14 7 48 102 1 9 6 9 13 157 ## 3 70 23 12 74 125 0 10 8 2 14 159 ## 4 84 30 6 57 119 1 5 3 3 13 149 ## 5 102 19 8 91 116 5 12 3 6 12 158 ## 6 73 22 4 68 99 4 12 3 3 14 148
Here comes another awesome package: pitchRx
Which offers APIs accessing daily MLB logs and 2D/3D animations.
install.packages(<span class="pl-s1"><span class="pl-pds">"</span>pitchRx<span class="pl-pds">"</span></span>) library(<span class="pl-vo">pitchRx</span>)
OpenWAR is a open project for calculating WAR – Wins Above Replacement
WAR offers an estimate to answer the question, “If this player got injured and their team had to replace them with a freely available minor leaguer or a AAAA player from their bench, how much value would the team be losing?”
install.packages(<span class="pl-s1"><span class="pl-pds">"</span>Sxslt<span class="pl-pds">"</span></span>, <span class="pl-v">repos</span> <span class="pl-k">=</span> <span class="pl-s1"><span class="pl-pds">"</span>http://www.omegahat.org/R<span class="pl-pds">"</span></span>, <span class="pl-v">type</span> <span class="pl-k">=</span> <span class="pl-s1"><span class="pl-pds">"</span>source<span class="pl-pds">"</span></span>) require(<span class="pl-vo">devtools</span>) install_github(<span class="pl-s1"><span class="pl-pds">"</span>openWAR<span class="pl-pds">"</span></span>, <span class="pl-s1"><span class="pl-pds">"</span>beanumber<span class="pl-pds">"</span></span>)
After the amazing tour of useful packages, we need to write our OWN crawler/analyzer!!!
Data Source
Crawler in R (1/2)
<span class="pl-c">## RSelenium + phantomJS</span> library(<span class="pl-vo">XML</span>) library(<span class="pl-vo">RSelenium</span>) <span class="pl-vo">pJS</span> <span class="pl-k"><-</span> phantom() Sys.sleep(<span class="pl-c1">5</span>) <span class="pl-vo">remDr</span> <span class="pl-k"><-</span> remoteDriver(<span class="pl-v">browserName</span> <span class="pl-k">=</span> <span class="pl-s1"><span class="pl-pds">'</span>phantomjs<span class="pl-pds">'</span></span>) <span class="pl-vo">remDr</span><span class="pl-k">$</span>open() <span class="pl-vo">url</span> <span class="pl-k"><-</span> <span class="pl-s1"><span class="pl-pds">'</span>http://www.cpbl.com.tw/stats_hr.aspx<span class="pl-pds">'</span></span> <span class="pl-vo">remDr</span><span class="pl-k">$</span>navigate(<span class="pl-vo">url</span>) <span class="pl-vo">hr_tables</span> <span class="pl-k"><-</span> <span class="pl-st">list</span>()
Crawler in R (2/2)
<span class="pl-vo">year</span> <span class="pl-k"><-</span> seq(<span class="pl-c1">1990</span>, <span class="pl-c1">2014</span>) <span class="pl-k">for</span> (<span class="pl-vo">i</span> <span class="pl-k">in</span> <span class="pl-c1">1</span><span class="pl-k">:</span>length(<span class="pl-vo">year</span>)) { <span class="pl-vo">option</span> <span class="pl-k"><-</span> <span class="pl-vo">remDr</span><span class="pl-k">$</span>findElement(<span class="pl-v">using</span> <span class="pl-k">=</span> <span class="pl-s1"><span class="pl-pds">'</span>xpath<span class="pl-pds">'</span></span>, paste0(<span class="pl-s1"><span class="pl-pds">"</span>//*/option[@value = '<span class="pl-pds">"</span></span>,<span class="pl-vo">year</span>[<span class="pl-vo">i</span>],<span class="pl-s1"><span class="pl-pds">"</span>']<span class="pl-pds">"</span></span>)) <span class="pl-vo">option</span><span class="pl-k">$</span>clickElement() <span class="pl-vo">src</span> <span class="pl-k"><-</span> <span class="pl-vo">option</span><span class="pl-k">$</span>getPageSource() <span class="pl-vo">hr_tables</span>[<span class="pl-vo">i</span>] <span class="pl-k"><-</span> readHTMLTable(<span class="pl-vo">src</span>[[<span class="pl-c1">1</span>]], <span class="pl-v">stringAsFactors</span> <span class="pl-k">=</span> <span class="pl-c1">FALSE</span>, <span class="pl-v">encoding</span> <span class="pl-k">=</span> <span class="pl-s1"><span class="pl-pds">"</span>utf8<span class="pl-pds">"</span></span>) } <span class="pl-c"># View(hr_tables)</span> <span class="pl-vo">remDr</span><span class="pl-k">$</span>close() <span class="pl-vo">pJS</span><span class="pl-k">$</span>stop()
Let’s PLAY DATA!
load(<span class="pl-s1"><span class="pl-pds">"</span>hr_tables.RData<span class="pl-pds">"</span></span>) head(<span class="pl-vo">hr_tables</span>[[<span class="pl-c1">25</span>]])
## NUMBER YEAR GID DATE STADIUM BATTER BATTERTEAM PITCHER ## 1 # YEAR GAME NO. DATE STADIUM PLAYER PLAYER TEAM PITCHER ## 2 1 2014 2 2014/3/23 天母 林 威 助 中信兄弟 費 古 洛 ## 3 2 2014 3 2014/3/23 澄清湖 詹 智 堯 Lamigo 林 正 豐 ## 4 3 2014 4 2014/3/25 新莊 張 志 豪 中信兄弟 林 晨 樺 ## 5 4 2014 5 2014/3/26 嘉義市 林 泓 育 Lamigo 林 岳 平 ## 6 5 2014 6 2014/3/27 新莊 張 志 豪 中信兄弟 黃 勝 雄 ## PITCHERTEAM RBI REMARK ## 1 PITCHER TEAM RBI REMARK ## 2 統一7-ELEVEn 1 ## 3 義大 1 ## 4 義大 1 ## 5 統一7-ELEVEn 1 ## 6 義大 1
Draw some pictures (1/2)
par(<span class="pl-v">family</span> <span class="pl-k">=</span> <span class="pl-s1"><span class="pl-pds">'</span>Heiti TC Light<span class="pl-pds">'</span></span>) <span class="pl-vo">m</span> <span class="pl-k"><-</span> ggplot(<span class="pl-vo">hr_table</span>, aes(<span class="pl-v">x</span> <span class="pl-k">=</span> as.numeric(<span class="pl-vo">GID</span>))) <span class="pl-vo">m</span> <span class="pl-k">+</span> geom_density(aes(<span class="pl-v">fill</span> <span class="pl-k">=</span> <span class="pl-st">factor</span>(<span class="pl-vo">BatterTeam</span>), <span class="pl-v">alpha</span> <span class="pl-k">=</span> <span class="pl-c1">0.01</span>)) <span class="pl-k">+</span> labs(<span class="pl-v">title</span> <span class="pl-k">=</span> paste0(<span class="pl-s1"><span class="pl-pds">"</span>HR Distribution of <span class="pl-pds">"</span></span>, <span class="pl-vo">year</span>[<span class="pl-vo">i</span>],<span class="pl-s1"><span class="pl-pds">"</span> Season 中華職棒<span class="pl-pds">"</span></span>, <span class="pl-vo">year</span>[<span class="pl-vo">i</span>], <span class="pl-s1"><span class="pl-pds">"</span>賽季逐場全壘打分布<span class="pl-pds">"</span></span>)) <span class="pl-k">+</span> theme(<span class="pl-v">text</span> <span class="pl-k">=</span> element_text(<span class="pl-v">family</span><span class="pl-k">=</span><span class="pl-s1"><span class="pl-pds">"</span>Heiti TC Light<span class="pl-pds">"</span></span>), <span class="pl-v">plot.title</span> <span class="pl-k">=</span> element_text(<span class="pl-v">size</span> <span class="pl-k">=</span> <span class="pl-c1">26</span>), <span class="pl-v">legend.text</span> <span class="pl-k">=</span> element_text(<span class="pl-v">size</span> <span class="pl-k">=</span> <span class="pl-c1">20</span>))
Keep going – further analysis
load(<span class="pl-s1"><span class="pl-pds">"</span>sc_tables.RData<span class="pl-pds">"</span></span>) <span class="pl-c"># 賽伯計量學的畢達哥拉斯定理</span> <span class="pl-en">win_pc</span> <span class="pl-k"><-</span> <span class="pl-k">function</span>(<span class="pl-vo">RS</span>, <span class="pl-vo">RA</span>) { <span class="pl-vo">perc</span> <span class="pl-k"><-</span> <span class="pl-vo">RS</span><span class="pl-k">^</span><span class="pl-c1">2</span> <span class="pl-k">/</span> (<span class="pl-vo">RS</span><span class="pl-k">^</span><span class="pl-c1">2</span> <span class="pl-k">+</span> <span class="pl-vo">RA</span><span class="pl-k">^</span><span class="pl-c1">2</span>) <span class="pl-k">return</span>(<span class="pl-vo">perc</span>) }
Draw some pictures (2/2)
levels(<span class="pl-vo">win_prob</span><span class="pl-k">$</span><span class="pl-vo">team</span>)
## [1] "統一7-ELEVEn" "義大" "中信兄弟" "Lamigo"
<span class="pl-c"># fig <- ggplot(data = win_prob, aes(x = team, y = real, fill = factor(half)))</span> <span class="pl-c"># fig + geom_bar(stat = "identity", position="dodge")</span> ggplot(<span class="pl-v">data</span> <span class="pl-k">=</span> <span class="pl-vo">win_prob</span>, aes(<span class="pl-v">x</span> <span class="pl-k">=</span> <span class="pl-vo">expd</span>, <span class="pl-v">y</span> <span class="pl-k">=</span> <span class="pl-vo">real</span>, <span class="pl-v">color</span> <span class="pl-k">=</span> <span class="pl-st">factor</span>(<span class="pl-vo">team</span>))) <span class="pl-k">+</span> stat_smooth() <span class="pl-k">+</span> geom_point() <span class="pl-k">+</span> coord_fixed()
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
An example demonstrating the relationship between “Real Value” v.s. “Market Value”
Advertising
To leave a comment for the author, please follow the link and comment on their blog: NYC Data Science Academy » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.