Cricket data analysis
[This article was first published on Enterprise Software Doesn't Have to Suck, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Cricket World Cup 2011 is approaching and I’m interested in analyzing one day international cricket data to predict some results and share interesting information about cricket.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
For the analysis, I need cricket data and tried several things to get it…
- Personal research: Explored the web but couldn’t find aggregated cricket data anywhere. There are many cricket-statistics oriented websites but none of them were useful (except Cricinfo, my favorite cricket website)
- Reach out to my network: I requested my friends for advise last month and received many emails with information and offer to help compile the data
- Reached out to Sports data companies: I contacted Opta Sports to buy who this data. Although they have the data, it was too expensive for my personal experiment.
I’m happy to report that now I have a robust Ruby script that can download all One Day International Cricket data (3000+ matches) in 3 handy files:
1) Win-Loss data:
Match_ID | Team1 | Team2 | Winner | Margin | First.Innings.Total | Second.Innings.Total | Ground | Matchdate | Ground_Country | Ground_Latitude | Ground_Longitude | Series |
ODI no. 1 | Sri Lanka | New Zealand | no result | 203 | Dambulla | Aug 19, 2010 | Sri Lanka | 7.8566667 | 80.6491667 | Sri Lanka Triangular Series | ||
ODI no. 2 | Sri Lanka | India | Sri Lanka | 8 wickets | 103 | 104 | Dambulla | Aug 22, 2010 | Sri Lanka | 7.8566667 | 80.6491667 | Sri Lanka Triangular Series |
ODI no. 3 | India | New Zealand | India | 105 runs | 223 | 118 | Dambulla | Aug 25, 2010 | Sri Lanka | 7.8566667 | 80.6491667 | Sri Lanka Triangular Series |
2) Batting data:
Match_ID | Inning | Player | Country | Out | Runs | Minutes | Balls | Fours | Sixes | Scorerate |
ODI no. 1 | 1 | V Sehwag | India | lbw b Kulasekara | 12 | 12 | 2 | 0 | 100 | |
ODI no. 1 | 1 | RG Sharma | India | lbw b Mathews | 11 | 21 | 2 | 0 | 52.38 | |
ODI no. 1 | 1 | Yuvraj Singh | India | lbw b Malinga | 38 | 64 | 5 | 1 | 59.37 | |
ODI no. 1 | 1 | SK Raina | India | c Sangakkara b Perera | 8 | 16 | 1 | 0 | 50 |
3) Bowling data:
Match_ID | Inning | Player | Country | Overs | Maidens | Runs | Wickets | Economy |
ODI no. 1 | 1 | SL Malinga | Sri Lanka | 9 | 1 | 21 | 2 | 2.33 |
ODI no. 1 | 1 | KMDN Kulasekara | Sri Lanka | 9 | 2 | 31 | 2 | 3.44 |
ODI no. 1 | 1 | AD Mathews | Sri Lanka | 8 | 3 | 20 | 1 | 2.5 |
ODI no. 1 | 1 | NLTC Perera | Sri Lanka | 7.4 | 1 | 28 | 5 | 3.65 |
This Ruby script takes about 40 minutes on a fast internet connection to collect the data. It took me ~ 40 hours to write and fine tune the script. Most of the time was spent in dealing with typical data issues associated with web scraping and making the script generic to handle Test cricket and T20 cricket scorecards as well.
To leave a comment for the author, please follow the link and comment on their blog: Enterprise Software Doesn't Have to Suck.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.