Cricket data analysis

[This article was first published on Enterprise Software Doesn't Have to Suck, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Cricket World Cup 2011 is approaching and I’m interested in analyzing one day international cricket data to predict some results and share interesting information about cricket.


 
For the analysis, I need cricket data and tried several things to get it…
  • Personal research: Explored the web but couldn’t find aggregated cricket data anywhere. There are many cricket-statistics oriented websites but none of them were useful (except Cricinfo, my favorite cricket website)
  • Reach out to my network: I requested my friends for advise last month and received many emails with information and offer to help compile the data
  • Reached out to Sports data companies: I contacted Opta Sports to buy who this data. Although they have the data, it was too expensive for my personal experiment.
So I decided to collect this data myself by web scraping cricket scorecards. I first tried to use R libraries to web scrape but found it lacking. So I switched to Ruby, which has a great library for web scraping – Hpricot (thanks Satty for getting me started and Amit/Thomas for solving my newbee issues).

I’m happy to report that now I have a robust Ruby script that can download all One Day International Cricket data (3000+ matches) in 3 handy files:

1) Win-Loss data:

Match_ID  Team1  Team2  Winner  Margin  First.Innings.Total  Second.Innings.Total  Ground  Matchdate  Ground_Country  Ground_Latitude  Ground_Longitude  Series
ODI no. 1  Sri Lanka  New Zealand  no result 203  Dambulla  Aug 19, 2010  Sri Lanka 7.8566667 80.6491667  Sri Lanka Triangular Series
ODI no. 2  Sri Lanka  India  Sri Lanka  8 wickets 103 104  Dambulla  Aug 22, 2010  Sri Lanka 7.8566667 80.6491667  Sri Lanka Triangular Series
ODI no. 3  India  New Zealand  India  105 runs 223 118  Dambulla  Aug 25, 2010  Sri Lanka 7.8566667 80.6491667  Sri Lanka Triangular Series


2) Batting data:


Match_ID  Inning  Player  Country  Out  Runs  Minutes  Balls  Fours  Sixes  Scorerate
ODI no. 1 1 V Sehwag India lbw b Kulasekara 12
12 2 0 100
ODI no. 1 1 RG Sharma India lbw b Mathews 11
21 2 0 52.38
ODI no. 1 1 Yuvraj Singh India lbw b Malinga 38
64 5 1 59.37
ODI no. 1 1 SK Raina India c Sangakkara b Perera 8
16 1 0 50


3) Bowling data:

Match_ID  Inning  Player  Country  Overs  Maidens  Runs  Wickets  Economy
ODI no. 1 1 SL Malinga Sri Lanka 9 1 21 2 2.33
ODI no. 1 1 KMDN Kulasekara Sri Lanka 9 2 31 2 3.44
ODI no. 1 1 AD Mathews Sri Lanka 8 3 20 1 2.5
ODI no. 1 1 NLTC Perera Sri Lanka 7.4 1 28 5 3.65


    This Ruby script takes about 40 minutes on a fast internet connection to collect the data. It took me ~ 40 hours to write and fine tune the script. Most of the time was spent in dealing with typical data issues associated with web scraping and making the script generic to handle Test cricket and T20 cricket scorecards as well.

    To leave a comment for the author, please follow the link and comment on their blog: Enterprise Software Doesn't Have to Suck.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

    Never miss an update!
    Subscribe to R-bloggers to receive
    e-mails with the latest R posts.
    (You will not see this message again.)

    Click here to close (This popup will not appear again)