Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
“To grasp how different a million is from a billion, think about it like this: A million seconds is a little under two weeks; a billion seconds is about thirty-two years.”
“One of the pleasures of looking at the world through mathematical eyes is that you can see certain patterns that would otherwise be hidden.”
Steven Strogatz, Prof at Cornell University
Introduction
Within the last two weeks, I was introduced to Benford’s Law by 2 of my friends. Initially, I looked it up and Google and was quite intrigued by the law. Subsequently another friends asked me to check the ‘Digits’ episode, from the “Connected” series on Netflix by Latif Nasser, which I strongly recommend you watch.
Benford’s Law also called the Newcomb–Benford law, the law of anomalous numbers, or the First Digit Law states that, when dealing with quantities obtained from Nature, the frequency of appearance of each digit in the first significant place is logarithmic. For example, in sets that obey the law, the number 1 appears as the leading significant digit about 30.1% of the time, the number 2 about 17.6%, number 3 about 12.5% all the way to the number 9 at 4.6%. This interesting logarithmic pattern is observed in most natural datasets from population densities, river lengths, heights of skyscrapers, tax returns etc. What is really curious about this law, is that when we measure the lengths of rivers, the law holds perfectly regardless of the units used to measure. So the length of the rivers would obey the law whether we measure in meters, feet, miles etc. There is something almost mystical about this law.
The law has also been used widely to detect financial fraud, manipulations in tax statements, bots in twitter, fake accounts in social networks, image manipulation etc. In this age of deep fakes, the ability to detect fake images will assume paramount importance. While deviations from Benford Law do not always signify fraud, to large extent they point to an aberration. Prof Nigrini, of Cape Town used this law to identify financial discrepancies in Enron’s financial statement resulting in the infamous scandal. Also the 2009 Iranian election was found to be fradulent as the first digit percentages did not conform to those specified by Benford’s Law.
While it cannot be said with absolute certainty, marked deviations from Benford’s law could possibly indicate that there has been manipulation of natural processes. Possibly Benford’s law could be used to detect large scale match-fixing in cricket tournaments. However, we cannot look at this in isolation and the other statistical and forensic methods may be required to determine if there is fraud. Here is an interesting paper Promises and perils of Benford’s law
A set of numbers is said to satisfy Benford’s law if the leading digit d (d ∈ {1, …, 9}) occurs with probability
This law also works for number in other bases, in base b >=2
Interestingly, this law also applies to sports on the number of point scored in basketball etc. I was curious to see if this applied to cricket. Previously, using my R package yorkr, I had already converted all T20 data and ODI data from Cricsheet which is available at yorkrData2020, I wanted to check if Benford’s Law worked on the runs scored, or deliveries faced by batsmen at team level or at a tournament level (IPL, Intl. T20 or ODI).
Thankfully, R has a package benford.analysis to check for data behaviour in accordance to Benford’s Law, and I have used this package in my post
This post is also available in RPubs as Benford’s Law meets IPL, Intl. T20 and ODI
library(data.table) library(reshape2) library(dplyr) library(benford.analysis) library(yorkr)
In this post, I have randomly check data with Benford’s law. The fully converted dataset is available in yorkrData2020 which I have included above. You can try on any dataset including ODI (men,women),Intl T20(men,women),IPL,BBL,PSL,NTB and WBB.
1. Check the runs distribution by Royal Challengers Bangalore
We can see the behaviour is as expected with Benford’s law, with minor deviations
load("/Users/tvganesh/backup/software/cricket-package/yorkr-cricsheet/yorkrData2020/ipl/iplBattingBowlingDetails/Royal Challengers Bangalore-BattingDetails.RData") rcbRunsTrends = benford(battingDetails$runs, number.of.digits = 1, discrete = T, sign = "positive") rcbRunsTrends ## ## Benford object: ## ## Data: battingDetails$runs ## Number of observations used = 1205 ## Number of obs. for second order = 99 ## First digits analysed = 1 ## ## Mantissa: ## ## Statistic Value ## Mean 0.458 ## Var 0.091 ## Ex.Kurtosis -1.213 ## Skewness -0.025 ## ## ## The 5 largest deviations: ## ## digits absolute.diff ## 1 1 14.26 ## 2 7 13.88 ## 3 9 8.14 ## 4 6 5.33 ## 5 4 4.78 ## ## Stats: ## ## Pearson's Chi-squared test ## ## data: battingDetails$runs ## X-squared = 5.2091, df = 8, p-value = 0.735 ## ## ## Mantissa Arc Test ## ## data: battingDetails$runs ## L2 = 0.0022852, df = 2, p-value = 0.06369 ## ## Mean Absolute Deviation (MAD): 0.004941381 ## MAD Conformity - Nigrini (2012): Close conformity ## Distortion Factor: -18.8725 ## ## Remember: Real data will never conform perfectly to Benford's Law. You should not focus on p-values!
1a. Plot trends
Note: The Digits Distribution plot, is the plot of interest. The second order Digits Distribution is a relatively new test and is based on sorting the data and plotting the differences. The test can be applied to any data set and nonconformity usually signals an unusual issue related to data integrity. Anyway, Benford’s Law applies only to the first Digits Distribution plot. For a deeper analysis, the other plots besides other statistical tests may be required. There are other approaches to determine anamolies. I would assume, an easy way is to use Benford’s law and progressively dig deeper.
plot(rcbRunsTrends)
2. Check the ‘balls played’ distribution by Royal Challengers Bangalore
load("/Users/tvganesh/backup/software/cricket-package/yorkr-cricsheet/yorkrData2020/ipl/iplBattingBowlingDetails/Royal Challengers Bangalore-BattingDetails.RData") rcbBallsPlayedTrends = benford(battingDetails$ballsPlayed, number.of.digits = 1, discrete = T, sign = "positive") plot(rcbBallsPlayedTrends)
3. Check the runs distribution by Chennai Super Kings
The trend seems to deviate from the expected behavior to some extent in the number of digits for 5 & 7.
load("/Users/tvganesh/backup/software/cricket-package/yorkr-cricsheet/yorkrData2020/ipl/iplBattingBowlingDetails/Chennai Super Kings-BattingDetails.RData") cskRunsTrends = benford(battingDetails$runs, number.of.digits = 1, discrete = T, sign = "positive") cskRunsTrends ## ## Benford object: ## ## Data: battingDetails$runs ## Number of observations used = 1054 ## Number of obs. for second order = 94 ## First digits analysed = 1 ## ## Mantissa: ## ## Statistic Value ## Mean 0.466 ## Var 0.081 ## Ex.Kurtosis -1.100 ## Skewness -0.054 ## ## ## The 5 largest deviations: ## ## digits absolute.diff ## 1 5 27.54 ## 2 2 18.40 ## 3 1 17.29 ## 4 9 14.23 ## 5 7 14.12 ## ## Stats: ## ## Pearson's Chi-squared test ## ## data: battingDetails$runs ## X-squared = 22.862, df = 8, p-value = 0.003545 ## ## ## Mantissa Arc Test ## ## data: battingDetails$runs ## L2 = 0.002376, df = 2, p-value = 0.08173 ## ## Mean Absolute Deviation (MAD): 0.01309597 ## MAD Conformity - Nigrini (2012): Marginally acceptable conformity ## Distortion Factor: -17.90664 ## ## Remember: Real data will never conform perfectly to Benford's Law. You should not focus on p-values!
3a. Plot the trends
plot(cskRunsTrends)
##3b. Check details of suspicious behavious Interestingly the package benford.analysis has functions to get details of data which result in the deviation. The getSuspects() function returns data which are ‘suspicious’. We probably need to look at aberrations with a pinch of salt. A look at the statistical distribution and other investigation would need to be carried out to determine the cause.
suspects <- getSuspects(cskRunsTrends, battingDetails) suspects ## batsman ballsPlayed fours sixes runs strikeRate bowler ## 1: JA Morkel 18 1 2 29 161.11 nobody ## 2: MS Dhoni 31 2 0 23 74.19 Shahid Afridi ## 3: MS Dhoni 22 1 1 22 100.00 Shoaib Ahmed ## 4: SK Raina 19 1 2 25 131.58 nobody ## 5: MS Dhoni 37 6 1 58 156.76 nobody ## --- ## 311: MS Dhoni 12 3 1 25 208.33 nobody ## 312: SK Raina 43 5 2 54 125.58 nobody ## 313: Harbhajan Singh 8 0 0 2 25.00 nobody ## 314: SK Raina 13 4 0 22 169.23 nobody ## 315: AT Rayudu 21 2 0 25 119.05 nobody ## wicketFielder wicketKind wicketPlayerOut date ## 1: nobody notOut notOut 2008-05-06 ## 2: Shahid Afridi caught MS Dhoni 2008-05-06 ## 3: Shoaib Ahmed caught MS Dhoni 2009-04-27 ## 4: nobody caught and bowled SK Raina 2009-04-27 ## 5: nobody notOut notOut 2009-05-04 ## --- ## 311: nobody notOut notOut 2018-04-22 ## 312: nobody notOut notOut 2018-04-22 ## 313: nobody notOut notOut 2018-05-22 ## 314: nobody bowled SK Raina 2018-05-22 ## 315: nobody notOut notOut 2019-04-17 ## venue opposition ## 1: MA Chidambaram Stadium, Chepauk Deccan Chargers ## 2: MA Chidambaram Stadium, Chepauk Deccan Chargers ## 3: Kingsmead Deccan Chargers ## 4: Kingsmead Deccan Chargers ## 5: Buffalo Park Deccan Chargers ## --- ## 311: Rajiv Gandhi International Stadium, Uppal Sunrisers Hyderabad ## 312: Rajiv Gandhi International Stadium, Uppal Sunrisers Hyderabad ## 313: Wankhede Stadium Sunrisers Hyderabad ## 314: Wankhede Stadium Sunrisers Hyderabad ## 315: Rajiv Gandhi International Stadium, Uppal Sunrisers Hyderabad ## winner result ## 1: Deccan Chargers NA ## 2: Deccan Chargers NA ## 3: Deccan Chargers NA ## 4: Deccan Chargers NA ## 5: Chennai Super Kings NA ## --- ## 311: Chennai Super Kings NA ## 312: Chennai Super Kings NA ## 313: Chennai Super Kings NA ## 314: Chennai Super Kings NA ## 315: Sunrisers Hyderabad NA
4. Check runs distribution in all of Indian Premier League (IPL)
battingDF <- NULL teams <-c("Chennai Super Kings","Deccan Chargers","Delhi Daredevils", "Kings XI Punjab", 'Kochi Tuskers Kerala',"Kolkata Knight Riders", "Mumbai Indians", "Pune Warriors","Rajasthan Royals", "Royal Challengers Bangalore","Sunrisers Hyderabad","Gujarat Lions", "Rising Pune Supergiants") setwd("/Users/tvganesh/backup/software/cricket-package/yorkr-cricsheet/yorkrData2020/ipl/iplBattingBowlingDetails") for(team in teams){ battingDetails <- NULL val <- paste(team,"-BattingDetails.RData",sep="") print(val) tryCatch(load(val), error = function(e) { print("No data1") setNext=TRUE } ) details <- battingDetails battingDF <- rbind(battingDF,details) } ## [1] "Chennai Super Kings-BattingDetails.RData" ## [1] "Deccan Chargers-BattingDetails.RData" ## [1] "Delhi Daredevils-BattingDetails.RData" ## [1] "Kings XI Punjab-BattingDetails.RData" ## [1] "Kochi Tuskers Kerala-BattingDetails.RData" ## [1] "Kolkata Knight Riders-BattingDetails.RData" ## [1] "Mumbai Indians-BattingDetails.RData" ## [1] "Pune Warriors-BattingDetails.RData" ## [1] "Rajasthan Royals-BattingDetails.RData" ## [1] "Royal Challengers Bangalore-BattingDetails.RData" ## [1] "Sunrisers Hyderabad-BattingDetails.RData" ## [1] "Gujarat Lions-BattingDetails.RData" ## [1] "Rising Pune Supergiants-BattingDetails.RData" trends = benford(battingDF$runs, number.of.digits = 1, discrete = T, sign = "positive") trends ## ## Benford object: ## ## Data: battingDF$runs ## Number of observations used = 10129 ## Number of obs. for second order = 123 ## First digits analysed = 1 ## ## Mantissa: ## ## Statistic Value ## Mean 0.4521 ## Var 0.0856 ## Ex.Kurtosis -1.1570 ## Skewness -0.0033 ## ## ## The 5 largest deviations: ## ## digits absolute.diff ## 1 2 159.37 ## 2 9 121.48 ## 3 7 93.40 ## 4 8 83.12 ## 5 1 61.87 ## ## Stats: ## ## Pearson's Chi-squared test ## ## data: battingDF$runs ## X-squared = 78.166, df = 8, p-value = 1.143e-13 ## ## ## Mantissa Arc Test ## ## data: battingDF$runs ## L2 = 5.8237e-05, df = 2, p-value = 0.5544 ## ## Mean Absolute Deviation (MAD): 0.006627966 ## MAD Conformity - Nigrini (2012): Acceptable conformity ## Distortion Factor: -20.90333 ## ## Remember: Real data will never conform perfectly to Benford's Law. You should not focus on p-values!
4b. Plot trends all of IPL
We can see that the trend follows quite closely to Benford’s curve for all of IPL
plot(trends)
5. Check Benford’s law in India matches
setwd("/Users/tvganesh/backup/software/cricket-package/yorkr-cricsheet/yorkrData2020/t20/t20BattingBowlingDetails") load("India-BattingDetails.RData") indiaTrends = benford(battingDetails$runs, number.of.digits = 1, discrete = T, sign = "positive") plot(indiaTrends)
6. Check Benford’s law in all of Intl. T20
setwd("/Users/tvganesh/backup/software/cricket-package/yorkr-cricsheet/yorkrData2020/t20/t20BattingBowlingDetails") teams <-c("Australia","India","Pakistan","West Indies", 'Sri Lanka', "England", "Bangladesh","Netherlands","Scotland", "Afghanistan", "Zimbabwe","Ireland","New Zealand","South Africa","Canada", "Bermuda","Kenya","Hong Kong","Nepal","Oman","Papua New Guinea", "United Arab Emirates","Namibia","Cayman Islands","Singapore", "United States of America","Bhutan","Maldives","Botswana","Nigeria", "Denmark","Germany","Jersey","Norway","Qatar","Malaysia","Vanuatu", "Thailand") for(team in teams){ battingDetails <- NULL val <- paste(team,"-BattingDetails.RData",sep="") print(val) tryCatch(load(val), error = function(e) { print("No data1") setNext=TRUE } ) details <- battingDetails battingDF <- rbind(battingDF,details) } intlT20Trends = benford(battingDF$runs, number.of.digits = 1, discrete = T, sign = "positive") intlT20Trends ## ## Benford object: ## ## Data: battingDF$runs ## Number of observations used = 21833 ## Number of obs. for second order = 131 ## First digits analysed = 1 ## ## Mantissa: ## ## Statistic Value ## Mean 0.447 ## Var 0.085 ## Ex.Kurtosis -1.158 ## Skewness 0.018 ## ## ## The 5 largest deviations: ## ## digits absolute.diff ## 1 2 361.40 ## 2 9 276.02 ## 3 1 264.61 ## 4 7 210.14 ## 5 8 198.81 ## ## Stats: ## ## Pearson's Chi-squared test ## ## data: battingDF$runs ## X-squared = 202.29, df = 8, p-value < 2.2e-16 ## ## ## Mantissa Arc Test ## ## data: battingDF$runs ## L2 = 5.3983e-06, df = 2, p-value = 0.8888 ## ## Mean Absolute Deviation (MAD): 0.007821098 ## MAD Conformity - Nigrini (2012): Acceptable conformity ## Distortion Factor: -24.11086 ## ## Remember: Real data will never conform perfectly to Benford's Law. You should not focus on p-values!
5a. Plot trends
plot(intlT20Trends)
6. Check Benford’s law in ODI
This plot also nicely follows the Benford’s predicted curve
setwd("/Users/tvganesh/backup/software/cricket-package/yorkr-cricsheet/yorkrData2020/odi/odiBattingBowlingDetails") teams <-c("Australia","India","Pakistan","West Indies", 'Sri Lanka', "England", "Bangladesh","Netherlands","Scotland", "Afghanistan", "Zimbabwe","Ireland","New Zealand","South Africa","Canada", "Bermuda","Kenya","Hong Kong","Nepal","Oman","Papua New Guinea", "United Arab Emirates","Namibia","Cayman Islands","Singapore", "United States of America","Bhutan","Maldives","Botswana","Nigeria", "Denmark","Germany","Jersey","Norway","Qatar","Malaysia","Vanuatu", "Thailand") battingDF<-NULL for(team in teams){ battingDetails <- NULL val <- paste(team,"-BattingDetails.RData",sep="") print(val) tryCatch(load(val), error = function(e) { print("No data1") setNext=TRUE } ) details <- battingDetails battingDF <- rbind(battingDF,details) } odiTrends = benford(battingDF$runs, number.of.digits = 1, discrete = T, sign = "positive") odiTrends ## ## Benford object: ## ## Data: battingDF$runs ## Number of observations used = 23766 ## Number of obs. for second order = 179 ## First digits analysed = 1 ## ## Mantissa: ## ## Statistic Value ## Mean 0.468 ## Var 0.089 ## Ex.Kurtosis -1.204 ## Skewness -0.069 ## ## ## The 5 largest deviations: ## ## digits absolute.diff ## 1 5 240.18 ## 2 4 190.84 ## 3 9 177.47 ## 4 8 157.69 ## 5 1 66.28 ## ## Stats: ## ## Pearson's Chi-squared test ## ## data: battingDF$runs ## X-squared = 100.07, df = 8, p-value < 2.2e-16 ## ## ## Mantissa Arc Test ## ## data: battingDF$runs ## L2 = 0.002845, df = 2, p-value < 2.2e-16 ## ## Mean Absolute Deviation (MAD): 0.004609365 ## MAD Conformity - Nigrini (2012): Close conformity ## Distortion Factor: -14.92332 ## ## Remember: Real data will never conform perfectly to Benford's Law. You should not focus on p-values!
Plot trends
plot(odiTrends)
The data for other formats is available at yorkrData2020. Feel free to try it out yourself.
Conclusion
Maths rules our lives, more than we are aware, more that we like to admit. It is there in all of nature. Whether it is the recursive patterns of Mandelbrot sets, the intrinsic notion of beauty through the golden ratio, the murmuration of swallows, the synchronous blinking of fireflies or in the almost univerality of Benford’s law on natural datasets, mathematics govern us.
Isn’t it strange that while we humans pride ourselves of freewill, the runs scored by batsmen in particular formats conform to Benford’s rule for the first digits. It almost looks like, the runs that will be scored is almost to extent predetermined to fall within specified ranges obeying Benford’s law. So much for choice.
Something to be pondered over!
Also see
- Introducing GooglyPlusPlus!!!
- Deconstructing Convolutional Neural Networks with Tensorflow and Keras
- Going deeper into IBM’s Quantum Experience!
- Experiments with deblurring using OpenCV
- Big Data 6: The T20 Dance of Apache NiFi and yorkpy
- Deep Learning from first principles in Python, R and Octave – Part 4
- Practical Machine Learning with R and Python – Part 4
- Re-introducing cricketr! : An R package to analyze performances of cricketers
- Bull in a china shop – Behind the scenes in Android
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.