1
Benford’s Law… Is it magic?
Gaetan “Guy” Lion
July 2010
2
What is the probability that the population number of any country starts with any of
the first digit: 1,2,3,4,5,6,7,8, or 9?
The probability that the population number of any country starts with any of the first digit is probably: 1/9 = 11.1%...
Countries population. Frequency of first digit
0%
2%
4%
6%
8%
10%
12%
1 2 3 4 5 6 7 8 9
First digit
Fre
qu
ency
3
… The Correct Answer
First digit Frequency1 28.4%2 14.9%3 13.5%4 9.9%5 9.0%6 9.0%7 5.4%8 6.3%9 3.6%
100.0%
Countries population. Frequency of first digit
0%
5%
10%
15%
20%
25%
30%
1 2 3 4 5 6 7 8 9
First digit
Fre
qu
ency
Actual
Speculated
4
Countries populations follow Benford’s Law
Chi Square P value the two distributions are the same: 0.8
Countries population. Frequency of first digit
0%
5%
10%
15%
20%
25%
30%
35%
1 2 3 4 5 6 7 8 9
First digit
Fre
qu
en
cy
Population
Benford
5
Benford’s LawBenford’s law states that in lists of numbers from many real-world data, the first digit frequency is defined by this equation: Log (1+1/First Digit)This results in the frequency distribution shown below that is different from a uniform distribution.
Benford's Law distributionLOG(1+1/First Digit)
First Digit Frequency1 30.1%2 17.6%3 12.5%4 9.7%5 7.9%6 6.7%7 5.8%8 5.1%9 4.6%
100.0%
Benford's Law vs Uniform distribution
0%
5%
10%
15%
20%
25%
30%
35%
1 2 3 4 5 6 7 8 9
First Digit
Fre
qu
ency
Benford
Uniform
6
When does this law work?
The data crosses at least one scale (or order of magnitude) as shown below:
Scale RangeScale 1 1 to 9Scale 2 10 to 99Scale 3 100 to 999Scale 4 1,000 to 9,999Etc… Etc…
You preferably need a sample > 100.
7
Demographic data follows Benford Law very closely
The U.S. has over 3,000 counties. All shown demographic measures follow Benford’s Law pretty closely. This very large sample renders the Chi Square Goodness of fit test very (if not excessively) rigorous.
U.S. Census 2000 of counties population
Benford Population Births DeathsNatural increase
Internatio-nal
migrationDomestic migration
Net migration
1 30.1% 30.9% 30.3% 28.6% 31.2% 35.2% 29.8% 29.8%2 17.6% 17.9% 16.5% 16.7% 17.6% 18.7% 17.8% 18.4%3 12.5% 12.6% 13.7% 13.0% 12.1% 13.0% 13.0% 13.0%4 9.7% 9.8% 10.4% 10.0% 9.2% 8.2% 9.6% 9.5%5 7.9% 6.7% 7.8% 9.0% 7.6% 6.6% 8.3% 7.9%6 6.7% 6.6% 6.3% 7.6% 6.5% 5.9% 6.7% 7.2%7 5.8% 5.4% 5.8% 5.6% 6.5% 5.9% 6.5% 5.5%8 5.1% 5.5% 4.6% 4.9% 4.8% 3.7% 4.4% 4.5%9 4.6% 4.6% 4.6% 4.6% 4.5% 2.9% 4.0% 4.1%
Chi Square P value 0.37 0.24 0.08 0.41 0.00 0.30 0.48
8
NYSE Stocks volume
This captures the first digit frequency of volume of over 2,000 NYSE stocks on June 21st. The fit is excellent both visually and statistically.
NYSEFirst Digit Benford Volume
1 30.1% 30.8%2 17.6% 16.4%3 12.5% 13.6%4 9.7% 9.8%5 7.9% 8.0%6 6.7% 6.4%7 5.8% 5.6%8 5.1% 5.2%9 4.6% 4.2%
Chi Square P value 0.73
NYSE Stocks' Volume on June 21
0%
5%
10%
15%
20%
25%
30%
35%
1 2 3 4 5 6 7 8 9
First digit
Fir
st d
igit
fre
qu
ency
Benford
Volume
9
PG&E SmartMeter test
First Digit Benford Analog SmartMeter1 30.1% 33.0% 33.0%2 17.6% 22.0% 22.0%3 12.5% 12.1% 12.1%4 9.7% 9.9% 9.9%5 7.9% 4.4% 4.4%6 6.7% 5.5% 5.5%7 5.8% 5.5% 5.5%8 5.1% 3.3% 3.3%9 4.6% 4.4% 4.4%
Chi Square p value 0.90 0.90
This captures 91 observations between April and July 2010 of analog vs SmartMeter kWh consumption readings. Both the visual and statistical fit are pretty good.
Benford vs PG&E kWh meters
0%
5%
10%
15%
20%
25%
30%
35%
1 2 3 4 5 6 7 8 9
First digit
Fre
qu
en
cy
of
firs
t d
igit
Benford
PG&E
10
Tennis pros ATP pointsATP points
0%
5%
10%
15%
20%
25%
30%
35%
1 2 3 4 5 6 7 8 9
First Digit
Benford
ATP
The number of ATP points of the first 1,600 professional tennis players follow closely Benford’s Law. Because of the large sample the associated P value is small.
11
Even when it is not supposed to work… It kind of does.
I investigated Bernie Madoff’s monthly returns vs its closest competitor (GATEX). Although those data sets were not fit to use Benford’s Law the visual fit was surprisingly good.
Benford's Law test: Madoff vs GATEX and S&P 500 distribution of monthly returns first digit
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
1 2 3 4 5 6 7 8 9
First Digit of monthly return
Fre
qu
ency
Benford
Madoff
GATEX
12
Is Benford Law magic?
Bacteria
>
No, a simple rule is that there are more small things than large things in the universe…
13
… a simple explanation…The general principle is that there are more smaller observations vs larger ones. There are probably nearly twice as many 1s as there are 2s and three times as many 1s as there are 3s, etc… Using such a principle throughout gives us a frequency that is close to Benford’s Law.
First Digit frequency Benford's Law vs Simple rule
0%
5%
10%
15%
20%
25%
30%
35%
40%
1 2 3 4 5 6 7 8 9
First digit
Fir
st d
igit
fre
qu
enc
y
Benford
Simple
Simple ruleBenford Simple proportion
Digit log(1+1/d) rule 1/d
1 30.1% 35.3% 1.002 17.6% 17.7% 0.503 12.5% 11.8% 0.334 9.7% 8.8% 0.255 7.9% 7.1% 0.206 6.7% 5.9% 0.177 5.8% 5.0% 0.148 5.1% 4.4% 0.139 4.6% 3.9% 0.11
2.83
We would need a sample > 1,000 to reach statistical significance at the 0.05 level that those two distributions are different.
14
Extending Benford’s Law beyond first digit
Benford’s Law is not limited to the first digit. You can use as many digits as you want using the formula: Log(1+1/Digits) For instance, the frequency of numbers that start with 367 = Log(1+1/367) = 0.12%.
15
Benford vs Simple rule for first two digits
When dealing with first two digits (10 – 99), Benford’s Law and the Simple Rule have indistinguishable distributions. You would need samples > 700,000 to reach statistical significance at the 0.05 level that the two distributions are different.
1st two Digits distribution Benford's Law vs Simple rule
0%
1%
1%
2%
2%
3%
3%
4%
4%
5%
10 14 18 22 26 30 34 38 42 46 50 54 58 62 66 70 74 78 82 86 90 94 98
First two digits
Fir
st t
wo
dig
its
freq
uen
cy
Benford
Simple
16
Time series growing by 2% per period
A time series growing by 2% per period over 116 periods replicates almost exactly Benford’s Law frequency distribution. This makes sense. The difference between 1 and 2 is a 100% increase vs between 2 and 3 is only a a 50% increase, etc… This entails there
will be a lot more 1s than other digits.
First digit frequenciesBenford Actual
First digit Expected Observed1 30.1% 30.2%2 17.6% 17.2%3 12.5% 12.9%4 9.7% 9.5%5 7.9% 7.8%6 6.7% 6.9%7 5.8% 6.0%8 5.1% 4.3%9 4.6% 5.2%
Chi Square 1.00
Difference between one digit and the next
0%
20%
40%
60%
80%
100%
120%
1 2 3 4 5 6 7 8 9
First Digit
100%
50%
33%25%
20% 17% 14% 12%
17
Math properties of Benford’s Law
• Scale invariance: if a set of numbers closely follows Benford’s Law (BL), multiplying the numbers by any possible constant will create another set of numbers that also follows Benford’s Law. See the “Ones Scaling Test” on next slide.
• Base invariance: if a set of numbers follows BL using a different base (Log, natural log, etc…) will also create another set of numbers that follows BL.
18
The Ones Scaling TestLooking at tax return numbers that followed BL closely, someone used the Ones Scaling Test to see if the number of “1s” would remain the same if multiplied by a constant. In this case, they multiplied the set of numbers by 1.01 and did that 696 times. This corresponds to multiplying the numbers progressively up to a factor of 1,000 as 1.01^696 = 1,000.
As shown, across all iterations the number of 1s remained very stable around the BL predicated level of 30.1%.
Source: “The Scientist and Engineer’s Guide to Digital Signal Processing. Steve Smith, PhD.
19
What can we do with Benford’s Law?Quite a bit it turns out!
20
A few Benford’s Law applications…
• Investigating political elections integrity;
• Checking tax returns for fraud;
• Uncovering accounting fraud;
• Detecting false insurance claims.
21
Iran Election
Mahmoud Ahmadinejad's vote totals have more '2s' and fewer '1s' than expected. Roukema speculates Iranian officials replaced 1s by 2s. So, for instance, in some town where he received 1,954 votes, they would report his having received 2,954 votes.
Source: Nate Silver. fivethirtyeight.com
22
Franken Vote count
“…This hugely violates Benford's Law -- there are not nearly enough totals beginning in 1 and too many beginning in numbers like 5, 6 and 7. The odds of these anomalies having occurred by chance are greater than a quadrillion to one against… the reason this pattern emerges is because precinct sizes in Minnesota are not truly random. There is a large number of precincts in Minnesota that are designed to serve between 1,000 and 2,000 voters; since Franken won about 42 percent of the votes statewide, this leads to a relatively high number of instances where his vote totals are in the high single digits (672, 704, 588, etc.)”Source: Nate Silver. fivethirtyeight.com
Senator
23
Inspector Clouseau demonstrates how to run a fraud investigation
24
Detecting fraud (an example). Step 1
Checks 483 Checks 522
First Digit Benford 09 Q4 Benford 10 Q11 145 155 157 1462 85 76 92 783 60 57 65 674 47 51 51 525 38 36 41 406 32 27 35 607 28 30 30 288 25 27 27 259 22 25 24 26
483 483 522 522
Chi Square 0.84 Chi Square 0.06
A company issued 483 checks in 2009 Q4 that was audited and everything checked out. It also issued 522 checks in 2010 Q1. A fraud investigator notes that 09 Q4 pattern fit Benford Law very closely (P value 0.84). He notes that the fit deteriorated in 010 Q1 9 (P value 0.06).
25
Step 2. Focus on the differenceBenford vs 2010 Q1
0%
5%
10%
15%
20%
25%
30%
35%
1 2 3 4 5 6 7 8 9
Benford
10 Q1
As shown, the company has issued many more checks starting with the ‘6’ digit than expected (60 vs 35 for BL).
26
Step 3. Focus on the 6s first two digits
We have 28 checks out of 522 starting with the two digits 66 vs 3.4 expected per Benford’s Law. This calls for further investigation.
Checks 522
First 2 dig. Benford 10 Q160 3.7 361 3.7 462 3.6 563 3.6 264 3.5 565 3.5 466 3.4 2867 3.4 468 3.3 369 3.3 2
35 60
# of checks
1st two Digits distribution Benford's Law vs Simple rule
0%
1%
1%
2%
2%
3%
3%
4%
4%
5%
10 14 18 22 26 30 34 38 42 46 50 54 58 62 66 70 74 78 82 86 90 94 98
First two digits
Fir
st t
wo
dig
its
freq
uen
cy
Benford
Simple
27
Step 4. Focus on the 66s to three digits
Carrying this analysis to the first three digits, we see an unusual # of checks starting with ‘666’ and ‘668.’ Later, we find that the checks starting with ‘666’ were legitimate ones that four employees wrote to pay for a monthly service that cost $5.95 per month plus tax or $6.66 with tax. Meanwhile, 9 of the 10 checks starting with ‘668’ were fraudulent ones.
First 2 dig. Benford 10 Q1660 0.3 1661 0.3 1662 0.3 1663 0.3 0664 0.3 0665 0.3 1666 0.3 12667 0.3 0668 0.3 10669 0.3 2
3.4 28
# of checks
28
Replicating Clouseau’s success
• The NY District Attorney’s Office applied the same methodology to uncover 103 checks out of 784 that were not authentic;
• The State of Arizona uncovered a $2 million check fraud in 1993;
• The State of North Carolina uncovered a $4.8 million procurement fraud over 2002 – 2005.
29
The Key
• Benford’s Law helps you find “the needle in the hay stack” within the data;
• This does not mean all anomalies are fraudulent. But, it helps in finding the ones that are.