Selection of bin width and bin number in histograms

Selection of bin width and bin number for

Histograms

Introduction

We often come across the problem when density of the variable of interest is unknown. One popular method of estimating the unknown density is by using the Histogram estimator.

Often the decision on bin number or bin width in a histogram is made arbitrarily or subjectively but need not be. Here we review the literature on various statistical procedures that have been proposed for making the decision on optimum bin width and bin number.

Aspects of the Study We shall review various methods in statistical literature that are prevalent for determining the optimal number of bins and the bin width in a histogram.

We shall also try to present a comparative analysis so as to determine which methods are more efficient.

The measure we use to compare the various methods of optimal binning is sup|ĥ(x) -h(x)| where ĥ(x) is the histogram density estimator at x and h is the true density at the point x.

Proposed methods of interest for optimal binning Sturges’ rule and Doane Modification Scott Rule and Freedman-Diaconis

modification Bayesian optimal binning Optimal binning by Hellinger Risk

minimization Penalized maximum log-likelihood method

with penalty A and Hogg penalty Stochastic complexity or Kolmogorov

complexity method

Sturges’ Rule

If one constructs a frequency distribution with k bins , each of width 1 and

centered on the points i=0,1,. . . , k-1 . Choose the bin count of the "i" th

bin to be the Binomial coefficient . As k increases, the ideal frequency

histogram assumes the shape of a normal density with mean (k-1)/2 and

variance (k-1)/4.

According to the Sturges' rule, the optimum number of bins for the histogram is given by,

n =

k is the number of bins to be used. This when solved for k, gives us

n=

We split the sample range into k such bins of equal length. So, the Sturges'

rule gives us a regular histogram.

Conceptual Fallacy of the Sturges rule

• There is conceptually a fallacy in Sturges rule derivation Instead of choosing n= , one could have satisfied any n that satisfies individual cell frequencies to be

• m(i)=no. of obs in “i”th cell could well have been taken to be

m(i)= n.

• So, intuitively there is no reason for choosing this particular n given the motivation we employ in Sturges’ rule.

Doane’s law

For skewed or kurtotic distributions, additional bins may be required. Doane proposed increasing the number of bins by log2(1 +ŷ ) where ŷ is the standardized skewness coefficient.

Scott rule and Freedman Diaconis modification

• We get an optimum band width by minimizing the asymptotic expected L2 norm. The histogram estimator is given by

• ĥ(x)=Vk /nh where h is the bin width and n is the total no. of observations and Vk = no. of obs lying in the “k” th bin.

• The optimum band width given by

h*(x) = [f(xk)/2γ2n] 1/3 where xk is some point lying in the “k”th bin and γ is the Lipschitz continuity factor.

For normal density case, we observe that h*= 3.5n-1/3sd(x) for regular case.

The Friedman Diaconis modification for non-normal data is given by h*= 2(IQ)n-1/3

Hellinger risk minimization

• The Hellinger risk between the histogram density estimator ĥ(x) for a given bin width k for a regular histogram and the true density f(x) is defined as

H= • We try to minimize this quantity for different choices of the bin

width or bin number .

• If the true f is known, we have no problem in dealing with this integral. But, if the true f is not known, one may estimate f using Bootstrapping over repeated sample from f.

Bayesian model for optimal binning

The likelihood of the data given the parameters M – no. of bins and the vector tuple π ,we get P(d/ π,M,I)= (M/V)N π1

n1π2n2 ……πM- 1

nM-1 πMnM where

V =Mv and v is the bin width.

Assume that the prior densities are defined as follows P(M/I)=1/C where C= max no. of bins taken in accountP(π/M) = [π1π2 …πM ]-1/2 ᴦ(M/2)/ᴦ(1/2)M. Which is actually a Dirichlet distribution with M parameters equal to ½ and this is conjugate prior of multinomial distribution.

P(π, M/d,I) = k*P(π /M)P(M/)P(d/ π ,M) is obtained and integrated over M to get the marginal distribution of M which when maximized yields the optimal value of M.

Maximum penalized loglikelihood method

In this case we do maximize the loglikelihood of the multinomial distribution corresponding to a histogram but with some penalty function added. The penalized loglikelihood is thus of the form

Pl=log(L(ĥ, x1 , x2 ,……, xn))- penn (I) where I is the partition of the sample range into disjoint intervals. Note that these bins need not be of equal length i,e the histogram may be irregular.

There are various choices of the penalty , however our two choices have been under D bins

penA=

The first penalty is applicable for both regular and irregular casespenB(Hogg or Akaike penalty)=D-1

Stochastic complexity method

• This is based on the idea of encoding the data with minimum number of bits. This is a sort of PML with no. of bits or description length as penalty.

• If P(X|Ө) be the distribution of the data with Ө unknown and if σi (Ө) be the standard deviation with respect to the best estimator of “i” th co-ordiante of Ө, then the description length is given by

- log2 (P(X|Ө))+∑ log2 ( )

We define stochastic complexity as - log2 ∫ P(X|Ө) π(Ө)dӨ. If we take an uniform prior for Ө,

Then taking P(X|Ө) to be the multinomial distribution, we get stochastic complexity to be

l=(m-1)ᴉ (N1.N2 ….. , Nm )/(m+n-1)ǃ. Maximize wrt to m to get the no. of bins

Simulation design In order to compare the various methods of binning, we use simulation experiments from 3 reference distributions namely Chi square (2), Normal(0,1) and Uniform(1,10).

We compare the statistic T = | |ĥ(x)-f(x)|

For various methods and compare how smaller the value of T is on an average for each of these methods.

We have simulated 1000 observations from each of the reference distributions,computed the T statistic for each simulated run and carry out this experiment 200 times to get a distribution of T.

Mean and variance of T for chi-square(2) Method Mean no.

of binsMean(T) Variance(T

)

Sturges 10 0.1364 0.00031

Doane 15 0.1028 0.00018

Scott 19 0.0874 0.00015

Hellinger 13 0.1151 0.00022

FD 32 0.0747 0.00023

Kolmogrov 10 0.2744 0.01288

Bayesian 12 0.1177 0.00037

Hogg 18 0.0948 0.000194

Irregular(penA)

6 0.1134 0.00028

back

Analysis of chi square simulation For Chi-square(2), Freedman-diaconis and

Scott’s rule have performed very well in terms of smaller mean value of T.

Kolmogorov’s complexity method has the maximum spread in the t-values. The distribution of T under the sturge’s rule dominates that under Freedman-diaconis and Scott’s rule.

The irregular histogram method under PenA gives very less no. of bins compared to others.

Mean and variance of T for N(0,1)

Method Mean No. of Bins

Mean(T) Variance(T)

Sturges 10 0.0909 0.00013

Scott 18 0.08377 0.00025

Hellinger 20 0.08309 0.00022

FD 25 0.08687 0.00029

Kolmogrov 13 0.2243 0.0137

Bayesian 13 0.0912 0.00022

Hogg 12 0.0855 0.00011

Irregular(penA)

6 0.1984 0.00113

T distribution for N(0,1) family

back

Analysis of normal simulation For Normal(0,1), we left out Doane's modification as it is meant for non-normal or skewed distribution.

Sturges rule and Scott’s rule have performed very well under the normal case, which is expected given that they are designed under normality assumptions.

Scott, Freedman-Diaconis and Sturges rule are very close to one another in terms of the distribution of T.

The penalized log –likelihood with penalty A has a distribution of T that dominates the T distribution under the other methods.

The T-distribution under stochastic complexity and Hellinger distance have maximum spread. The minimum spread is due to Sturges rule.

Mean and Variance of T for U(1,10)

Method Mean no. of Bins

Mean(T) Variance(T)

Sturges 10 0.1298 0.00036

Scott 9 0.1288 0.00035

Doane 11 0.1308 0.00051

FD 9 0.1283 0.00032

Bayesian 9 0.1274 0.000361

back

Analysis under U(1,10) distribution

Most of the methods under uniform case give only 1 or 2 bins, so they cannot be compared with others which are more stable in nature.

However, the Scott’s, Freedman Diaconis and Sturges rule have performed well with small values of the T and small variation in the values of T under repeated simulations.

Similar to the univariate method, we try to generalize our method for bivariate distributions.

Here we simulate observations from bivariate normal distribution with mean (0,0) , ρ = 0.5 and σ2 = 1.

The methods we use are the multivariate extension of the Bayesian optimal binning and the multivariate Scott's rule.

Multivariate distribution simulation

Multivariate Scott’s rule application

In the same vein as in univariate case, the multivariate Scott’s rule is determined by minimizing

the asymptotic L-2 error of the expected L-2 norm.

The Multivariate Scott’s choice of bin width is given by h*=3.5 σxk Where d is the dimension of the dataset and σxk the standard deviation along “k”th co-ordinate.

The 3-d histogram obtained for T statistic distribution under Scott rule

Distribution of T-statistic for Bivariate Normal under Scott’s rule

Bayesian optimal binning for multivariate normal case

In this case, we select Mx bins along X axis and My bins along the Y axis and define M= Mx My .The joint likelihood in this case given by

h(x,y, Mx ,My )=which is quite analogous to the univariate case. Again taking a rectangular prior for (Mx,My ) and dirichlet distribution of M dimensions with each parameter ½ as prior for ℿ.

Bivariate normal histogram under Bayesian optimal binning

T distribution under Bayesian rule for bivariate normal

Further

We have dealt with only histogram estimators in this paper.However,one may apply smoothing parameter to make the estimator more efficient and analyze the values of T-staistic for various smoothing parameters.

We have only used Bayesian and Scott’s multivariate extensions . However, one may try to generalize other methods in the multivariate case .

One may use other form of penalties and observe for which penalty, the estimator thus obtained is most efficient.

Conclusion

From All three univariate simulation experiments we infer that Scott’s and Freedman-Diaconis method have been most efficient in reducing the values of T .

No method however is uniformly best under all scenarios. For bivariate normal case, using Scott’s rule

and Bayesian optimal binning , we find that the T value is smaller on an average under Scott than under the bayesian optimal binning.

Presented by Group 4:

Kushal Kumar DeySaswati SahaRaka Mondol

Avijit Kumar DuttaNilanjan Chatterjee

Documents

Selection of bin width and bin number in histograms