Upload
kushal-kumar-dey
View
5.005
Download
4
Tags:
Embed Size (px)
DESCRIPTION
This was performed as a Group project as part of coursework at Indian Statistical Institute.
Citation preview
Selection of bin width and bin number for
Histograms
Introduction
We often come across the problem when density of the variable of interest is unknown. One popular method of estimating the unknown density is by using the Histogram estimator.
Often the decision on bin number or bin width in a histogram is made arbitrarily or subjectively but need not be. Here we review the literature on various statistical procedures that have been proposed for making the decision on optimum bin width and bin number.
Aspects of the Study We shall review various methods in statistical literature that are prevalent for determining the optimal number of bins and the bin width in a histogram.
We shall also try to present a comparative analysis so as to determine which methods are more efficient.
The measure we use to compare the various methods of optimal binning is sup|ĥ(x) -h(x)| where ĥ(x) is the histogram density estimator at x and h is the true density at the point x.
Proposed methods of interest for optimal binning Sturges’ rule and Doane Modification Scott Rule and Freedman-Diaconis
modification Bayesian optimal binning Optimal binning by Hellinger Risk
minimization Penalized maximum log-likelihood method
with penalty A and Hogg penalty Stochastic complexity or Kolmogorov
complexity method
Sturges’ Rule
If one constructs a frequency distribution with k bins , each of width 1 and
centered on the points i=0,1,. . . , k-1 . Choose the bin count of the "i" th
bin to be the Binomial coefficient . As k increases, the ideal frequency
histogram assumes the shape of a normal density with mean (k-1)/2 and
variance (k-1)/4.
According to the Sturges' rule, the optimum number of bins for the histogram is given by,
n =
k is the number of bins to be used. This when solved for k, gives us
n=
We split the sample range into k such bins of equal length. So, the Sturges'
rule gives us a regular histogram.
Conceptual Fallacy of the Sturges rule
• There is conceptually a fallacy in Sturges rule derivation Instead of choosing n= , one could have satisfied any n that satisfies individual cell frequencies to be
• m(i)=no. of obs in “i”th cell could well have been taken to be
m(i)= n.
• So, intuitively there is no reason for choosing this particular n given the motivation we employ in Sturges’ rule.
Doane’s law
For skewed or kurtotic distributions, additional bins may be required. Doane proposed increasing the number of bins by log2(1 +ŷ ) where ŷ is the standardized skewness coefficient.
Scott rule and Freedman Diaconis modification
• We get an optimum band width by minimizing the asymptotic expected L2 norm. The histogram estimator is given by
• ĥ(x)=Vk /nh where h is the bin width and n is the total no. of observations and Vk = no. of obs lying in the “k” th bin.
• The optimum band width given by
h*(x) = [f(xk)/2γ2n] 1/3 where xk is some point lying in the “k”th bin and γ is the Lipschitz continuity factor.
For normal density case, we observe that h*= 3.5n-1/3sd(x) for regular case.
The Friedman Diaconis modification for non-normal data is given by h*= 2(IQ)n-1/3
Hellinger risk minimization
• The Hellinger risk between the histogram density estimator ĥ(x) for a given bin width k for a regular histogram and the true density f(x) is defined as
H= • We try to minimize this quantity for different choices of the bin
width or bin number .
• If the true f is known, we have no problem in dealing with this integral. But, if the true f is not known, one may estimate f using Bootstrapping over repeated sample from f.
Bayesian model for optimal binning
The likelihood of the data given the parameters M – no. of bins and the vector tuple π ,we get P(d/ π,M,I)= (M/V)N π1
n1π2n2 ……πM- 1
nM-1 πMnM where
V =Mv and v is the bin width.
Assume that the prior densities are defined as follows P(M/I)=1/C where C= max no. of bins taken in accountP(π/M) = [π1π2 …πM ]-1/2 ᴦ(M/2)/ᴦ(1/2)M. Which is actually a Dirichlet distribution with M parameters equal to ½ and this is conjugate prior of multinomial distribution.
P(π, M/d,I) = k*P(π /M)P(M/)P(d/ π ,M) is obtained and integrated over M to get the marginal distribution of M which when maximized yields the optimal value of M.
Maximum penalized loglikelihood method
In this case we do maximize the loglikelihood of the multinomial distribution corresponding to a histogram but with some penalty function added. The penalized loglikelihood is thus of the form
Pl=log(L(ĥ, x1 , x2 ,……, xn))- penn (I) where I is the partition of the sample range into disjoint intervals. Note that these bins need not be of equal length i,e the histogram may be irregular.
There are various choices of the penalty , however our two choices have been under D bins
penA=
The first penalty is applicable for both regular and irregular casespenB(Hogg or Akaike penalty)=D-1
Stochastic complexity method
• This is based on the idea of encoding the data with minimum number of bits. This is a sort of PML with no. of bits or description length as penalty.
• If P(X|Ө) be the distribution of the data with Ө unknown and if σi (Ө) be the standard deviation with respect to the best estimator of “i” th co-ordiante of Ө, then the description length is given by
- log2 (P(X|Ө))+∑ log2 ( )
We define stochastic complexity as - log2 ∫ P(X|Ө) π(Ө)dӨ. If we take an uniform prior for Ө,
Then taking P(X|Ө) to be the multinomial distribution, we get stochastic complexity to be
l=(m-1)ᴉ (N1.N2 ….. , Nm )/(m+n-1)ǃ. Maximize wrt to m to get the no. of bins
Simulation design In order to compare the various methods of binning, we use simulation experiments from 3 reference distributions namely Chi square (2), Normal(0,1) and Uniform(1,10).
We compare the statistic T = | |ĥ(x)-f(x)|
For various methods and compare how smaller the value of T is on an average for each of these methods.
We have simulated 1000 observations from each of the reference distributions,computed the T statistic for each simulated run and carry out this experiment 200 times to get a distribution of T.
Mean and variance of T for chi-square(2) Method Mean no.
of binsMean(T) Variance(T
)
Sturges 10 0.1364 0.00031
Doane 15 0.1028 0.00018
Scott 19 0.0874 0.00015
Hellinger 13 0.1151 0.00022
FD 32 0.0747 0.00023
Kolmogrov 10 0.2744 0.01288
Bayesian 12 0.1177 0.00037
Hogg 18 0.0948 0.000194
Irregular(penA)
6 0.1134 0.00028
back
Analysis of chi square simulation For Chi-square(2), Freedman-diaconis and
Scott’s rule have performed very well in terms of smaller mean value of T.
Kolmogorov’s complexity method has the maximum spread in the t-values. The distribution of T under the sturge’s rule dominates that under Freedman-diaconis and Scott’s rule.
The irregular histogram method under PenA gives very less no. of bins compared to others.
Mean and variance of T for N(0,1)
Method Mean No. of Bins
Mean(T) Variance(T)
Sturges 10 0.0909 0.00013
Scott 18 0.08377 0.00025
Hellinger 20 0.08309 0.00022
FD 25 0.08687 0.00029
Kolmogrov 13 0.2243 0.0137
Bayesian 13 0.0912 0.00022
Hogg 12 0.0855 0.00011
Irregular(penA)
6 0.1984 0.00113
T distribution for N(0,1) family
back
Analysis of normal simulation For Normal(0,1), we left out Doane's modification as it is meant for non-normal or skewed distribution.
Sturges rule and Scott’s rule have performed very well under the normal case, which is expected given that they are designed under normality assumptions.
Scott, Freedman-Diaconis and Sturges rule are very close to one another in terms of the distribution of T.
The penalized log –likelihood with penalty A has a distribution of T that dominates the T distribution under the other methods.
The T-distribution under stochastic complexity and Hellinger distance have maximum spread. The minimum spread is due to Sturges rule.
Mean and Variance of T for U(1,10)
Method Mean no. of Bins
Mean(T) Variance(T)
Sturges 10 0.1298 0.00036
Scott 9 0.1288 0.00035
Doane 11 0.1308 0.00051
FD 9 0.1283 0.00032
Bayesian 9 0.1274 0.000361
back
Analysis under U(1,10) distribution
Most of the methods under uniform case give only 1 or 2 bins, so they cannot be compared with others which are more stable in nature.
However, the Scott’s, Freedman Diaconis and Sturges rule have performed well with small values of the T and small variation in the values of T under repeated simulations.
Similar to the univariate method, we try to generalize our method for bivariate distributions.
Here we simulate observations from bivariate normal distribution with mean (0,0) , ρ = 0.5 and σ2 = 1.
The methods we use are the multivariate extension of the Bayesian optimal binning and the multivariate Scott's rule.
Multivariate distribution simulation
Multivariate Scott’s rule application
In the same vein as in univariate case, the multivariate Scott’s rule is determined by minimizing
the asymptotic L-2 error of the expected L-2 norm.
The Multivariate Scott’s choice of bin width is given by h*=3.5 σxk Where d is the dimension of the dataset and σxk the standard deviation along “k”th co-ordinate.
The 3-d histogram obtained for T statistic distribution under Scott rule
Distribution of T-statistic for Bivariate Normal under Scott’s rule
Bayesian optimal binning for multivariate normal case
In this case, we select Mx bins along X axis and My bins along the Y axis and define M= Mx My .The joint likelihood in this case given by
h(x,y, Mx ,My )=which is quite analogous to the univariate case. Again taking a rectangular prior for (Mx,My ) and dirichlet distribution of M dimensions with each parameter ½ as prior for ℿ.
Bivariate normal histogram under Bayesian optimal binning
T distribution under Bayesian rule for bivariate normal
Further
We have dealt with only histogram estimators in this paper.However,one may apply smoothing parameter to make the estimator more efficient and analyze the values of T-staistic for various smoothing parameters.
We have only used Bayesian and Scott’s multivariate extensions . However, one may try to generalize other methods in the multivariate case .
One may use other form of penalties and observe for which penalty, the estimator thus obtained is most efficient.
Conclusion
From All three univariate simulation experiments we infer that Scott’s and Freedman-Diaconis method have been most efficient in reducing the values of T .
No method however is uniformly best under all scenarios. For bivariate normal case, using Scott’s rule
and Bayesian optimal binning , we find that the T value is smaller on an average under Scott than under the bayesian optimal binning.
Presented by Group 4:
Kushal Kumar DeySaswati SahaRaka Mondol
Avijit Kumar DuttaNilanjan Chatterjee