22

Robust - Semantic Scholar...robust. Most of the w ork in the literature deals with robust estimation for lo cation and scale. Less atten tion has b een paid to robust estimation of

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Robust - Semantic Scholar...robust. Most of the w ork in the literature deals with robust estimation for lo cation and scale. Less atten tion has b een paid to robust estimation of

Robust Series Expansionsfor Probability Density EstimationMax WellingDept. of Electrical EngineeringCalifornia Institute of Technology 136-93Pasadena, CA [email protected]: density estimation, robust statistics, cumulants, projection pursuitAbstractA complete and orthogonal series expansion for robust probability density estimation is introduced.The expansion generalizes the well known Gram-Charlier and Edgeworth expansions which su�er frominstabilities in the presence of outliers. Robust cumulants are de�ned which inherit most of the niceproperties associated with the classical cumulants, but in addition are B-robust. We present simulationresults, showing the improved behaviour of the novel density estimator in L2 norm in the presence ofoutliers, for both sound data and arti�cially generated data. Robustness, bias and e�ciency propertiesare explored through the calculation of the in uence function. As an application, we de�ne novel (robust)projection pursuit contrast functions.1 IntroductionRobust Estimation and Robust Statitistics are maturing �elds of research nowadays. (Huber, 1981; Hampelet al., 1986). Its aim is to device theoretical tools and estimation techniques that are less vulnerable todeviations from idealized assumptions. For instance, if we have a parametric model of a density function,robust statistics asks the question how the estimation of the parameters behaves if this parametric model isnot exactly correct. It also deals with the issue of how stable an estimate is against the presence of outliers.For instance, should we expect large or small di�erences when we add or move a particular data point.Estimation techniques which are relatively insensitive to small changes are called robust. Most of the workin the literature deals with robust estimation for location and scale. Less attention has been paid to therobust estimation of higher order statistical quantites like skewness or kurtosis.Famous, non-robust techniques for density estimation are the Edgeworth and Gram-Charlier expansions.These series expansions are popular tools, because the expansion coe�cients enjoy some desirable prop-erties, like covariant transformation under rotations, while being easily expressible in terms of moments.Although moments are e�ciently calculated using sample averages, they are very sensitive to adding orshifting data points at large distances from the mean. This e�ect gets worse for higher order moments,because distant datapoints dominate the sample estimate of hxni for large n (where h:i denotes taking theaverage over the pdf).In this paper we de�ne novel series expansions which are closely related to the Edgeworth and Gram-Charlierexpansions, but are shown to be much more robust against the presence of outliers. Moreover, the expansioncoe�cients inherit the nice properties associated with the classical expansions, and are easily expressible interms of robust moments, h(�x)ne 12 (1��2)x2i, which contain a decreasing exponential. � is a parameter thatcan be tuned to interpolate between the classical case, �2 = 1, and the more robust side of the spectrum�2 / 2. Clearly, adding or shifting data at large distances has little e�ect on the estimate due to the presenceof the decreasing exponential. This is causing the new expansion to be robust, while still being complete andorthogonal. It implies that the e�ect of outliers is redistributed over the di�erent terms in the expansion,making them less vulnerable to sample uctuations and outliers.After a review of the classical Edgeworth and Gram-Charlier expansions we de�ne the robust generalizedexpansions. It is shown that through the incorporation of the \probability constraint" we can ensure thatthe pdf integrates to unity. Through the introduction of the in uence function we can study the robustnessproperties of the expansion and show that it is \B-robust". Asymptotic variance and bias are also studiedwith the help of the in uence function. It is shown that the asymptotic variance for the robust expan-sion coe�cients is smaller than that of its classical counterparts. Also, the bias is shown to decrease forsuper-Gaussian distributions. As an application we derive new robust projection pursuit contrast functions.Simulations for real world data (sound data) show that while the classical expansion becomes unstable, thegeneralized expansion still provides a reliable estimate for the pdf. By using arti�cially generated data, we1

Page 2: Robust - Semantic Scholar...robust. Most of the w ork in the literature deals with robust estimation for lo cation and scale. Less atten tion has b een paid to robust estimation of

can explore which value for � provides the best results. We found that for �2 � 1:9 the best results weattained.2 Classical Density EstimationIn this section we will summarize the classical Gram-Charlier expansion and Edgeworth expansion for densityestimation.Consider the situation where we want to estimate a zero-mean, unit-variance pdf. There is no loss ofgenerality in the zero-mean and unit-variance assumptions as we can always remove the mean by subtractionand normalize the variance by scaling. Furthermore, assume that the pdf can be written as a perturbationof a Gaussian pdf: �(x) = 1p2� e�x22 . In other words, the zeroth order estimation of the pdf is a Gaussianwith zero mean and unit variance. The Gram-Charlier expansion can now be written asp(x) = 1Xn=0 cn(�)n dndxn �(x) = 1Xn=0 cnHn(x)�(x); (1)where Hn(x) are normalized Hermite polynomials of order n. In appendix A one can �nd some usefulproperties of the Hermite polynomials Hn(x). It can be shown (Kendall and Stuart, 1963) that thesepolynomials are orthogonal with respect to the measure d� = �(x)dx,Z 1�1Hn(x)Hm(x)d� = n!�nm; (2)where �nm denotes the Kronecker delta function. Using this result we can easily calculate the coe�cients cnin the Gram-Charlier expansion (1) cn = 1n! Z 1�1 p(x)Hn(x)dx: (3)In practice, one has a number (N) data points from which one has to estimate the coe�cients cn by replacingthe integral by a sample average cn � 1N 1n! NXA=1Hn(xA); (4)where capital indices (A;B:::) will always index the samples. From this equation it is clear that for higherorder polynomials the result of this estimation becomes highly sensitive to outliers. As the ouliers are raisedto the power n they tend to dominate the estimate of the coe�cients. Using equation (71) from appendixA, we can express the expansion coe�cients cn in terms of the momentscn = 1n! (�n � n[2]2 � 1!�n�2 + n[4]22 � 2!�n�4 � n[6]23 � 3!�n�6 + :::) ; (5)where n[m] � n!=(n � m)!. If we insert this equation back into the expansion (1) and break it o� after a�xed number of terms (e.g. four), we �nd an approximation for the pdf in terms of its moments:p(x) = f1 + 16�3H3(x) + 124(�4 � 3)H4(x)g�(x): (6)As mentioned earlier, estimates of moments higher than four are in general unreliable in terms of sampleaverages in the presence of outliers. The zeroth moment represents the integral over the pdf and is one byde�nition of a pdf. The �rst moment vanishes because it represents the mean and we removed it. The secondmoment equals the variance and is one because we scaled the data to unit variance.An alternative formulation of this expansion exists and is called Edgeworth expansion. Since it is also anexpansion in terms of Hermite polynomials, it is really only the de�nition of its expansion coe�cients, thecumulants, which is distinct from the Gram-Charlier expansion. However, the conventions for breaking theseries o� at a �xed number of terms is often di�erent. The de�nition of the Edgeworth expansion is givenby, p(x) = e(P1n=3 1n!�n(�1)n dndxn )�(x); (7)2

Page 3: Robust - Semantic Scholar...robust. Most of the w ork in the literature deals with robust estimation for lo cation and scale. Less atten tion has b een paid to robust estimation of

where the relation (70) from appendix A, and the Taylor expansion of an exponential function may be usedto convert this into a series of Hermite polynomials. The cumulants �n have some interesting propertieswhich we will list after we introduced the multivariate expansion. Also the cumulants can be expressed interms of the moments by using the following relation,1Xn=1 1n!�n(it)n = ln( 1Xm=0 1m!�m(it)m): (8)Expanding the natural logarithm and equating all powers of (it) we �nd1 the cumulants in terms of themoments. Using this we can rewrite the estimate of the pdf (6) in terms of the �rst three cumulants:p(x) = f1 + 16�3H3(x) + 124�4H4(x)g�(x): (9)The �rst and second cumulant vanish from the equation (�1 = 0, �2 = 1) because we assumed zero-mean and unit variance distributions. The third cumulant is called skewness and tells us something aboutthe asymmetry of the pdf. The fourth cumulant is called kurtosis and measures the \peakiness" of thepdf. Distributions with large kurtosis (super Gaussian) have a lot of mass near the center and long tails.Distributions with small kurtosis are more uniformly distributed.In the following, we will brie y treat the multivariate case. For more details we refer to standard textbookson advanced statistics (Kendall and Stuart, 1963). Firstly we assume again a preprocessing stage were wesphere the data. This means that we subtract the mean and scale the variances in every eigendirectionof the covariance matrix to unity. Next we introduce the multivariate, unit-variance Gaussian, �(x) =1(2�)M2 e� 12xTx, as an obvious generalization of the one dimensional Gaussian. The Gram-Charlier expansionnow becomes p(x) = 1Xn=0 MXi1=1 : : : MXin=1 ci1:::in(�1)n ddxi1 : : : ddxin �(x): (10)The index n counts the number of derivatives being taken and every index ik runs over the di�erent randomvariables. If we collect all derivatives with respect to one random variable and take the derivatives usingequation (70), we can replace the derivatives again by Hermite-polynomials. From (10) we can see that thecoe�cients ci1:::in transform as tensors with respect to rotations. Given the fact that p(x) is invariant withrespect to rotations (as the Jacobian of a rotation is one), and because the ci1:::in form an inner productwith an array of derivatives which transform as tensors, we deduce that the ci1:::in transform as tensorsthemselves ci1:::in ! Oi1j1 : : : Oinjncj1:::jn ; (11)where Oij is a rotation matrix. The multivariate generalization of (7) is given by,p(x) = e(P1n=3PMi1=1:::PMin=1 1n!�i1:::in (�1)n ddxi1 ::: ddxin )�(x): (12)Notice that by the same reasoning we can readily observe from this equation that the cumulants transformas tensors, i.e. they have an identical transformation rule as the coe�cients ci1:::in (11). Fortunately, therelation between cumulants and moments also readily generalizes,1Xn=1 MXi1=1 : : : MXin=1 1n!�i1:::in(iti1) : : : (itin) = ln( 1Xm=0 MXj1=1 : : : MXjm=1 1m!�j1:::jm(itj1) : : : (itjm)); (13)Edgeworth studied expansion (12) in the context of elementary errors and found that in the multivariate case(let's say with M random variables), the cumulants are of the order M (2�n)=2, where n is the order of thecumulant. Notice that this observation is only valuable for multiple dependent random variables (M > 1). Ifthe M random variables were independent, the expansion would reduce to M independent one dimensionalexpansions for which M (2�n)=2 = 1. This does not imply that the Gram-Charlier and one dimensionalEdgeworth expansions are worthless. In general, the series converge rapidly and are well de�ned.1The i is there for reasons of convergence. The series P1m=0 1m!�m(it)m is de�ned to be the characteristic function of thepdf. It is equal to R1�1 eitxp(x)dx If we would have omitted i the convergence of the integral would be much worse.3

Page 4: Robust - Semantic Scholar...robust. Most of the w ork in the literature deals with robust estimation for lo cation and scale. Less atten tion has b een paid to robust estimation of

We have seen that both the Gram-Charlier and the Edgeworth expansion are approximations around aunit-variance, zero-mean Gaussian density. This means that for such a Gaussian all parameters except onevanish, c0 = 1; �ij = �ij . Let's call this property I. Another nice property both coe�cients share (II), is thefact that they transform as tensors with respect to rotations (11). A third property (III) distinguishes thecumulants from other expansion coe�cients: For independent random variables, \cross-cumulants" vanish.Cross-cumulants are de�ned as cumulants �i1:::in with not all values for the indices i1:::in equal. Because wewill need to proof the same property for the robust cumulants in the next section, we will sketch the proofbrie y in the following.The second term in (13) can be rewritten as lnEfeixtg. If the random variables are independent, theexponent factorizes in a product of averages over the independent variables separately. The ln will thenturn this into a sum over terms depending on only one random variable, PMj=1 lnEfeixjtjg. Because the\cross-cumulants" in the �rst term of (13) are precisely the terms that mix the terms with di�erent tj , theymust vanish.3 Robust Estimation using Generalized Series ExpansionsIn this section we will de�ne the robust expansions which generalize the Gram-Charlier and Edgeworthexpansions. Let us return to the one dimensional case and determine the location and scale of the pdf. Robustestimates exist in the literature (see for instance (Hampel et al., 1986), (Huber, 1981)). One possibility wouldbe to use the median as a location estimate and the standardized median of the absolute deviations from themedian (MAD) as the scale estimate,Location = medA(xA) A = 1:::N (14)Scale = 1:483 medA(jxA �medB(xB)j) A;B = 1:::N (15)The constant 1:483 is there to make the scale estimate unbiased for a Gaussian pdf. In the following wewill assume that the location has been calculated and subtracted from the data, while the scale has beencalculated and used to rescale the data to unit variance.Next, consider a Hermite polynomial with a scaled argument, Hn(�x), where � is a real number, usuallychosen between 1 and p2 (the details of those boundaries are explained in due course). We will now showthat we can expand the pdf in terms of these scaled Hermite polynomials in the following way,p(x) = �(x)�(�x) 1Xn=0 dn(�)n dnd(�x)n �(�x) = 1Xn=0 dnHn(�x)�(x): (16)where we introduced the scaled Gaussian distribution �(�x) = �p2� e� 12�2x2 . To see that this expansion isorthogonal we remark that the scaled Hermite polynomials are orthogonal with respect to a di�erent innerproduct. If we introduce the measured�� = �(�x) dx � �p2� e� 12�2x2 dx; (17)we can write: Z 1�1Hn(�x)Hm(�x) d�� = n! �nm: (18)The measure (17) is a generalization of the measure de�ned in section (2) which is regained in the limit� = 1. Also, (18) is the generalization of (2). Using (16) and (18) we can �nd coe�cients dn,dn = 1n! Z 1�1 p(x)Hn(�x)��1(x) d��: (19)The advantage of this expansion becomes apparent if we estimate these coe�cients using the sample average,dn = 1N 1n! NXA=1 �(�xA)�(xA) Hn(�xA) (20)4

Page 5: Robust - Semantic Scholar...robust. Most of the w ork in the literature deals with robust estimation for lo cation and scale. Less atten tion has b een paid to robust estimation of

Because we chose the factor � to be larger than one, the factor �(�xA)�(xA) = �e� 12 (�2�1)x2A has the function ofsuppressing the e�ect of outliers on the estimate of dn. This is intuitively the reason why this expansionis more stable for heavy tailed pdfs. Recall also that the expansion is complete, so that asymptoticallyno information is thrown away. The in uence of the outliers is smeared out over higher order coe�cients.In practice, we break the expansion o� after some �xed order. This implies that in practice we take lessinformation from the outliers into account. Notice that we can tune this in uence by adjusting the parameter�.Following the line of reasoning in section (2) further, we would like to be able to express the coe�cients dnin terms of some generalized moments. If we generalize the moments in the following way,�(�)n = Z 1�1(�x)np(x)�(�x)�(x) dx; (21)we acquire the same functional dependence as in (5). The suppressing factor �(�x)�(x) renders these generalizedmoments again robust against outliers. In the limit � = 1 we have the usual de�nition of the moments.With these de�nitions, the approximation of the series by a �nite number of terms using moments insteadof the coe�cients dn is almost the same as (6). However, we should replace Hn(x) ! Hn(�x). Anotherimportant di�erence is that the zeroth order moment does not represent the integral over the pdf for generic�. For the same reason, �(�)1 and �(�)2 do not represent the mean and variance for generic �, and do thereforenot equal 0 and 1 respectively. If we cut the series o� after a �xed number of terms, we may �nd that thepdf does not exactly integrate to one and is not zero-mean and unit-variance. In practise, we �nd thatthese constraints are satis�ed in good approximation though. It is however desirable that the probabilityconstraint, R p(x) dx = 1, is exactly satis�ed. We will therefore introduce an extra term in the expansion inthe next section to enforce this \probability conservation".The approximation of the pdf in terms of robust moments up to fourth order is given by,p(x) = f�(�)0 + �(�)1 H1(�x) + 12(�(�)2 � �(�)0 )H2(�x) + 16(�(�)3 � 3�(�)1 )H3(�x)+ 124(�(�)4 � 6�(�)2 + 3�(�)0 )H4(�x)g�(x): (22)Notice that we could go up to much higher orders in the general case. In simulations we went up to ordertwenty.Next we will de�ne the generalized cumulants in terms of the generalized moments by the same equation as(8), i.e. we will de�ne the generalized cumulants such that equation (8) will hold for all �. In appendix B wecomputed explicitely the relation between the moments and cumulants up to order 4. Using that informationone can rewrite the Gram-Charlier expansion (22) in terms of the generalized cumulants. We will not botherto write it out here, as the �nal answer is somewhat complicated. In appendix D we will prove the followinggeneralized cumulant expansion,p(x) = �(x)�(�x) e(P1n=0 1n! ~�(�)n (�1)n dnd(�x)n )�(�x): (23)By �(�)n we denote the generalized cumulants, de�ned through relation (8). The ~�(�)n are equal to the gen-eralized cumulants for all n 6= 2. For n = 2 we have ~�(�)2 = �(�)2 � 1. Intuitively, the derivatives producethe Hn(�x) times �(�x) and the �rst term interchanges the �(�x) for �(x). Expanding this expression andusing the relation between generalized cumulants and generalized moments in appendix B, one can verifythat the �rst few terms do indeed match with (22).Generalizations to higher dimensions are straightforward. For the multivariate robust Gram-Charlier expan-sion we �nd, p(x) = �(x)�(�x) 1Xn=0 MXi1=1 : : : MXin=1 di1:::in(�1)n dd(�x)i1 : : : dd(�x)in �(�x): (24)By using (70), we can replace derivatives again by scaled Hermite polynomials. Finally, we can combine (12)and (23) into the multivariate generalized Edgeworth expansion,p(x) = �(x)�(�x) e(P1n=0PMi1=1:::PMin=1 1n! ~�(�)i1:::in (�1)n dd(�x)i1 ::: dd(�x)in )�(�x); (25)5

Page 6: Robust - Semantic Scholar...robust. Most of the w ork in the literature deals with robust estimation for lo cation and scale. Less atten tion has b een paid to robust estimation of

were we have again the relation ~�(�)i1i2 = �(�)i1i2 � �i1i2 with all other cumulants unchanged.With these de�nitions we can now check whether the generalized expansion coe�cients still enjoy propertiesI through III. Property I is easily checked since the zeroth order approximation is precisely �(x) (and theexpansions are complete and orthogonal). This implies that the only nonzero coe�cients in case of a zero-mean, unit-variance Gaussian, are again given by d0 = 1 and �ij = �ij . From the de�nitions (24) and (25),we also check property II. The coe�cients in both expansions form contractions with an array of derivativeswhich transform as tensors with respect to rotations. Because the pdf remains invariant with respect torotations, we �nd that the coe�cients also transform as tensors. Property III can be proved as follows.Consider again relation (13). The same relation for robust cumulants and moments holds (we de�nedthe robust cumulants by these equations and derived (25) from it). The second term can be written asln Efe(�x)�(it)g where Ef:g denotes the following expression, R p(x)�(�x)�(x) f : gdx. If the random variables xare independent, the exponential factorizes into a product of expectations which is converted into a sumdue to the logarithm, PMj=1 ln Efe(�xj)(itj )g. We can now observe again that all terms with distinct tj havevanished from the expansion of the above expression. Because the cross cumulants in (13) are precisely thoseterms we must conclude they also vanish.So we notice that all properties (I,II,III) de�ned in section (2) are preserved under the generalizationprocess. It should be noted though that the expansion (25) does not seem to be an expansion in M 2�n2 (n isorder of cumulant). But as noted before, this does not need to be a serious drawback for many applications.4 Including the Probability ConstraintIn the Gram-Charlier expansion, the constraint R1�1 p(x) dx = 1 is automatically conserved even if we onlytake a �nite number of terms into account. This is true since the zero'th order Hermite polynomial is simplythe function 1, and therefore its expansion coe�cient is given by (3,4)c0 = Z 1�1 p(x) dx � 1N NXA=1 1 = 1: (26)Moreover, all subsequent polynomials are orthogonal to 1 and therefore, they don't interfere with this result,i.e. Z 1�1 p(x) dx = Z 1�1(1 + 1Xn=1 cnHn(x))�(x) dx (27)= h1; 1i+ 1Xn=1 cnhHn; 1i= h1; 1i= 1:where h; i denotes the inner product under the measure d�. This implies that the probability integrates to1 exactly, even if we approximate the pdf using a �nite sample and a �nite number of terms. However, itmight happen that the approximation becomes negative in the tails. In such a case one needs to set theprobability to zero there, and consequently conservation of probability is lost.In the generalized expansion the above result is not necessarily true anymore. Can we add a function to theexpansion that enforces this constraint on the probability? Let us call this function (x) and consequentlyadd a term (x)�(x) to the expansion,pR(x) = f RXn=0 dnHn(�x) + (x)g�(x); (28)where the index R indicates that we are approximating the pdf by taking R orders of Hermite polynomialsinto account. This term should compensate the �rst term such that the probability always integrates to one,6

Page 7: Robust - Semantic Scholar...robust. Most of the w ork in the literature deals with robust estimation for lo cation and scale. Less atten tion has b een paid to robust estimation of

i.e. Z 1�1 (x)�(x) dx = 1� RXn=0 dn Z 1�1 Hn(�x)�(x) dx (29)= 1� RXn=0n! dnan (30)where, an = (n� 1)!!n! (�2 � 1)n2 �n;2k k 2 f0; 1; 2; 3; :::g; (31)and (n � 1)!! denotes the double factorial of (n � 1) de�ned by 1 � 3 � 5:::(n � 1). This however, does notconstrain (x) very much; it just implies that it should be L1 integrable when multiplied by a Gaussian.Secondly, we want (x) to be orthogonal to all Hermite polynomials Hn(�x) n = 1:::R under the measured��. For reasons that become clear later, we will consider (x) as the R + K 0th term in our expansion,where K = 1 if R is odd and K = 2 if R is even (so R +K is always even). To be compatible with theorthogonality relation (18) we set, Z 1�1 (x)2 d�� = (R +K)! (32)Using this, we may calculate as, = 1(R +K)! Z 1�1 p(x) (x)��1(x)d��: (33)Thirdly, we would like to impose that the calculation of does not involve taking another sample averageapart from the dn that need to be determined anyway. This narrows our choice for (x) down to a linearcombination of the scaled Hermite polynomials Hn(�x) together with the function �(x)�(�x) . This last functionmay be considered as the analogue of \100 in the generalized expansion, since inserting it in (33) producesprecisely the probability constraint. If we use as an \ansatz" an arbitrary linear combination of thesefunctions, and try to ful�l all constraints discussed above, we quickly realize that (x) has to be given by, (x) = p(R+K)!q 1�p2��2 �PRn=0 n!a2n ( �(x)�(�x) � RXn=0 anHn(�x)): (34)For we �nd, = 1p(R+K)! (1�PRn=0 n!andn)q 1�p2��2 �PRn=0 n!a2n ; (35)where we can use the estimates dn for dn.2 Note that (x) is only square integrable under the measure d��for �2 < 2. This is the reason that we chose the values for �2 to lie between 1 and 2 in the previous section.One might wonder what happens to this correction term in the limit � ! 1. In appendix C we will provethe following property3� lim�!1 = cR+K� lim�!1 (x) = HR+K(x)2In these derivations we made use of the following integral,Z 1�1 �2(x)�(�x) dx = 1�p2� �2 : (36)3Notice that (x) is even, which is a reason why it could never be equal to an odd Hermite polynomial in the limit �! 1.7

Page 8: Robust - Semantic Scholar...robust. Most of the w ork in the literature deals with robust estimation for lo cation and scale. Less atten tion has b een paid to robust estimation of

This important fact tells us that the correction term will become the (R +K)'th term in the usual Gram-Charlier expansion when � ! 1. So alternatively, we can take the point of view that is an independentparameter to be estimated by the data. d0 can then be calculated from the parameters (d1; :::dR; ). In thelimit � = 1, d0 will become unity exactly.The derivation of (x) could also have been presented in a slightly di�erent way. First we realize that �(x)�(�x)is the analogue of \100, and subsequently use a Gram-Schmidt orthogonalization procedure to �nd (x). Thisphilosophy can illustrate what the correction term looks like in the multivariate case. We would need toorthogonalize �(x)�(�x) with respect to all combinations of Hermite polynomials for which the orders add up toR, e.g. H2(x) H4(y) H1(z):::. Because the derivation is not fundamentally di�erent (no extra formulas areneeded) and the bookkeeping is somewhat messy, we will not persue it here.5 Robustness of the Generalized EstimatorsIn this section we will examine the robustness of our generalized expansion in the context of \robust statis-tics". Many de�nitions in this section are taken from (Hampel et al., 1986). First we will establish that theestimators dn are Fisher consistent, a property required for the calculation of the in uence function. This in- uence function measures the sensitivity of the estimates to adding an additional observation to the dataset.From there, we can calculate various robustness measures and establish that our generalized estimators havebecome \B-robust". The relation with respect to so called M-estimators is also explored.First let us recall our model density, pR(x), given by equation (28). It depends on the �nite set of param-eters d0:::dR (or equivalently d1; :::; ). Also recall that we have de�ned estimators dn by (20). We willshow that these estimaters are Fisher consistent. This is saying that at the model density, the estimatormeasures asymptotically (N !1) the right quantity. This de�nition is similar to the notion of asymptoticunbiasedness. Fisher consistent is de�ned by, dR;1n (pR) = dn (37)In our case this gives, dR;1n (pR) = 1n! Z pR(x)Hn(�x)�(�x)�(x) dx (38)= 1n! Z ( (x) + RXm=1 dmHm(�x))Hn(�x)d�� (39)= dn; (40)where the last line follows from the fact that the functions fHn(�x); (x)g are orthogonal with respect tothe measure d��. This proves that the estimator is indeed Fisher consistent4.Next we turn to the calculation of the in uence function IF (x; dn; pR) (see (Hampel et al., 1986)). Thisfunction measures the sensitivity of the estimators to adding one more observation to the data set. It isde�ned by: IF (x; dn; pR) = limt!0 dn((1� t)pR + t�x)� dn(pR)t : (41)The probability density �x(y) is de�ned by �x(y) = �(x � y), i.e. it is the probability density with mass 1at position x. In our case we �nd for the in uence function:IF (x; dn; pR) = 1n!Hn(�x)�(�x)�(x) � dn (42)If this function is �nite everywhere, we call the estimator dn \B-robust". It means that adding one moreobservation at any location x will not disturb the estimate beyond control. This property is expressed bythe \gross-error sensitivity", �n = supxjIF (x; dn; pR)j: (43)4We use a hat (e.g. dn) to denote an estimate from a �nite sample. Since we have shown that the asymptotic estimatordR;1n equals dn, we will simply ignore the hat and superscripts in the following for asymptotic estimators.8

Page 9: Robust - Semantic Scholar...robust. Most of the w ork in the literature deals with robust estimation for lo cation and scale. Less atten tion has b een paid to robust estimation of

The crucial point to note is that for � = 1 we �nd � = 1 because Hn(x) blows up at x ! 1. For � > 1however, the \squeezing factor" �(�x)�(x) prevents this, and renders the estimator B-robust.Another useful function is the \local-shift sensitivity", de�ned by,��n = supx6=yj jIF (y; dn; pR)� IF (x; dn; pR)jjx� yj : (44)This function measures the sensitivity of the estimator to moving one observation from location x to locationy. If we consider the in�nitesimal version of this we are calculating the derivative of the in uence function.By the same reasoning, we �nd that for � = 1 we have �� =1 and for � > 1 we have �nite ��. Again, theproblems arise for very large x, i.e. for outliers. The fact that our generalized expansion performs betterthan the classical expansion is thus explained by the fact it is more stable in the presence of outliers.There is yet another interesting quantity called the \rejection point" ��. All observations farther away thanthis point are thrown away. So it is de�ned to be that point for which the in uence function vanishes for alllocations farther away than that point. In our case this quantity is in�nite, stating that no outlier is rejectedcompletely. However, its in uence on the estimator is becoming less and less important for larger x.Finally we like to remark that our estimation problem can easily be cast into the framework of \M-estimators". This class of estimation problems is a generalization of the well known maximum likelihoodestimation. Its objective is to �nd the dn that minimizes the following expression,NXA=1 �(xA; dn) (45)We will take the obvious choice, �(xA; dn) = 12( 1n!Hn(�xA)�(�xA)�(xA) � dn)2: (46)Taking the derivative with respect to dn we have to solve:NXA=1(xA; dn) = 0 (47)with (xA; dn) = dn� 1n!Hn(�xA)�(�xA)�(xA) in our case. This -function also has the interpretation of a kind ofin uence function on the data. In general there is a relationship between and IF . In our case it is easy toshow that using the general formula in (Hampel et al., 1986), page 101, the IF -function calculated from isindeed equal to (42). Usually, the solution to a M-estimation problem is done iteratively, sweeping across allthe data many times. In the present case we know the solution explicitely (dn = 1N 1n!PNA=1Hn(�xA)�(�xA)�(xA) ),displaying the fact that the calculation needs to consider the data only once.The multivariate case is again a straightforward generalization of the one dimensional case. The conclusion,namely that the estimates of the coe�cients of the generalized multivariate expansions are B-robust for� > 1, is equally true in this case.6 Bias, Asymptotic Variance and Fisher InformationIn this section we will answer the question whether the robusti�cation of the Gram-Charlier series introducesa larger bias or impairs the e�ciency of the pdf-estimation. First we will express the mean square error(MSE) in terms of the bias and the asymptotic correlation matrix which can be expressed in terms of thein uence function. We will then de�ne a variation on the mean integrated square error (MISE) and de�nea quantity that is easily compared with the Fisher information to get an estimate on the e�ciency of theestimation.In (Silverman, 1986) we can �nd the de�nition of the pointwise mean square error, MSEx,MSEx(p(N)R (x)) = Ef(p(N)R (x)� p(x))2g: (48)9

Page 10: Robust - Semantic Scholar...robust. Most of the w ork in the literature deals with robust estimation for lo cation and scale. Less atten tion has b een paid to robust estimation of

In this equation p(N)R (x) represents a particular instance of the estimate pR(x) when using a data set of Nsamples. We will assume in�nitely many of those N -data sets and de�ne the expectation of p(N)R over thosedata sets as pR(x), pR(x) = Efp(N)R (x)g = f (x) + RXn=0 dnHn(�x)g�(x); (49)The MSE can easily be rewritten as:MSEx = Ef(p(N)R (x) � pR(x))2g+ (pR(x)� p(x))2 (50)The second term is interpreted as the square of the bias. This term is independent of the sample size N andcan therefore not be improved upon if we choose more data points. This term depends on how well we candescribe a certain pdf using a projection on our �rst R scaled Hermite polynomials. The �rst term is thevariance of the estimate p(N)R (x) at every point x around its mean pR(x). This term will be seen to vanishas 1N for large N . Using the following formula (Hampel et al., 1986),d(N)n � dn � 1N NXA=1 IF (xA; dn; pR); (51)we can express the �rst term in (50) in terms of the in uence function,Ef(p(N)R (x)� pR(x))2g = 1N RXn;m=0�(dn; dm)Hn(�x)Hm(�x)�2(x); (52)where we de�ned, �(dn; dm) = Z 1�1 p(x)IF (x; dn)IF (x; dm) dx: (53)In the derivation of (52), we used that the samples are independent, i.e.EfIF (xA; dn)IF (xB ; dm)g = EfIF (xA; dn)gEfIF (xB; dm)g if A 6= B: (54)Finally we used that, EfIF (xA; dn)g = Z 1�1 dx p(x)IF (xA; dn) = 0: (55)So we may conclude that the MSE for every point x is determined by a N -independent bias term, and a 1Ncovariance term. The covariance term is completely determined by the asymptotic covariance matrix whichis written in terms of the in uence function, calculated in the previous section. A convenient measure forthe size of � is the total variation Trace(�). If we calculate the integral over the MSE using a measure�(�x)�(x)2 dx, then (52) is seen to be precisely proportional to this total variation.Next we want to recall an important theorem from estimation theory:Cramer-Rao inequality: The covariance matrix of a set of Fisher consistent estimators, �, is lowerbounded by the (matrix) inverse of the Fisher information matrix J : �(dn; dm) � J�1(dn; dm), where theprobability density, pR(x), depends on the parameters d0; :::; dR (or equivalently d1; :::; ).The de�nition of the Fisher information is given by5:J(dn; dm) = Ef 1p(x) dd dn pR(x) 1p(x) dd dm pR(x)gp (56)= Z 1�1 Hn(�x)Hm(�x)�(x)2p(x) dx: (57)5Actually, the asymptotic covariance and Fisher information are usually de�ned slightly di�erent by replacing all p(x) in (53)and (57) by pR(x). The proof that the matrix inverse of (57) constitutes a lower bound on (53) is proven by using s(x; dn) =1p(x) dd dn pR(x) as a modi�ed score function in the usual proof. Notice that we have Efs(x)gp = 0 and Efs(x; dn) dmgp = �nm,where we used Efdmgp = dm. 10

Page 11: Robust - Semantic Scholar...robust. Most of the w ork in the literature deals with robust estimation for lo cation and scale. Less atten tion has b een paid to robust estimation of

(a)1 1.2 1.4 1.6 1.8 2

1.5

2

2.5

3

3.5

4

4.5

5x 10

−4 Bias for exponential density (a=1.5)

α2

bia

s (b)1 1.2 1.4 1.6 1.8 2

0.5

1

1.5

2

2.5

3x 10

−4 Bias for exponential density (a=4)

α2

bia

s

Figure 1: Bias as a function of �2 for the exponential density with (a) a = 1:5 and (b) a = 4.(a)

1 1.2 1.4 1.6 1.8 21

1.2

1.4

1.6

1.8

2

2.2

2.4Asymptotic variance & inverse Fisher information (a=1.5)

α2

tra

ce

(Σ)

& tra

ce

(J−

1) (b)

1 1.2 1.4 1.6 1.8 20.9

1

1.1

1.2

1.3

1.4

1.5Asymptotic variance & inverse Fisher information (a=4)

α2

tra

ce

(Σ)

& tra

ce

(J−

1)

Figure 2: Asymptotic variance (solid line) and inverse Fisher information (dashed line) as a function of �2for the exponential density with (a) a = 1:5 and (b) a = 4.Using this theorem we can trivially show that,Trace(�) � Trace(J�1): (58)In �gures 1 and 2 we have plotted the bias, the asymptotic variance and inverse Fisher information as afunction of �2 for two di�erent pdf's. Both densities are shown in �gure 3c and 4a, and are explained inmore detail in section 8. One pdf (a = 1:5) has positive kurtosis (super-Gaussian) and has long tails. Theother pdf (a = 4) has negative kurtosis (sub-Gaussian) and is more uniformly distributed. In both caseswe observe that the bias and the asymptotic variance decrease. The inverse Fisher information serves asa lower bound on the asymptotic variance. The decrease in bias and variance is more dramatic for thesuper-Gaussian density. This is in accordance with the general conclusion that the new expansion is morerobust. The conclusion from this section must not be that the bias and asymptotic variance will decrease forall probability densities. However, we do expect that for pdfs with many outliers the generalized expansionperforms superior to the classical expansion. Also, all considerations in this section are \asymptotic" in thesense that N !1. In the next section we will see that when we try to estimate a pdf from a �nite dataset,the classical expansion is often unstable in the presence of outliers. Also in this sense, the robust expansionis a signi�cant improvement over the classical expansion.7 Projection Pursuit and Independent Component AnalysisThe objective for projection pursuit (PP) (Friedman, 1987) is to �nd \interesting" low-dimensional projec-tions in high dimensional data sets. \Interesting" is being de�ned as di�erent from normally distributed.Many measures are available to de�ne the distance between a pdf and a Gaussian density. In (Huber, 1981)11

Page 12: Robust - Semantic Scholar...robust. Most of the w ork in the literature deals with robust estimation for lo cation and scale. Less atten tion has b een paid to robust estimation of

we can �nd measures based on cumulants, L2-norm (see also (Cook et al., 1993)), standardized Fisher infor-mation, and negentropy (see also (Hyv�arinen, 1997)). Any measure (or contrast function) should be positivede�nite, minimized by a Gaussian distribution, and maximally discriminative (i.e. ideally, no two di�erentdensities should produce the same value). In the following we will argue that the generalized cumulants~�(�)n , or the coe�cients dn, are well suited for building robust contrast functions. This is mainly due to thefact that these coe�cients are both being robust and have easy transformation properties with respect torotations.Suppose we want to �nd a one dimensional projection of the data which maximizes a contrast function. Wewill rotate the data in this high dimensional space and check its contrast on some arbitrary axis (say thex-axis). This implies that we project the data on this axis, estimate the marginal probability density, andcalculate the distance between the normal distribution and this marginal distribution using our favorite mea-sure. Imagine that we have calculated all coe�cients di1:::in up to a �xed order n. The marginal distributionalong the x-axis can be estimated using the coe�cients dx; dxx; dxxx; ::: up to order n. For conveniencewe will denote these marginal coe�cients by dxn for the n'th order coe�cient along the x-axis. To align thedirection of largest contrast with the x-axis, we need to be able to rotate these indices. But we have alreadyobserved that the di1:::in form an invariant subspace under rotations for any �xed n. So it is relatively easyto calculate the new set of coe�cients d0x:::x after a rotation,d0x:::x = Ox;j1 : : : Ox;jndj1:::jn : (59)Of course, the number of coe�cients grows exponentially at every new level n. We may however use theorem17 from Comon's paper on independent component analysis (Comon, 1994), to show that we can do rotationsin two dimensional subspaces to perform an e�cicient search using a Jacobi algorithm. This greatly reducesthe number of coe�cients to be calculated. From the above we can conclude that it is advantegeous to searchfor a contrast function de�ned in terms of the coe�cients dxn.Let us de�ne the following index using the L2 norm,I = Z 1�1( p(x)�(x) � 1)2 d��: (60)Clearly, this contrast is positive de�nite and vanishes for p(x) = �(x). It can be rewritten asI = Z 1�1( 1Xn=0 dxnHn(�x) � 1)2d�� (61)= (dx0 � 1)2 + 1Xn=1n!(dxn)2; (62)For practical purposes we need to break the series o� at a certain order R. Also, the �rst term is strictlyscalar and will not change when we rotate the dataset. This implies that it is just an additive constant whichcan be removed from the contrast function. We can also weight the di�erent orders by a weighting factorwn � 0. This is possible because for a Gaussian all orders have to vanish independently (since it is a sum ofpositive squares there can be no cancellations). Assigning di�erent weights to di�erent orders gives one the exibility to stress a certain quality in the pursuit. If one is interested for instance in skewness, one mightwant to stress dx3 more than the other coe�cients. The contrast function that we propose is thus given by,I1 = RXn=1n!wn(dxn)2 wn � 0: (63)Because the cumulants share the tensorial property (prop. II), they are equally suitable for de�ning robustcontrast functions (see (Huber, 1981)). Using our generalized cumulants, we can de�ne the following contrastfunction, I2 = RXn=1wn(~�(�)n )2 wn � 0; (64)where we wrote ~�(�)n for ~�(�)x:::x. Recall that the only di�erence between ~�(�)n and �(�)n was in the second ordercumulant: ~�(�)xx = �(�)xx �1. Property II allows us again to perform an e�cient search for the optimal projection12

Page 13: Robust - Semantic Scholar...robust. Most of the w ork in the literature deals with robust estimation for lo cation and scale. Less atten tion has b een paid to robust estimation of

(a)−5 0 50

0.5

1

1.5

2

2.5

3Exponential density (a=0.5)

x

p(x

) (b)−5 0 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8Exponential density (a=1)

x

p(x

) (c)−5 0 50

0.1

0.2

0.3

0.4

0.5Exponential density (a=1.5)

x

p(x

)

Figure 3: Probability densities from the exponential family with (a) a = 0:5, (b) a = 1 and (c) a = 1:5. Alldensities are super-Gaussian. Notice that the smaller the value of a, the more outliers are present.using a Jacobi algorithm. Property I guarantees that the contrast vanishes for a Gaussian distribution.Moreover, the contrast is evidently positive de�nite. The weighting factors allow for a more directed searchtowards prede�ned qualities (e.g. a search for high or low kurtosis would imply a large w4).A related issue in data analysis is to �nd combinations of random variables that are as independent aspossible (independent component analysis or \ICA"). It can be shown that after a sphering of the data,only a rotation remains to �nd the ICA-solution. This makes the coe�cients di1;:::;in and the generalizedcumulants again potentially interesting building blocks for contrast functions. In his original treatmentof the problem, Comon used a cumulant based expansion of the marginal pdfs to estimate the mutualinformation between the random variables (Comon, 1994). The main idea behind almost all ICA-algorithmsis to minimize this mutual information (directly or indirectly). Comon noticed that one serious drawbackof the cumulant based expansion is that after fourth order the cumulants, as they are estimated from thesamples, become unstable due to the large in uence of outliers. Any contrast function which is build fromour robust cumulants, would not su�er from this shortcoming. In the following we will exploit the thirdproperty (III) that was proved for the generalized cumulants, namely that for independent random variablesall cross-cumulants vanish. A �rst guess would therefore be to de�ne a contrast function (to be minimized) asthe sum over all cross-cumulants. Unfortunately, the number of those cross-cumulants grows exponentiallyin the order n of the expansion. However, if we observe that PMi1=1 : : :PMin=1(~�(�)i1:::in)2 is scalar, we canderive that the maximization of the sum of the marginal cumulants is equivalent to the minimization of thesum of the cross cumulants. We therefore de�ne,I 02 = RXn=1 MXi=1 wn(~�(�)i:::i)2 wn � 0; (65)This contrast is obviously positive de�nite. Gaussian directions have a zero contribution to (65), due toproperty I. Property III can again be exploited to do an e�cient search over all possible rotations usinga Jacobi algorithm. The weights wn are included to be able to emphasize certain qualities in the solution.Notice the inclusion of the wn does not destroy any of the above arguments, because all cumulants of acertain order n form an invariant subspace with respect to rotations.The similarity between the contrast function I2 and I 02 is quite evident. Indeed, the only di�erence betweenprojection pursuit and independent component analysis is really in the number of dimensions one is inter-ested in. Projection pursuit searches for one dimensional subspaces while independent component analysistries to �nd a new basis onto which the data can be represented as independent as possible. Hybrid situa-tions are thinkable, where we look for a D dimensional subspace of interesting projections or independentcomponents. In that case we propose to use the same contrast function as (65) but with M replaced by D.13

Page 14: Robust - Semantic Scholar...robust. Most of the w ork in the literature deals with robust estimation for lo cation and scale. Less atten tion has b een paid to robust estimation of

(a)−5 0 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35Exponential density (a=4)

x

p(x

) (b)−5 0 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Mixture of Gaussians (µ=0.3, c=3, d=0)

x

p(x

) (c)−5 0 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7Mixture of Gaussians (µ=0.5, c=3, d=2)

x

p(x

)

Figure 4: (a) Exponential density with a = 4 (sub-Gaussian). Mixture of Gaussians with (b) � = 0:3, c = 3,d = 0 and (c) � = 0:5, c = 3, d = 2.8 ExperimentsIn this section we discuss two experiments which were done in order to show the improved performanceof the robust expansion over the classical expansion. For the �rst experiment, we used sound data (6c)which were downloaded from the web6. In the second experiment, we used arti�cial data. Access to thetrue density allowed us to evaluate the performance of the robust expansion in L2-norm. The followingprobability densities were used,� Exponential density (�gure 3a,b,c, 4a):pex(x) = ab 1a2�( 1a )e�bjxja : (66)The parameter a determines the shape of the pdf. For values of a smaller than 2 we �nd super-Gaussiandistibutions with progressively more outliers as a approaches 0. For a = 2 we reproduce a Gaussianand for a > 2 we �nd sub-Gaussian distributions with shorter tails. The parameter b determines thevariance. In our simulations we assume zero mean and unit variance pdf's, so we choose b = 1.� Mixture of two Gaussians (�gure 4b,c):pmog(x) = �a�(ax+ b) + c(1� �)�(cx + d) (67)Here � determines which fraction of the points is assigned to the �rst mixture and which fraction tothe second. The parameters a; b; c; d determine location and scale of the Gaussians. Fixing the overallmean to be zero and the variance to be unity, we can express two parameters in terms of two othersa = �cp�c2 � (1� �)d2 � �(1� �) (68)b = �ad(1� �)c� (69)Experiment 1In the �rst experiment we used sound recordings to show that the classical expansion can become unstablein the presence of outliers. It is well known that natural sound is usually distributed according to a super-Gaussian density. Taking a closer look at �gure 6c reveals many datapoints located far away from the mean.The claim is that these outliers \confuse" the classical expansion, making it unstable. This e�ect is depictedin �gure 5. Plot (a) presents the values of the expansion coe�cients, cn, estimated from the 5000 samples.It is evident that the series diverges, i.e. it has become unstable. Figure (b) is the estimate for the densityafter the fourth order term has been taken into account. We can see that the instability already sets in at6The recordings can be found at http://sweat.cs.unm.edu/bap/demos.html14

Page 15: Robust - Semantic Scholar...robust. Most of the w ork in the literature deals with robust estimation for lo cation and scale. Less atten tion has b een paid to robust estimation of

(a)0 1 2 3 4 5 6 7 8 9 10

−100

0

100

200

300

400

500

600

700

expansion coefficient number

va

lue

of

exp

an

sio

n c

oe

ffic

ien

t

Expansion coefficients for sound data (α=1)

(b)−10 −5 0 5 10

−0.2

0

0.2

0.4

0.6

0.8

1

x

p(x

)

Estimated pdf for sound data (α=1)

Figure 5: Density estimation of sound data using the classical expansion (� = 1). In (a) we can see that thecoe�cients grow with the order of the approximation, i.e. the series diverges. In (b) we show the estimateafter four orders have been taken into account. The negative tails signal the onset of the divergence.(a)

0 1 2 3 4 5 6 7 8 9 10−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

expansion coefficient number

va

lue

of

exp

an

sio

n c

oe

ffic

ien

t

Expansion coefficients for sound data (α=1.8)

(b)−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

x

p(x

)

Estimated pdf for sound data (α=1.8)

(c)−10 −5 0 5 100

50

100

150

200

250

300

350

400

450Histogram of sound data

x

nu

mb

er

of

sa

mp

les in

bin

Figure 6: Density estimation for sound data using the robust expansion at � = 1:8. In (a) we can see thatthe coe�cients decrease with the order of the approximation, i.e. the series converges. In (b) we show theestimate after ten orders. (c) shows the histogram of the sound data (5000 samples).this order because the pdf becomes negative (which is of course not allowed). Needless to say that higherorders only make the estimate worse.In �gure 6 we used the value �2 = 1:8 in the robust expansion. Because the values of the expansioncoe�cients dn decrease, we may conclude that the series converges. The estimate of the pdf after ten ordersis depicted in �gure (b). The downweighting of the outliers has made the series expansion stable.Experiment 2In the second experiment, we measured the L2 distance between the true pdf and the estimated pdf as afunction of the tuning parameter �2 (�gures 7, 8). We plotted both the L2 deviation for the highest orderestimate (solid line) and for the best estimate (dashed line). In all experiments we used 5000 samples, butthe qualitative results do not depend much on the precise numer of data. In most experiments we used20 Hermite polynomials, except for the exponential density at a = 1:5 and 4. There, we used 10 Hermitepolynomials to be compatible with the results on asymptic bias and variance, done in section 6. The mainconclusion from these experiments is: For long-tailed distributions (super-Gaussian), the generalized (robust)series expansion performs superior to the classical expansion (� = 1). The best value for �2 is around 1:9.For the exponential density at a = 0:5 and a = 1, the classical expansion does not converge. The samephenomenon can be observed, for the mixture of Gaussians at � = 0:3, c = 3, d = 0. The explanation isthat the series becomes unstable due to the presence of outliers. To show this in more detail, we plotted theL2 di�erence between estimated and true density as a function of the order of the estimate in �gure 9 for15

Page 16: Robust - Semantic Scholar...robust. Most of the w ork in the literature deals with robust estimation for lo cation and scale. Less atten tion has b een paid to robust estimation of

(a)1 1.2 1.4 1.6 1.8 2

0.05

0.1

0.15

0.2

0.25

0.3

α2

L2−

dis

tan

ce

be

twe

en

pd

f a

nd

estim

ate

Quality of fit for exponential density (a=0.5)

(b)1 1.2 1.4 1.6 1.8 2

0

0.002

0.004

0.006

0.008

0.01

0.012

α2

L2−

dis

tan

ce

be

twe

en

pd

f a

nd

estim

ate

Quality of fit for exponential density (a=1)

(c)1 1.2 1.4 1.6 1.8 2

2.1

2.2

2.3

2.4

2.5

2.6

2.7

2.8

2.9x 10

−4

α2

L2−

dis

tan

ce

be

twe

en

pd

f a

nd

estim

ate

Quality of fit for exponential density (a=1.5)

Figure 7: L2 distance between true density and estimated density. Plotted are the best estimate over allorders (dashed line) as a function of �2 and the highest order estimate (solid line) as a function of �2.The pdfs are depicted in �gure 3. In (a) we used the exponential density at a = 0:5 and 20 orders in theexpansion. In (b) we used a = 1 and 20 orders. In (c) we used a = 1:5 and 10 orders. In all cases 5000samples were used. Notice that the improvement is most dramatic at low values of a, i.e. high kurtoticditributions.

(a)1 1.2 1.4 1.6 1.8 2

1.5

2

2.5x 10

−4

α2

L2−

dis

tan

ce

be

twe

en

pd

f a

nd

estim

ate

Quality of fit for exponential density (a=4)

(b)1 1.2 1.4 1.6 1.8 2

0

0.01

0.02

0.03

0.04

0.05

α2

L2−

dis

tan

ce

be

twe

en

pd

f a

nd

estim

ate

Quality of fit for mixture of Gaussians (µ=0.3, c=3, d=0)

(c)1 1.2 1.4 1.6 1.8 2

1

2

3

4

5

6x 10

−3

α2

L2−

dis

tan

ce b

etw

ee

n p

df a

nd

est

ima

te

Quality of fit for mix. of Gauss. (µ=0.5, c=3, d=2)

Figure 8: L2 distance between true density and estimated density. Plotted are the best estimate over allorders (dashed line) as a function of �2 and the highest order estimate (solid line) as a function of �2. Thepdfs are depicted in �gure 4. In (a) we used the exponential density at a = 4 and 10 orders in the expansion.In (b) we used the mixture of Gaussians pdf with � = 0:3, c = 3 and d = 0 and 20 orders in the expansion.In (c) we used the mixture of Gaussians pdf with � = 0:5, c = 3 and d = 2 and 20 orders. In all cases 5000samples were used.16

Page 17: Robust - Semantic Scholar...robust. Most of the w ork in the literature deals with robust estimation for lo cation and scale. Less atten tion has b een paid to robust estimation of

(a)0 5 10 15 20

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

order of expansion

L2

−d

ista

nce

be

twe

en

pd

f a

nd

estim

ate

L2 distance for exponential density (a=1, α=1)

(b)0 5 10 15 20

0

0.005

0.01

0.015

0.02

0.025

0.03

order of expansion

L2

−d

ista

nce

be

twe

en

pd

f a

nd

estim

ate

L2 distance for exponential density (a=1, α=1.9)

(c)1 1.2 1.4 1.6 1.8 2

0.9995

1

1.0005

1.001

1.0015

1.002Total probability for estimated exp. density (a=1)

α2

tota

l p

rob

ab

ility

Figure 9: In �gures (a) and (b) we plot the L2 distance between the true and estimated density for theexponential density at a = 1. In (a) we used the classical expansion � = 1. In this case the series has di�cultyconverging. In (b) we used the robust expansion at � = 1:9 and observe that the series converges. (c) plotsthe total integrated probability for the estimated density (a = 1) as a function of �2. The uncorrecteddensity is represented by the dashed line and deviates signi�cantly from 1 as we move away from �2 = 1.The correctionterm solves this problem between � = 1 and � � 1:9. After that, the correctionterm becomesunreliable (see section 4).the exponential density at a = 1. When �2 = 1, i.e. the classical case, the series fails to converge, while for�2 = 1:9 the series converges nicely. This e�ect is even worse when a = 0:5, in which case the series divergesexponentially for �2 = 1. We also show the total integrated probability as a function of �2, which shouldequal one per de�nition. One can observe that without the correctionterm discussed in section 4, the totalprobability starts to deviate from unity if we move away from the classical case, �2 = 1. The correctionterm becomes unreliable close to � = 2.In �gure 10 we compare the �nal densities, estimated using the classical expansion on the one hand, and therobust expansion on the other hand. The �gure shows the best estimate (in L2 norm) over all orders for theexponential density at a = 1. From �gure 9 we can see the best classical estimate is found after four orderswhile the best robust estimate is found after twenty orders. Using more orders in the expansion obviouslyresults in a more precise estimate of the pdf.9 ConclusionsMoments and cumulants are well known statistical tools that have a wide range of applications. The pop-ularity is mainly due to their simplicity and intuitive interpretation. Moreover, cumulants have some niceproperties discussed in section 2 and 3. The most prominent one being an easy transformation rule with re-spect to rotations (property III). Although less known, the coe�cients of the Gram-Charlier expansion sharemany of these properties with cumulants. All these statistical objects however have one de�nite drawback:their calculation is dominated by samples far away from the mean. This causes the Gram-Charlier and theEdgeworth expansion to become unreliable after approximately four orders. This sensitivity to outliers isoften called non-robustness. The aim of this paper was to show that there is indeed a generalization to bothGram-Charlier and Edgeworth expansions that is less sensitive to outliers and does not destabilize after fourorders. Most importantly, the expansion coe�cients of these robust series expansions still enjoy the niceproperties associated with their classical counterparts.After an introduction to the classical Gram-Charlier and Edgeworth expansion in section 2 we de�ne thegeneralized expansion in section 3. In section 4 we include an extra term in the expansion to enforce con-servation of probability, i.e. we want to ensure that the total probability integrates to one. In section 5we proceed to show that the coe�cients of the generalized expansion are B-robust, a measure from robuststatistics to evaluate sensitivity to outliers. Other functions, like \gross-error sensitivity" and \local-shiftsensitivity" are also calculated and discussed in this section. A connection with M-estimators is explained.Section 6 studies the mean square error of the new estimators. It is shown that the MSE can be decomposedinto a bias and a asymptotic variance term, and that for long tailed distribution we can expect both the biasand the asymptotic variance to decrease. For more uniform distributions the e�ect is much less pronounced.As an application, we derive novel (robust) projection pursuit and independent component analysis contrast17

Page 18: Robust - Semantic Scholar...robust. Most of the w ork in the literature deals with robust estimation for lo cation and scale. Less atten tion has b een paid to robust estimation of

(a)−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

x

p(x

)

Best estimate for exponential density (a=1,α=1)

(b)−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

x

p(x

)

Best estimate for exponential density (a=1,α=1.9)

Figure 10: Best estimates for the exponential density at a = 1. In (a) we plot the best classical estimatewhich is found after four orders of Hermite polynomials are taken into account (i.e. H0(x); :::; H4(x)). Forhigher orders, the series becomes unstable, i.e. the calculation of the expansion coe�cients is too sensitive tosample uctuations. Especially outliers disturb the estimation. The best estimate from the robust expansionis depicted in (b). In that case the best estimate is found when all orders are taken into account, i.e. 20.functions in section 7. It is shown how all properties (I, II, III, see below) are exploited in these de�nitions.The �nal section (8) contains two experiments to convince the reader of the usefulness of the generalizedexpansion. The �rst experiment uses sound data to show that the classical series diverges after three orders.The generalized expansion, in contrast, is able to accurately estimate the pdf. In the second series of ex-periments we used arti�cial data to assess the performance of the generalized expansion as a function of theparameter �2. The L2 norm was used to measure the di�erence between the true pdf and the estimated pdf.More proof was found that the robust expansions perform favourably as compared to the classical expansionwhen the pdf has long tails. The best value for �2 was found to lie around 1:9.In the following we will list the advantages and disadvantages of the generalized expansion.adv. a: The expansion is relatively insensitive to the presence of outliers. It therefore performs well for prob-ability densities with long tails.adv. b: The generalized cumulants share some nice properties with the classical ones, namely,I: For a Gaussian density all cumulants, except the second order, vanish.II: For independent random variables all cross-cumulants vanish.III: Cumulants transform as tensors with respect to rotations.adv. c: The sample estimates of the expansion coe�cients are non-iterative, i.e. they are evaluated by sweepingthrough the data only once.disadv. a: The estimated pdf is not automatically positive everywhere (true for both classical and generalizedexpansion).disadv. b: We have no proof that the generalized expansion is an expansion in M (2�n)=2, where M is the numberof dimensions and n the order of the expansion.We like to emphasize that in many of the current applications of moments and cumulants, we can use therobust counterparts de�ned in this paper. This is due to the fact that most properties of the moments andcumulants are preserved in the generalization. For instance, because the robust cumulants transform easywith respect to rotations, we can de�ne invariants with respect to translations, scale and rotations withrelative ease7. Another application was discussed in section 7, where novel robust contrast functions forprojection pursuit and independent component analysis were constructed. In general, we suspect that therobust cumulants can be useful in many interesting applications.7These invariants could be constructed as follows. Remove mean and variance to accomplish invariance with respect totranslation and scale. Then, de�ne contractions of the higher order cumulants with Kronecker delta functions (�ij ), Levi-Civitatensors (�ijk) or other cumulants, to generate invariants with respect to rotations. The advantage of using our generalizedcumulants is obviously that we generate robust invariants. 18

Page 19: Robust - Semantic Scholar...robust. Most of the w ork in the literature deals with robust estimation for lo cation and scale. Less atten tion has b een paid to robust estimation of

A Hermite PolynomialsThis appendix contains some properties of Hermite polynomials. The Hermite polynomials are de�ned by,(�1)n dndxn�(x) = Hn(x)�(x): (70)In terms of the variable x they are written as,Hn(x) = xn � n[2]2 � 1!xn�2 + n[4]22 � 2!xn�4 � n[6]23 � 3!xn�6 + ::: ; (71)where n[m] � n!=(n�m)!. This gives for the �rst 10 polynomials explicitly,H0(x) = 1 (72)H1(x) = x (73)H2(x) = x2 � 1 (74)H3(x) = x3 � 3x (75)H4(x) = x4 � 6x2 + 3 (76)H5(x) = x5 � 10x3 + 15x (77)H6(x) = x6 � 15x4 + 45x2 � 15 (78)H7(x) = x7 � 21x5 + 105x3 � 105x (79)H8(x) = x8 � 28x6 + 210x4 � 420x2 + 105 (80)H9(x) = x9 � 36x7 + 378x5 � 1260x3 + 945x (81)H10(x) = x10 � 45x8 + 630x6 � 3150x4 + 4725x2 � 945: (82)Di�erentiating the following identity etx� 12 t2 = 1Xi=0 tiHi(x)i! (83)n times with respect to x, gives the following property,dndxnHm(x) = m[n]Hm�n(x): (84)Di�erentiating (83) once with respect to t produces a relation between the di�erent orders,Hn(x)� xHn�1(x) + (n� 1)Hn�2(x) = 0 n � 2: (85)Combining (85) and (84) gives a di�erential equation for the Hermite polynomial,d2dx2Hn(x) � x ddxHn(x) + nHn(x) = 0 n � 2: (86)Finally, the Hermite polynomials are orthogonal in the following sense,Z 1�1Hn(x)Hm(x)�(x)dx = n!�n;m; (87)where �(x) = 1p2� e� 12x2 .B Generalized Moments and CumulantsThis appendix contains the de�nition of the cumulants in terms of the moments and vice versa for general�. We have not denoted � explicitely in the following for notational convenience.�0 = ln�0 (88)19

Page 20: Robust - Semantic Scholar...robust. Most of the w ork in the literature deals with robust estimation for lo cation and scale. Less atten tion has b een paid to robust estimation of

�1 = �1�0 (89)�2 = �2�0 � (�1�0 )2 (90)�3 = �3�0 � 3�1�2�20 + 2(�1�0 )3 (91)�4 = �4�0 � 3(�2�0 )2 � 4�1�3�20 + 12�21�2�30 � 6(�1�0 )4 (92)�0 = e�0 (93)�1�0 = �1 (94)�2�0 = �2 + �21 (95)�3�0 = �3 + 3�2�1 + �31 (96)�4�0 = �4 + 4�3�1 + 3�22 + 6�2�21 + �41 (97)C Limiting Value of Correction Term as �! 1In this appendix we proof the claim in section (4) that,� lim�!1 = cR+K� lim�!1 (x) = HR+K(x)where K = 1 if R is odd and K = 2 if R is even. First we remind the reader that,p(x)�(x) = 1Xn=0 dnHn(�x) (98)�(x)�(�x) = 1Xn=0 anHn(�x) (99)dn = 1n! h p(x)�(x) ; Hn(�x)i� (100)an = 1n! h �(x)�(�x) ; Hn(�x)i�; (101)where h; i� denotes the inner product with respect to the measure d�� = �(�x)dx. Using Parseval's theoremwe can now prove, 1 = h p(x)�(x) ; �(x)�(�x) i� (102)= h 1Xn=0 dnHn(�x); 1Xm=0 amHm(�x)i� (103)= 1Xn=0n!dnan (104)In the last line orthogonality of the Hermite polynomials was used (87). In a similar fashion we can prove,1�p2� �2 = h �(x)�(�x) ; �(x)�(�x) i� = 1Xn=0n!a2n: (105)From (31) we �nd, an = (n� 1)!!n! (�2 � 1)n2 �n;2k k 2 f0; 1; 2; 3; :::g: (106)20

Page 21: Robust - Semantic Scholar...robust. Most of the w ork in the literature deals with robust estimation for lo cation and scale. Less atten tion has b een paid to robust estimation of

Because we are going to take the limit of �! 1, we de�ne " = �2� 1� 1. Next, we consider the expressionfor (35), = 1p(R +K)! (1�PRn=0 n!andn)q 1�p2��2 �PRn=0 n!a2n (107)= 1p(R +K)!P1n=R+K n!andnqP1n=R+K n!a2n ; (108)Because an = 0 for odd n, the �rst nonzero term in the series is even. Expanding this in terms of " gives: = 1p(R +K)! (R +K � 1)!!"R+K2 dR+K +O("R+K+22 )(R+K�1)!!p(R+K)! "R+K2 +O("R+K+22 ) (109)) dR+K if "! 0: (110)For (x) we follow the same reasoning, (x) = p(R+K)!q 1�p2��2 �PRn=0 n!a2n ( �(x)�(�x) � RXn=0 anHn(�x)) (111)= p(R+K)!qP1n=R+K n!a2n ( 1Xn=R+K anHn(�x)) (112)= p(R +K)! (R+K�1)!!(R+K)! "R+K2 HR+K(�x) +O("R+K+22 )(R+K�1)!!p(R+K)! "R+K2 +O("R+K+22 ) (113)) HR+K(x) if "! 0: (114)This concludes the proof.D Proof of Equation (23)The characteristic function of a pdf is de�ned by:(t) = Z 1�1 eixtp(x)dx: (115)If we use a Taylor expansion for eixt we �nd that this is equal to(t) = 1Xn=0 1n!�n(it)n: (116)For arbitrary � we �nd, (�)(t) = Z 1�1 ei�xtp(x)�(�x)�(x) dx: (117)Using the de�nition of the generalized moments (21) it is easily seen that it is equal to (116) if we replace�n by �(�)n . With this de�nition, is also the moment generating function of p(x). Observe that (117) isthe Fourier transform of the following function,(�)(t) = F [p(x)�(�x)��(x) ]: (118)21

Page 22: Robust - Semantic Scholar...robust. Most of the w ork in the literature deals with robust estimation for lo cation and scale. Less atten tion has b een paid to robust estimation of

This implies that we can �nd an expression for p(x) if we invert the Fourier transform,p(x) = ��(x)�(�x) 12� Z 1�1 e�i�xt(�)(t) dt: (119)Next, we use the relation between the cumulants and the moments (8) to write,p(x) = ��(x)�(�x) 12� Z 1�1 e�i�xt eP1n=0 1n!�(�)n (it)n dt: (120)By de�ning ~�(�)n = �(�)n � �n;2 we can separate a factor �(t) (Gaussian) inside the integral,p(x) = ��(x)�(�x) 1p2� Z 1�1 e�i�xt eP1n=0 1n! ~�(�)n (it)n�(t) dt: (121)Finally, we will need the result F�1[(it)n�(t)] = p2�� (�1)n dnd(�x)n �(�x): (122)If we expand the exponent containing the cumulants in a Taylor series, and do the inverse Fourier transformon every term separately, after which we combine the terms again in an exponent, we �nd the desired result(23).ReferencesComon, P. (1994). Independent component analysis, a new concept? Signal Processing, 36:287{314.Cook, D., Buja, A., and Cabrera, J. (1993). Projection pursuit indexes based on orthonormal functionexpansions. Journal of Computational and Graphical Statistics, 2:225{250.Friedman, J. (1987). Exploratory projection pursuit. Journal of the American Statistical Association,82:249{266.Hampel, F., Ronchetti, E., Rousseuw, P., and Stahel, W. (1986). Robust statistics. Wiley.Huber, P. (1981). Robust statistics. Wiley.Hyv�arinen, A. (1997). New approximations of di�erential entropy for independent component analysis andprojection pursuit. Technical report, Helsinki University of Technology, Laboratory of Computer andInformation Science.Kendall, M. and Stuart, A. (1963). The advanced theory of statistics Vol. 1. Gri�n.Silverman, B. (1986). Density estimation for statistics and data analysis. Chapman and Hall.

22