31
Introduction Definitions and assumptions Distributional properties Contaminated data Empirical application References Mahalanobis distances on factor structured data Deliang Dai 1 1 Department of Economics and Statistics Linnaeus University, V¨ axj¨o August 24, 2014 Mahalanobis distances on factor structured data

Mahalanobis distances on factor structured data

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Mahalanobis distances on factor structured data

IntroductionDefinitions and assumptions

Distributional propertiesContaminated data

Empirical applicationReferences

Mahalanobis distances on factor structured data

Deliang Dai1

1Department of Economics and StatisticsLinnaeus University, Vaxjo

August 24, 2014

Mahalanobis distances on factor structured data

Page 2: Mahalanobis distances on factor structured data

IntroductionDefinitions and assumptions

Distributional propertiesContaminated data

Empirical applicationReferences

Outline

1 IntroductionBackgroundsAdvantages of Mahalanobis DistancesFactor modelMD on factor structured data

2 Definitions and assumptions

3 Distributional properties

4 Contaminated data

5 Empirical application

Mahalanobis distances on factor structured data

Page 3: Mahalanobis distances on factor structured data

IntroductionDefinitions and assumptions

Distributional propertiesContaminated data

Empirical applicationReferences

BackgroundsAdvantages of Mahalanobis DistancesFactor modelMD on factor structured data

Outline

1 IntroductionBackgroundsAdvantages of Mahalanobis DistancesFactor modelMD on factor structured data

2 Definitions and assumptions

3 Distributional properties

4 Contaminated data

5 Empirical application

Mahalanobis distances on factor structured data

Page 4: Mahalanobis distances on factor structured data

IntroductionDefinitions and assumptions

Distributional propertiesContaminated data

Empirical applicationReferences

BackgroundsAdvantages of Mahalanobis DistancesFactor modelMD on factor structured data

Background of Mahalanobis distance

Mahalanobis distance (MD) was proposed by Mahalanobis (1936).It is used for measuring the distance between two random vectors orone random vector and its mean vector. It is widely used in manyareas such as cluster analysis; discriminant analysis (Fisher; 1940);assessing multivariate normality (Mardia; 1974; Mardia et al.; 1980;Mitchell and Krzanowski; 1985; Holgersson and Shukur; 2001); hy-pothesis testing and outlier detecting (Wilks; 1963; Mardia et al.;1980).

Mahalanobis distances on factor structured data

Page 5: Mahalanobis distances on factor structured data

IntroductionDefinitions and assumptions

Distributional propertiesContaminated data

Empirical applicationReferences

BackgroundsAdvantages of Mahalanobis DistancesFactor modelMD on factor structured data

Definition of Mahalanobis distance

The definition of Mahalanobis distance is given as follows,

Definition 1

Let Xi : p × 1 be a random vector such that E [Xi ] = µp×1 andE[(Xi − µ) (Xi − µ)′

]= Σp×p for i = 1, ..., n. Then we make the

following definition:

Dii := (Xi − µ)′Σ−1 (Xi − µ) . (1)

Mahalanobis distances on factor structured data

Page 6: Mahalanobis distances on factor structured data

IntroductionDefinitions and assumptions

Distributional propertiesContaminated data

Empirical applicationReferences

BackgroundsAdvantages of Mahalanobis DistancesFactor modelMD on factor structured data

Advantages of MD

The Mahalanobis distance which includes the covariance matrix islinear transformation invariant.

Mahalanobis distances on factor structured data

Page 7: Mahalanobis distances on factor structured data

IntroductionDefinitions and assumptions

Distributional propertiesContaminated data

Empirical applicationReferences

BackgroundsAdvantages of Mahalanobis DistancesFactor modelMD on factor structured data

Mahalanobis distances on factor structured data

Page 8: Mahalanobis distances on factor structured data

IntroductionDefinitions and assumptions

Distributional propertiesContaminated data

Empirical applicationReferences

BackgroundsAdvantages of Mahalanobis DistancesFactor modelMD on factor structured data

Mahalanobis distances on factor structured data

Page 9: Mahalanobis distances on factor structured data

IntroductionDefinitions and assumptions

Distributional propertiesContaminated data

Empirical applicationReferences

BackgroundsAdvantages of Mahalanobis DistancesFactor modelMD on factor structured data

Background of factor model

Factor model is widely applied in many researches on the unobservedmeasurements. For example, the arbitrage pricing theory (Ross;1976) based on the assumption that asset returns follow a factorstructure. In consumer theory, factor analysis shows that utilitymaximizing could be described by at most three factors (Gormanet al.; 1981; Lewbel; 1991). In desegregate business cycle analysis,the factor model is used to identify the common and specific shocksthat drive a country’s economy (Gregory and Head; 1999; Bai; 2003).

Mahalanobis distances on factor structured data

Page 10: Mahalanobis distances on factor structured data

IntroductionDefinitions and assumptions

Distributional propertiesContaminated data

Empirical applicationReferences

BackgroundsAdvantages of Mahalanobis DistancesFactor modelMD on factor structured data

The combination of the factor model and Mahalanobis distance of-fers an option for both approximation of the structure of covarianceand measuring the distance between observations at the same time.The balance between simplicity of structure and information offersa good method for the dependent structured data.In this presentation, we only concern the detection of outlier.

Mahalanobis distances on factor structured data

Page 11: Mahalanobis distances on factor structured data

IntroductionDefinitions and assumptions

Distributional propertiesContaminated data

Empirical applicationReferences

Outline

1 IntroductionBackgroundsAdvantages of Mahalanobis DistancesFactor modelMD on factor structured data

2 Definitions and assumptions

3 Distributional properties

4 Contaminated data

5 Empirical application

Mahalanobis distances on factor structured data

Page 12: Mahalanobis distances on factor structured data

IntroductionDefinitions and assumptions

Distributional propertiesContaminated data

Empirical applicationReferences

Definition of factor model

Definition 2

Let x ∼ N(µ,Σ), be the p × 1 random observations with mean µand covariance matrix Σ. The factor model could be representedas

x− µ(p×1) = L(p×m)F(m×1) + ε(p×1),

where x is the observed date (p > m), µ the corresponding ex-pectation , L the factor loadings and F is a m × 1 vector of factorscores.

Mahalanobis distances on factor structured data

Page 13: Mahalanobis distances on factor structured data

IntroductionDefinitions and assumptions

Distributional propertiesContaminated data

Empirical applicationReferences

Assumptions of factor model

Beside the definition of factor model above, we need some moreassumptions as follows.

Definition 3

Let E (F) = 0, Cov(F) = Im×m and E (ε) = 0. As defined above,assume ε ∼ N(0p×1,Ψ) , where Ψ is a diagonal matrix andF ∼ N(0m×1, I) are distributed independently (Cov(ε,F) = 0).

Mahalanobis distances on factor structured data

Page 14: Mahalanobis distances on factor structured data

IntroductionDefinitions and assumptions

Distributional propertiesContaminated data

Empirical applicationReferences

Definitions of MDs on factor model

Definition 4

The MD on F part is

D (Fi ) = Fi′cov(Fi )

−1Fi = Fi′Fi ,

while for the ε part, the MD is

D (εi ) = εi′Ψ−1εi .

Mahalanobis distances on factor structured data

Page 15: Mahalanobis distances on factor structured data

IntroductionDefinitions and assumptions

Distributional propertiesContaminated data

Empirical applicationReferences

Definitions of MDs on factor model

Definition 5

The MDs for the estimated factor scores and error are:

D(

Fi

)= F′icov(Fi )

−1Fi and D (εi ) = ε′iΨ−1εi .

where Fi = L′S−1

(xi − x) is the estimated factor scores from theregression method (Johnson and Wichern; 2002).

Mahalanobis distances on factor structured data

Page 16: Mahalanobis distances on factor structured data

IntroductionDefinitions and assumptions

Distributional propertiesContaminated data

Empirical applicationReferences

Outline

1 IntroductionBackgroundsAdvantages of Mahalanobis DistancesFactor modelMD on factor structured data

2 Definitions and assumptions

3 Distributional properties

4 Contaminated data

5 Empirical application

Mahalanobis distances on factor structured data

Page 17: Mahalanobis distances on factor structured data

IntroductionDefinitions and assumptions

Distributional propertiesContaminated data

Empirical applicationReferences

Joint distribution of seperated MDs

Proposition 1

The joint distribution of D (Fi ) and D (εi ) is,

Z =

[D (Fi )D (εi )

]∼

[χ2

(m)

χ2(p)

], cov (Z ) =

[2m 00 2p

].

Mahalanobis distances on factor structured data

Page 18: Mahalanobis distances on factor structured data

IntroductionDefinitions and assumptions

Distributional propertiesContaminated data

Empirical applicationReferences

Joint distribution of seperated MDs

Proposition 2

The asymptotic distributions of the MD under the factor model withan unknown factor structure is

Y =

[D(

Fi

)D (εi )

]`→[χ2

(m)

χ2(p)

], cov

(Y)

=

[2m 00 2p

].

Mahalanobis distances on factor structured data

Page 19: Mahalanobis distances on factor structured data

IntroductionDefinitions and assumptions

Distributional propertiesContaminated data

Empirical applicationReferences

Outline

1 IntroductionBackgroundsAdvantages of Mahalanobis DistancesFactor modelMD on factor structured data

2 Definitions and assumptions

3 Distributional properties

4 Contaminated data

5 Empirical application

Mahalanobis distances on factor structured data

Page 20: Mahalanobis distances on factor structured data

IntroductionDefinitions and assumptions

Distributional propertiesContaminated data

Empirical applicationReferences

Definitions of contaminated data

Definition 6

Assuming two kinds of outliers are interested: (i) an additive outlierin εi and (ii) an additive outlier in Fi : m × 1. The correspondingdefinitions are given as follows (assuming known mean):

(i) Xi = LFi + εi + δi , δi = 0,∀i 6=k , δi : p × 1 and δk′δk = 1.

(ii) Xi = L (Fi + δi ) + εi , δi = 0,∀i 6=k , δk : m × 1.

Mahalanobis distances on factor structured data

Page 21: Mahalanobis distances on factor structured data

IntroductionDefinitions and assumptions

Distributional propertiesContaminated data

Empirical applicationReferences

Contaminated data

Let dδj , j = 1, 2 be the MDs for the two kinds of outliers, andlet S = n−1

∑ni=1 (LFi + εi ) (LFi + εi )

′ be the sample covariancematrix with known mean. To simplify calculations, let δk⊥Fi andδk⊥εi by assumption, so that they are not only uncorrelated but alsoorthogonal. The following expressions separate the contaminatedand un-contaminated parts of the MD:

Mahalanobis distances on factor structured data

Page 22: Mahalanobis distances on factor structured data

IntroductionDefinitions and assumptions

Distributional propertiesContaminated data

Empirical applicationReferences

Contaminated data

Proposition 3

For case (i),

dδ1 =

dii i 6= k

dkk −n(1− xk

′n−1S−1δk)2

1 + δ′kn−1S−1δk+ n i = k

For case (ii),

dδ2 =

dii i 6= k

dkk −n(1− xk

′n−1S−1Lδk)2

1 + δk′L′n−1S−1Lδk

+ n. i = k

where dii = nxi′W−1xi = xi

′S−1xi .

Mahalanobis distances on factor structured data

Page 23: Mahalanobis distances on factor structured data

IntroductionDefinitions and assumptions

Distributional propertiesContaminated data

Empirical applicationReferences

Outline

1 IntroductionBackgroundsAdvantages of Mahalanobis DistancesFactor modelMD on factor structured data

2 Definitions and assumptions

3 Distributional properties

4 Contaminated data

5 Empirical application

Mahalanobis distances on factor structured data

Page 24: Mahalanobis distances on factor structured data

IntroductionDefinitions and assumptions

Distributional propertiesContaminated data

Empirical applicationReferences

Dataset

We implicate the results on a data setting which is consisted by themonthly closing stock prices from ten companies.The companies are Exxon Mobil Corp., Chevron Corp., Apple Inc.,General Electric Co., Wells Fargo Co., Procter Gamble Co., Mi-crosoft Corp., Johnson & Johnson, Google Inc., Johnson ControlInc. They follow by the order of variables.

Mahalanobis distances on factor structured data

Page 25: Mahalanobis distances on factor structured data

IntroductionDefinitions and assumptions

Distributional propertiesContaminated data

Empirical applicationReferences

Dataset

We modified the closing prices into the monthly return rate (r),

r =yt − yt−1

yt−1,

where yt stands for the closing price in the tth month. The wholeanalysis based on the returns on the stock prices.

Mahalanobis distances on factor structured data

Page 26: Mahalanobis distances on factor structured data

IntroductionDefinitions and assumptions

Distributional propertiesContaminated data

Empirical applicationReferences

Boxplot

Figure: The boxplot of the original MDs.

Mahalanobis distances on factor structured data

Page 27: Mahalanobis distances on factor structured data

IntroductionDefinitions and assumptions

Distributional propertiesContaminated data

Empirical applicationReferences

Figure: The boxplot of error terms.

Mahalanobis distances on factor structured data

Page 28: Mahalanobis distances on factor structured data

IntroductionDefinitions and assumptions

Distributional propertiesContaminated data

Empirical applicationReferences

Figure: The boxplot of factor scores.

Mahalanobis distances on factor structured data

Page 29: Mahalanobis distances on factor structured data

IntroductionDefinitions and assumptions

Distributional propertiesContaminated data

Empirical applicationReferences

Bibliography I

Bai, J. (2003). Inferential theory for factor models of largedimensions, Econometrica 71(1): 135–171.

Fisher, R. A. (1940). The precision of discriminant functions,Annals of Human Genetics 10(1): 422–429.

Gorman, N., McConnell, I. and Lachmann, P. (1981).Characterisation of the third component of canine and felinecomplement, Veterinary immunology and immunopathology2(4): 309–320.

Gregory, A. W. and Head, A. C. (1999). Common andcountry-specific fluctuations in productivity, investment, and thecurrent account, Journal of Monetary Economics44(3): 423–451.

Mahalanobis distances on factor structured data

Page 30: Mahalanobis distances on factor structured data

IntroductionDefinitions and assumptions

Distributional propertiesContaminated data

Empirical applicationReferences

Bibliography II

Holgersson, H. and Shukur, G. (2001). Some aspects ofnon-normality tests in systems of regression equations,Communications in Statistics-Simulation and Computation30(2): 291–310.

Johnson, R. A. and Wichern, D. W. (2002). Applied multivariatestatistical analysis, Vol. 5, Prentice hall Upper Saddle River, NJ.

Lewbel, A. (1991). The rank of demand systems: theory andnonparametric estimation, Econometrica: Journal of theEconometric Society pp. 711–730.

Mahalanobis, P. (1936). On the generalized distance in statistics,Proceedings of the National Institute of Sciences of India,Vol. 2, New Delhi, pp. 49–55.

Mahalanobis distances on factor structured data

Page 31: Mahalanobis distances on factor structured data

IntroductionDefinitions and assumptions

Distributional propertiesContaminated data

Empirical applicationReferences

Bibliography III

Mardia, K. (1974). Applications of some measures of multivariateskewness and kurtosis in testing normality and robustnessstudies, Sankhya: The Indian Journal of Statistics, Series Bpp. 115–128.

Mardia, K., Kent, J. and Bibby, J. (1980). Multivariate analysis.

Mitchell, A. and Krzanowski, W. (1985). The mahalanobisdistance and elliptic distributions, Biometrika 72(2): 464–467.

Ross, S. A. (1976). The arbitrage theory of capital asset pricing,Journal of economic theory 13(3): 341–360.

Wilks, S. (1963). Multivariate statistical outliers, Sankhya: TheIndian Journal of Statistics, Series A pp. 407–426.

Mahalanobis distances on factor structured data