11
?Joural of Business &Economic Statistics, Vol.2, No. 1, January 1984 Robust Methods of Building Regression Models-An Application to the Housing Sector Daniel Penia Escuela de lingenieros Industriales, Universidad Politecnica de Madrid, Madrid, Spain Javier Ruiz-Castillo Departmento de Teoria Economica, Universidad Complutense de Madrid, Madrid, Spain Thisarticle studies robustification strategies for the linear modelin the presence of outliers. The advantages of an internal analysis of the robustness of least squares for a given sample are pointed out. The application of this methodology is illustrated by building an explicit model of the determinants of rental housing values in the Madrid Metropolitan Area. KEY WORDS: Outliers; Influential observations; Robust regression; Cook distance; Hedonic price function; Housing market. 1. INTRODUCTION The question of introducing some degree of objectiv- ity in the rejection of outlying observations has been the subject of considerableresearch in the statistics literature. This is a fundamental problem with cross- section samples whereone typically has a large body of data on numerous variables. A few atypical observa- tions may make the data distribution nonnormal, de- stroying the optimality of the least squares estimation procedure, which could become very inefficient. In this article we consider the problem fromthe point of view of building an explanatory model of market rental values in terms of the observed traits of each housing unit in an urban area. Hence, this exercise belongs to the vastliterature on hedonic price functions in urban economics, which has been reviewed by Gril- iches (1971), Ball (1973), and Quigley (1979). The microeconomic underpinnings of the empirical work in this area can be foundin Rosen (1974), who provides a model of price determination of a differentiated and indivisible product under competitive conditions. The restof the article is organized as follows.Section 2 summarizes the effects of outliers in the context of maximum likelihood estimation of the linear model. Section 3 briefly surveys the possible solutions and showsthe advantages of a robustification of the model's construction methodology, consisting of an internal sensitivity analysis of a model estimated by least squares with a particular sample. Its empirical application is illustrated in Section 4, where we present a model of the determinants of housing rentalvalues for the Mad- rid Metropolitan Area. The final section, Section 5, containssome concluding comments. 2. THE EFFECTS OF OUTLIERS We begin by brieflyreviewing for later reference the maximumlikelihoodestimationof the linearmodel Y= X + U (1) whereY is a (n x 1) vector of responses, X is a (n x k) matrix of predetermined variables with rank k, , is a (k x 1) vector of parameters, and U is a (n x 1) vector of disturbances. Let f be the density function of U, and assume that E[U] = 0 and E[UU'] = a2 I. The maximum likelihood estimationof (1) leadsto n n max In f(ei) = min ~ -g(ej), i=1 i=l (2) where -g = In f and e, = yi - x/ # are the sample residuals. Iff is differentiable, the maximum likelihood esti- mator of 6 is the solution (assumed unique) to the system Z I(ei)x; = O', i=l (3) where / is the firstderivative of g, and xi is the ith row 10 of Business & Economic Statistics, Vol. 2, No. 1, January 1984 Robust Methods of Building Regression Models-An Application to the Housing Sector Daniel Peña Escuela de Iingenieros Industriales, Universidad Politécnica de Madrid. Madrid, Spain Javier Ruiz-Castillo Departmento de Teoría Económica, Universidad Complutense de Madrid, Madrid, Spain This article studies robustification strategies for the linear model in the presence of outliers. The advantages of an intemal analysis of the robustness of least squares for a given sample are pointed out. The application of this tnethodology is iIIustrated by building an explicit model of the determinants of rental housing values in the Madrid Metropolitan Area. KEY WORDS: Outliers; Influential observations; Robust regression; Cook distance; Hedonic price function; Housing market. where 1/; is the first derivative of g, and is the ith row n n max L In f(e¡) = min L -g(e¡), (2) ¡=I ¡=I 2. THE EFFECTS OF OUTLlERS We begin by briefly reviewing for later reference the maximum likelihood estimation ofthe linear model illustrated in Section 4, where we present a model of the determinants of housing rental values for the Mad- rid Metropolitan Area. The final section, Section 5, contains sorne coneluding comments. (1) (3) Y=x,B+U n L 1/;(e¡)xí = O', ¡=I where Y is a (n x 1) vector ofresponses, X is a (n x k) matrix of predetermined variables with rank k, P is a (k x 1) vector ofparameters, and U is a (n x 1) vector of disturbances. Let fbe the density function of U, and assume that E[U] = Oand E[UU'] = u 2 I. The maximum likelihood estimation of (1) leads to where -g = In f and = - Pare the sample residuals. If f is differentiable, the maximum likelihood esti- mator of P is the solution (assumed unique) to the system 1. INTRODUCTION The question of introducing sorne degree of objectiv- ity in the rejection of out1ying observations has been the subject of considerable research in the statistics literature. This is a fundamental problem with cross- section samples where one typically has a large body of data on numerous variables. A few atypical observa- tions may make the data distribution nonnormal, de- stroying the optimality of the least squares estimation procedure, which could become very inefficient. In this artiele we consider the problem from the point of view of building an explanatory model of market rental values in terms of the observed traits of each housing unit in an urban area. Hence, this exercise belongs to the vast literature on hedonic price functions in urban economics, which has been reviewed by Gril- iches (1971), Ball (1973), and Quigley (1979). The microeconomic underpinnings of the empirical work in this area can be found in Rosen (1974), who provides a model of price determination of a differentiated and indivisible product under competitive conditions. The rest of the artiele is organized as follows. Section 2 summarizes the effects of outliers in the context of maximum likelihood estimation of the linear model. Section 3 briefly surveys the possible solutions and shows the advantages of a robustification of the model's construction methodology, consisting of an internal sensitivity analysis of a model estimated by least squares with a particular sample. Its empirical application is 10 of Business & Economic Statistics, Vol. 2, No. 1, January 1984 Robust Methods of Building Regression Models-An Application to the Housing Sector Daniel Peña Escuela de Iingenieros Industriales, Universidad Politécnica de Madrid. Madrid, Spain Javier Ruiz-Castillo Departmento de Teoría Económica, Universidad Complutense de Madrid, Madrid, Spain This article studies robustification strategies for the linear model in the presence of outliers. The advantages of an intemal analysis of the robustness of least squares for a given sample are pointed out. The application of this tnethodology is iIIustrated by building an explicit model of the determinants of rental housing values in the Madrid Metropolitan Area. KEY WORDS: Outliers; Influential observations; Robust regression; Cook distance; Hedonic price function; Housing market. where 1/; is the first derivative of g, and is the ith row n n max L In f(e¡) = min L -g(e¡), (2) ¡=I ¡=I 2. THE EFFECTS OF OUTLlERS We begin by briefly reviewing for later reference the maximum likelihood estimation ofthe linear model illustrated in Section 4, where we present a model of the determinants of housing rental values for the Mad- rid Metropolitan Area. The final section, Section 5, contains sorne coneluding comments. (1) (3) Y=x,B+U n L 1/;(e¡)xí = O', ¡=I where Y is a (n x 1) vector ofresponses, X is a (n x k) matrix of predetermined variables with rank k, P is a (k x 1) vector ofparameters, and U is a (n x 1) vector of disturbances. Let fbe the density function of U, and assume that E[U] = Oand E[UU'] = u 2 I. The maximum likelihood estimation of (1) leads to where -g = In f and = - Pare the sample residuals. If f is differentiable, the maximum likelihood esti- mator of P is the solution (assumed unique) to the system 1. INTRODUCTION The question of introducing sorne degree of objectiv- ity in the rejection of out1ying observations has been the subject of considerable research in the statistics literature. This is a fundamental problem with cross- section samples where one typically has a large body of data on numerous variables. A few atypical observa- tions may make the data distribution nonnormal, de- stroying the optimality of the least squares estimation procedure, which could become very inefficient. In this artiele we consider the problem from the point of view of building an explanatory model of market rental values in terms of the observed traits of each housing unit in an urban area. Hence, this exercise belongs to the vast literature on hedonic price functions in urban economics, which has been reviewed by Gril- iches (1971), Ball (1973), and Quigley (1979). The microeconomic underpinnings of the empirical work in this area can be found in Rosen (1974), who provides a model of price determination of a differentiated and indivisible product under competitive conditions. The rest of the artiele is organized as follows. Section 2 summarizes the effects of outliers in the context of maximum likelihood estimation of the linear model. Section 3 briefly surveys the possible solutions and shows the advantages of a robustification of the model's construction methodology, consisting of an internal sensitivity analysis of a model estimated by least squares with a particular sample. Its empirical application is 10

Robust Methods of Building Regression Models—An Application to thè Housing Sector

Embed Size (px)

Citation preview

?Joural of Business & Economic Statistics, Vol. 2, No. 1, January 1984

Robust Methods of Building Regression Models-An Application to the

Housing Sector

Daniel Penia Escuela de lingenieros Industriales, Universidad Politecnica de Madrid, Madrid, Spain

Javier Ruiz-Castillo Departmento de Teoria Economica, Universidad Complutense de Madrid, Madrid, Spain

This article studies robustification strategies for the linear model in the presence of outliers. The advantages of an internal analysis of the robustness of least squares for a given sample are pointed out. The application of this methodology is illustrated by building an explicit model of the determinants of rental housing values in the Madrid Metropolitan Area.

KEY WORDS: Outliers; Influential observations; Robust regression; Cook distance; Hedonic price function; Housing market.

1. INTRODUCTION

The question of introducing some degree of objectiv- ity in the rejection of outlying observations has been the subject of considerable research in the statistics literature. This is a fundamental problem with cross- section samples where one typically has a large body of data on numerous variables. A few atypical observa- tions may make the data distribution nonnormal, de- stroying the optimality of the least squares estimation procedure, which could become very inefficient.

In this article we consider the problem from the point of view of building an explanatory model of market rental values in terms of the observed traits of each housing unit in an urban area. Hence, this exercise belongs to the vast literature on hedonic price functions in urban economics, which has been reviewed by Gril- iches (1971), Ball (1973), and Quigley (1979). The microeconomic underpinnings of the empirical work in this area can be found in Rosen (1974), who provides a model of price determination of a differentiated and indivisible product under competitive conditions.

The rest of the article is organized as follows. Section 2 summarizes the effects of outliers in the context of maximum likelihood estimation of the linear model. Section 3 briefly surveys the possible solutions and shows the advantages of a robustification of the model's construction methodology, consisting of an internal sensitivity analysis of a model estimated by least squares with a particular sample. Its empirical application is

illustrated in Section 4, where we present a model of the determinants of housing rental values for the Mad- rid Metropolitan Area. The final section, Section 5, contains some concluding comments.

2. THE EFFECTS OF OUTLIERS We begin by briefly reviewing for later reference the

maximum likelihood estimation of the linear model

Y= X + U (1) where Y is a (n x 1) vector of responses, X is a (n x k) matrix of predetermined variables with rank k, , is a (k x 1) vector of parameters, and U is a (n x 1) vector of disturbances.

Let f be the density function of U, and assume that E[U] = 0 and E[UU'] = a2 I. The maximum likelihood estimation of (1) leads to

n n

max In f(ei) = min ~ -g(ej), i=1 i=l

(2)

where -g = In f and e, = yi - x/ # are the sample residuals.

Iff is differentiable, the maximum likelihood esti- mator of 6 is the solution (assumed unique) to the system

Z I(ei)x; = O', i=l

(3)

where / is the first derivative of g, and xi is the ith row

10

~oumal of Business & Economic Statistics, Vol. 2, No. 1, January 1984

Robust Methods of Building RegressionModels-An Application to theHousing Sector

Daniel PeñaEscuela de Iingenieros Industriales, Universidad Politécnica de Madrid. Madrid, Spain

Javier Ruiz-CastilloDepartmento de Teoría Económica, Universidad Complutense de Madrid, Madrid, Spain

This article studies robustification strategies for the linear model in the presence of outliers. Theadvantages of an intemal analysis of the robustness of least squares for a given sample arepointed out. The application of this tnethodology is iIIustrated by building an explicit model of thedeterminants of rental housing values in the Madrid Metropolitan Area.

KEY WORDS: Outliers; Influential observations; Robust regression; Cook distance; Hedonicprice function; Housing market.

where 1/; is the first derivative ofg, and X¡ is the ith row

n n

max L In f(e¡) = min L -g(e¡), (2)¡=I ¡=I

2. THE EFFECTS OF OUTLlERS

We begin by briefly reviewing for later reference themaximum likelihood estimation ofthe linear model

illustrated in Section 4, where we present a model ofthe determinants of housing rental values for the Mad­rid Metropolitan Area. The final section, Section 5,contains sorne coneluding comments.

(1)

(3)

Y=x,B+U

n

L 1/;(e¡)xí = O',¡=I

where Y is a (n x 1) vector ofresponses, X is a (n x k)matrix of predetermined variables with rank k, P is a(k x 1) vector ofparameters, and U is a (n x 1) vectorof disturbances.

Let fbe the density function of U, and assume thatE[U] = Oand E[UU'] = u 2 I. The maximum likelihoodestimation of (1) leads to

where -g = In f and e¡ = Y¡ - xí Pare the sampleresiduals.

If f is differentiable, the maximum likelihood esti­mator of P is the solution (assumed unique) to thesystem

1. INTRODUCTION

The question of introducing sorne degree of objectiv­ity in the rejection of out1ying observations has beenthe subject of considerable research in the statisticsliterature. This is a fundamental problem with cross­section samples where one typically has a large body ofdata on numerous variables. A few atypical observa­tions may make the data distribution nonnormal, de­stroying the optimality of the least squares estimationprocedure, which could become very inefficient.

In this artiele we consider the problem from the pointof view of building an explanatory model of marketrental values in terms of the observed traits of eachhousing unit in an urban area. Hence, this exercisebelongs to the vast literature on hedonic price functionsin urban economics, which has been reviewed by Gril­iches (1971), Ball (1973), and Quigley (1979). Themicroeconomic underpinnings of the empirical workin this area can be found in Rosen (1974), who providesa model of price determination of a differentiated andindivisible product under competitive conditions.

The rest of the artiele is organized as follows. Section2 summarizes the effects of outliers in the context ofmaximum likelihood estimation of the linear model.Section 3 briefly surveys the possible solutions andshows the advantages ofa robustification ofthe model'sconstruction methodology, consisting of an internalsensitivity analysis ofa model estimated by least squareswith a particular sample. Its empirical application is

10

~oumal of Business & Economic Statistics, Vol. 2, No. 1, January 1984

Robust Methods of Building RegressionModels-An Application to theHousing Sector

Daniel PeñaEscuela de Iingenieros Industriales, Universidad Politécnica de Madrid. Madrid, Spain

Javier Ruiz-CastilloDepartmento de Teoría Económica, Universidad Complutense de Madrid, Madrid, Spain

This article studies robustification strategies for the linear model in the presence of outliers. Theadvantages of an intemal analysis of the robustness of least squares for a given sample arepointed out. The application of this tnethodology is iIIustrated by building an explicit model of thedeterminants of rental housing values in the Madrid Metropolitan Area.

KEY WORDS: Outliers; Influential observations; Robust regression; Cook distance; Hedonicprice function; Housing market.

where 1/; is the first derivative ofg, and X¡ is the ith row

n n

max L In f(e¡) = min L -g(e¡), (2)¡=I ¡=I

2. THE EFFECTS OF OUTLlERS

We begin by briefly reviewing for later reference themaximum likelihood estimation ofthe linear model

illustrated in Section 4, where we present a model ofthe determinants of housing rental values for the Mad­rid Metropolitan Area. The final section, Section 5,contains sorne coneluding comments.

(1)

(3)

Y=x,B+U

n

L 1/;(e¡)xí = O',¡=I

where Y is a (n x 1) vector ofresponses, X is a (n x k)matrix of predetermined variables with rank k, P is a(k x 1) vector ofparameters, and U is a (n x 1) vectorof disturbances.

Let fbe the density function of U, and assume thatE[U] = Oand E[UU'] = u 2 I. The maximum likelihoodestimation of (1) leads to

where -g = In f and e¡ = Y¡ - xí Pare the sampleresiduals.

If f is differentiable, the maximum likelihood esti­mator of P is the solution (assumed unique) to thesystem

1. INTRODUCTION

The question of introducing sorne degree of objectiv­ity in the rejection of out1ying observations has beenthe subject of considerable research in the statisticsliterature. This is a fundamental problem with cross­section samples where one typically has a large body ofdata on numerous variables. A few atypical observa­tions may make the data distribution nonnormal, de­stroying the optimality of the least squares estimationprocedure, which could become very inefficient.

In this artiele we consider the problem from the pointof view of building an explanatory model of marketrental values in terms of the observed traits of eachhousing unit in an urban area. Hence, this exercisebelongs to the vast literature on hedonic price functionsin urban economics, which has been reviewed by Gril­iches (1971), Ball (1973), and Quigley (1979). Themicroeconomic underpinnings of the empirical workin this area can be found in Rosen (1974), who providesa model of price determination of a differentiated andindivisible product under competitive conditions.

The rest of the artiele is organized as follows. Section2 summarizes the effects of outliers in the context ofmaximum likelihood estimation of the linear model.Section 3 briefly surveys the possible solutions andshows the advantages ofa robustification ofthe model'sconstruction methodology, consisting of an internalsensitivity analysis ofa model estimated by least squareswith a particular sample. Its empirical application is

10

Pefa and Ruiz-Castillo: Robust Methods of Building Regression Models

of X. Another way of writing (3) is

n

~ ex Wi = 0', i=

(4)

where wi = e(ei)/ei. Thus, the maximum likelihood estimation of the

linear model can be interpreted as (a) the minimization of a certain function g of the sample residuals; (b) the choice of a function k of sample residuals, whose com- ponents are orthogonal to the linear space generated by the columns of X; and (c) as weighted least squares with weights wi determined iteratively.

Iff is symmetric, we may assume that it belongs to the potential exponential family-a general form sug- gested by Diananda (1949) and Box (1953), and studied by Box and Tiao (1973). In this case

f(u) = kl(a)a-expl-k2(a) I u/a 12/(l+a)3,

-l < a< 1, a < , -oo u < oo, (5)

where a is the standard deviation, and a indicates the kurtosis of the distribution.

For a = 0, the distribution is the normal; for a = 1, it is the Laplace distribution; and as a approaches -1, one obtains in the limit the uniform distribution. More- over, expression (5) includes leptokurtic distributions with tails wider than the normal when a > 0, and platokurtic distributions when a < 0.

Taking a as known, the maximization of the likeli- hood of a linear model with disturbances given by (5) leads to

n 2/( +a)

min yi-x,#3 i=l

This includes as particular cases the minimization of absolute deviations (a = 1), least squares (a = 0), and the minimization of the maximum deviation (as a -- -1). Therefore, the decision on an adequate estimation criterion strongly depends on the specific characteristics of the distribution with which one is working.

In this context, the problem with least squares is that it may become very inefficient in the presence of a few atypical data that make the distribution leptokurtic. To see this, assume that the disturbances in the linear model are N(0, a2) but there exists an unknown small proportion E of atypical observations. This fact can be modeled, following, among others, Tukey (1960), Box and Tiao (1968, 1973), and Guttman (1973), by assum- ing that these anomalous observations come from a normal distribution with zero mean and variance h a2 with h > 1. Then the density function will be

f(u) = (1 - )fN(UI 0, U2) + efN(u 0, h U2). (6)

It is immediate that

var(u) = a2(1 + (h - 1))

YP

_ _-

_

MP

_ _ -

.".0 w o

Xp

(A)

Y,

YP

Xp

(B) Figure 1. TWO TYPES OF OUTLIERS. In case (A), the anomalous

value of the response leads to a vertical displacement of the regres- sion line and a large residual. In case (B), the atypical value of the explicative variables determines the slope of the regression line but leads to a small residual.

and f will be symmetric with kurtosis I + (h2 - 1) )

= 3((1 + (h- 1))2 -I =3 (b 1)

with a > 1. Therefore, the distribution will be leptokur-

x

x

11

,

ofX. Another way ofwriting (3) is

Peña and Ruiz-Castillo: Robust Methods of Building Regression Models 11

y

where W¡ = 1/;(e¡)/e¡.Thus, the maximum likelihood estimation of the

linear model can be interpreted as (a) the minimizationof a eertain funetion g of the sample residuals; (b) thechoice of a funetion 1/; of sample residuals, whose eom­ponents are orthogonal to the linear spaee generated bythe eolumns ofX; and (e) as weighted least squares withweights Wi determined iteratively.

If! is symmetrie, we may assume that it belongs tothe potential exponential family-a general form s~

gested by Diananda (1949) and Box (1953), and studiedby Box and Tiao (1973). In this case

!(u) = k l (a)u- 1expl-k2(a) I u/u 12/(I+a)l,

n

¿ e¡x[ W¡ = O',¡=I

(4)p

Yp -.- _.- - _.-.~--~

.. --­------• ••1 :.,._.,,_ ........

__ .::. -::¡t· .: ..*: .-----:-..... :: ..: " :-.,e .• : •••• :."e.. • •• '." . .• . •". ". e." .. . '.' .~ : .... .. ' .. . '" .

" : ....".: .••II

x-1<a:sl, u<oo, -oou<oo, (5)

where u is the standard deviation, and a indieates thekurtosis of the distribution.

For a = O, the distribution is the normal; for a = 1,it is the Laplaee distribution; and as a approaehes -1,one obtains in the limit the uniform distribution. More­over, expression (5) ineludes leptokurtie distributionswith tails wider than the normal when a > O, andplatokurtie distributions when a < O.

Taking a as known, the maximization of the likeli­hood of a linear model with disturbanees given by (5)leads to

n I /2/(I+a)min¿ Yi - x[{j .

/=1

This ineludes as particular cases the minimization ofabsolute deviations (a = 1), least squares (a = O), andthe minimization of the maximum deviation (as a -+-1). Therefore, the decision on an adequate estimationcriterion strongly depends on the speeifie eharaeteristiesof the distribution with which one is working.

In this eontext, the problem with least squares is thatit may beeome very ineffieient in the presence of a fewatypieal data that make the distribution leptokurtie. Tosee this, assume that the disturbanees in the linearmodel are N(O, u 2

) but there exists an unknown smallproportion f of atypieal observations. This faet can bemodeled, following, among others, Tukey (1960), Boxand Tiao (1968, 1973), and Guttman (1973), by assum­ing that these anomalous observations come from anormal distribution with zero mean and variance h u 2

with h > l. Then the density funetion will be

f(u) = (l - f)!N(U IO, u 2) + f !N(U IO, h u 2). (6)

It is immediate that

var(u) = u 2(1 + f(h - 1»

(A)

y

p ,,,"~p --_._-----------------------,.,.. ".. ..

",. ..... l.. . "•• ••• a "••••• ','

~ :: .~ ... :,:,:'.:.. ' '. ... '... .,,''. : ", :" ..•• " #/IfI#" • ,,' •;.,::.. :.." ..

" " -: '" ." " ." '.. .

x

(8)Figure 1. TWO TYPES OF OUTLlERS. In case (A), the anomalous

value of the response leads to a vertical displacement of the regres­sion line and a large residual. In case (B), the atypical value of theexplicative variables determines the slope of the regression line butleads to a small residual.

and ! will be symmetrie with kurtosis

= 3( 1 + f(h2

- 1) _ 1) = 3 (ó - 1)'Y (l + f(h - 1»2

with ó > l. Therefore, the distribution will be leptokur-

ofX. Another way ofwriting (3) is

Peña and Ruiz-Castillo: Robust Methods of Building Regression Models 11

y

where W¡ = 1/;(e¡)/e¡.Thus, the maximum likelihood estimation of the

linear model can be interpreted as (a) the minimizationof a eertain funetion g of the sample residuals; (b) thechoice of a funetion 1/; of sample residuals, whose eom­ponents are orthogonal to the linear spaee generated bythe eolumns ofX; and (e) as weighted least squares withweights Wi determined iteratively.

If! is symmetrie, we may assume that it belongs tothe potential exponential family-a general form s~

gested by Diananda (1949) and Box (1953), and studiedby Box and Tiao (1973). In this case

!(u) = k l (a)u- 1expl-k2(a) I u/u 12/(I+a)l,

n

¿ e¡x[ W¡ = O',¡=I

(4)p

Yp -.- _.- - _.-.~--~

.. --­------• ••1 :.,._.,,_ ........

__ .::. -::¡t· .: ..*: .-----:-..... :: ..: " :-.,e .• : •••• :."e.. • •• '." . .• . •". ". e." .. . '.' .~ : .... .. ' .. . '" .

" : ....".: .••II

x-1<a:sl, u<oo, -oou<oo, (5)

where u is the standard deviation, and a indieates thekurtosis of the distribution.

For a = O, the distribution is the normal; for a = 1,it is the Laplaee distribution; and as a approaehes -1,one obtains in the limit the uniform distribution. More­over, expression (5) ineludes leptokurtie distributionswith tails wider than the normal when a > O, andplatokurtie distributions when a < O.

Taking a as known, the maximization of the likeli­hood of a linear model with disturbanees given by (5)leads to

n I /2/(I+a)min¿ Yi - x[{j .

/=1

This ineludes as particular cases the minimization ofabsolute deviations (a = 1), least squares (a = O), andthe minimization of the maximum deviation (as a -+-1). Therefore, the decision on an adequate estimationcriterion strongly depends on the speeifie eharaeteristiesof the distribution with which one is working.

In this eontext, the problem with least squares is thatit may beeome very ineffieient in the presence of a fewatypieal data that make the distribution leptokurtie. Tosee this, assume that the disturbanees in the linearmodel are N(O, u 2

) but there exists an unknown smallproportion f of atypieal observations. This faet can bemodeled, following, among others, Tukey (1960), Boxand Tiao (1968, 1973), and Guttman (1973), by assum­ing that these anomalous observations come from anormal distribution with zero mean and variance h u 2

with h > l. Then the density funetion will be

f(u) = (l - f)!N(U IO, u 2) + f !N(U IO, h u 2). (6)

It is immediate that

var(u) = u 2(1 + f(h - 1»

(A)

y

p ,,,"~p --_._-----------------------,.,.. ".. ..

",. ..... l.. . "•• ••• a "••••• ','

~ :: .~ ... :,:,:'.:.. ' '. ... '... .,,''. : ", :" ..•• " #/IfI#" • ,,' •;.,::.. :.." ..

" " -: '" ." " ." '.. .

x

(8)Figure 1. TWO TYPES OF OUTLlERS. In case (A), the anomalous

value of the response leads to a vertical displacement of the regres­sion line and a large residual. In case (B), the atypical value of theexplicative variables determines the slope of the regression line butleads to a small residual.

and ! will be symmetrie with kurtosis

= 3( 1 + f(h2

- 1) _ 1) = 3 (ó - 1)'Y (l + f(h - 1»2

with ó > l. Therefore, the distribution will be leptokur-

12 Joural of Business & Economic Statistics, January 1984

tic. In this case, least squares is no longer optimal. Also, since the variances of the parameter estimators depend directly.on the error variance, which is greater than a2, such estimates will be unreliable and very unstable in different samples.

Finally, it is worthwhile to note that there are two types of possible outliers. If we consider sample points ( i, x/), one may find an anomalous value of y, for the corresponding xi, as in Figure 1 (A). The residual for point P will be large, and its effect will be a vertical displacement of the regression line. Alternatively, we may have an atypical value of the vector of explanatory variables that may not be associated with an atypical response, as in Figure 1 (B). Here, point P alone essen- tially determines the slope of the regression line, so that in spite of the anomalous nature of the situation the residual may be very small or even zero.

3. DIFFERENT APPROACHES TO SOLVE THE PROBLEM

The practical approaches to deal with the problems posed by outliers can be summarized as follows:

1. Appeal to the central limit theorem to justify the normality hypothesis in order to use least squares. Once the model has been estimated, use residual plots against the estimated values or the explanatory variables to detect possible outliers.

2. Use a Bayesian approach that involves building a formal model which incorporates the a priori expected deviation with respect to the standard linear model by means of parameters in an extended model.

3. Reject least squares in favor of a robust estimation procedure by selecting a function g that yields reason- ably efficient estimates under the normality assumption without suffering the instability of least squares in the presence of outliers.

4. Robustify, rather than the estimation criterion, the methodology followed in the construction of the linear model.

This requires checking at each stage that decisions are not determined by a small group of anomalous obser- vations. Hence, least squares is not abandoned, but instead the estimation process is supplemented by a battery of diagnostic checks that permit detection of potentially influential observations, measurements of their effects on estimated coefficients, and tests of whether they are significantly atypical.

In the next section we briefly review these ap- proaches.

3.1 The Use of Residual Plots

This is the alternative suggested by the vast majority of statistics and econometric textbooks. Its main limi- tation is that, at best, residual plots can only serve to detect outliers of type A in Figure 1. However, in the

context of a large sample of data on numerous variables, residual plots by themselves are not very helpful for detecting atypical multivariant values with several co- ordinates far from the mean values of the explanatory variables. Unfortunately, these outliers of type B in Figure 1 may have a great influence on the regression results and are, therefore, particularly damaging.

3.2 The Bayesian Approach

This has been used by Jeffreys (1961), Box and Tiao (1968), Chen and Box (1979 a, b, c), Box (1979, 1980) and others. It is possibly the most general and thorough approach to the problem, but we have been unable to implement it because of its computational complica- tions and the requirements of adequate software for its efficient application. Thus, we abstain here from further comments on it.

3.3 Robust Regression Estimates

The shortcomings of the least squares approach al- ready mentioned have led in the last 20 years to an extensive literature that aims to overcome these diffi- culties. Books by Mosteller and Tukey (1977), Huber (1981), and Barnett and Lewis (1978) present the prob- lem and contain numerous references.

The instability of least squares in the presence of outliers is due to the form of the functions g and V in expressions (2) and (3). In this case, g(u) = u2, {(u) = u, and wi(u) = i/(u)/u = 1. Therefore, since all obser- vations are given equal weight, those data with a large residual in absolute value carry the least squares equa- tion towards them-an obviously undesirable effect. It is intuitively clear that a function g that grows more slowly when u is large will give a smaller weight to such atypical observations, leading, consequently, to more robust estimates. This solution has been advocated by Huber (1964) and others (see Stigler 1973 for historical comments). Hogg (1979) and Huber (1981) present a good summary of this approach. See also Jeffreys (1961, p. 214 ff.).

These robust procedures are subject to three types of criticisms. First, the heuristic nature of the functions g or 4 introduce a certain arbitrariness in the formulation. Second, the small-sample properties of the estimates are unknown. Third, these methods are useful in deal- ing with outliers of type A in Figure 1, but they do not solve the problem posed by atypical values with small residuals.

With respect to the first criticism, Chen and Box (1979a) have established that the functions g and A

suggested in the literature are optimal for particular types of contamination. For instance, Huber's function g is optimal for a normal distribution with Laplace tails, which can be closely approximated by the contami- nated normal model presented in (6). Therefore, it can be argued that the methodology we use should depend

12 Joumal of Business & Economic Statistics, January 1984

tic. In this case, least squares is no longer optimal. AIso,since the variances of the parameter estimators dependdirectly,on the error variance, which is greater than u 2,

sueb,estimates will be unreliable and very unstable indifTerent samples.

Finally, it is worthwhile to note that there are twotypes of possible outliers. If we consider sample points(Yi, xI), one may find an anomalous value ofYi for thecorresponding Xi, as in Figure 1 (A). The residual forpoint P will be large, and its efTect will be a verticaldisplacement of the regression lineo Altematively, wemay have an atypical value ofthe vector ofexplanatoryvariables that may not be associated with an atypicalresponse, as in Figure 1 (B). Here, point P alone essen­tially determines the slope ofthe regression line, so thatin spite of the anomalous nature of the situation theresidual may be very small or even zero.

3. DIFFERENT APPROACHES TO SOLVETHE PROBLEM

The practical approaches to deal with the problemsposed by outliers can be summarized as follows:

l. Appeal to the centrallimit theorem to justify thenormality hypothesis in order to use least squares. Oncethe model has been estimated, use residual plots againstthe estimated values or the explanatory variables todetect possible outliers.

2. Use a Bayesian approach that involves building aformal model which incorporates the a priori expecteddeviation with respect to the standard linear model bymeans of parameters in an extended model.

3. Reject least squares in favor ofa robust estimationprocedure by selecting a function g that yields reason­ably efficient estimates under the normality assumptionwithout sufTering the instability of least squares in thepresence of outliers.

4. Robustify, rather than the estimation criterion,the methodology followed in the construction of thelinear model.

This requires checking at each stage that decisions arenot determined by a small group of anomalous obser­vations. Hence, least squares is not abandoned, butinstead the estimation process is supplemented by abattery of diagnostic checks that permit detection ofpotentially influential observations, measurements oftheir efTects on estimated coefficients, and tests ofwhether they are significantly atypical.

In the next section we briefly review these ap­proaches.

3.1 The Use of Residual Plots

This is the altemative suggested by the vast majorityof statistics and econometric textbooks. Its main limi­tation is that, at best, residual plots can only serve todetect outliers of type A in Figure l. However, in the

context ofa large sample ofdata on numerous variables,residual pl01s by themselves are not very helpful fordetecting atypical multivariant values with several co­ordinates far from the mean values of the explanatoryvariables. Unfortunately, these outliers of type B inFigure 1 may have a great influence on the regressionresults and are, therefore, particularly damaging.

3.2 The Bayesian Approach

This has been used by JefTreys (1961), Box and Tiao(1968), Chen and Box (1979 a, b, c), Box (1979, 1980)and others. It is possibly the most general and thoroughapproach to the problem, but we have been unable toimplement it because of its computational complica­tions and the requirements of adequate software for itsefficient application. Thus, we abstain here from furthercomments on it.

3.3 Robust Regression Estimates

The shortcomings of the least squares approach al­ready mentioned have led in the last 20 years to anextensive literature that aims to overcome these diffi­culties. Books by Mosteller and Tukey (1977), Huber(1981), and Bamett and Lewis (1978) present the prob­lem and contain numerous references.

The instability of least squares in the presence ofoutliers is due to the form of the functions g and \f; inexpressions (2) and (3). In this case, g(u) = u2

, 1/I(u) =u, and Wi(U) = 1/I(u)/u = l. Therefore, since all obser­vations are given equal weight, those data with a largeresidual in absolute value carry the least squares equa­tion towards them-an obviously undesirable efTect. Itis intuitively clear that a function g that grows moreslowly when u is large will give a smaller weight to suchatypical observations, leading, consequently, to morerobust estimates. This solution has been advocated byHuber (1964) and others (see Stigler 1973 for historicalcomments). Hogg (1979) and Huber (1981) present agood summary ofthis approach. See also JefTreys (1961,p. 214 fT.).

These robust procedures are subject to three types ofcriticisms. First, the heuristic nature of the functions gor 1/1 introduce a certain arbitrariness in the formulation.Second, the small-sample properties of the estimatesare unknown. Third, these methods are useful in deal­ing with outliers of type A in Figure 1, but they do notsolve the problem posed by atypical values with smallresiduals.

With respect to the first criticism, Chen and Box(1979a) have established that the functions g and 1/1suggested in the literature are optimal for particulartypes of contamination. For instance, Huber's functiongis optimal for a normal distribution with Laplace tails,which can be closely approximated by the contami­nated normal model presented in (6). Therefore, it canbe argued that the methodology we use should depend

12 Joumal of Business & Economic Statistics, January 1984

tic. In this case, least squares is no longer optimal. AIso,since the variances of the parameter estimators dependdirectly,on the error variance, which is greater than u 2,

sueb,estimates will be unreliable and very unstable indifTerent samples.

Finally, it is worthwhile to note that there are twotypes of possible outliers. If we consider sample points(Yi, xI), one may find an anomalous value ofYi for thecorresponding Xi, as in Figure 1 (A). The residual forpoint P will be large, and its efTect will be a verticaldisplacement of the regression lineo Altematively, wemay have an atypical value ofthe vector ofexplanatoryvariables that may not be associated with an atypicalresponse, as in Figure 1 (B). Here, point P alone essen­tially determines the slope ofthe regression line, so thatin spite of the anomalous nature of the situation theresidual may be very small or even zero.

3. DIFFERENT APPROACHES TO SOLVETHE PROBLEM

The practical approaches to deal with the problemsposed by outliers can be summarized as follows:

l. Appeal to the centrallimit theorem to justify thenormality hypothesis in order to use least squares. Oncethe model has been estimated, use residual plots againstthe estimated values or the explanatory variables todetect possible outliers.

2. Use a Bayesian approach that involves building aformal model which incorporates the a priori expecteddeviation with respect to the standard linear model bymeans of parameters in an extended model.

3. Reject least squares in favor ofa robust estimationprocedure by selecting a function g that yields reason­ably efficient estimates under the normality assumptionwithout sufTering the instability of least squares in thepresence of outliers.

4. Robustify, rather than the estimation criterion,the methodology followed in the construction of thelinear model.

This requires checking at each stage that decisions arenot determined by a small group of anomalous obser­vations. Hence, least squares is not abandoned, butinstead the estimation process is supplemented by abattery of diagnostic checks that permit detection ofpotentially influential observations, measurements oftheir efTects on estimated coefficients, and tests ofwhether they are significantly atypical.

In the next section we briefly review these ap­proaches.

3.1 The Use of Residual Plots

This is the altemative suggested by the vast majorityof statistics and econometric textbooks. Its main limi­tation is that, at best, residual plots can only serve todetect outliers of type A in Figure l. However, in the

context ofa large sample ofdata on numerous variables,residual pl01s by themselves are not very helpful fordetecting atypical multivariant values with several co­ordinates far from the mean values of the explanatoryvariables. Unfortunately, these outliers of type B inFigure 1 may have a great influence on the regressionresults and are, therefore, particularly damaging.

3.2 The Bayesian Approach

This has been used by JefTreys (1961), Box and Tiao(1968), Chen and Box (1979 a, b, c), Box (1979, 1980)and others. It is possibly the most general and thoroughapproach to the problem, but we have been unable toimplement it because of its computational complica­tions and the requirements of adequate software for itsefficient application. Thus, we abstain here from furthercomments on it.

3.3 Robust Regression Estimates

The shortcomings of the least squares approach al­ready mentioned have led in the last 20 years to anextensive literature that aims to overcome these diffi­culties. Books by Mosteller and Tukey (1977), Huber(1981), and Bamett and Lewis (1978) present the prob­lem and contain numerous references.

The instability of least squares in the presence ofoutliers is due to the form of the functions g and \f; inexpressions (2) and (3). In this case, g(u) = u2

, 1/I(u) =u, and Wi(U) = 1/I(u)/u = l. Therefore, since all obser­vations are given equal weight, those data with a largeresidual in absolute value carry the least squares equa­tion towards them-an obviously undesirable efTect. Itis intuitively clear that a function g that grows moreslowly when u is large will give a smaller weight to suchatypical observations, leading, consequently, to morerobust estimates. This solution has been advocated byHuber (1964) and others (see Stigler 1973 for historicalcomments). Hogg (1979) and Huber (1981) present agood summary ofthis approach. See also JefTreys (1961,p. 214 fT.).

These robust procedures are subject to three types ofcriticisms. First, the heuristic nature of the functions gor 1/1 introduce a certain arbitrariness in the formulation.Second, the small-sample properties of the estimatesare unknown. Third, these methods are useful in deal­ing with outliers of type A in Figure 1, but they do notsolve the problem posed by atypical values with smallresiduals.

With respect to the first criticism, Chen and Box(1979a) have established that the functions g and 1/1suggested in the literature are optimal for particulartypes of contamination. For instance, Huber's functiongis optimal for a normal distribution with Laplace tails,which can be closely approximated by the contami­nated normal model presented in (6). Therefore, it canbe argued that the methodology we use should depend

Peha and Ruiz-Castillo: Robust Methods of Building Regression Models

on the specific structure of each particular sample. The third criticism leads to generalized M-estimates in which the weights wi in (4) depend not only on the residual but also on the observation's influence meas- ured by its distance to the center of the scatter of points as in Krasker and Welsch (1982). Although this ap- proach partially solves the problem, the solution re- mains heuristic and obtaining sampling properties of estimates is difficult.

3.4 Robustification of the Methodology

The main reason for constructing robust estimation methods is to guarantee that our results will not be fundamentally dependent on a few anomalous obser- vations. However, the fact that an estimate might be very sensitive to a small set of outliers does not mean that it is inefficient in every conceivable case. Before rejecting an estimation procedure, it is reasonable to investigate whether its good properties are preserved in each particular sample.

Therefore, given a data set susceptible to being treated by means of a linear model, it is pertinent to ask the following questions: (a) Does this sample contain ob- servations whose a priori influence is much greater than the rest in the construction of the model? (b) Is it possible to measure the actual influence that each in- dividual observation has a posteriori on the parameter estimates? (c) Does there exist a test to determine whether an observation constitutes an outlier?

We now review the answers that have been given to these questions. The first issue has been approached with the help of the "hat" matrix, whose properties have been discussed by Huber (1975), Hoaglin and Welsch (1978), Cook (1977, 1979), Belsley, Kuh, and Welsch (1980), and Weisberg (1980).

The "hat" matrix V projects the vector Y on the linear space generated by the columns of X:

Y = V Y, V = X(X'X)-'X'. (8)

The matrix V is symmetric and idempotent. Its impor- tance for our purpose lies in the fact that e = (I - V)U = (I - V)Y, from which one obtains

var(ei) a2(1 - vii) (9)

with vii = (xi - )'(X'X)')(x - x), where X is the centered matrix of the observation, and l/n (X'X) is the variance and covariance matrix for the explanatory variables.

Therefore, except for a constant term, vii represents the Mahalanobis distance of an observation xi to the center of gravity of the scatter of points, X. If a point xi is very far from X, its vii will be large and the variance of the corresponding residual will be small, as (9) indi- cates. In the limit, if vii = 1, the variance will be zero, which means that the point's position relative to the rest forces the regression equation to go through it, irrespective of the observed value for yi.

It can be concluded that sample points with high vii are, potentially, influential. Since V is a projection matrix, 0 < vii < 1. Moreover, since the trace of an idempotent matrix is equal to its rank, E= I vii = k where k is the rank of X. Consequently, the average value of the vii's is kin. Following Belsley, Kuh, and Welsch (1980), in practice an observation is considered potentially influential if vii> 2k/n.

Huber (1981) has suggested another interesting inter- pretation for the vii terms. Since V is idempotent, vii = j= v2. Thus, taking (8) into account

n

var( i) = S v0 var(yj) = 2vii. j=1

Therefore, recalling that the sample mean of h inde- pendent observations with common variance a2 has variance a2/h, it is clear that 1/vii can be interpreted as the number of equivalent observations used to compute Yi. If vii = 1, then y, is computed with a single observa- tion, its residual is zero (see Equation 9).

In an alternative approach to determine a priori influential observations, Andrews and Pregibon (1978) use the change of "volume" of the scatter of points when one eliminates a subset of observations. However, Draper and John (1981) have established that a measure of a single point's influence in this approach is precisely 1 - vii.

The second issue is how to determine the actual influence on the model of each observation in a given sample. There are several ways of doing this based on the empirical influence function IEA = SA - 6, where AA is the estimate obtained after eliminating the subset

A of observations, and f is the estimate with the full sample (see Cook and Weisberg 1980).

A simple way of obtaining a scalar measure of A's influence is to consider the distance between AA and A in a metric with statistical meaning. Cook (1977) intro- duced such a measure by

DA = (A - )(X'X)( - )/ks2

where s2 is the regression residual variance and (X'X)-'s2 is an estimate of the variance covariance matrix for ,.

Using the subindex (i) to indicate that a referred-to characteristic has been calculated without the ith obser- vation, the Cook distance can easily be obtained from

= ((i) - A)' (X'X)()(i - 6)/ks2

= e vii/ks2(1 - vii)2.

It is interesting to note that Di can also be written as

Di = (Y(i - Y)' (Y(, - Y)/ks2.

indicating that Di measures the Euclidean distance in which the prediction vector Y is translated after elimi- nating the ith observation from the regression.

Finally, the construction of tests to determine the

13 Peña and Ruiz-Castillo: Robust Methods of Building Regression Models 13

on the specific structure of each particular sample. Thethird criticism leads to generalized M-estimates inwhich the weights W¡ in (4) depend not only on theresidual but also on the observation's influence meas­ured by its distance to the center of the scatter of pointsas in Krasker and Welsch (1982). Although this ap­proach partiaIly solves the problem, the solution re­mains heuristic and obtaining sampling properties ofestimates is difficult.

3.4 Robustification of the Methodology

The main reason for constructing robust estimationmethods is to guarantee that our results will not befundamentaIly dependent on a few anomalous obser­vations. However, the fact that an estimate might bevery sensitive to a smaIl set of outliers does not meanthat it is inefficient in every conceivable case. Beforerejecting an estimation procedure, it is reasonable toinvestigate whether its good properties are preserved ineaeh particular sample.

Therefore, given a data set susceptible to being treatedby means of a linear model, it is pertinent to ask thefoIlowing questions: (a) Does this sample eontain ob­servations whose a priori influence is mueh greater thanthe rest in the construction of the model? (b) Is itpossible to measure the actual influence that eaeh in­dividual observation has a posteriori on the parameterestimates? (e) Does there exist a test to determinewhether an observation constitutes an outlier?

We now review the answers that have been given tothese questions. The first issue has been approachedwith the help ofthe "hat" matrix, whose properties havebeen discussed by Huber (1975), Hoaglin and Welseh(1978), Cook (1977, 1979), Belsley, Kuh, and Welseh(1980), and Weisberg (1980).

The "hat" matrix V projects the vector Y on thelinear spaee generated by the eolumns of X:

V= V Y, V = X(X'XtIX'. (8)

The matrix V is symmetrie and idempotent. Its impor­tanee for our purpose lies in the fact that e = (1 - V)U= (1 - V)Y, from which one obtains

var(e¡) = 0"2(1 - Vii) (9)

with Vii = (x¡ - x)'(X'Xtl(x¡ - x), where X is thecentered matrix of the observation, and l/n (X'X) isthe varianee and covariance matrix for the explanatoryvariables.

Therefore, except for a constant term, Vii representsthe Mahalanobis distanee of an observation Xi to thecenter ofgravity of the seatter of points, X. If a point Xi

is very far from X, its Vii will be large and the varianceof the eorresponding residual will be smaIl, as (9) indi­cates. In the limit, if Vi¡ = 1, the varianee will be zero,which means that the point's position relative to therest forees the regression equation to go through it,irrespective ofthe observed value for Yi.

It can be concluded that sample points with high Vii

are, potentially, influential. Since V is a projectionmatrix, O < Vii ~ l. Moreover, since the trace of anidempotent matrix is equal to its rank, r7= 1 Vii = k,where k is the rank of X. Consequently, the averagevalue of the v¡¡'s is kln. FoIlowing Belsley, Kuh, andWelsch (1980), in praetice an observation is eonsideredpotentiaIly influential if Vii> 2kln.

Huber (1981) has suggested another interesting inter­pretation for the Vii terms. Sinee V is idempotent, Vii =

r1=1 V7j. Thus, taking (8) into aceountn

var(y¡)= L v'f;var(Yj)=0"2Vii •j=l

Therefore, recalling that the sample mean of h inde­pendent observations with common varianee 0"2 hasvariance O" 21h, it is clear that 1IV¡i can be interpreted asthe number ofequivalent observations used to computeYi. If Vii = 1, then Yi is computed with a single observa­tion, its residual is zero (see Equation 9).

In an alternative approach to determine a prioriinfluential observations, Andrews and Pregibon (1978)use the change of "volume" of the scatter of pointswhen one eliminates a subset ofobservations. However,Draper and John (1981) have established that a measureofa single point's influenee in this approaeh is precisely1 - Vii.

The second issue is how to determine the actualinfluence on the model of eaeh observation in a givensample. There are several ways of doing this based onthe empirical influence function lEA = IJA - {J, whereIJA is the estimate obtained after eliminating the subsetA of observations, and {J is the estimate with the fuIlsample (see Cook and Weisberg 1980).

A simple way of obtaining a sealar measure of A'sinfluenee is to eonsider the distance between IJA and {Jin a metric with statistieal meaning. Cook (1977) intro­duced such a measure by

DA = (IJA - {J)'(X'X)(IJA - {J)/ks 2,

where S2 is the regression residual varianee and(X'xt 1S2 is an estimate of the variance eovariancematrix for {J.

Using the subindex (i) to indicate that a referred-toeharacteristic has been calculated without the ith obser­vation, the Cook distanee can easily be obtained from

D¡ = (IJU) - {J)' (X'X)(IJ(i) - {J)/ks 2

= eh;;/ks2(1 - Vii)2.

It is interesting to note that D¡ can also be written as

D¡ = (VU) - Y)' (V(/) - Y)/ks 2•

indicating that D¡ measures the Euclidean distanee inwhieh the prediction vector Y is translated after elimi­nating the ith observation from the regression.

Finally, the construetion of tests to determine the

Peña and Ruiz-Castillo: Robust Methods of Building Regression Models 13

on the specific structure of each particular sample. Thethird criticism leads to generalized M-estimates inwhich the weights W¡ in (4) depend not only on theresidual but also on the observation's influence meas­ured by its distance to the center of the scatter of pointsas in Krasker and Welsch (1982). Although this ap­proach partiaIly solves the problem, the solution re­mains heuristic and obtaining sampling properties ofestimates is difficult.

3.4 Robustification of the Methodology

The main reason for constructing robust estimationmethods is to guarantee that our results will not befundamentaIly dependent on a few anomalous obser­vations. However, the fact that an estimate might bevery sensitive to a smaIl set of outliers does not meanthat it is inefficient in every conceivable case. Beforerejecting an estimation procedure, it is reasonable toinvestigate whether its good properties are preserved ineaeh particular sample.

Therefore, given a data set susceptible to being treatedby means of a linear model, it is pertinent to ask thefoIlowing questions: (a) Does this sample eontain ob­servations whose a priori influence is mueh greater thanthe rest in the construction of the model? (b) Is itpossible to measure the actual influence that eaeh in­dividual observation has a posteriori on the parameterestimates? (e) Does there exist a test to determinewhether an observation constitutes an outlier?

We now review the answers that have been given tothese questions. The first issue has been approachedwith the help ofthe "hat" matrix, whose properties havebeen discussed by Huber (1975), Hoaglin and Welseh(1978), Cook (1977, 1979), Belsley, Kuh, and Welseh(1980), and Weisberg (1980).

The "hat" matrix V projects the vector Y on thelinear spaee generated by the eolumns of X:

V= V Y, V = X(X'XtIX'. (8)

The matrix V is symmetrie and idempotent. Its impor­tanee for our purpose lies in the fact that e = (1 - V)U= (1 - V)Y, from which one obtains

var(e¡) = 0"2(1 - Vii) (9)

with Vii = (x¡ - x)'(X'Xtl(x¡ - x), where X is thecentered matrix of the observation, and l/n (X'X) isthe varianee and covariance matrix for the explanatoryvariables.

Therefore, except for a constant term, Vii representsthe Mahalanobis distanee of an observation Xi to thecenter ofgravity of the seatter of points, X. If a point Xi

is very far from X, its Vii will be large and the varianceof the eorresponding residual will be smaIl, as (9) indi­cates. In the limit, if Vi¡ = 1, the varianee will be zero,which means that the point's position relative to therest forees the regression equation to go through it,irrespective ofthe observed value for Yi.

It can be concluded that sample points with high Vii

are, potentially, influential. Since V is a projectionmatrix, O < Vii ~ l. Moreover, since the trace of anidempotent matrix is equal to its rank, r7= 1 Vii = k,where k is the rank of X. Consequently, the averagevalue of the v¡¡'s is kln. FoIlowing Belsley, Kuh, andWelsch (1980), in praetice an observation is eonsideredpotentiaIly influential if Vii> 2kln.

Huber (1981) has suggested another interesting inter­pretation for the Vii terms. Sinee V is idempotent, Vii =

r1=1 V7j. Thus, taking (8) into aceountn

var(y¡)= L v'f;var(Yj)=0"2Vii •j=l

Therefore, recalling that the sample mean of h inde­pendent observations with common varianee 0"2 hasvariance O" 21h, it is clear that 1IV¡i can be interpreted asthe number ofequivalent observations used to computeYi. If Vii = 1, then Yi is computed with a single observa­tion, its residual is zero (see Equation 9).

In an alternative approach to determine a prioriinfluential observations, Andrews and Pregibon (1978)use the change of "volume" of the scatter of pointswhen one eliminates a subset ofobservations. However,Draper and John (1981) have established that a measureofa single point's influenee in this approaeh is precisely1 - Vii.

The second issue is how to determine the actualinfluence on the model of eaeh observation in a givensample. There are several ways of doing this based onthe empirical influence function lEA = IJA - {J, whereIJA is the estimate obtained after eliminating the subsetA of observations, and {J is the estimate with the fuIlsample (see Cook and Weisberg 1980).

A simple way of obtaining a sealar measure of A'sinfluenee is to eonsider the distance between IJA and {Jin a metric with statistieal meaning. Cook (1977) intro­duced such a measure by

DA = (IJA - {J)'(X'X)(IJA - {J)/ks 2,

where S2 is the regression residual varianee and(X'xt 1S2 is an estimate of the variance eovariancematrix for {J.

Using the subindex (i) to indicate that a referred-toeharacteristic has been calculated without the ith obser­vation, the Cook distanee can easily be obtained from

D¡ = (IJU) - {J)' (X'X)(IJ(i) - {J)/ks 2

= eh;;/ks2(1 - Vii)2.

It is interesting to note that D¡ can also be written as

D¡ = (VU) - Y)' (V(/) - Y)/ks 2•

indicating that D¡ measures the Euclidean distanee inwhieh the prediction vector Y is translated after elimi­nating the ith observation from the regression.

Finally, the construetion of tests to determine the

14 Joural of Business & Economic Statistics, January 1984

atypical data in regression models has used numerous approaches (see Barnett and Lewis 1978 for a survey of this topic). When the problem is considered within a likelihood ratio testing approach, the resulting test sta- tistic is a monotonic function of the Studentized resid- uals

ri= eils - vii, (10)

where the least squares residual has been divided by its estimated standard deviation.

A shortcoming of this approach is that the distribu- tion of ri under the normality assumption is not Student t, because numerator and denominator are not inde- pendent. However, the substitution of s(i) for s in (10) yields a Student t distribution with n - k - 1 degrees of freedom. For computational reasons (see Weisberg 1980), it is convenient to express it as

ti = ri V(n - k - l)/(n - k- r),

where ri is given by (10). The relevant distribution for obtaining the test's significance level is that of the maximum value of a sample of t statistics with n - k - 1 degrees of freedom, which value is unknown. How- ever, approximate critical values have been tabulated using the Bonferroni inequality (see Miller 1977, and Cook and Prescott 1981).

In conclusion, the statistics vii, Di, and t, constitute the basis for the methodological robustification of the linear model. The vii terms depend only on the prede- termined variables and measure the potential influence of each observation taking into account its position relative to the rest of the sample. We would have a robust design if all points had analogous vii values. The Cook Di statistic captures the actual influence of each observation on the estimated parameters of the predic- tion vector Y. The statistic is interesting because it indicates the practical irrelevance of worrying about sample observations that, although anomalous, have little influence on the model. Finally, the t statistic summarizes both features and is used as a formal test of whether a single observation is an outlier. The next section contains an application of this way of attacking the problems posed by outliers.

4. A MODEL FOR THE DETERMINANTS OF RENTAL HOUSING VALUES IN THE

MADRID METROPOLITAN AREA

4.1 The Problem and the Data

In Spain, government intervention in the rental hous- ing sector takes two forms. First, several public insti- tutions promote-directly or indirectly-the construc- tion of public housing at rents below the market level. Second, since 1920 the government has enforced com- pulsory lease renewal and rent controls in the private sector. In 1964 rents were liberalized on new contracts.

Therefore, the rental market sector includes only pri- vate housing units occupied after 1964.

Our problem in this section is to build an explanatory model of market rental values in terms of the observed traits of dwelling units in the Madrid Metropolitan Area (MMA hereafter). This is not a behavioral relationship but a function that gives the rent resulting from the interaction of supply and demand for each variety of the differentiated product. The partial derivatives of that function are interpreted as the implicit or hedonic prices of the corresponding characteristics.

The final aim is to use the estimated model to assess the economic advantages and the distributional conse-

Table 1. Structural Housing Traits Name Description

AGE

OCUP

M2 ROOMI NFL DET

Age of building AXIX

A4164 A6574

a. Continuous Variables Building age in years since

its construction Years of occupancy of the

housing unit Space in squared meters Space in number of rooms Number of floors Deterioration state of the

building

b. Dummy Variables

Built in XIX century Built in 1900-1940 Built in 1941-1964 Built in 1965-1974

Mean 21.9

3.6

68.0 3.6 5.1 4.6

Type of building MAGL "Marginal" housing (in bad con-

dition) CHTW Chalet or townhouse APT Detached apartment building

Other apartment building

Type of promotion PRla Private firm

User's cooperative, particular in- dividual, selfconstruction

UNKa Unknown

Hygienic services LESS Less than a full bathroom

-One full bathroom TWOM Two or more bathrooms

Payments of utilities HOTWa Hot water bill included in other

concept HEAT" Heating bill included in other

concept BEXP8 Building expenditures included

in other concept Other variables

TELPH With telephone CHEAT With central heating GAR With garage JANa With a janitor FURN With furniture FTENa First tenancy ' Nonsignificant variables in the exploratory analysis.

Standard Deviation

23.0

' 2.3

42.8 1.2 2.8

17.5

Percent

7 19 21 53

100

3

4 38 55

100

27 45

28 100

19 70 11

100

47

38

19

36 20 7

34 15 18

14 Joumal of Business & Economic Statistics, January 1984

4. A MODEL FOR THE DETERMINANTS OFRENTAL HOUSING VALUES IN THE

MADRID METROPOLITAN AREA

atypical data in regression models has used numerousapproaches (see Barnett and Lewis 1978 for a survey ofthis topic). When the problem is considered within alikelihood ratio testing approach, the resulting test sta­tistic is a monotonic function of the Studentized resid­uals

4.1 The Problem and the Data

In Spain, government intervention in the rental hous­ing sector takes two forms. First, several public insti­tutions promote-directly or indirectly-the construc­tion of public housing at rents below the market level.Second, since 1920 the government has enforced com­pulsory lease renewal and rent controls in the privatesector. In 1964 rents were liberalized on new contracts.

where the least squares residual has been divided by itsestimated standard deviation.

A shortcoming of this approach is that the distribu­tion of r¡ under the normality assumption is not Studentt, because numerator and denominator are not inde­pendent. However, the substitution of s(i) for s in (10)yields a Student t distribution with n-k - 1 degreesof freedom. For computational reasons (see Weisberg1980), it is convenient to express it as

3620

7341518

7192153

100

3

43855

100

2745

28100

197011

100

47

38

19

Percent

Table 1. Structural Housing Traits

Name Description

Standarda. Continuous Variables Mean Deviation

AGE Building age in years since 21.9 23.0its construction

OCUP Years of occupancy of the 3.6 • 2.3housing unit

M2 Space in squared meters 68.0 42.8ROOM" Space in number of rooms 3.6 1.2NFL Number of f100rs 5.1 2.8DET Deterioration state of the 4.6 17.5

building

Age of buildingAXIX Built in XIX century

Builtin 1900-1940A4164 Built in 1941-1964A6574 Built in 1965-1974

b. Dummy Variables

• Nonsignificant variables in the exploratory analysis.

Therefore, the renta! market sector ineludes only pri­vate housing units occupied after 1964.

Our problem in this section is to build an explanatorymodel of market rental values in terms of the observedtraits ofdwelling units in the Madrid Metropolitan Area(MMA hereafter). This is not a behavioral relationshipbut a function that gives the rent resulting from theinteraction of supply and demand for each variety ofthe differentiated product. The partial derivatives ofthat function are interpreted as the implicit or hedonicprices of the corresponding characteristics.

The final aim is to use the estimated model to assessthe economic advantages and the distributional conse-

Type of buildingMAGL "Marginal" housing (in bad con-

dition)CHTW Chalet or townhouseAPT Detached apartment building

Other apartment building

Type of promotionPRla Private firm

User's cooperative, particular in­dividual, selfconstruction

UNKa Unknown

Hygienic servicesLESS Less than a full bathroom

009 full bathroomTWOM Two or more bathrooms

Payrllents of utilitiesHOTWa Hot water bill included in other

conceptHEAra Heating bill included in other

conceptBEXpa Building expenditures included

in other conceptOther variables

TELPH With telephoneCHEAT With central heatingGAR With garageJANa With a janitorFURN With furnitureFTEW First tenancy

(10)r¡=e¡/s~,

t¡ = r¡ .J(n - k - l)/(n - k - r7),

where r¡ is given by (10). The relevant distribution forobtaining the test's significance level is that of themaximum value of a sample of t statistics with n-k ­1 degrees of freedom, which value is unknown. How­ever, approximate critical values have been tabulatedusing the Bonferroni inequality (see Miller 1977, andCook and Prescott 1981).

In conelusion, the statistics Vii, D¡, and t¡ constitutethe basis for the methodological robustification of thelinear model. The V¡¡ terms depend only on the prede­termined variables and measure the potential influenceof each observation taking into account its positionrelative to the rest of the sample. We would have arobust design if aH points had analogous V¡¡ values. TheCook Di statistic captures the actual influence of eachobservation on the estimated parameters of the predic­tion vector Y. The statistic is interesting because itindicates the practical irrelevance of worrying aboutsample observations that, although anomalous, havelittle influence on the model. FinaHy, the t statisticsummarizes both features and is used as a formal testof whether a single observation is an outlier. The nextsection contains an application of this way of attackingthe problems posed by outliers.

14 Joumal of Business & Economic Statistics, January 1984

4. A MODEL FOR THE DETERMINANTS OFRENTAL HOUSING VALUES IN THE

MADRID METROPOLITAN AREA

atypical data in regression models has used numerousapproaches (see Barnett and Lewis 1978 for a survey ofthis topic). When the problem is considered within alikelihood ratio testing approach, the resulting test sta­tistic is a monotonic function of the Studentized resid­uals

4.1 The Problem and the Data

In Spain, government intervention in the rental hous­ing sector takes two forms. First, several public insti­tutions promote-directly or indirectly-the construc­tion of public housing at rents below the market level.Second, since 1920 the government has enforced com­pulsory lease renewal and rent controls in the privatesector. In 1964 rents were liberalized on new contracts.

where the least squares residual has been divided by itsestimated standard deviation.

A shortcoming of this approach is that the distribu­tion of r¡ under the normality assumption is not Studentt, because numerator and denominator are not inde­pendent. However, the substitution of s(i) for s in (10)yields a Student t distribution with n-k - 1 degreesof freedom. For computational reasons (see Weisberg1980), it is convenient to express it as

3620

7341518

7192153

100

3

43855

100

2745

28100

197011

100

47

38

19

Percent

Table 1. Structural Housing Traits

Name Description

Standarda. Continuous Variables Mean Deviation

AGE Building age in years since 21.9 23.0its construction

OCUP Years of occupancy of the 3.6 • 2.3housing unit

M2 Space in squared meters 68.0 42.8ROOM" Space in number of rooms 3.6 1.2NFL Number of f100rs 5.1 2.8DET Deterioration state of the 4.6 17.5

building

Age of buildingAXIX Built in XIX century

Builtin 1900-1940A4164 Built in 1941-1964A6574 Built in 1965-1974

b. Dummy Variables

• Nonsignificant variables in the exploratory analysis.

Therefore, the renta! market sector ineludes only pri­vate housing units occupied after 1964.

Our problem in this section is to build an explanatorymodel of market rental values in terms of the observedtraits ofdwelling units in the Madrid Metropolitan Area(MMA hereafter). This is not a behavioral relationshipbut a function that gives the rent resulting from theinteraction of supply and demand for each variety ofthe differentiated product. The partial derivatives ofthat function are interpreted as the implicit or hedonicprices of the corresponding characteristics.

The final aim is to use the estimated model to assessthe economic advantages and the distributional conse-

Type of buildingMAGL "Marginal" housing (in bad con-

dition)CHTW Chalet or townhouseAPT Detached apartment building

Other apartment building

Type of promotionPRla Private firm

User's cooperative, particular in­dividual, selfconstruction

UNKa Unknown

Hygienic servicesLESS Less than a full bathroom

009 full bathroomTWOM Two or more bathrooms

Payrllents of utilitiesHOTWa Hot water bill included in other

conceptHEAra Heating bill included in other

conceptBEXpa Building expenditures included

in other conceptOther variables

TELPH With telephoneCHEAT With central heatingGAR With garageJANa With a janitorFURN With furnitureFTEW First tenancy

(10)r¡=e¡/s~,

t¡ = r¡ .J(n - k - l)/(n - k - r7),

where r¡ is given by (10). The relevant distribution forobtaining the test's significance level is that of themaximum value of a sample of t statistics with n-k ­1 degrees of freedom, which value is unknown. How­ever, approximate critical values have been tabulatedusing the Bonferroni inequality (see Miller 1977, andCook and Prescott 1981).

In conelusion, the statistics Vii, D¡, and t¡ constitutethe basis for the methodological robustification of thelinear model. The V¡¡ terms depend only on the prede­termined variables and measure the potential influenceof each observation taking into account its positionrelative to the rest of the sample. We would have arobust design if aH points had analogous V¡¡ values. TheCook Di statistic captures the actual influence of eachobservation on the estimated parameters of the predic­tion vector Y. The statistic is interesting because itindicates the practical irrelevance of worrying aboutsample observations that, although anomalous, havelittle influence on the model. FinaHy, the t statisticsummarizes both features and is used as a formal testof whether a single observation is an outlier. The nextsection contains an application of this way of attackingthe problems posed by outliers.

Peia and Ruiz-Castillo: Robust Methods of Building Regression Models

quences of public housing and rent control policies in Spain. The results of such an assessment will be re- ported elsewhere in Peina and Ruiz-Castillo (1983).

Our data come from a 1974 survey of 4,067 housing units in the MMA (or .4% of the total number for that area). The sample used here consists of 460 private rental dwellings occupied between 1964 and 1974. Such data will be made available by the authors upon request.

Hedonic price functions for urban areas usually in- volve two types of explanatory variables: a set of traits characterizing dwelling units of the buildings to which they belong, and a set of locational characteristics. Tables 1 and 2 describe the variables of either type that we could measure.

Some comments on measurement problems are in order:

1. In many cases, to get a value for a building's age, we had to consider the midpoint of the known construc- tion interval. This led to certain discontinuities for this variable. When we only knew that the building dated from the 19th century, the AGE variable was assigned the value 85, which implies that the construction date was assumed to be 1880.

2. For each building, interviewers recorded a number of points for each of eight types of observed defects. Thus, the greater the value of the state of deterioration variable (DET), the worse the condition of the corre- sponding building.

3. The accessibility variable is a weighted index of transportation times from each of the 85 zones that make up the MMA to six centrally located neighbor- hoods where 25% of all employment is concentrated. The weights reflect the relative importance of employ- ment in each of those neighborhoods relative to total employment in the six. Accessibility is measured in minutes of private and public transportation, weighted by the utilization rate of both modes for all transpor- tation purposes in the metropolitan area.

4. The index HIGH is the first principal component

Table 2. Locational Variablesa Standard

Name Description Mean Deviatio

ACC Accessibility index in 41.2 17.6 minutes of transportation time

POPD Population density in 19,950 20,683 inhabitants per squared kilometer

RENTb Average monthly family 18,826 4,764 income in pesetas

HIGH Socioeconomic index .06 .88 OLD Buildings age index .13 1.14 INFRAb Index of housing in bad -.07 .78

conditions SCHOOLb Primary and secondary 7,351 4,345

school enrollments *All variables take values in the 85 zones that make up the Madrid Metropolitan Area. b Nonsignificant variables in the explanatory analysis.

explaining 34% of the variance of a set of 12 variables representing different socioeconomic aspects of the 85 zones in the MMA.

5. We also applied principal components analysis to 5 variables describing several buildings' characteristics in each of the 85 zones. The first principal component (OLD), explaining 38% of the variance, was dominated by the average age of buildings in each zone. The second component (INFRA), explaining an additional 21% of the variance, was interpreted as an index of the impor- tance of housing in bad condition in each zone.

6. Of all the variables that might conceivably repre- sent local public sector activities, we could only meas- ure primary and secondary school enrollment (SCHOOL).

7. It is always difficult to ascertain whether the rental amount in monthly expenditures in surveys of this type reflects gross or net rent. In our case, we only knew whether the payments for certain utilities were included or not in other measures, but did not know whether such a measure was the rent bill itself. We tried to account for these effects by constructing three dummy variables that take the value 1 if the person interviewed declared, respectively, that payments for hot water, heating, or building expenditures were included in the other measure.

4.2 The Selection of the Functional Form

To decide on the best functional form, we followed an iterative process that began with an exploratory analysis of the data to obtain a reasonable first repre- sentation. Next, we concentrated on the identification of possible outliers. Finally, we carried out the maxi- mum likelihood estimation of the dependent variable's best transformation and performed various checks to find an adequate metric for the predetermined varia- bles.

For the initial exploratory analysis, we used three types of tools: bivariate plots of the response variable with respect to each explanatory variable, the empirical distribution of each variable, and residual plots from preliminary regressions with several sets of explanatory variables. The results were the following:

1. The rent variable required transformation, possi- bly a logarithmic transformation. Plots of ei = F( yi) for the untransformed y variable showed curvature and heteroskedasticity. Moreover, the distribution of both the rent variable and the regression residuals had a strong positive asymmetry. Finally, the logarithm has a clear economic interpretation, indicating that each trait's effect depends on the level reached by the other housing attributes.

2. To obtain linearity for the response, once rents were expressed in logs, the logarithmic transformation was also applied to the continuous variables OCUP,

15 Peña and Ruiz-Castillo: Robust Methods of Building Regression Models 15

quences of public housing and rent control policies inSpain. The results of such an assessment wiIl be re­ported elsewhere in Peña and Ruiz-Castillo (1983).

Our data come from a 1974 survey of 4,067 housingunits in the MMA (or .4% ofthe total number for thatarea). The sample used here consists of 460 privaterental dwellings occupied between 1964 and 1974. Suchdata wiIl be made available by the authors upon request.

Hedonic price functions for urban areas usuaIly in­volve two types of explanatory variables: a set of traitscharacterizing dwelling units of the buildings to whichthey belong, and a set of locational characteristics.Tables 1 and 2 describe the variables of either type thatwe could measure.

Sorne comments on measurement problems are inorder:

l. In many cases, to get a value for a building's age,we had to consider the midpoint ofthe known construc­tion interval. This led to certain discontinuities for thisvariable. When we only knew that the building datedfrom the 19th century, the AGE variable was assignedthe value 85, which implies that the construction datewas assumed to be 1880.

2. For each building, interviewers recorded a numberof points for each of eight types of observed defects.Thus, the greater the value of the state of deteriorationvariable (DET), the worse the condition of the corre­sponding building.

3. The accessibility variable is a weighted index oftransportation times from each of the 85 zones thatmake up the MMA to six centraIly located neighbor­hoods where 25% of aIl employment is concentrated.The weights reflect the relative importance of employ­ment in each of those neighborhoods relative to totalemployment in the six. Accessibility is measured inminutes of private and public transportation, weightedby the utilization rate of both modes for aIl transpor­tation purposes in the metropolitan area.

4. The index HIGH is the first principal component

Table 2. Locational Variables8

Name Description MeanStandardDeviation

ACC Accessibility index in 41.2 17.6minutes of transportationtime

POPO Population density in 19,950 20,683inhabitants per squaredkilometer

RENr' Average monthly family 18,826 4,764income in pesetas

HIGH Socioeconomic index .06 .88OLO Buildings age index .13 1.14INFRAb Index of housing in bad -.07 .78

conditionsSCHOOLb Primary and secondary 7,351 4,345

school enrollments

• AII variables take values in !he 85 zones !hat make up !he Madrid Metropolitan Area.b Nonsignilicant variables in !he explanatory analysis.

explaining 34% ofthe variance of a set of 12 variablesrepresenting different socioeconomic aspects of the 85zones in the MMA.

5. We also applied principal components analysis to5 variables describing several buildings' characteristicsin each of the 85 zones. The first principal component(OLD), explaining 38% ofthe variance, was dominatedby the average age ofbuildings in each zone. The secondcomponent (INFRA), explaining an additional21 % ofthe variance, was interpreted as an index of the impor­tance of housing in bad condition in each zone.

6. Of aIl the variables that might conceivably repre­sent local public sector activities, we could only meas­ure primary and secondary school enroIlment(SCHOOL).

7. It is always difficult to ascertain whether the rentalamount in monthly expenditures in surveys ofthis typereflects gross or net rento In our case, we only knewwhether the payments for certain utilities were includedor not in other measures, but did not know whethersuch a measure was the rent bill itself. We tried toaccount for these effects by constructing three dummyvariables that take the value 1 if the person intervieweddeclared, respectively, that payments for hot water,heating, or building expenditures were included in theother measure.

4.2 The Selection of the Functional Forrn

To decide on the best functional form, we foIlowedan iterative process that began with an exploratoryanalysis of the data to obtain a reasonable first repre­sentation. Next, we concentrated on the identificationof possible outliers. FinaIly, we carried out the maxi­mum likelihood estimation ofthe dependent variable'sbest transformation and performed various checks tofind an adequate metric for the predetermined varia­bles.

For the initial exploratory analysis, we used threetypes of tools: bivariate plots of the response variablewith respect to each explanatory variable, the empiricaldistribution of each variable, and residual plots frompreliminary regressions with several sets of explanatoryvariables. The results were the foIlowing:

l. The rent variable required transformation, possi­bly a logarithmic transformation. Plots ofe¡ = F( y¡) forthe untransformed y variable showed curvature andheteroskedasticity. Moreover, the distribution of boththe rent variable and the regression residuals had astrong positive asymmetry. FinaIly, the logarithm has aclear economic interpretation, indicating that eachtrait's effect depends on the level reached by the otherhousing attributes.

2. To obtain linearity for the response, once rentswere expressed in logs, the logarithmic transformationwas also applied to the continuous variables OCUP,

Peña and Ruiz-Castillo: Robust Methods of Building Regression Models 15

quences of public housing and rent control policies inSpain. The results of such an assessment wiIl be re­ported elsewhere in Peña and Ruiz-Castillo (1983).

Our data come from a 1974 survey of 4,067 housingunits in the MMA (or .4% ofthe total number for thatarea). The sample used here consists of 460 privaterental dwellings occupied between 1964 and 1974. Suchdata wiIl be made available by the authors upon request.

Hedonic price functions for urban areas usuaIly in­volve two types of explanatory variables: a set of traitscharacterizing dwelling units of the buildings to whichthey belong, and a set of locational characteristics.Tables 1 and 2 describe the variables of either type thatwe could measure.

Sorne comments on measurement problems are inorder:

l. In many cases, to get a value for a building's age,we had to consider the midpoint ofthe known construc­tion interval. This led to certain discontinuities for thisvariable. When we only knew that the building datedfrom the 19th century, the AGE variable was assignedthe value 85, which implies that the construction datewas assumed to be 1880.

2. For each building, interviewers recorded a numberof points for each of eight types of observed defects.Thus, the greater the value of the state of deteriorationvariable (DET), the worse the condition of the corre­sponding building.

3. The accessibility variable is a weighted index oftransportation times from each of the 85 zones thatmake up the MMA to six centraIly located neighbor­hoods where 25% of aIl employment is concentrated.The weights reflect the relative importance of employ­ment in each of those neighborhoods relative to totalemployment in the six. Accessibility is measured inminutes of private and public transportation, weightedby the utilization rate of both modes for aIl transpor­tation purposes in the metropolitan area.

4. The index HIGH is the first principal component

Table 2. Locational Variables8

Name Description MeanStandardDeviation

ACC Accessibility index in 41.2 17.6minutes of transportationtime

POPO Population density in 19,950 20,683inhabitants per squaredkilometer

RENr' Average monthly family 18,826 4,764income in pesetas

HIGH Socioeconomic index .06 .88OLO Buildings age index .13 1.14INFRAb Index of housing in bad -.07 .78

conditionsSCHOOLb Primary and secondary 7,351 4,345

school enrollments

• AII variables take values in !he 85 zones !hat make up !he Madrid Metropolitan Area.b Nonsignilicant variables in !he explanatory analysis.

explaining 34% ofthe variance of a set of 12 variablesrepresenting different socioeconomic aspects of the 85zones in the MMA.

5. We also applied principal components analysis to5 variables describing several buildings' characteristicsin each of the 85 zones. The first principal component(OLD), explaining 38% ofthe variance, was dominatedby the average age ofbuildings in each zone. The secondcomponent (INFRA), explaining an additional21 % ofthe variance, was interpreted as an index of the impor­tance of housing in bad condition in each zone.

6. Of aIl the variables that might conceivably repre­sent local public sector activities, we could only meas­ure primary and secondary school enroIlment(SCHOOL).

7. It is always difficult to ascertain whether the rentalamount in monthly expenditures in surveys ofthis typereflects gross or net rento In our case, we only knewwhether the payments for certain utilities were includedor not in other measures, but did not know whethersuch a measure was the rent bill itself. We tried toaccount for these effects by constructing three dummyvariables that take the value 1 if the person intervieweddeclared, respectively, that payments for hot water,heating, or building expenditures were included in theother measure.

4.2 The Selection of the Functional Forrn

To decide on the best functional form, we foIlowedan iterative process that began with an exploratoryanalysis of the data to obtain a reasonable first repre­sentation. Next, we concentrated on the identificationof possible outliers. FinaIly, we carried out the maxi­mum likelihood estimation ofthe dependent variable'sbest transformation and performed various checks tofind an adequate metric for the predetermined varia­bles.

For the initial exploratory analysis, we used threetypes of tools: bivariate plots of the response variablewith respect to each explanatory variable, the empiricaldistribution of each variable, and residual plots frompreliminary regressions with several sets of explanatoryvariables. The results were the foIlowing:

l. The rent variable required transformation, possi­bly a logarithmic transformation. Plots ofe¡ = F( y¡) forthe untransformed y variable showed curvature andheteroskedasticity. Moreover, the distribution of boththe rent variable and the regression residuals had astrong positive asymmetry. FinaIly, the logarithm has aclear economic interpretation, indicating that eachtrait's effect depends on the level reached by the otherhousing attributes.

2. To obtain linearity for the response, once rentswere expressed in logs, the logarithmic transformationwas also applied to the continuous variables OCUP,

16 Joural of Business & Economic Statistics, January 1984

M2, NFL, DET, ACC, and POPD. Since the case for OCUP and NFL was not clear, the decision to trans- form them was maintained only provisionally.

3. Variables denoted by a in Tables 1 and 2 were initially rejected because they did not supply additional information.

4. The building age variable showed a complex and highly nonlinear influence, probably because it captures very different effects and acts as a proxy for other variables. Moreover, as already indicated, its construc- tion was not free of difficulties. To identify nonlinear effects, its range was broken down into several intervals represented by a set of dummy variables. The results

were that both 19th-century and very moder housing showed rents significantly higher, while housing from 1940 was the least expensive. In a first approximation, we represented this effect by a second degree polyno- mial. To avoid the expected multicollinearity, the fol- lowing variables were defined: AGDM = AGE - AGE, and AGDM2 = AGDM2, where AGE is the mean age for all housing.

With these decisions made, the resulting model ap- pears in column (1) of Table 3. The residuals' distri- bution is asymmetric, with asymmetry and kurtosis coefficients equal to -1.95 and 7.5. The Kolmogorov- Smirnov test leads to the rejection of the residuals'

Table 3. Regression Results

Coefficients (and standard errors)

Variables Other (1) (2) (3) (4) (5) Alternative

Variables

CONSTANT

AGDM

AGDM2

OCUPa

M2a

NFLa

DET8

MAGL

CHTW

APT

LESS

TWOM

TELPH

CHEAT

GAR

FURN

BEXP

ACC8

POPD8

HIGH

OLD

R2

Standard error

Number of observations a In logarithms.

5.77 (.77)

-.012 (.002) .0002

(.00005) -.25 (.04) .46

(.07) .26

(.06) -.06 (.03) .11

(.15) .66

(.15) -.08 (.06)

-.19 (.08) .06

(.09) .05

(.06) .11

(.07) .28

(.10) .33

(.07)

-.15 (.13) .04

(.02) .16

(.04) -.06 (.03)

.71

.48

460

7.14 (.60)

-.010 (.002) .0002

(.00004) -.25 (.03) .42

(.05) .20

(.05) -.06 (.02)

-.0008 (.11) .53

(.12) .02

(.05) -.23 (.07) .07

(.07) .14

(.05) .09

(.06) .29

(.08) .29

(.05)

-.33 (.10) .0009

(.017) .10

(.03) -.05 (.02)

.79

.37

451

7.57 (.35)

-.012 (.002) .0002

(.00004) -.25 (.03) .42

(.05) .20

(.04) -.06 (.02)

.54 (.11)

-.23 (.07) .08

(.07) .14

(.05) .09

(.06) .30

(.08) .29

(.05) .06

(.05) -.38 (.08)

.10 (.03)

-0.7 (.03)

.79

.37

451

7.81 (.31)

-.008 (.002) .0001

(.0003) -.25 (.02) .40

(.05) .18

(.04) -.07 (.02)

.48 (.10)

-.23 (.06) .18

(.06) .14

(.04) .11

(.05) .26

(.07) .24

(.05) .11

(.05) -.41 (.07)

.09 (.03)

-.06 (.02)

.82

.33

443

8.13 (.33)

-.09 (.03) .12

(.08) -.08 (.007) .39

(.05) .19

(.04) -.08 (.02)

AGEa

AXIX

OCUP

.49 (.10)

-.25 (.06) .19

(.06) .15

(.04) .13

(.05) .25

(.07) .26

(.05) .12

(.04) .41

(.07)

.08 (.03)

-.07 (.02)

.82

.32

443

16 Joumal of Business & Economic Statistics, January 1984

M2, NFL, DET, ACC, and POPD. Since the case forOCUP and NFL was not clear, the decision to trans­form them was maintained only provisionally.

3. Variables denoted by a in Tables 1 and 2 wereinitially rejected because they did not supply additionalinformation.

4. The building age variable showed a complex andhighly nonlinear influence, probably because it capturesvery different effects and acts as a proxy for othervariables. Moreover, as already indicated, its construc­tion was not free of difficulties. To identify nonlineareffects, its range was broken down into several intervalsrepresented by a set of dummy variables. The results

were that both 19th-century and very modem housingshowed rents significantly higher, while housing from1940 was the least expensive. In a first approximation,we represented this effect by a second degree polyno­mial. To avoid the expected multicollinearity, the fol­lowing variables were defined: AGDM = AGE - AGE,and AGDM2 = AGDM2

, where AGE is the mean agefor all housing.

With these decisions made, the resulting model ap­pears in column (1) of Table 3. The residuals' distri­bution is asymmetric, with asymmetry and kurtosiscoefficients equal to -1.95 and 7.5. The Kolmogorov­Smimov test leads to the rejection of the residuals'

Table 3. Regression Results

Coefficients (and standard errors)

Variables Other(1) (2) (3) (4) (5) Alternative

Variables

CONSTANT 5.77 7.14 7.57 7.S1 S.13(.77) (.60) (.35) (.31) (.33)

AGDM -.012 -.010 -.012 -.OOS -.09 AGEa

(.002) (.002) (.002) (.002) (.03)AGDM2 .0002 .0002 .0002 .0001 .12 AXIX

(.00005) (.00004) (.00004) (.0003) (.OS)OCUpa -.25 -.25 -.25 -.25 -.OS OCUP

(.04) (.03) (.03) (.02) (.007)M2a .46 .42 .42 .40 .39

(.07) (.05) (.05) (.05) (.05)NFLa .26 .20 .20 .1S .19

(.06) (.05) (.04) (.04) (.04)DETa -.06 -.06 -.06 -.07 -.OS

(.03) (.02) (.02) (.02) (.02)MAGL .11 -.OOOS

(.15) (.11 )CHTW .66 .53 .54 .4S .49

(.15) (.12) (.11) (.10) (.10)APT -.OS .02

(.06) (.05)LESS -.19 -.23 -.23 -.23 -.25

(.OS) (.07) (.07) (.06) (.06)TWOM .06 .07 .OS .1S .19

(.09) (.07) (.07) (.06) (.06)TELPH .05 .14 .14 .14 .15

(.06) (.05) (.05) (.04) (.04)CHEAT .11 .09 .09 .11 .13

(.07) (.06) (.06) (.05) (.05)GAR .2S .29 .30 .26 .25

(.10) (.OS) (.OS) (.07) (.07)FURN .33 .29 .29 .24 .26

(.07) (.05) (.05) (.05) (.05)BEXP .06 .11 .12

(.05) (.05) (.04)ACCa -.15 -.33 -.3S -.41 .41

(.13) (.10) (.OS) (.07) (.07)POPDa .04 .0009

(.02) (.017)HIGH .16 .10 .10 .09 .OS

(.04) (.03) (.03) (.03) (.03)OLD -.06 -.05 -0.7 -.06 -.07

(.03) (.02) (.03) (.02) (.02)

R2 .71 .79 .79 .S2 .S2Standard

error .4S .37 .37 .33 .32Numberof

observations 460 451 451 443 443

• In Iogarilhms.

16 Joumal of Business & Economic Statistics, January 1984

M2, NFL, DET, ACC, and POPD. Since the case forOCUP and NFL was not clear, the decision to trans­form them was maintained only provisionally.

3. Variables denoted by a in Tables 1 and 2 wereinitially rejected because they did not supply additionalinformation.

4. The building age variable showed a complex andhighly nonlinear influence, probably because it capturesvery different effects and acts as a proxy for othervariables. Moreover, as already indicated, its construc­tion was not free of difficulties. To identify nonlineareffects, its range was broken down into several intervalsrepresented by a set of dummy variables. The results

were that both 19th-century and very modem housingshowed rents significantly higher, while housing from1940 was the least expensive. In a first approximation,we represented this effect by a second degree polyno­mial. To avoid the expected multicollinearity, the fol­lowing variables were defined: AGDM = AGE - AGE,and AGDM2 = AGDM2

, where AGE is the mean agefor all housing.

With these decisions made, the resulting model ap­pears in column (1) of Table 3. The residuals' distri­bution is asymmetric, with asymmetry and kurtosiscoefficients equal to -1.95 and 7.5. The Kolmogorov­Smimov test leads to the rejection of the residuals'

Table 3. Regression Results

Coefficients (and standard errors)

Variables Other(1) (2) (3) (4) (5) Alternative

Variables

CONSTANT 5.77 7.14 7.57 7.S1 S.13(.77) (.60) (.35) (.31) (.33)

AGDM -.012 -.010 -.012 -.OOS -.09 AGEa

(.002) (.002) (.002) (.002) (.03)AGDM2 .0002 .0002 .0002 .0001 .12 AXIX

(.00005) (.00004) (.00004) (.0003) (.OS)OCUpa -.25 -.25 -.25 -.25 -.OS OCUP

(.04) (.03) (.03) (.02) (.007)M2a .46 .42 .42 .40 .39

(.07) (.05) (.05) (.05) (.05)NFLa .26 .20 .20 .1S .19

(.06) (.05) (.04) (.04) (.04)DETa -.06 -.06 -.06 -.07 -.OS

(.03) (.02) (.02) (.02) (.02)MAGL .11 -.OOOS

(.15) (.11 )CHTW .66 .53 .54 .4S .49

(.15) (.12) (.11) (.10) (.10)APT -.OS .02

(.06) (.05)LESS -.19 -.23 -.23 -.23 -.25

(.OS) (.07) (.07) (.06) (.06)TWOM .06 .07 .OS .1S .19

(.09) (.07) (.07) (.06) (.06)TELPH .05 .14 .14 .14 .15

(.06) (.05) (.05) (.04) (.04)CHEAT .11 .09 .09 .11 .13

(.07) (.06) (.06) (.05) (.05)GAR .2S .29 .30 .26 .25

(.10) (.OS) (.OS) (.07) (.07)FURN .33 .29 .29 .24 .26

(.07) (.05) (.05) (.05) (.05)BEXP .06 .11 .12

(.05) (.05) (.04)ACCa -.15 -.33 -.3S -.41 .41

(.13) (.10) (.OS) (.07) (.07)POPDa .04 .0009

(.02) (.017)HIGH .16 .10 .10 .09 .OS

(.04) (.03) (.03) (.03) (.03)OLD -.06 -.05 -0.7 -.06 -.07

(.03) (.02) (.03) (.02) (.02)

R2 .71 .79 .79 .S2 .S2Standard

error .4S .37 .37 .33 .32Numberof

observations 460 451 451 443 443

• In Iogarilhms.

Peha and Ruiz-Castillo: Robust Methods of Building Regression Models

Table 4. Possible Outliers Observation

numberDit 1 .03 .06 -6.6 2 .04 .05 -5.3 3 .06 .08 -5.4 4 .03 .02 -4.1 5 .04 .03 -3.7 6 .05 .03 -3.7 7 .06 .04 -3.7 8 .03 .02 -3.6 9 .05 .03 -3.3

10 .05 .02 -3.0 11 .04 .02 -3.0 12 .11 .04 -2.8 13 .03 .01 -2.8 14 .03 .01 -2.5 15 .03 .00 2.4 16 .04 .01 -2.4 17 .12 .04 2.4 18 .15 .00 -0.7 19 .15 .00 .07

normality with a = .01. The distribution appears to be a normal contaminated by a small number of negative values, since it is symmetric around the median (whose value is .07) and reasonably normal in the range .07 + 1.5 a, where a is the residuals standard deviation.

The internal analysis of the model's robustness yielded 19 observations worthy of attention, which are included in Table 4. The last two (numbers 18 and 19) are the potentially more influential with the largest vi values, although their actual influence is negligible ac- cording to the Di statistic. The first 17 observations with the largest ti values were carefully reviewed, with the result that the first 9 appeared to suffer from data transcription errors (omission of a zero in the rent figure). Since several of the next 8 observations were open to doubt, we decided to maintain them provision- ally. Consequently, we estimated a new regression with 451 data points with results summarized in column (2) of Table 3. As can be seen, the elimination of the first 9 observations improves the results without changing them substantially. The coefficients of most variables remain essentially constant with the following excep- tions: (a) the coefficients of MAGL and APT become practically zero, suggesting that they should be elimi- nated from the model; (b) the influence of POPD appears to be captured now by the accessibility index; and (c) the coefficients of TELPH and ACC increase, making them significant.

In view of this information, we estimated a new model without the variables MAGL, APT, and POPD, but introducing the variables previously rejected for the model with 460 data points. As a result, BEXP was provisionally included in the model because, although not significant (it has a t value of 1.2), it appears to be potentially important. Column (3) of Table 3 summa- rizes the final model fitted with 451 data points.

Next, we repeated the robustness analysis for this model in order to detect new anomalous data. We confirmed the atypical nature of the 8 observations previously commented upon, although their actual in- fluence appeared to be generally small judging by their Di values. At any rate, we reestimated the model with- out these 8 observations, obtaining the results presented in column (4) of Table 3. We should remark that (a) the variables TWOM, CHEAT, and BEXP, which were not formally significant with a = .05, become signifi- cant without any doubt; (b) the rest of the coefficients are not significantly altered; and (c) the Kolmogorov- Smirov test, as well as tests on the asymmetry and kurtosis, lead to the acceptance of the hypothesis of the residuals' normality with a = .10.

In conclusion, if we compare the latter with the initial model, it can be observed that after eliminating the 18 observations that we considered as outliers (3.7% of the total), the residual variance has diminished by 55%, the proportion of explained variability has increased by 17%, and we can reasonably accept the hypothesis that the residuals are normally distributed. Most coefficients have changed very slightly, and when this is not the case and the model becomes more compatible with a priori economic information: the distance to the center of Madrid measured by the logarithm of the accessibil- ity index, and the fact that a housing unit has two or more bathrooms, telephone, central heating, and build- ing expenditures included in another measure, become significant variables in the model presumably free from atypical values.

To test the former specification we have performed the maximum likelihood estimation of the Box-Cox transformation parameter X for the rent variable. This has been done for the models with 460, 451, and 443 data. The results, which are practically insensitive to different specifications of the continuous explanatory variables, are presented in Table 5.

As we keep eliminating atypical observations, the maximum of the likelihood function for X gradually

Table 5. Maximum Values of the Likelihood Function for Different Specifications of the Box-Cox Parameter and the Sample Size

x n -.2 -.1 .0 .1 .2 .3 .4 .5

460 -4866 -4816 -4775 -4745 -4727 -4721 -4729 -4750 451 -4635 -4605 -4585 -4574 -4574 -4584 -4605 -4637 443 -4493 -4469 -4454 -4447 -4450 -4464 -4489 -4524

17 Peña and Ruiz-Castillo: Robust Methods of Building Regression Models 17

Table 4. Possible Outliers

Observatían v, DI tinumber

1 .03 .06 -6.62 .04 .05 -5.33 .06 .08 -5.44 .03 .02 -4.15 .04 .03 -3.76 .05 .03 -3.77 .06 .04 -3.78 .03 .02 -3.69 .05 .03 -3.3

10 .05 .02 -3.011 .04 .02 -3.012 .11 .04 -2.813 .03 .01 -2.814 .03 .01 -2.515 .03 .00 2.416 .04 .01 -2.417 .12 .04 2.418 .15 .00 -0.719 .15 .00 .07

normality with a = .0 l. The distribution appears to bea normal contaminated by a small number of negativevalues, since it is symmetric around the median (whosevalue is .07) and reasonably normal in the range .07 ±1.5 q, where q is the residuals standard deviation.

The internal analysis of the model's robustnessyielded 19 observations worthy of attention, which areincluded in Table 4. The last two (numbers 18 and 19)are the potentially more influential with the largest Vii

values, although their actual influence is negligible ac­cording to the Di statistic. The first 17 observationswith the largest ti values were carefully reviewed, withthe result that the first 9 appeared to suffer from datatranscription errors (omission of a zero in the rentfigure). Since several of the next 8 observations wereopen to doubt, we decided to maintain them provision­ally. Consequently, we estimated a new regression with451 data points with results summarized in column (2)of Table 3. As can be seen, the elimination of the first9 observations improves the results without changingthem substantially. The coefficients of most variablesremain essentially constant with the following excep­tions: (a) the coefficients of MAGL and APT becomepractically zero, suggesting that they should be elimi­nated from the model; (b) the influence of POPDappears to be captured now by the accessibility index;and (c) the coefficients of TELPH and ACC increase,making them significant.

In view of this information, we estimated a newmodel without the variables MAGL, APT, and POPD,but introducing the variables previously rejected for themodel with 460 data points. As a result, BEXP wasprovisionally included in the model because, althoughnot significant (it has a t value of 1.2), it appears to bepotentially important. Column (3) of Table 3 summa­rizes the final model fitted with 451 data points.

Next, we repeated the robustness analysis for thismodel in order to detect new anomalous data. Weconfirmed the atypical nature of the 8 observationspreviously commented upon, although their actual in­fluence appeared to be generally small judging by theirDi values. At any rate, we reestimated the model with­out these 8 observations, obtaining the results presentedin column (4) of Table 3. We should remark that (a)the variables TWOM, CHEAT, and BEXP, which werenot formally significant with a = .05, become signifi­cant without any doubt; (b) the rest of the coefficientsare not significantly altered; and (c) the Kolmogorov­Smirnov test, as well as tests on the asymmetry andkurtosis, lead to the acceptance of the hypothesis of theresiduals' normality with a = .10.

In conclusion, ifwe compare the latter with the initialmodel, it can be observed that after eliminating the 18observations that we considered as outliers (3.7% ofthetotal), the residual variance has diminished by 55%, theproportion of explained variability has increased by17%, and we can reasonably accept the hypothesis thatthe residuals are normally distributed. Most coefficientshave changed very slightly, and when this is not thecase and the model becomes more compatible with apriori economic information: the distance to the centerof Madrid measured by the logarithm of the accessibil­ity index, and the fact that a housing unit has two ormore bathrooms, telephone, central heating, and build­ing expenditures included in another measure, becomesignificant variables in the model presumably free fromatypical values.

To test the former specification we have performedthe maximum likelihood estimation of the Box-Coxtransformation parameter Afor the rent variable. Thishas been done for the models with 460,451, and 443data. The results, which are practically insensitive todifferent specifications of the continuous explanatoryvariables, are presented in Table 5.

As we keep eliminating atypical observations, themaximum of the likelihood function for A gradually

Table 5. Maximum Values of the Likelihood Function for Different Specifications of the Box-CoxParameter and the Sample Size

n

460451443

-.2

-4866-4635-4493

-.1

-4816-4605-4469

.0-4775-4585-4454

.1

-4745-4574-4447

.2

-4727-4574-4450

.3

-4721-4584-4464

.4

-4729-4605-4489

.5

-4750-4637-4524

Peña and Ruiz-Castillo: Robust Methods of Building Regression Models 17

Table 4. Possible Outliers

Observatían v, DI tinumber

1 .03 .06 -6.62 .04 .05 -5.33 .06 .08 -5.44 .03 .02 -4.15 .04 .03 -3.76 .05 .03 -3.77 .06 .04 -3.78 .03 .02 -3.69 .05 .03 -3.3

10 .05 .02 -3.011 .04 .02 -3.012 .11 .04 -2.813 .03 .01 -2.814 .03 .01 -2.515 .03 .00 2.416 .04 .01 -2.417 .12 .04 2.418 .15 .00 -0.719 .15 .00 .07

normality with a = .0 l. The distribution appears to bea normal contaminated by a small number of negativevalues, since it is symmetric around the median (whosevalue is .07) and reasonably normal in the range .07 ±1.5 q, where q is the residuals standard deviation.

The internal analysis of the model's robustnessyielded 19 observations worthy of attention, which areincluded in Table 4. The last two (numbers 18 and 19)are the potentially more influential with the largest Vii

values, although their actual influence is negligible ac­cording to the Di statistic. The first 17 observationswith the largest ti values were carefully reviewed, withthe result that the first 9 appeared to suffer from datatranscription errors (omission of a zero in the rentfigure). Since several of the next 8 observations wereopen to doubt, we decided to maintain them provision­ally. Consequently, we estimated a new regression with451 data points with results summarized in column (2)of Table 3. As can be seen, the elimination of the first9 observations improves the results without changingthem substantially. The coefficients of most variablesremain essentially constant with the following excep­tions: (a) the coefficients of MAGL and APT becomepractically zero, suggesting that they should be elimi­nated from the model; (b) the influence of POPDappears to be captured now by the accessibility index;and (c) the coefficients of TELPH and ACC increase,making them significant.

In view of this information, we estimated a newmodel without the variables MAGL, APT, and POPD,but introducing the variables previously rejected for themodel with 460 data points. As a result, BEXP wasprovisionally included in the model because, althoughnot significant (it has a t value of 1.2), it appears to bepotentially important. Column (3) of Table 3 summa­rizes the final model fitted with 451 data points.

Next, we repeated the robustness analysis for thismodel in order to detect new anomalous data. Weconfirmed the atypical nature of the 8 observationspreviously commented upon, although their actual in­fluence appeared to be generally small judging by theirDi values. At any rate, we reestimated the model with­out these 8 observations, obtaining the results presentedin column (4) of Table 3. We should remark that (a)the variables TWOM, CHEAT, and BEXP, which werenot formally significant with a = .05, become signifi­cant without any doubt; (b) the rest of the coefficientsare not significantly altered; and (c) the Kolmogorov­Smirnov test, as well as tests on the asymmetry andkurtosis, lead to the acceptance of the hypothesis of theresiduals' normality with a = .10.

In conclusion, ifwe compare the latter with the initialmodel, it can be observed that after eliminating the 18observations that we considered as outliers (3.7% ofthetotal), the residual variance has diminished by 55%, theproportion of explained variability has increased by17%, and we can reasonably accept the hypothesis thatthe residuals are normally distributed. Most coefficientshave changed very slightly, and when this is not thecase and the model becomes more compatible with apriori economic information: the distance to the centerof Madrid measured by the logarithm of the accessibil­ity index, and the fact that a housing unit has two ormore bathrooms, telephone, central heating, and build­ing expenditures included in another measure, becomesignificant variables in the model presumably free fromatypical values.

To test the former specification we have performedthe maximum likelihood estimation of the Box-Coxtransformation parameter Afor the rent variable. Thishas been done for the models with 460,451, and 443data. The results, which are practically insensitive todifferent specifications of the continuous explanatoryvariables, are presented in Table 5.

As we keep eliminating atypical observations, themaximum of the likelihood function for A gradually

Table 5. Maximum Values of the Likelihood Function for Different Specifications of the Box-CoxParameter and the Sample Size

n

460451443

-.2

-4866-4635-4493

-.1

-4816-4605-4469

.0-4775-4585-4454

.1

-4745-4574-4447

.2

-4727-4574-4450

.3

-4721-4584-4464

.4

-4729-4605-4489

.5

-4750-4637-4524

18 Joural of Business & Economic Statistics, January 1984

approaches zero. The maximum is reached for X = .3 with the full sample, .2 with 451 data, and .1 with 443 data. In the latter case, a 95% confidence interval does not include the logarithmic transformation (X = 0). Although this suggests that the model might still contain further outliers, we did accept the logarithmic transfor- mation as adequate because it is reasonable from an economic point of view and is not dramatically rejected by the empirical evidence (see Atkinson 1982 for a recent analysis of the transformation parameters' sen- sitivity to outliers).

As regards the explanatory variables, we have already pointed out that the first exploratory models did not indicate whether to express the OCUP and NFL varia- bles with or without a logarithmic transformation. To decide this issue, we performed the following 2 x 2 factorial experiment in which the sums of squared residuals are presented for each possible specification:

OCUP NFL 45.34

In NFL 44.22

In OCUP 46.25

45.14

The results suggest that the number of floors should be in logs, while the years of occupancy should not be transformed.

The last variable whose specification was open to doubt was the building age. We searched for the best nonlinear specification using the procedure suggested by Box and Tidwell (1962), but unfortunately the com- putation algorithm did not converge. Finally, we ap- plied the following criterion. First, among the plausible transformations, choose the one generating the smallest sum of square residuals. Next, study the possibility of supplementing that specification with one or more of the dummy variables AXIX, A4164, or A6574.

This procedure leads to the logarithmic transforma- tion, corrected by the dummy AXIX. Since the coeffi- cient of the log of AGE was negative and that for AXIX was positive, this formulation is consistent with our information on the relationship's pattern: ceteris pari- bus, the greater the building age, the smaller the housing rent except for the 19th-century buildings, whose solid construction (or other unobserved characteristics) re- quire an upward correction.

4.3 The Selection of the Final Model

Since we want to predict market rents for housing units whose rents are government controlled, a relevant criterion to choose the number of regressors is the mean squared prediction error. An estimate of this error serving to compare different models is the Mallows statistic: Cp = (SSRp/a2) + 2p - n, where SSRp is the sum of squared residuals with p regressors, a2 is an unbiased estimator of the residual variance in the model with the largest number of variables, and n is the sample size. The Cp statistic permits the selection of the subset

that maximizes the model's predictive capacity (or min- imizes the mean squared error).

This criterion did not lead to the inclusion of new variables to our previous list. The best model is pre- sented in column (5) of Table 3, and has a Cp of 12.6 with 17 explanatory variables. Once this model was selected, we repeated the internal analysis of each ob- servation and searched for other sources of specification errors. The results were as follows:

1. The maximum value for the t statistic for the Studentized residuals was 3.5. The two next values were 3.1, while the rest of the data presented no problems. These three observations are close to the explanatory variables' center of gravity, so that their influence on parameter estimates is small. At any rate, there was no observation with a high Di value. Therefore, we con- cluded that the final model is robust to outliers.

2. Residual plots did not show any evidence of spec- ification errors. The residual distribution is normal according to the Kolmogorov-Smirov test with a = .05.

3. Finally, the estimation situation is adequate with- out multicollinearity problems: the condition index of the X'X matrix was only 8.8 (see Belsley, Kuh, and Welsch 1980).

4.4 The Economic Interpretation In the first place, the above analysis indicates that

82% of rent differences for market housing in the MMA can be explained by the 17 characteristics that were empirically relevant. While the information on struc- tural traits is rather rich, data on attributes referring to housing location in the 85 zones of the MMA were very poor. Thus, it is not surprising that the latter, the accessibility index ACC, the socioeconomic index ALTA, and the buildings age index OLD explained only 4% of the observed variability, while the 14 struc- tural characteristics explained the remaining 78%. If we had data on local public goods levels, pollution of different types, and the distribution of nonresidential land uses, we should expect that the location variables' relative importance would have been greater.

In the second place, all variables appear with the expected algebraic sign. As for the coefficients' inter- pretation, the following comments are in order:

1. For the variables in logarithms AGE, M2, NFL, DET, and ACC the coefficients measure the elasticity directly. Thus, a 10% increase in housing size measured in squared meters leads to a 4% increase in rents, which indicates that there are decreasing returns to scale in this variable. The -.4 elasticity for the accessibility index is somewhat low; two dwellings identical in every respect, except for a difference of 50% on transportation time to the Central Business District, would have a 20% difference in rent. The .19 elasticity for number of

NFLInNFL

18 Joumal of Business & Economic Statistics. January 1984

approaches zero. The maximum is reached for A = .3with the full sample, .2 with 451 data, and .1 with 443data. In the latter case, a 95% confidence interval doesnot inelude the logarithmic transformation (A = O).Although this suggests that the model might still containfurther outliers, we did accept the logarithmic transfor­mation as adequate because it is reasonable from aneconomic point ofview and is not dramatically rejectedby the empirical evidence (see Atkinson 1982 for arecent analysis of the transformation parameters' sen­sitivity to outliers).

As regards the explanatory variables, we have alreadypointed out that the first exploratory models did notindicate whether to express the OCUP and NFL varia­bles with or without a logarithmic transformation. Todecide this issue, we performed the following 2 x 2factorial experiment in which the sums of squaredresiduals are presented for each possible specification:

OCUP In OCUP45.34 46.2544.22 45.14

The results suggest that the number of floors should bein logs, while the years of occupancy should not betransformed.

The last variable whose specification was open todoubt was the building age. We searched for the bestnonlinear specification using the procedure suggestedby Box and Tidwell (1962), but unfortunately the com­putation algorithm did not converge. Finally, we ap­plied the following criterion. First, among the plausibletransformations, choose the one generating the smallestsum of square residuals. Next, study the possibility ofsupplementing that specification with one or more ofthe dummy variables AXIX, A4164, or A6574.

This procedure leads to the logarithmic transforma­tion, corrected by the dummy AXIX. Since the coeffi­cient of the log ofAGE was negative and that for AXIXwas positive, this formulation is consistent with ourinformation on the relationship's pattern: ceteris pari­bus, the greater the building age, the smaller the housingrent except for the 19th-century buildings, whose solidconstruction (or other unobserved characteristics) re­quire an upward correction.

4.3 The Selection of the Final Model

Since we want to predict market rents for housingunits whose rents are government controlled, a relevantcriterion to choose the number of regressors is the meansquared prediction error. An estimate of this errorserving to compare different models is the Mallowsstatistic: Cp = (SSRp/u 2

) + 2p - n, where SSRp is thesum of squared residuals with p regressors, u2 is anunbiased estimator ofthe residual variance in the modelwith the largest number ofvariables, and n is the samplesize. The Cp statistic permits the selection of the subset

that maximizes the model's predictive capacity (or min­imizes the mean squared error).

This criterion did not lead to the inelusion of newvariables to our previous list. The best model is pre­sented in column (5) of Table 3, and has a Cp of 12.6with 17 explanatory variables. Once this model wasselected, we repeated the internal analysis of each ob­servation and searched for other sources ofspecificationerrors. The results were as follows:

1. The maximum value for the t statistic for theStudentized residuals was 3.5. The two next values were3.1, while the rest of the data presented no problems.These three observations are elose to the explanatoryvariables' center of gravity, so that their influence onparameter estimates is small. At any rate, there was noobservation with a high Di value. Therefore, we con­eluded that the final model is robust to outliers.

2. Residual plots did not show any evidence of spec­ification errors. The residual distribution is normalaccording to the Kolmogorov-Smirnov test with a =.05.

3. Finally, the estimation situation is adequate with­out multicollinearity problems: the condition index ofthe X'X matrix was only 8.8 (see Belsley, Kuh, andWelsch 1980).

4.4 The Economic Interpretation

In the first place, the aboye analysis indicates that82% of rent differences for market housing in the MMAcan be explained by the 17 characteristics that wereempirically relevant. While the information on struc­tural traits is rather rich, data on attributes referring tohousing location in the 85 zones of the MMA were verypoor. Thus, it is not surprising that the latter, theaccessibility index ACe, the socioeconomic indexALTA, and the buildings age index OLD explainedonly 4% ofthe observed variability, while the 14 struc­tural characteristics explained the remaining 78%. Ifwehad data on local public goods levels, pollution ofdifferent types, and the distribution of nonresidentialland uses, we should expect that the location variables'relative importance would have been greater.

In the second place, all variables appear with theexpected algebraic signo As for the coefficients' inter­pretation, the following comments are in order:

l. For the variables in logarithms AGE, M2, NFL,DET, and ACC the coefficients measure the elasticitydirectly. Thus, a 10% increase in housing size measuredin squared meters leads to a 4% increase in rents, whichindicates that there are decreasing returns to scale inthis variable. The -.4 elasticity for the accessibilityindex is somewhat low; two dwellings identical in everyrespect, except for a difference of50% on transportationtime to the Central Business District, would have a 20%difference in rento The .19 elasticity for number of

NFLInNFL

18 Joumal of Business & Economic Statistics. January 1984

approaches zero. The maximum is reached for A = .3with the full sample, .2 with 451 data, and .1 with 443data. In the latter case, a 95% confidence interval doesnot inelude the logarithmic transformation (A = O).Although this suggests that the model might still containfurther outliers, we did accept the logarithmic transfor­mation as adequate because it is reasonable from aneconomic point ofview and is not dramatically rejectedby the empirical evidence (see Atkinson 1982 for arecent analysis of the transformation parameters' sen­sitivity to outliers).

As regards the explanatory variables, we have alreadypointed out that the first exploratory models did notindicate whether to express the OCUP and NFL varia­bles with or without a logarithmic transformation. Todecide this issue, we performed the following 2 x 2factorial experiment in which the sums of squaredresiduals are presented for each possible specification:

OCUP In OCUP45.34 46.2544.22 45.14

The results suggest that the number of floors should bein logs, while the years of occupancy should not betransformed.

The last variable whose specification was open todoubt was the building age. We searched for the bestnonlinear specification using the procedure suggestedby Box and Tidwell (1962), but unfortunately the com­putation algorithm did not converge. Finally, we ap­plied the following criterion. First, among the plausibletransformations, choose the one generating the smallestsum of square residuals. Next, study the possibility ofsupplementing that specification with one or more ofthe dummy variables AXIX, A4164, or A6574.

This procedure leads to the logarithmic transforma­tion, corrected by the dummy AXIX. Since the coeffi­cient of the log ofAGE was negative and that for AXIXwas positive, this formulation is consistent with ourinformation on the relationship's pattern: ceteris pari­bus, the greater the building age, the smaller the housingrent except for the 19th-century buildings, whose solidconstruction (or other unobserved characteristics) re­quire an upward correction.

4.3 The Selection of the Final Model

Since we want to predict market rents for housingunits whose rents are government controlled, a relevantcriterion to choose the number of regressors is the meansquared prediction error. An estimate of this errorserving to compare different models is the Mallowsstatistic: Cp = (SSRp/u 2

) + 2p - n, where SSRp is thesum of squared residuals with p regressors, u2 is anunbiased estimator ofthe residual variance in the modelwith the largest number ofvariables, and n is the samplesize. The Cp statistic permits the selection of the subset

that maximizes the model's predictive capacity (or min­imizes the mean squared error).

This criterion did not lead to the inelusion of newvariables to our previous list. The best model is pre­sented in column (5) of Table 3, and has a Cp of 12.6with 17 explanatory variables. Once this model wasselected, we repeated the internal analysis of each ob­servation and searched for other sources ofspecificationerrors. The results were as follows:

1. The maximum value for the t statistic for theStudentized residuals was 3.5. The two next values were3.1, while the rest of the data presented no problems.These three observations are elose to the explanatoryvariables' center of gravity, so that their influence onparameter estimates is small. At any rate, there was noobservation with a high Di value. Therefore, we con­eluded that the final model is robust to outliers.

2. Residual plots did not show any evidence of spec­ification errors. The residual distribution is normalaccording to the Kolmogorov-Smirnov test with a =.05.

3. Finally, the estimation situation is adequate with­out multicollinearity problems: the condition index ofthe X'X matrix was only 8.8 (see Belsley, Kuh, andWelsch 1980).

4.4 The Economic Interpretation

In the first place, the aboye analysis indicates that82% of rent differences for market housing in the MMAcan be explained by the 17 characteristics that wereempirically relevant. While the information on struc­tural traits is rather rich, data on attributes referring tohousing location in the 85 zones of the MMA were verypoor. Thus, it is not surprising that the latter, theaccessibility index ACe, the socioeconomic indexALTA, and the buildings age index OLD explainedonly 4% ofthe observed variability, while the 14 struc­tural characteristics explained the remaining 78%. Ifwehad data on local public goods levels, pollution ofdifferent types, and the distribution of nonresidentialland uses, we should expect that the location variables'relative importance would have been greater.

In the second place, all variables appear with theexpected algebraic signo As for the coefficients' inter­pretation, the following comments are in order:

l. For the variables in logarithms AGE, M2, NFL,DET, and ACC the coefficients measure the elasticitydirectly. Thus, a 10% increase in housing size measuredin squared meters leads to a 4% increase in rents, whichindicates that there are decreasing returns to scale inthis variable. The -.4 elasticity for the accessibilityindex is somewhat low; two dwellings identical in everyrespect, except for a difference of50% on transportationtime to the Central Business District, would have a 20%difference in rento The .19 elasticity for number of

Peha and Ruiz-Castillo: Robust Methods of Building Regression Models

floors does not have an immediate interpretation; per- haps taller buildings are more desirable on average because they possess some characteristic not reflected in our survey. Finally, the -.08 elasticity for housing deterioration state seems reasonable.

2. For the continuous untransformed variables OCUP, ALTA, and ANTIG the coefficients, multiplied by 100, represent the percentage in rent increase attrib- utable to a unit increase in the corresponding charac- teristic. The 8% for occupancy years can be interpreted as the annual rate of rent inflation during the 1965- 1974 period. The market premiums for location in more moder zones or for better socioeconomic con- ditions are, respectively, 7% to 9%.

3. When the dependent variable appears in logs, the expression (exp 13j - 1)- 100, where 1fj is the coefficient of a dummy variable, is interpreted as the percentage change in rents due to the presence of the attribute in question. For the 9 significant dummy variables, such effects, expressed in percent, are as follows:

FURN AXIX CHTW LESS TWOM 29.0 12.7 62.9 -21.8 20.5

TELF CHEAT GAR BEXP 16.0 14.0 28.3 13.3

In conclusion, the goodness of fit is very satisfactory, and the economic explanation of rent differences in terms of the final model's 17 significant variables is, on balance, quite reasonable.

5. CONCLUSIONS

To prevent least squares' great sensitivity to outliers, an internal study of the model's robustness was rec- ommended in Section 3 highlighting the potentially influential observations, as well as those that, in fact, clearly affect the estimation results. The diagonal terms of the projection matrix V is a good indication of the former, while the Cook Di statistic is adequate to meas- ure the latter. Moreover, a t statistic serves to determine which observations can be considered atypical.

Our empirical application of this methodology could be placed in the context of Chapter 4 of Belsley, Kuh, and Welsch (1980). These authors analyze the Harrison and Rubinfeld (1978) data on market values and char- acteristics of 506 owner-occupied dwelling units in the Boston Metropolitan Area. In the first place, they detect the nonnormality of OLS residuals in such a regression. Then they compute M-estimators, observe that some coefficients are considerably modified, and verify that the weighted Studentized residuals follow a normal distribution. On the other hand, using some statistics different from ours, they detect that 10% of the sample consists of influential observations, and informally ana- lyze the consequences of applying OLS after deleting 5

of outliers on the functional form. In Section 4, we also found nonnormality of OLS

residuals in an exploratory model. However, when we apply the robustification strategy we recommend, we find that (a) decisions regarding the model's functional form and the variables to include in it can be consid- erably affected by a few outliers; (b) the residuals' lack of normality can be attributed to some identified data coding errors and other anomalous observations; and (c) the elimination of outliers improves the statistical model of rent housing values in the MMA, enhancing its economic meaning.

ACKNOWLEDGMENTS

This work is part of a larger project financed by the Spanish Ministerio de Economia y Comercio. The au- thors wish to acknowledge the cooperation of Jose Antonio Quintero, who was responsible for the com- putational aspects of this work, as well as helpful com- ments by this journal's referees.

[Received September 1982. Revised July 1983.]

REFERENCES

ANDREWS, P. F., and PREGIBON, D. (1978), "Finding the Outliers that Matter," Journal of the Royal Statistical Society, Ser. B, 40, 85-93.

ATKINSON, A. C. (1982), "Regression Diagnostics, Transformations and Constructed Variables," Journal of the Royal Statistical Soci- ety, Ser. B, 44, 1-36.

BALL, M. J. (1973), "Recent Empirical Work on the Determinants of Relative House Prices," Urban Studies, 10, 213-231.

BARNETT, V., and LEWIS, T. (1978), Outliers in Statistical Data, John Wiley.

BELSLEY, D. A., KUH, E., and WELSCH, R. E. (1980), Regression Diagnostics, John Wiley.

BOX, G. E. P. (1953), "Non-normality and Tests on Variances," Biometrika, 40, 318-335.

(1979), "Robustness and Modeling," in Robustness in Statis- tics, eds. R. L. Launer and G. N. Wilkinson, Academic Press.

(1980), "Sampling and Bayes' Inference in Scientific Model- ling and Robustness" (with discussion), Journal of the Royal Sta- tistical Society, Ser. A, 383-430.

BOX, G. E. P., and TIAO, C. G. (1968), "A Bayesian Approach to Some Outlier Problems," Biometrika, 55, 119-129.

(1973), Bayesian Inference in Statistical Analysis, Reading, Mass.: Addison-Wesley.

BOX, G. E. P., and TIDWELL, P. W. (1962), "Transformation of the Independent Variables," Technometrics, 4, 531-550.

CHEN, G. G., and BOX, G. E. P. (1979a), "Implied Assumptions for Some Proposed Robust Estimators," Technical Report No. 568, University of Wisconsin, Madison, Dept. of Statistics.

(1979b), "A Study of Real Data," Technical Report No. 569, University of Wisconsin, Madison, Dept. of Statistics.

(1979c), "Further Study of Robustification Via a Bayesian Approach," Technical Report No. 570, University of Wisconsin, Madison, Dept. of Statistics.

COOK, R. D. (1977), "Detection of Influential Observations in Linear Regression," Technometrics, 19, 15-18.

(1979), "Influential Observations in Linear Regression," Jour- nal of the American Statistical Association, 74, 169-17.

COOK, R. D., and PRESCOTT, P. (1981), "On the Accuracy of

19

data points. In neither case do they investigate the effect

Peña and Ruiz-Castillo: Robust Methods of Building Regression Modefs 19

5. CONCLUSIONS

In conclusion, the goodness offit is very satisfactory,and the economic explanation of rent differences interms ofthe final model's 17 significant variables is, onbalance, quite reasonable.

floors does not have an immediate interpretation; per­haps taller buildings are more desirable on averagebecause they possess sorne characteristic not reflectedin our survey. Finally, the -.08 elasticity for housingdeterioration state seems reasonable.

2. For the continuous untransformed variablesOCUP, ALTA, and ANTIG the coefficients, multipliedby 100, represent the percentage in rent increase attrib­utable to a unit increase in the corresponding charac­teristic. The 8% for occupancy years can be interpretedas the annual rate of rent inflation during the 1965­1974 periodo The market premiums for location inmore modern zones or for better socioeconomic con­ditions are, respectively, 7% to 9%.

3. When the dependent variable appears in logs, theexpression (exp (jj - 1). 100, where (jj is the coefficientof a dummy variable, is interpreted as the percentagechange in rents due to the presence of the attribute inquestion. For the 9 significant dummy variables, sucheffects, expressed in percent, are as follows:

To prevent least squares' great sensitivity to outliers,an internal study of the model's robustness was rec­ommended in Section 3 highlighting the potentiallyinfluential observations, as well as those that, in fact,clearly affect the estimation results. The diagonal termsof the projection matrix V is a good indication of theformer, while the Cook Di statistic is adequate to meas­ure the latter. Moreover, a t statistic serves to determinewhich observations can be considered atypical.

Our empirical application of this methodology couldbe placed in the context of Chapter 4 of Belsley, Kuh,and Welsch (1980). These authors analyze the Harrisonand Rubinfeld (1978) data on market values and char­acteristics of 506 owner-occupied dwelling units in theBoston Metropolitan Area. In the first place, they detectthe nonnormality ofOLS residuals in such a regression.Then they compute M-estimators, observe that sornecoefficients are considerably modified, and verify thatthe weighted Studentized residuals follow a normaldistribution. On the other hand, using sorne statisticsdifferent from ours, they detect that 10% of the sampleconsists ofinfluential observations, and informally ana­lyze the consequences of applying OLS after deleting 5data points. In neither case do they investigate the effect

REFERENCES

of outliers on the functional form.In Section 4, we also found nonnormality of OLS

residuals in an exploratory model. However, when weapply the robustification strategy we recommend, wefind that (a) decisions regarding the model's functionalform and the variables to include in it can be consid­erably affected by a few outliers; (b) the residuals' lackof normality can be attributed to sorne identified datacoding errors and other anomalous observations; and(c) the elimination of outliers improves the statisticalmodel of rent housing values in the MMA, enhancingits economic meaning.

ACKNOWLEDGMENTS

This work is part of a larger project financed by theSpanish Ministerio de Economía y Comercio. The au­thors wish to acknowledge the cooperation of JoséAntonio Quintero, who was responsible for the com­putational aspects of this work, as well as helpful com­ments by this journal's referees.

[Received September 1982. Revised July 1983.]

ANDREWS, P. F., and PREGIBON, D. (1978), "Finding the Outliersthat Matter," Journal 01 the Royal Statistical Society. Ser. B, 40,85-93.

ATKINSON, A. C. (1982), "Regression Diagnostics, Transfonnationsand Constructed Variables," Journal 01the Royal Statistical Saci­ety, Ser. B, 44, 1-36.

BALL, M. J. (1973), "Recent Empirical Work on the DetenninantsofRelative House Prices," Urban Studies, 10,213-231.

BARNETT, V., and LEWIS, T. (1978), Outliers in Statistical Data,John Wiley.

BELSLEY, D. A., KUH, E., and WELSCH, R. E. (1980), RegressionDiagnostics, John Wiley.

BOX, G. E. P. (1953), "Non-nonnality and Tests on Variances,"Biometrika. 40, 318-335.-- (1979), "Robustness and Modeling," in Robustness in Statis­

tics, eds. R. L. Launer and G. N. Wilkinson, Academic Press.-- (1980), "Sampling and Bayes' Inference in Scientific Model­

Iing and Robustness" (with discussion), Journal 01 the Royal Sta­tistical Society, Ser. A, 383-430.

BOX, G. E. P., and TIAO, C. G. (1968), "A Bayesian Approach toSorne Outlier Problems," Biometrika, 55, 119-129.-- (1973), Bayesian Inference in Statistical Analysis, Reading,

Mass.: Addison-Wesley.BOX, G. E. P., and TIDWELL, P. W. (1962), "Transfonnation of

the Independent Variables," Technometrics, 4,531-550.CHEN, G. G., and BOX, G. E. P. (l979a), "Implied Assumptions

for Sorne Proposed Robust Estimators," Technical Report No. 568,University of Wisconsin, Madison, Dept. of Statistics.

--(1 979b), "A Study of Real Data," Technical Report No. 569,University ofWisconsin, Madison, Dept. ofStatistics.-- (l979c), "Further Study of Robustification Via a Bayesian

Approach," Technical Report No. 570, University of Wisconsin,Madison, Dept. of Statistics.

COOK, R. D. (1977), "Detection ofInfluential Observations in LinearRegression," Technometrics, 19, 15-18.--(1979), "In!luential Observations in Linear Regression," Jour­

nal olthe American Statistical Association, 74, 169-17.COOK, R. D., and PRESCOTT, P. (1981), "On the Accuracy of

BEXP13.3

CHTW LESS TWOM62.9 -21.8 20.5

TELF CHEAT GAR16.0 14.0 28.3

FURN AXIX29.0 12.7

Peña and Ruiz-Castillo: Robust Methods of Building Regression Modefs 19

5. CONCLUSIONS

In conclusion, the goodness offit is very satisfactory,and the economic explanation of rent differences interms ofthe final model's 17 significant variables is, onbalance, quite reasonable.

floors does not have an immediate interpretation; per­haps taller buildings are more desirable on averagebecause they possess sorne characteristic not reflectedin our survey. Finally, the -.08 elasticity for housingdeterioration state seems reasonable.

2. For the continuous untransformed variablesOCUP, ALTA, and ANTIG the coefficients, multipliedby 100, represent the percentage in rent increase attrib­utable to a unit increase in the corresponding charac­teristic. The 8% for occupancy years can be interpretedas the annual rate of rent inflation during the 1965­1974 periodo The market premiums for location inmore modern zones or for better socioeconomic con­ditions are, respectively, 7% to 9%.

3. When the dependent variable appears in logs, theexpression (exp (jj - 1). 100, where (jj is the coefficientof a dummy variable, is interpreted as the percentagechange in rents due to the presence of the attribute inquestion. For the 9 significant dummy variables, sucheffects, expressed in percent, are as follows:

To prevent least squares' great sensitivity to outliers,an internal study of the model's robustness was rec­ommended in Section 3 highlighting the potentiallyinfluential observations, as well as those that, in fact,clearly affect the estimation results. The diagonal termsof the projection matrix V is a good indication of theformer, while the Cook Di statistic is adequate to meas­ure the latter. Moreover, a t statistic serves to determinewhich observations can be considered atypical.

Our empirical application of this methodology couldbe placed in the context of Chapter 4 of Belsley, Kuh,and Welsch (1980). These authors analyze the Harrisonand Rubinfeld (1978) data on market values and char­acteristics of 506 owner-occupied dwelling units in theBoston Metropolitan Area. In the first place, they detectthe nonnormality ofOLS residuals in such a regression.Then they compute M-estimators, observe that sornecoefficients are considerably modified, and verify thatthe weighted Studentized residuals follow a normaldistribution. On the other hand, using sorne statisticsdifferent from ours, they detect that 10% of the sampleconsists ofinfluential observations, and informally ana­lyze the consequences of applying OLS after deleting 5data points. In neither case do they investigate the effect

REFERENCES

of outliers on the functional form.In Section 4, we also found nonnormality of OLS

residuals in an exploratory model. However, when weapply the robustification strategy we recommend, wefind that (a) decisions regarding the model's functionalform and the variables to include in it can be consid­erably affected by a few outliers; (b) the residuals' lackof normality can be attributed to sorne identified datacoding errors and other anomalous observations; and(c) the elimination of outliers improves the statisticalmodel of rent housing values in the MMA, enhancingits economic meaning.

ACKNOWLEDGMENTS

This work is part of a larger project financed by theSpanish Ministerio de Economía y Comercio. The au­thors wish to acknowledge the cooperation of JoséAntonio Quintero, who was responsible for the com­putational aspects of this work, as well as helpful com­ments by this journal's referees.

[Received September 1982. Revised July 1983.]

ANDREWS, P. F., and PREGIBON, D. (1978), "Finding the Outliersthat Matter," Journal 01 the Royal Statistical Society. Ser. B, 40,85-93.

ATKINSON, A. C. (1982), "Regression Diagnostics, Transfonnationsand Constructed Variables," Journal 01the Royal Statistical Saci­ety, Ser. B, 44, 1-36.

BALL, M. J. (1973), "Recent Empirical Work on the DetenninantsofRelative House Prices," Urban Studies, 10,213-231.

BARNETT, V., and LEWIS, T. (1978), Outliers in Statistical Data,John Wiley.

BELSLEY, D. A., KUH, E., and WELSCH, R. E. (1980), RegressionDiagnostics, John Wiley.

BOX, G. E. P. (1953), "Non-nonnality and Tests on Variances,"Biometrika. 40, 318-335.-- (1979), "Robustness and Modeling," in Robustness in Statis­

tics, eds. R. L. Launer and G. N. Wilkinson, Academic Press.-- (1980), "Sampling and Bayes' Inference in Scientific Model­

Iing and Robustness" (with discussion), Journal 01 the Royal Sta­tistical Society, Ser. A, 383-430.

BOX, G. E. P., and TIAO, C. G. (1968), "A Bayesian Approach toSorne Outlier Problems," Biometrika, 55, 119-129.-- (1973), Bayesian Inference in Statistical Analysis, Reading,

Mass.: Addison-Wesley.BOX, G. E. P., and TIDWELL, P. W. (1962), "Transfonnation of

the Independent Variables," Technometrics, 4,531-550.CHEN, G. G., and BOX, G. E. P. (l979a), "Implied Assumptions

for Sorne Proposed Robust Estimators," Technical Report No. 568,University of Wisconsin, Madison, Dept. of Statistics.

--(1 979b), "A Study of Real Data," Technical Report No. 569,University ofWisconsin, Madison, Dept. ofStatistics.-- (l979c), "Further Study of Robustification Via a Bayesian

Approach," Technical Report No. 570, University of Wisconsin,Madison, Dept. of Statistics.

COOK, R. D. (1977), "Detection ofInfluential Observations in LinearRegression," Technometrics, 19, 15-18.--(1979), "In!luential Observations in Linear Regression," Jour­

nal olthe American Statistical Association, 74, 169-17.COOK, R. D., and PRESCOTT, P. (1981), "On the Accuracy of

BEXP13.3

CHTW LESS TWOM62.9 -21.8 20.5

TELF CHEAT GAR16.0 14.0 28.3

FURN AXIX29.0 12.7

20 Joural of Business & Economic Statistics, January 1984

Bonferroni Significance Levels for Detecting Outliers in Linear Models," Technometrics, 23, 59-63.

COOK, R. D., and WEISBERG, S. (1980), "Characterization of an Empirical Influence Function for Detecting Influential Cases in Regression," Technometrics, 22, 495-508.

DIANANDA, P. H. (1949), "Note on Some Properties of Maximum Likelihood Estimates," Proceedings of the Cambridge Philosophical Society, 45, 536-544.

DRAPER, N. R., and JOHN, J. A. (1981), "Influential Observations and Outliers in Regression," Technometrics, 23, 21-26.

GRILLICHES, Z. (ed.) (1971), Price Indexes and Quality Change, Harvard University Press.

GUTTMAN, I. (1973), "Premium and Protection of Several Proce- dures for Dealing With Outliers When Sample Sizes are Moderate to Large," Technometrics, 15, 385-404.

HARRISON, D., and RUBINFELD, D. L. (1978), "Hedonic Housing Prices and the Demand for Clean Air," Journal of Environmental Economics and Management, 5, 81-102.

HOAGLIN, D. C., and WELSCH, R. E. (1978), "The Hat Matrix in Regression and ANOVA," The American Statistician, 32, 17-22.

HOGG, R. V. (1979), "An Introduction to Robust Estimation," in Robustness in Statistics, eds. R. L. Launer and G. N. Wilkinson, Academic Press.

HUBER, P. J. (1964), "Robust Estimation of a Location Parameter," Annals of Mathematical Statistics, 135, 73-101.

(1975), "Robustness and Designs," in A Survey of Statistical Design and Linear Models, ed. J. N. Srirastarra, North-Holland.

(1981), Robust Statistics, John Wiley.

JEFFREYS, H. (1961), Theory of Probability, Oxford Clarendon Press.

KRASKER, W. S., and WELSCH, R. E. (1982), "Efficient Bounded- influence Regression Estimation," Journal of the American Statis- tical Association, 77, 595-604.

MILLER, R. G. (1977), "Developments in Multiple Comparisons 1966-1976," Journal of the American Statistical Association, 72, 779-788.

MOSTELLER, F., and TUKEY, J. W. (1977), Data Analysis and Regression, Addison-Weslsy.

PENA, D., and RUIZ-CASTILLO, J. (1983), "Distributional Aspects of Public Rental Housing and Rent Control Policies in Spain," Journal of Urban Economics, forthcoming.

QUIGLEY, J. M. (1979), "What Have We Learned about Urban Housing Markets?," in Current Issues in Urban Economics, eds. P. Mieszkowski and M. Straszheim, The Johns Hopkins University Press.

ROSEN, S. (1974), "Hedonic Prices and Implicit Markets: Product Differentiation in Pure Competition," Journal of Political Econom- ics, 82, 34-55.

STIGLER, S. M. (1973), "Simon Newcomb, Perey Daniel, and the History of Robust Estimation 1855-1920," Journal ofthe American, Statistical Association, 63, 872-879.

TUKEY, J. W. (1960), "A Survey of Sampling from Contaminated Distributions, in Contributions to Probability and Statistics, ed. J. Olkin, Stanford University Press.

WEISBERG, S. (1980), Applied Linear Regression, John Wiley.

20 Joumal of Business &Economic Statistics, January 1984

Bonferroni Significance Levels for Detecting Outliers in LinearModels," Technometrics, 23, 59-63.

COOK, R. D., and WEISBERG, S. (1980), "Characterization of anEmpirical Inlluence Function for Detecting Inlluential Cases inRegression," Technometrics, 22, 495-508.

DlANANDA, P. H. (1949), "Note on Sorne Properties ofMaximumLikelihood Estimates," Proceedings olthe Cambridge PhilosophicalSaciety, 45, 536-544.

DRAPER, N. R., and JOHN, J. A. (1981), "Influential Observationsand Outliers in Regression," Technometrics, 23, 21-26.

GRILLlCHES, Z. (oo.) (1971), Price Indexes and Quality Change,Harvard University Press.

GUTTMAN, 1. (1973), "Premium and Protection of Several Proce­dures for Dealing With Outliers When Sample Sizes are Moderateto Large," Technometrics, 15,385-404.

HARRISON, D., and RUBINFELD, D. L. (1978), "HOOonic HousingPrices and the Demand for C1ean Air," Journal 01EnvironmentalEconomics and Management, 5, 81-102.

HOAGLlN, D. c., and WELSCH, R. E. (1978), "The Hat Matrix inRegression and ANOVA," The American Statistician, 32, 17-22.

HOGG, R. V. (1979), "An Introduction to Robust Estimation," inRobustness in Statistics, eds. R. L. Launer and G. N. Wilkinson,Academic Press.

HUBER, P. J. (1964), "Robust Estimation ofa Location Parameter,"Annals 01Mathematical Statistics, 135, 73-101.-- (1975), "Robustness and Designs," in A Survey olStatistical

Design and Linear Models, ed. J. N. Srirastarra, North-HoIland.-- (1981), Robust Statistics, John Wiley.

JEFFREYS, H. (1961), Theory 01 Probability, Oxford ClarendonPress.

KRASKER, W. S., and WELSCH, R. E. (1982), "Efficient BoundOO­influence Regression Estimation," Journal olthe American Statis­tical Association, 77, 595-604.

MILLER, R. G. (1977), "Developments in MuItiple Comparisons1966-1976," Journal 01 the American Statistical Association, 72,779-788.

MOSTELLER, F., and TUKEY, J. W. (1977), Data Analysis andRegression, Addison-Weslsy.

PEÑA, D., and RUIZ-CASTlLLO, J. (1983), "Distributional Aspectsof Public Rental Housing and Rent Control Policies in Spain,"Journal 01Urban Economics, forthcoming.

QUlGLEY, J. M. (1979), "What Have We LearnOO about UrbanHousing Markets?," in Current Issues in Urban Economics, eds.P. Mieszkowski and M. Straszheim, The Johns Hopkins UniversityPress.

ROSEN, S. (1974), "Hedonic Prices and Implicit Markets: ProductDitTerentiation in Pure Competition," JournalolPolitical Econom­ics, 82, 34-55.

STIGLER, S. M. (1973), "Simon Newcomb, Perey Daniel, and theHistory ofRobust Estimation 1855-1920," Journal oltheAmerican,Statistical Association, 63, 872-879.

TUKEY, J. W. (1960), "A Survey ofSampling from ContaminatedDistributions, in Contributions to Probability and Statistics, oo.J. Olkin, Stanford University Press.

WElSBERG, S. (1980), Applied Linear Regression, John Wilay.

20 Joumal of Business &Economic Statistics, January 1984

Bonferroni Significance Levels for Detecting Outliers in LinearModels," Technometrics, 23, 59-63.

COOK, R. D., and WEISBERG, S. (1980), "Characterization of anEmpirical Inlluence Function for Detecting Inlluential Cases inRegression," Technometrics, 22, 495-508.

DlANANDA, P. H. (1949), "Note on Sorne Properties ofMaximumLikelihood Estimates," Proceedings olthe Cambridge PhilosophicalSaciety, 45, 536-544.

DRAPER, N. R., and JOHN, J. A. (1981), "Influential Observationsand Outliers in Regression," Technometrics, 23, 21-26.

GRILLlCHES, Z. (oo.) (1971), Price Indexes and Quality Change,Harvard University Press.

GUTTMAN, 1. (1973), "Premium and Protection of Several Proce­dures for Dealing With Outliers When Sample Sizes are Moderateto Large," Technometrics, 15,385-404.

HARRISON, D., and RUBINFELD, D. L. (1978), "HOOonic HousingPrices and the Demand for C1ean Air," Journal 01EnvironmentalEconomics and Management, 5, 81-102.

HOAGLlN, D. c., and WELSCH, R. E. (1978), "The Hat Matrix inRegression and ANOVA," The American Statistician, 32, 17-22.

HOGG, R. V. (1979), "An Introduction to Robust Estimation," inRobustness in Statistics, eds. R. L. Launer and G. N. Wilkinson,Academic Press.

HUBER, P. J. (1964), "Robust Estimation ofa Location Parameter,"Annals 01Mathematical Statistics, 135, 73-101.-- (1975), "Robustness and Designs," in A Survey olStatistical

Design and Linear Models, ed. J. N. Srirastarra, North-HoIland.-- (1981), Robust Statistics, John Wiley.

JEFFREYS, H. (1961), Theory 01 Probability, Oxford ClarendonPress.

KRASKER, W. S., and WELSCH, R. E. (1982), "Efficient BoundOO­influence Regression Estimation," Journal olthe American Statis­tical Association, 77, 595-604.

MILLER, R. G. (1977), "Developments in MuItiple Comparisons1966-1976," Journal 01 the American Statistical Association, 72,779-788.

MOSTELLER, F., and TUKEY, J. W. (1977), Data Analysis andRegression, Addison-Weslsy.

PEÑA, D., and RUIZ-CASTlLLO, J. (1983), "Distributional Aspectsof Public Rental Housing and Rent Control Policies in Spain,"Journal 01Urban Economics, forthcoming.

QUlGLEY, J. M. (1979), "What Have We LearnOO about UrbanHousing Markets?," in Current Issues in Urban Economics, eds.P. Mieszkowski and M. Straszheim, The Johns Hopkins UniversityPress.

ROSEN, S. (1974), "Hedonic Prices and Implicit Markets: ProductDitTerentiation in Pure Competition," JournalolPolitical Econom­ics, 82, 34-55.

STIGLER, S. M. (1973), "Simon Newcomb, Perey Daniel, and theHistory ofRobust Estimation 1855-1920," Journal oltheAmerican,Statistical Association, 63, 872-879.

TUKEY, J. W. (1960), "A Survey ofSampling from ContaminatedDistributions, in Contributions to Probability and Statistics, oo.J. Olkin, Stanford University Press.

WElSBERG, S. (1980), Applied Linear Regression, John Wilay.