Bates, Nueva Version

7/29/2019 Bates, Nueva Version

1/102

Nonlinear Regression Analysis and Its

Applications


2/102


3/102

Nonlinear RegressionAnalysis and Its Applications

Second edition

Douglas M. Bates and Donald G. Watts

A Wiley-Interscience Publication

JOHN WILEY & SONS, INC.

New York / Chichester / Weinheim / Brisbane / Singapore / Toronto


4/102


5/102

Preface

text...

This is a preface section

text...

v


6/102


7/102

Acknowledgments

To Mary Ellen and Valery

vii


8/102


9/102

Contents

Preface v

Acknowledgments vii

1 Review of Linear Regression 11.1 The Linear Regression Model 1

1.1.1 The Least Squares Estimates 4

1.1.2 Sampling Theory Inference Results 51.1.3 Likelihood Inference Results 7 1.1.4 Bayesian Inference Results 7

1.1.5 Comments 7 1.2 The Geometry of Linear Least Squares 9

1.2.1 The Expectation Surface 10

1.2.2 Determining the Least Squares Estimates 121.2.3 Parameter Inference Regions 15

1.2.4 Marginal Confidence Intervals 19

1.2.5 The Geometry of Likelihood Results 23

1.3 Assumptions and Model Assessment 24

1.3.1 Assumptions and Their Implications 251.3.2 Model Assessment 27 1.3.3 Plotting Residuals 28

ix


10/102

x CONTENTS

1.3.4 Stabilizing Variance 29

1.3.5 Lack of Fit 30 Problems 31

2 Nonlinear Regression 33 2.1 The Nonlinear Regression Model 33

2.1.1 Transformably Linear Models 35 2.1.2 Conditionally Linear Parameters 372.1.3 The Geometry of the Expectation Surface 37

2.2 Determining the Least Squares Estimates 402.2.1 The GaussNewton Method 2 2 1 41

2.2.2 The Geometry of Nonlinear Least Squares 432.2.3 Convergence 50

2.3 Nonlinear Regression Inference Using the Linear

Approximation 53 2.3.1 Approximate Inference Regions for

Parameters 2 3 1 54

2.3.2 Approximate Inference Bands for theExpected Response 58

2.4 Nonlinear Least Squares via Sums of Squares 622.4.1 The Linear Approximation 62

2.4.2 Overshoot 67 Problems 67

Appendix A Data Sets Used in Examples 69

A.1 PCB 69 A.2 Rumford 70 A.3 Puromycin 70 A.4 BOD 71

A.5 Isomerization 73 A.6 -Pinene 74

A.7 Sulfisoxazole 75 A.8 Lubricant 76

A.9 Chloride 76 A.10 Ethyl Acrylate 76

A.11 Saccharin 79

A.12 Nitrite Utilization 80 A.13 s-PMMA 82 A.14 Tetracycline 82


11/102

CONTENTS xi

A.15 Oil Shale 82 A.16 Lipoproteins 85

References 87


12/102


13/102

1Review of Linear

Regression

Non sunt multiplicanda entia praeter necessitatem.(Entities are not to be multiplied beyond necessity.)

William of Ockham

We begin with a brief review of linear regression, because a thoroughgrounding in linear regression is fundamental to understanding nonlinear re-gression. For a more complete presentation of linear regression see, for ex-ample, Draper and Smith (1981), Montgomery and Peck (1982), or Seber(1977). Detailed discussion of regression diagnostics is given in Belsley, Kuh

and Welsch (1980) and Cook and Weisberg (1982), and the Bayesian approachis discussed in Box and Tiao (1973).

Two topics which we emphasize are modern numerical methods and thegeometry of linear least squares. As will be seen, attention to efficient com-puting methods increases understanding of linear regression, while the geo-metric approach provides insight into the methods of linear least squares andthe analysis of variance, and subsequently into nonlinear regression.

1.1 THE LINEAR REGRESSION MODEL

Linear regression provides estimates and other inferential results for the pa-rameters = (1, 2, . . . , P)T in the model

Yn = 1xn1 + 2xn2 + + PxnP + Zn (1.1)= (xn1, . . . , xnP) + Zn

1


14/102

2 REVIEW OF LINEAR REGRESSION

In this model, the random variable Yn, which represents the response forcase n, n = 1, 2, . . . , N , has a deterministic part and a stochastic part. Thedeterministic part, (xn1, . . . , xnP), depends upon the parameters and upon

the predictor or regressor variables xnp, p = 1, 2, . . . , P . The stochastic part,represented by the random variable Zn, is a disturbance which perturbs theresponse for that case. The superscript T denotes the transpose of a matrix.

The model for N cases can be written

Y= X +Z (1.2)

where Y is the vector of random variables representing the data we may get,X is the N P matrix of regressor variables,

X=

x11x21

.

.

.xN1

x12x22

.

.

.xN2

x13x23

.

.

.xN3

. . .

. . .

. . .

x1Px2P

.

.

.xNP

and Z is the vector of random variables representing the disturbances. (Wewill use bold face italic letters for vectors of random variables.)

The deterministic part, X, a function of the parameters and the regres-sor variables, gives the mathematical model or the model function for theresponses. Since a nonzero mean for Zn can be incorporated into the modelfunction, we assume that

E[Z] = 0 (1.3)

or, equivalently,E[Y] = X

We therefore call X the expectation function for the regression model. Thematrix X is called the derivative matrix, since the (n, p)th term is the deriva-tive of the nth row of the expectation function with respect to the pth pa-rameter.

Note that for linear models, derivatives with respect to any of the parame-ters are independent of all the parameters.

If we further assume that Z is normally distributed with

Var[Z] = E[ZZT] = 2I (1.4)

where I is an NN identity matrix, then the joint probability density functionfor Y, given and the variance 2, is

p(y|, 2) = 22N/2 exp (y X)

T(y X)

22 (1.5)

=

22N/2

exp

y X222


15/102

THE LINEAR REGRESSION MODEL 3

Age (years)

PCBconcentration(ppm)

2 4 6 8 10 12

0

5

10

15

20

25

30

Fig. 1.1 Plot of PCB concentration versus age for lake trout.

where the double vertical bars denote the length of a vector. When providedwith a derivative matrix X and a vector of observed data y, we wish to makeinferences about 2 and the P parameters .Example:

As a simple example of a linear regression model, we consider the con-centration of polychlorinated biphenyls (PCBs) in Lake Cayuga trout as

a function of age (Bache, Serum, Youngs and Lisk, 1972). The data setis described in Appendix 1, Section A1.1. A plot of the PCB concentra-tion versus age, Figure 1.1, reveals a curved relationship between PCBconcentration and age. Furthermore, there is increasing variance in thePCB concentration as the concentration increases. Since the assump-tion (1.4) requires that the variance of the disturbances be constant,we seek a transformation of the PCB concentration which will stabilizethe variance (see Section 1.3.2). Plotting the PCB concentration on alogarithmic scale, as in Figure 1.2a, nicely stabilizes the variance andproduces a more nearly linear relationship. Thus, a linear expectationfunction of the form

ln(PCB) = 1 + 2 age

could be considered appropriate, where ln denotes the natural logarithm(logarithm to the base e). Transforming the regressor variable (Box andTidwell, 1962) can produce an even straighter plot, as shown in Figure1.2b, where we use the cube root of age. Thus a simple expectation


16/102


Age (years)


2 4 6 8 10 12

0.5

1

5

10

(a)

Cube root of age


1.0 1.2 1.4 1.6 1.8 2.0 2.2

0.5

1

5

10

(b)

Fig. 1.2 Plot of PCB concentration versus age for lake trout. The concentration, ona logarithmic scale, is plotted versus age in part a and versus 3

age in part b.

function to be fitted is

ln(PCB) = 1 + 2 3

age

(Note that the methods of Chapter 2 can be used to fit models of theform

f(x,,) = 0 + 1x11 + 2x

22 + + PxPP

by simultaneously estimating the conditionally linear parameters andthe transformation parameters . The powers 1, . . . , P are used totransform the factors so that a simple linear model in x11 , . . . , x

PP is

appropriate. In this book we use the power = 0.33 for the age variableeven though, for the PCB data, the optimal value is 0.20.)

1.1.1 The Least Squares Estimates

The likelihood function, or more simply, the likelihood, l(, |y), for and is identical in form to the joint probability density (1.5) except that l(, |y)is regarded as a function of the parameters conditional on the observed data,rather than as a function of the responses conditional on the values of theparameters. Suppressing the constant (2)N/2 we write

l(, |y) N expy X2

22

(1.6)

The likelihood is maximized with respect to

when theresidual sum of

squares

S() = y X2 (1.7)


17/102


=

Nn=1

yn

P

p=1

xnpp

2

is a minimum. Thus the maximum likelihood estimate is the value of which minimizes S(). This is called the least squares estimate and can bewritten

= (XTX)1XTy (1.8)

Least squares estimates can also be derived by using sampling theory, sincethe least squares estimator is the minimum variance unbiased estimator for, or by using a Bayesian approach with a noninformative prior density on and . In the Bayesian approach, is the mode of the marginal posteriordensity function for .

All three of these methods of inference, the likelihood approach, the sam-pling theory approach, and the Bayesian approach, produce the same pointestimates for . As we will see shortly, they also produce similar regions of

reasonable parameter values. First, however, it is important to realize thatthe least squares estimates are only appropriate when the model (1.2) and theassumptions on the disturbance term, (1.3) and (1.4), are valid. Expressed inanother way, in using the least squares estimates we assume:

1. The expectation function is correct.

2. The response is expectation function plus disturbance.

3. The disturbance is independent of the expectation function.

4. Each disturbance has a normal distribution.

5. Each disturbance has zero mean.

6. The disturbances have equal variances.

7. The disturbances are independently distributed.

When these assumptions appear reasonable and have been checked usingdiagnostic plots such as those described in Section 1.3.2, we can go on to makefurther inferences about the regression model.

Looking in detail at each of the three methods of statistical inference, wecan characterize some of the properties of the least squares estimates.

1.1.2 Sampling Theory Inference Results

The least squares estimator has a number of desirable properties as shown,

for example, in Seber (1977):

1. The least squares estimator is normally distributed. This followsbecause the estimator is a linear function ofY, which in turn is a linear


18/102


function of Z. Since Z is assumed to be normally distributed, isnormally distributed.

2. E[] = : the least squares estimator is unbiased.

3. Var[] = 2(XTX)1: the covariance matrix of the least squares esti-mator depends on the variance of the disturbances and on the derivativematrix X.

4. A 1 joint confidence region for is the ellipsoid( )TXTX( ) P s2F(P, N P; ) (1.9)

where

s2 =S()

N Pis the residual mean square or variance estimate based on NP degreesof freedom, and F(P, N

P; ) is the upper quantile for Fishers F

distribution with P and N P degrees of freedom.5. A 1 marginal confidence interval for the parameter p is

p se(p)t(N P; /2) (1.10)where t(N P; /2) is the upper /2 quantile for Students T distri-bution with N P degrees of freedom and the standard error of theparameter estimator is

se(p) = s

(XTX)1

pp

(1.11)

with (XTX)1pp

equal to thepth diagonal term of the matrix (XTX)1.

6. A 1 confidence interval for the expected response at x0 is

xT0 sxT0 (X

TX)1x0 t(N P; /2) (1.12)

7. A 1 confidence interval for the expected response at x0 is

xT0 sxT0 (X

TX)1x0 t(N P; /2) (1.13)

8. A 1 confidence band for the response function at any x is given by

xT sxT(XTX)1x

P F(P, N P; ) (1.14)

The expressions (1.13) and (1.14) differ because (1.13) concerns an intervalat a single specific point, whereas (1.14) concerns the band produced by theintervals at all the values ofx considered simultaneously.


19/102


1.1.3 Likelihood Inference Results

The likelihood l(, | y), equation (1.6), depends on only through y X

, so likelihood contours are of the form

y X2 = c (1.15)where c is a constant. A likelihood region bounded by the contour for which

c = S()

1 +

P

N PF(P, N P; )

is identical to a 1 joint confidence region from the sampling theory ap-proach. The interpretation of a likelihood region is quite different from thatof a confidence region, however.

1.1.4 Bayesian Inference Results

As shown in Box and Tiao (1973), the Bayesian marginal posterior densityfor , assuming a noninformative prior density for and of the form

p(, ) 1 (1.16)is

p(|y)

1 +( )TXTX( )

s2

(+P)/2

(1.17)

which is in the form of a P-variate Students T density with location pa-rameter , scaling matrix s2(XTX)1, and = N P degrees of freedom.Furthermore, the marginal posterior density for a single parameter p, say, is

a univariate Students T density with location parameter p, scale parameter

s2

(XTX)1pp

, and degrees of freedom N P. The marginal posteriordensity for the mean of y at x0 is a univariate Students T density withlocation parameter xT0 , scale parameter s

2xT0 (XTX)1x0, and degrees of

freedom N P.A highest posterior density (HPD) region of content 1 is defined (Box

and Tiao, 1973) as a region R in the parameter space such that Pr { R} =1 and, for 1 R and 2 R, p(1|y) p(2|y). For linear modelswith a noninformative prior, an HPD region is therefore given by the ellipsoiddefined in (1.9). Similarly, the marginal HPD regions for p and xT0 arenumerically identical to the sampling theory regions (1.11, 1.12, and 1.13).

1.1.5 Comments

Although the three approaches to statistical inference differ considerably, theylead to essentially identical inferences. In particular, since the joint confidence,


20/102


likelihood, and Bayesian HPD regions are identical, we refer to them all asinference regions.

In addition, when referring to standard errors or correlations, we will use

the Bayesian term the standard error of p when, for the sampling theoryor likelihood methods, we should more properly say the standard error ofthe estimate ofp.

For linear least squares, any of the approaches can be used. For nonlinearleast squares, however, the likelihood approach has the simplest and mostdirect geometrical interpretation, and so we emphasize it.Example:

The PCB data can be used to determine parameter estimates and jointand marginal inference regions. In this linear situation, the regions canbe summarized using , s2, XTX, and = N P. For the ln(PCB)data with 3

age as the regressor, we have = (2.391, 2.300)T, s2 =

0.246 on = 26 degrees of freedom, and

XTX =

28.000 46.94146.941 83.367

X

TX

1

=

0.6374 0.3589

0.3589 0.2141

The joint 95% inference region is then

28.00(1 + 2.391)2 + 93.88(1 + 2.391)(2 2.300) + 83.37(2 2.300)2 = 2(0.246)3.37

= 1.66

the marginal 95% inference interval for the parameter 1 is

2.391 (0.496)

0.6374(2.056)

or3.21 1 1.58

and the marginal 95% inference interval for the parameter 2 is

2.300 (0.496)

0.2141(2.056)

or1.83 2 2.77

The 95% inference band for the ln(PCB) value at any 3

age = x, is

2.391 + 2.300x (0.496)

0.637 0.718x + 0.214x2

2(3.37)

These regions are plotted in Figure 1.3.

While it is possible to give formal expressions for the least squares estima-tors and the regression summary quantities in terms of the matrices XTXand (XTX)1, the use of these matrices for computing the estimates is notrecommended. Superior computing methods are presented in Section 1.2.2.


21/102

THE GEOMETRY OF LINEAR LEAST SQUARES 9

3.5 3.0 2.5 2.0 1.5

1.8

2.0

2.2

2.4

2.6

2.8

+

1

2

(a)

1.0 1.2 1.4 1.6 1.8 2.0 2.21

0

1

2

3

Cube root of age

ln(PCBconcentra

tion)

(b)

Fig. 1.3 Inference regions for the model ln(PCB) = 1 + 2 3

age. Part a shows theleast squares estimates (+), the parameter joint 95% inference region (solid line), and

the marginal 95% inference intervals (dotted lines). Part b shows the fitted response(solid line) and the 95% inference band (dotted lines).

Finally, the assumptions which lead to the use of the least squares estimatesshould always be examined when using a regression model. Further discussionon assumptions and their implications is given in Section 1.3.

1.2 THE GEOMETRY OF LINEAR LEAST SQUARES

The model (1.2) and assumptions (1.3) and (1.4) lead to the use of the leastsquares estimate (1.8) which minimizes the residual sum of squares (1.7). As

implied by (1.7), S() can be regarded as the square of the distance from thedata vector y to the expected response vector X . This links the subject oflinear regression to Euclidean geometry and linear algebra. The assumptionof a normally distributed disturbance term satisfying (1.3) and (1.4) indicatesthat the appropriate scale for measuring the distance between y and X is theusual Euclidean distance between vectors. In this way the Euclidean geometryof the N-dimensional response space becomes statistically meaningful. Thisconnection between geometry and statistics is exemplified by the use of theterm spherical normal for the normal distribution with the assumptions (1.3)and (1.4), because then contours of constant probability are spheres.

Note that when we speak of the linear form of the expectation functionX, we are regarding it as a function of the parameters , and that whendetermining parameter estimates we are only concerned with how the expectedresponse depends on the parameters, not with how it depends on the variables.In the PCB example we fit the response to 3

age using linear least squares

because the parameters enter the model linearly.


22/102


1.2.1 The Expectation Surface

The process of calculating S() involves two steps:

1. Using the P-dimensional parameter vector and the N P derivativematrix X to obtain the N-dimensional expected response vector () =X , and

2. Calculating the squared distance from () to the observed response y,y ()2.

The possible expected response vectors () form a P-dimensional expec-tation surface in the N-dimensional response space. This surface is a linearsubspace of the response space, so we call it the expectation plane when dealingwith a linear model.Example:

To illustrate the geometry of the expectation surface, consider just three

cases from the ln(PCB) versus3

age data,3

age ln(PCB)

1.26 0.921.82 2.152.22 2.52

The matrix X is then

X=

111

1.261.822.22

which consists of two column vectorsx1 = (1, 1, 1)T andx2 = (1.26, 1.82, 2.22)

T.These two vectors in the 3-dimensional response space are shown in

Figure 1.4b, and correspond to the points = (1, 0)T

and = (0, 1)T

in the parameter plane, shown in Figure 1.4a. The expectation func-tion () = X defines a 2-dimensional expectation plane in the3-dimensional response space. This is shown in Figure 1.4c, wherethe parameter lines corresponding to the lines 1 = 3, . . . , 5 and2 = 2, . . . , 2, shown in Figure 1.4a, are given. A parameter line is as-sociated with the parameter which is varying so the lines correspondingto 1 = 3, . . . , 5 (dotdashed lines) are called 2 lines.Note that the parameter lines in the parameter plane are straight, par-allel, and equispaced, and that their images on the expectation planeare also straight, parallel, and equispaced. Because the vector x1 isshorter than x2 (x1 =

3 while x2 =

9.83), the spacing between

the lines of constant 1 on the expectation plane is less than that be-tween the lines of constant 2. Also, the vectors x1 and x2 are not

orthogonal. The angle between them can be calculated from

cos =xT1 x2

x1x2


23/102


1

2

2 0 2 4

2

1

0

1

2

(0,1)T

(1,0)T

(a)

(1,1,1)T

(1.26,1.82,2.22)T

1

2

1

1

2

3

1

22

(b)

1

2

1

1

2

2

1

= 0

1

= 2

2= 0

2

= 1

2

= 2

(c)

Fig. 1.4 Expectation surface for the 3-case PCB example. Part a shows the parameterplane with 1 parameter lines (dashed) and 2 parameter lines (dotdashed). Part bshows the vectors x1 (dashed line) and x2 (dotdashed line) in the response space.The end points of the vectors correspond to = (1, 0)T and = (0, 1)T respectively.Part c shows a portion of the expectation plane (shaded) in the response space, with1 parameter lines (dashed) and 2 parameter lines (dotdashed).


24/102


=5.30

(3)(9.83)

= 0.98

to be about 11, so the parameter lines on the expectation plane arenot at right angles as they are on the parameter plane.

As a consequence of the unequal length and nonorthogonality of thevectors, unit squares on the parameter plane map to parallelograms onthe expectation plane. The area of the parallelogram is

x1x2 sin = x1x2

1 cos2 (1.18)=

(xT1 x1)(x

T2 x2) (xT1 x2)2

=

XTXThat is, the Jacobian determinant of the transformation from the pa-

rameter plane to the expectation plane is a constant equal toX

T

X1/2

.Conversely, the ratio of areas in the parameter plane to those on the

expectation plane isXTX1/2.

The simple linear mapping seen in the above example is true for all linearregression models. That is, for linear models, straight parallel equispacedlines in the parameter space map to straight parallel equispaced lines on theexpectation plane in the response space. Consequently, rectangles in one planemap to parallelepipeds in the other plane, and circles or spheres in one planemap to ellipses or ellipsoids in the other plane. Furthermore, the Jacobian

determinant,XTX1/2, is a constant for linear models, and so regions of

fixed size in one plane map to regions of fixed size in the other, no matterwhere they are on the plane. These properties, which make linear least squaresespecially simple, are discussed further in Section 1.2.3.

1.2.2 Determining the Least Squares Estimates

The geometric representation of linear least squares allows us to formulate avery simple scheme for determining the parameters estimates . Since theexpectation surface is linear, all we must do to determine the point on thesurface which is closest to the point y, is to project y onto the expectationplane. This gives us , and is then simply the value of corresponding to.

One approach to defining this projection is to observe that, after the projec-tion, the residual vector y will be orthogonal, or normal, to the expectationplane. Equivalently, the residual vector must be orthogonal to all the columnsof the Xmatrix, so

XT(y X) = 0


25/102


which is to say that the least squares estimate satisfies the normal equations

XTX = XTy (1.19)

Because of (1.19) the least squares estimates are often written = (XTX)1XTyas in (1.8). However, another way of expressing the estimate, and a more sta-ble way of computing it, involves decomposing X into the product of anorthogonal matrix and an easily inverted matrix. Two such decompositionsare the QR decomposition and the singular value decomposition (Dongarra,Bunch, Moler and Stewart, 1979, Chapters 9 and 11). We use the QR decom-position, where

X= QR

with the N N matrix Q and the N P matrix R constructed so that Qis orthogonal (that is, QTQ = QQT = I) and R is zero below the maindiagonal. Writing

R= R10

where R1 is P P and upper triangular, andQ = [Q1|Q2]

with Q1 the first P columns and Q2 the last N P columns ofQ, we haveX= QR= Q1R1 (1.20)

Performing a QR decomposition is straightforward, as is shown in Appendix2.

Geometrically, the columns ofQ define an orthonormal, or orthogonal, basisfor the response space with the property that the first P columns span theexpectation plane. Projection onto the expectation plane is then very easy if

we work in the coordinate system given by Q. For example we transform theresponse vector to

w = QTy (1.21)

with componentsw1 = Q

T1 y (1.22)

andw2 = Q

T2 y (1.23)

The projection ofw onto the expectation plane is then simplyw10

in the Q coordinates and

= Q

w10

= Q1w1 (1.24)


26/102


y

1

2

1

2

2

q1

q3

q2

a)

w

1

2

q1

1

2

2

q2

w2

b)

Fig. 1.5 Expectation surface for the 3-case PCB example. Part a shows a portion ofthe expectation plane (shaded) in the response space with 1 parameter lines (dashed)and 2 parameter lines (dotdashed) together with the response vector y. Also shownare the orthogonal unit vectors q1 and q2 in the expectation plane, and q3 orthogonalto the plane. Part b shows the response vector w, and a portion of the expectationplane (shaded) in the rotated coordinates given by Q.

in the original coordinates.Example:

As shown in Appendix 2, the QR decomposition (1.20) of the matrix

X=

111

1.261.822.22

for the 3-case PCB example is0.57740.57740.5774

0.74090.07320.6677

0.34320.81320.4700

1.7321

00

3.06000.6820

0

which gives [equation (1.21)]

w = 3.231.16

0.24 In Figure 1.5a we show the expectation plane and observation vectorin the original coordinate system. We also show the vectors q1, q2, q3,


27/102


which are the columns of Q. It can be seen that q1 and q2 lie inthe expectation plane and q3 is orthogonal to it. In Figure 1.5b weshow, in the transformed coordinates, the observation vector and theexpectation plane, which is now horizontal. Note that projecting wonto the expectation plane is especially simple, since it merely requiresreplacing the last element in w by zero.

To determine the least squares estimate we must find the value corre-sponding to . Since

= X

using (1.24) and (1.20)

R1 = w1 (1.25)

and we solve for by back-substitution (Stewart, 1973).Example:

For the complete ln(PCB), 3

age data set,

R1 =

5.291500

8.871052.16134

and w1 = (7.7570, 4.9721)

T, so = (2.391, 2.300)T .

1.2.3 Parameter Inference Regions

Just as the least squares estimates have informative geometric interpretations,so do the parameter inference regions (1.9), (1.10), (1.15) and those derivedfrom (1.17). Such interpretations are helpful for understanding linear regres-sion, and are essential for understanding nonlinear regression. (The geometricinterpretation is less helpful in the Bayesian approach, so we discuss only thesampling theory and likelihood approaches.)

The main difference between the likelihood and sampling theory geometricinterpretations is that the likelihood approach centers on the point y and thelength of the residual vector at () compared to the shortest residual vector,while the sampling theory approach focuses on possible values of () andthe angle that the resulting residual vectors could make with the expectationplane.

1.2.3.1 The Geometry of Sampling Theory Results To develop the geomet-ric basis of linear regression results from the sampling theory approach, wetransform to the Q coordinate system. The model for the random variableW = QTY is

W = R +QTZ

orU= WR (1.26)

where U= QTZ.


28/102


The spherical normal distribution ofZ is not affected by the orthogonaltransformation, so U also has a spherical normal distribution. This can be es-tablished on the basis of the geometry, since the spherical probability contours

will not be changed by a rigid rotation or reflection, which is what an orthogo-nal transformation must be. Alternatively, this can be established analyticallybecause QTQ = I, so the determinant ofQ is 1 and Qx = x for any N-vector x. Now the joint density for the random variables Z= (Z1, . . . , Z n)

T

is

pZ(z) = (22)N/2 exp

zTz22

and, after transformation, the joint density for U= QTZ is

pU(u) = (22)N/2|Q| exp

uTQTQu

22

= (22)

N/2 expuTu22

From (1.26), the form ofR leads us to partition U into two components:

U=

U1U2

where U1 consists of the first P elements ofU, and U2 the remaining NPelements. Each of these components has a spherical normal distribution ofthe appropriate dimension. Furthermore, independence of elements in theoriginal disturbance vector Z leads to independence of the elements ofU, sothe components U1 and U2 are independent.

The dimensions i of the components Ui, called the degrees of freedom,

are 1 = P and 2 = N P. The sum of squares of the coordinates of a-dimensional spherical normal vector has a 22 distribution on degrees offreedom, so

U12 22PU22 22NP

where the symbol is read is distributed as. Using the independence ofU1 and U2, we have

U12/PU22/(N P) F(P, N P) (1.27)

since the scaled ratio of two independent 2 random variables is distributedas Fishers F distribution.

The distribution (1.27) gives a reference distribution for the ratio of thesquared component lengths or, equivalently, for the angle that the disturbance


29/102


vector makes with the horizontal plane. We may therefore use (1.26) and(1.27) to test the hypothesis that equals some specific value, say 0, bycalculating the residual vector u0 = QTy R0 and comparing the lengthsof the components u

0

1 and u0

2 as in (1.27). The reasoning here is that a largeu01 compared to u02 suggests that the vector y is not very likely to havebeen generated by the model (1.2) with = 0, since u0 has a suspiciouslylarge component in the Q1 plane.

Note thatu022N P =

S()

N P = s2

andu012 = R10 w12 (1.28)

and so the ratio (1.27) becomes

R10 w12

P s2

(1.29)

Example:

We illustrate the decomposition of the residual u for testing the nullhypothesis

H0 : = (2.0, 2.0)Tversus the alternative

HA : = (2.0, 2.0)T

for the full PCB data set in Figure 1.6. Even though the rotateddata vector w and the expectation surface for this example are in a28-dimensional space, the relevant distances can be pictured in the 3-dimensional space spanned by the expectation surface (vectors q1 andq2

) and the residual vector. The scaled lengths of the components u1and u2 are compared to determine if the point

0 = (2.0, 2.0)T isreasonable.

The numerator in (1.29) is 5.291500

8.871052.16134

2.02.0

7.75704.9721

2 = 0.882The ratio is then 0.882/(2 0.246) = 1.79, which corresponds to a tailprobability (or p value) of 0.19 for an F distribution with 2 and 26degrees of freedom. Since the probability of obtaining a ratio at leastas large as 1.79 is 19%, we do not reject the null hypothesis.

A 1 joint confidence region for the parameters consists of all thosevalues for which the above hypothesis test is not rejected at level . Thus, a

value 0

is within a 1 confidence region ifu012/P

u022/(N P) F(P, N P; )


30/102


w

u0

u20

u10

R10

78

910

q1

4

5

6

7

2

3

4

w2

q2

2 = 2

2

= 3

1

= 2

1

= 1

Fig. 1.6 A geometric interpretation of the test H0 : = (2.0, 2.0)T for the fullPCB data set. We show the projections of the response vector w and a portion of theexpectation plane projected into the 3-dimensional space given by the tangent vectorsq1 and q2, and the orthogonal component of the response vector, w2. For the testpoint 0, the residual vector u0 is decomposed into a tangential component u01 andan orthogonal component u02.


31/102


Since s2 does not depend on 0, the points inside the confidence region forma disk on the expectation plane defined by

u1

2

P s2F(P, N

P; )

Furthermore, from (1.25) and (1.28) we have

u12 = R1( )2

so a point on the boundary of the confidence region in the parameter spacesatisfies

R1( ) =

P s2F(P, N P; ) dwhere d = 1. That is, the confidence region is given by

= +

P s2F(P, N P; )R11 d | d = 1

(1.30)

Thus the region of reasonable parameter values is a disk centered at R1 onthe expectation plane and is an ellipse centered at in the parameter space.Example:

For the ln(PCB) versus 3

age data, = (2.391, 2.300)T and s2 =0.246 based on 26 degrees of freedom, so the 95% confidence disk onthe transformed expectation surface is

R1 =

7.75704.9721

+ 1.288

cos sin

where 0 2. The disk is shown in the expectation plane in Figure1.7a, and the corresponding ellipse

= 2.391

2.300 + 1.288 0.18898

00.775660.46268

cos sin

is shown in the parameter plane in Figure 1.7b.

1.2.4 Marginal Confidence Intervals

We can create a marginal confidence interval for a single parameter, say 1,by inverting a hypothesis test of the form

H0 : 1 = 01

versusHA : 1 = 01

Any 01 for which H0 is not rejected at level is included in the 1

confidenceinterval. To perform the hypothesis test, we choose any parameter vector with

1 = 01 , say

01 ,0

TT

, calculate the transformed residual vector u0, anddivide it into three components: the first component u01 of dimension P 1


32/102


w

7 89 10

q1

45

67

2

3

4

q2

3.5 3.0 2.5 2.0 1.5

1.8

2.0

2.2

2.4

2

.6

2.8

+

1

2

Fig. 1.7 The 95% confidence disk and parameter confidence region for the PCB data.Part a shows the response vector w and a portion of the expectation plane projected

into the 3-dimensional space given by the tangent vectorsq1 and q2, and the orthogonalcomponent of the response vector, w2. The 95% confidence disk (shaded) in theexpectation plane (part a) maps to the elliptical confidence region (shaded) in theparameter plane (part b).

and parallel to the hyperplane defined by 1 = 01 ; the second component

u02 of dimension 1 and in the expectation plane but orthogonal to the 01

hyperplane; and the third component u03 of length (NP)s2 and orthogonalto the expectation plane. The component u02 is the same for any parameter with 1 = 01 , and, assuming that the true 1 is

01 , the scaled ratio of the

corresponding random variables U2 and U3 has the distribution

U22 /1

U32/(N P) F(1, N P)Thus we reject H0 at level if

u022

s2F(1, N P; )Example:

To test the null hypothesis

H0 : 1 = 2.0

versus the alternativeHA : 1 = 2.0

for the complete PCB data set, we decompose the transformed residualvector at 0 = (

2.0, 2.2)T into three components as shown in Figure

1.8 and calculate the ratiou022

s2=

0.240

0.246


33/102


u

w

u3

u1

u2

78

910

q1

4

5

6

7

2

3

4

q2

w2

1

= 2

Fig. 1.8 A geometric interpretation of the test H0:1 = 2.0 for the full PCB data set.We show the response vector w, and a portion of the expectation plane projected intothe 3-dimensional space given by the tangent vectors q1 and q2, and the orthogonalcomponent of the response vector, w2. For a representative point on the line 1 = 2the residual vector u is decomposed into a tangential component u01 along the line, atangential component u02 perpendicular to the line, and an orthogonal component u

03.


34/102


= 0.97

This corresponds to a p value of 0.33, and so we do not reject the nullhypothesis.

We can create a 1 marginal confidence interval for 1 as all values forwhich

u022 s2F(1, N P; )

or, equivalently,|u02| s t(N P; /2) (1.31)

Since |u02| is the distance from the point R1 to the line corresponding to1 =

01 on the transformed parameter plane, the confidence interval will

include all values 01 for which the corresponding parameter line intersectsthe disk

R1 + st(N P; /2)d | d = 1

(1.32)

Instead of determining the value of|u0

2|for each 0

1

, we take the disk (1.32)and determine the minimum and maximum values of1 for points on the disk.Writing r1 for the first row ofR11 , the values of 1 corresponding to pointson the expectation plane disk are

r1(R1 + s t(N P; /2)d) = 1 + s t(N P; /2)r1dand the minimum and maximum occur for the unit vectors in the direction ofr1; that is, d = r1T /r1. This gives the confidence interval

1 sr1t(N P; /2)In general, a marginal confidence interval for parameter p is

p sr

p

t(N P; /2) (1.33)where rp is the pth row ofR11 . The quantity

se(p) = srp (1.34)is called the standard error for the pth parameter. Since

(XTX)1 = (RT1R1)1

= R11 RT1

rp2 =

(XTX)1pp

, so the standard error can be written as in equation

(1.11).

A convenient summary of the variability of the parameter estimates can beobtained by factoring R11 as

R11 = diag(r1, r2, . . . ,rP)L (1.35)


35/102


where L has unit length rows. The diagonal matrix provides the parameterstandard errors, while the correlation matrix

C= LLT (1.36)

gives the correlations between the parameter estimates.Example:

For the ln(PCB) data, = (2.391, 2.300)T, s2 = 0.246 with 26 degreesof freedom, and

R11 =

5.29150

08.871052.16134

1

=

0.188980

0.775660.46268

=

0.7980

00.463

0.237

00.972

1

which gives standard errors of 0.798

0.246 = 0.396 for 1 and 0.463

0.246 =

0.230 for 2. Also

C=

10.972

0.9721

so the correlation between 1 and 2 is 0.97. The 95% confidenceintervals for the parameters are given by 2.391 2.056(0.396) and2.300 2.056(0.230), which are plotted in Figure 1.3a.

Marginal confidence intervals for the expected response at a design pointx0 can be created by determining which hyperplanes formed by constantxT0 intersect the disk (1.32). Using the same argument as was used toderive (1.33), we obtain a standard error for the expected response at x0 assxT0R11 , so the confidence interval is

xT0

s

xT0R

11

t(N

P; /2) (1.37)

Similarly, a confidence band for the response function is

xT sxTR11

P F(P, N P; ) (1.38)Example:

A plot of the fitted expectation function and the 95% confidence bandsfor the PCB example was given in Figure 1.3b.

Ansley (1985) gives derivations of other sampling theory results in linearregression using the QR decomposition, which, as we have seen, is closelyrelated to the geometric approach to regression.

1.2.5 The Geometry of Likelihood Results

The likelihood function indicates the plausibility of values of relative to y,and consequently has a simple geometrical interpretation. If we allow to


36/102


take on any value in the N-dimensional response space, the likelihood contoursare spheres centered on y. Values of of the form = X generate a P-dimensional expectation plane, and so the intersection of the plane with the

likelihood spheres produces disks.Analytically, the likelihood function (1.6) depends on through

y2 = QT( y)2 (1.39)= QT1 ( y)2 + QT2 ( y)2= w() w12 + w22

where w() = QT1 and QT2 = 0. A constant value of the total sum of

squares specifies a disk of the form

w() w12 = con the expectation plane. Choosing

c = P s2F(P, N

P; )

produces the disk corresponding to a 1 confidence region. In terms of thetotal sum of squares, the contour is

S() = S()

1 +

P

N PF(P, N P; )

(1.40)

As shown previously, and illustrated in Figure 1.7, this disk transforms toan ellipsoid in the parameter space.

1.3 ASSUMPTIONS AND MODEL ASSESSMENT

The statistical assumptions which lead to the use of the least squares esti-

mates encompass several different aspects of the regression model. As withany statistical analysis, if the assumptions on the model and data are notappropriate, the results of the analysis will not be valid.

Since we cannot guarantee a priori that the different assumptions are allvalid, we must proceed in an iterative fashion as described, for example, inBox, Hunter and Hunter (1978). We entertain a plausible statistical model forthe data, analyze the data using that model, then go back and use diagnosticssuch as plots of the residuals to assess the assumptions. If the diagnosticsindicate failure of assumptions in either the deterministic or stochastic com-ponents of the model, we must modify the model or the analysis and repeatthe cycle.

It is important to recognize that the design of the experiment and themethod of data collection can affect the chances of assumptions being validin a particular experiment. In particular randomization can be of great helpin ensuring the appropriateness of all the assumptions, and replication allowsgreater ability to check the appropriateness of specific assumptions.


37/102

ASSUMPTIONS AND MODEL ASSESSMENT 25

1.3.1 Assumptions and Their Implications

The assumptions, as listed in Section 1.1.1, are:

1. The expectation function is correct. Ensuring the validity of this as-sumption is, to some extent, the goal of all science. We wish to build amodel with which we can predict natural phenomena. It is in buildingthe mathematical model for the expectation function that we frequentlyfind ourselves in an iterative loop. We proceed as though the expecta-tion function were correct, but we should be prepared to modify it asthe data and the analyses dictate. In almost all linear, and in manynonlinear, regression situations we do not know the true model, butwe choose a plausible one by examining the situation, looking at dataplots and cross-correlations, and so on. As the analysis proceeds we canmodify the expectation function and the assumptions about the distur-bance term to obtain a more sensible and useful answer. Models shouldbe treated as just models, and it must be recognized that some will be

more appropriate or adequate than others. Nevertheless, assumption (1)is a strong one, since it implies that the expectation function includesall the important predictor variables in precisely the correct form, andthat it does not include any unimportant predictor variables. A usefultechnique to enable checking the adequacy of a model function is toinclude replications in the experiment. It is also important to actuallymanipulate the predictor variables and randomize the order in whichthe experiments are done, to ensure that causation, not correlation, isbeing determined (Box, 1960).

2. The response is expectation function plus disturbance. This assumptionis important theoretically, since it allows the probability density func-tion for the random variable Y describing the responses to be simply

calculated from the probability density function for the random variableZ describing the disturbances. Thus,

pY(y|, 2) = pZ(y X |2)In practice, this assumption is closely tied to the assumption of constantvariance of the disturbances. It may be the case that the disturbancescan be considered as having constant variance, but as entering the modelmultiplicatively, since in many phenomena, as the level of the signalincreases, the level of the noise increases. This lack of additivityof the disturbance will manifest itself as a nonconstant variance in thediagnostic plots. In both cases, the corrective action is the sameeitheruse weighted least squares or take a transformation of the response as

was done in Example PCB 1.

3. The disturbance is independent of the expectation function. This as-sumption is closely related to assumption (2), since they both relate to


38/102


appropriateness of the additive model. One of the implications of thisassumption is that the control or predictor variables are measured per-fectly. Also, as a converse to the implication in assumption (1) that all

important variables are included in the model, this assumption impliesthat any important variables which are not included are not systemat-ically related to the response. An important technique to improve thechances that this is true is to randomize the order in which the ex-periments are done, as suggested by Fisher (1935). In this way, if animportant variable has been omitted, its effect may be manifested asa disturbance (and hence simply inflate the variability of the observa-tions) rather than being confounded with one of the predictor effects(and hence bias the parameter estimates). And, of course, it is impor-tant to actually manipulate the predictor variables not merely recordtheir values.

4. Each disturbance has a normal distribution. The assumption of normal-

ity of the disturbances is important, since this dictates the form of thesampling distribution of the random variables describing the responses,and through this, the likelihood function for the parameters. This leadsto the criterion of least squares, which is enormously powerful becauseof its mathematical tractability. For example, given a linear model, it ispossible to write down the analytic solution for the parameter estima-tors and to show [Gausss theorem (Seber, 1977)] that the least squaresestimates are the best both individually and in any linear combination,in the sense that they have the smallest mean square error of any lin-ear estimators. The normality assumption can be justified by appealingto the central limit theorem, which states that the resultant of manydisturbances, no one of which is dominant, will tend to be normallydistributed. Since most experiments involve many operations to set up

and measure the results, it is reasonable to assume, at least tentatively,that the disturbances will be normally distributed. Again, the assump-tion of normality will be more likely to be appropriate if the order ofthe experiments is randomized. The assumption of normality may bechecked by examining the residuals.

5. Each disturbance has zero mean. This assumption is primarily a sim-plifying one, which reduces the number of unknown parameters to amanageable level. Any nonzero mean common to all observations canbe accommodated by introducing a constant term in the expectationfunction, so this assumption is unimportant in linear regression. It canbe important in nonlinear regression, however, where many expectationfunctions occur which do not include a constant. The main implicationof this assumption is that there is no systematic bias in the disturbancessuch as could be caused by an unsuspected influential variable. Hence,we see again the value of randomization.


39/102


6. The disturbances have equal variances. This assumption is more impor-tant practically than theoretically, since a solution exists for the leastsquares estimation problem for the case of unequal variances [see, e.g.,

Draper and Smith (1981) concerning weighted least squares]. Practi-cally, however, one must describe how the variances vary, which canonly be done by making further assumptions, or by using informationfrom replications and incorporating this into the analysis through gener-alized least squares, or by transforming the data. When the variance isconstant, the likelihood function is especially simple, since the parame-ters can be estimated independently of the nuisance parameter 2. Themain implication of this assumption is that all data values are equallyunreliable, and so the simple least squares criterion can be used. Theappropriateness of this assumption can sometimes be checked after amodel has been fitted by plotting the residuals versus the fitted values,but it is much better to have replications. With replications, we cancheck the assumption before even fitting a model, and can in fact use

the replication averages and variances to determine a suitable variance-stabilizing transformation; see Section 1.3.2. Transforming to constantvariance often has the additional effect of making the disturbances be-have more normally. This is because a constant variance is necessarilyindependent of the mean (and anything else, for that matter), and thisindependence property is fundamental to the normal density.

7. The disturbances are distributed independently. The final assumptionis that the disturbances in different experiments are independent ofone another. This is an enormously simplifying assumption, becausethen the joint probability density function for the vector Y is just theproduct of the probability densities for the individual random variablesYn, n = 1, 2, . . . , N . The implication of this assumption is that the dis-

turbances on separate runs are not systematically related, an assump-tion which can usually be made to be more appropriate by randomiza-tion. Nonindependent disturbances can be treated by generalized leastsquares, but, as in the case where there is nonconstant variance, modi-fications to the model must be made either through information gainedfrom the data, or by additional assumptions as to the nature of theinterdependence.

1.3.2 Model Assessment

In this subsection we present some simple methods for verifying the appropri-ateness of assumptions, especially through plots of residuals. Further discus-sion on regression diagnostics for linear models is given in Hocking (1983), andin the books by Belsley et al. (1980), Cook and Weisberg (1982), and Draperand Smith (1981). In Chapter 3 we discuss model assessment for nonlinearmodels.


40/102


1.3.3 Plotting Residuals

A simple, effective method for checking the adequacy of a model is to plot thestudentized residuals, zn/s

1

hnn, versus the predictor variables and anyother possibly important lurking variables (Box, 1960; Joiner, 1981). Theterm hnn is the nth diagonal term of the hat matrixH= X(X

TX)1XT =Q1Q

T1 , and zn is the residual for the nth case,

zn = yn yn

A relationship between the residuals and any variable then suggests that thereis an effect due to that variable which has not been accounted for. Featuresto look for include systematic linear or curvilinear behavior of the residualswith respect to a variable. Important common lurking variables includetime or the order number of the experiment; if a plot of residuals versus timeshows suspicious behavior, such as runs of residuals of the same sign, then the

assumption of independence of the disturbances may be inappropriate.Plotting residuals versus the fitted values yn is also useful, since such plots

can reveal outliers or general inadequacy in the form of the expectation func-tion. It is also a very effective plot for revealing whether the assumptionof constant variance is appropriate. The most common form of nonconstantvariance is an increase in the variability in the responses when the level of theresponse changes. This behavior was noticed in the original PCB data. If aregression model is fitted to such data, the plot of the studentized residualsversus the fitted values tends to have a wedge-shaped pattern.

When residual plots or the data themselves give an indication of non-constant variance, the estimation procedure should be modified. Possiblechanges include transforming the data as was done with the PCB data orusing weighted least squares.

A quantilequantile plot (Chambers, Cleveland, Kleiner and Tukey, 1983)of the studentized residuals versus a normal distribution gives a direct checkon the assumption of normality. If the expectation function is correct andthe assumption of normality is appropriate, such a normal probability plot ofthe residuals should be a fairly straight line. Departures from a straight linetherefore suggest inappropriateness of the normality assumption, although,as demonstrated in Daniel and Wood (1980), considerable variability can beexpected in normal plots. Normal probability plots are also good for revealingoutliers.Example:

Plots of residuals are given in Figure 1.9 for the fit of ln(PCB) to 3

age.Since the fitted values are a linear function of the regressor variable3

age, the form of the plot of the studentized residuals versus y will bethe same as that versus 3age, so we only display the former. The plotversus y and the quantilequantile plot are well behaved. Neither plotreveals outliers.


41/102


Fitted values

Studentizedresiduals

0.0 0.5 1.0 1.5 2.0 2.5 3.0

2

1

0

1

2

(a)

Normal quantile

Studentizedresiduals

2 1 0 1 2

2

1

0

1

2

(b)

Fig. 1.9 Studentized residuals for the PCB data plotted versus fitted values in parta and versus normal quantiles in part b.

1.3.4 Stabilizing Variance

An experiment which includes replications allows further tests to be made onthe appropriateness of assumptions. For example, even before an expectationfunction has been proposed, it is possible to check the assumption of constantvariance by using an analysis of variance to get averages and variances foreach set of replications and plotting the variances and standard deviationsversus the averages. If the plots show systematic relationships, then one canuse a variance-stabilizing procedure to transform to constant variance.

One procedure is to try a range of power transformations in the form (Boxand Cox, 1964)

y() =

y1 = 0

ln y = 0

We calculate and plot variances versus averages for y(), = 0,0.5,1, . . .and select that value of for which the variance appears to be most stable.Alternatively, for a random variable Y, if there is a power relationship betweenthe standard deviation and the mean such that , it can be shown(Draper and Smith, 1981; Montgomery and Peck, 1982; Box et al., 1978) thatthe variance of the transformed random variable Y1 will be approximatelyconstant.

Variance-stabilizing transformations usually have the additional benefit ofmaking the distribution of the disturbances appear more nearly normal, as

discussed in Section 1.3.1. Alternatively, one can use the replication infor-mation to assist in choosing a form of weighting for weighted least squares.

Example:


42/102


Average (ppm)

Standarddeviation(ppm)

0 5 10 15

0

2

4

6

8

10

(a)

Average (log scale)

Standarddeviation(lo

gscale)

0.0 0.5 1.0 1.5 2.0 2.5

0.2

0.3

0.4

0.5

0.6

0.7

(b)

Fig. 1.10 Replication standard deviations plotted versus replication averages for thePCB data in part a and for the ln(PCB) data in part b.

A plot of the standard deviations versus the averages for the originalPCB data is given in Figure 1.10a. It can be seen that there is agood straight line relationship between s and y, and so the variance-stabilizing technique leads to the logarithmic transformation. In Fig-ure 1.10b we plot the standard deviations versus the averages for theln(PCB) data. This plot shows no systematic relationship, and hencesubstantiates the effectiveness of the logarithmic transformation in sta-bilizing the variance.

1.3.5 Lack of Fit

When the data set includes replications, it is also possible to perform tests forlack of fit of the expectation function. Such analyses are based on an analysisof variance in which the residual sum of squares S() with N P degreesof freedom is decomposed into the replication sum of squares Sr (equal tothe total sum of squares of deviations of the replication values about theiraverages) with, say, r degrees of freedom, and the lack of fit sum of squares

Sl = S()Sr, with l = f NP r degrees of freedom. We then comparethe ratio of the lack of fit mean square over the replication mean square withthe appropriate value in the F table. That is, we compare

Sl/lSr/r

with F(l, r; )

to determine whether there is significant lack of fit at level . The geometricjustification for this analysis is that the replication subspace is always orthog-onal to the subspace containing the averages and the expectation function.


43/102

PROBLEMS 31

Table 1.1 Lack of fit analysis of the model fitted to the PCB data

Source Degrees of Sum of Mean F Ratio p Value

Freedom Squares Square

Lack of fit 9 1.923 0.214 0.812 0.61Replication 17 4.475 0.263

Residuals 26 6.398 0.246

If no lack of fit is found, then the lack of fit analysis of variance has servedits purpose, and the estimate of 2 should be based on the residual meansquare. That is, the replication and lack of fit sums of squares and degrees offreedom should be recombined to give an estimate with the largest number ofdegrees of freedom, so as to provide the most reliable parameter and expectedvalue confidence regions. If lack of fit is found, the analyst should attemptto discover why, and modify the expectation function accordingly. Furtherdiscussion on assessing the fit of a model and on modifying and comparingmodels is given in Sections 3.7 and 3.10.Example:

For the ln(PCB) versus 3

age data, the lack of fit analysis is presentedin Table 1.3.5. Because the p value suggests no lack of fit, we combinethe lack of fit and replication sums of squares and degrees of freedomand take as our estimate of2, the residual mean square of 0.246 basedon 26 degrees of freedom. If there had been lack of fit, we would havehad to modify the model: in either situation, we do not simply use thereplication mean square as an estimate of the variance.

Problems

1.1 Write a computer routine in a language of your choice to perform a QRdecomposition of a matrix using Householder transformations.

1.2 Draw a picture to show the Householder transformation of a vectory = (y1, y2)

T to the x axis. Use both forms of the vector u corresponding toequations (A2.1) and (A2.2). Hint: Draw a circle of radius y.1.3 Perform a QR decomposition of the matrix X from Example PCB 3,

X=

111

1.261.822.22

using u as in equation (A2.2). Compare the result with that in Appendix 2.

1.4


44/102


1.4.1. Perform a QR decomposition of the matrix

D =

00

010

11

111

and obtain the matrix Q1. This matrix is used in Example -pinene 6, Section4.3.4.

1.4.2. Calculate QT2 y, where y = (50.4, 32.9, 6.0, 1.5, 9.3)T, without ex-

plicitly solving for Q2.

1.5

1.5.1. Fit the model ln(PCB) = 1 + 2age to the PCB data and performa lack of fit analysis of the model. What do you conclude about the adequacyof this model?

1.5.2. Plot the residuals versus age, and assess the adequacy of the model.Now what do you conclude about the adequacy of the model?

1.5.3. Fit the model ln(PCB) = 1 + 2age+ 3age2 to the PCB data andperform a lack of fit analysis of the model. What do you conclude about theadequacy of this model?

1.5.4. Perform an extra sum of squares analysis to determine whether thequadratic term is a useful addition.

1.5.5. Explain the difference between your answers in (a), (b), and (d).


45/102

2Nonlinear Regression:

Iterative Estimation andLinear Approximations

Although this may seem a paradox, all exact science is dominated by the idea ofapproximation.

Bertrand Russell

Linear regression is a powerful method for analyzing data described bymodels which are linear in the parameters. Often, however, a researcher has amathematical expression which relates the response to the predictor variables,and these models are usually nonlinear in the parameters. In such cases,linear regression techniques must be extended, which introduces considerable

complexity.

2.1 THE NONLINEAR REGRESSION MODEL

A nonlinear regression model can be written

Yn = f(xn, ) + Zn (2.1)

where f is the expectation function and xn is a vector of associated regressorvariables or independent variables for the nth case. This model is of exactlythe same form as (1.1) except that the expected responses are nonlinear func-tions of the parameters. That is, for nonlinear models, at least one of thederivatives of the expectation function with respect to the parameters depends

on at least one of the parameters.

33


46/102

34 NONLINEAR REGRESSION

To emphasize the distinction between linear and nonlinear models, we use for the parameters in a nonlinear model. As before, we use P for the numberof parameters.

When analyzing a particular set of data we consider the vectors xn, n =1, 2, . . . , N , as fixed and concentrate on the dependence of the expected re-sponses on . We create the N-vector () with nth element

n() = f(xn,)n = 1, . . . , N

and write the nonlinear regression model as

Y= () +Z (2.2)

with Z assumed to have a spherical normal distribution. That is,

E[Z] = 0

Var(Z) = E[ZZT] = 2I

as in the linear model.Example:

Count Rumford of Bavaria was one of the early experimenters on thephysics of heat. In 1798 he performed an experiment in which a cannonbarrel was heated by grinding it with a blunt bore. When the cannonhad reached a steady temperature of 130F, it was allowed to cooland temperature readings were taken at various times. The ambienttemperature during the experiment was 60F, so [under Newtons lawof cooling, which states that df/dt = (fT0), where T0 is the ambienttemperature] the temperature at time t should be

f(t, ) = 60 + 70et

Since f/ = 70tet depends on the parameter , this model isnonlinear. Rumfords data are presented in Appendix A, Section A.2.

Example:

The MichaelisMenten model for enzyme kinetics relates the initialvelocity of an enzymatic reaction to the substrate concentration xthrough the equation

f(x,) =1x

2+x(2.3)

In Appendix A, Section A.3 we present data from Treloar (1974) onthe initial rate of a reaction for which the MichaelisMenten modelis believed to be appropriate. The data, for an enzyme treated withPuromycin, are plotted in Figure 2.1.

Differentiating f with respect to 1 and 2 gives

f1

= x2 + x

(2.4)

f

2=

1x(2 + x)2


47/102

THE NONLINEAR REGRESSION MODEL 35

Concentration (ppm)

Velocity((counts/min)/min)

0.0 0.2 0.4 0.6 0.8 1.0

0

50

100

150

200

Fig. 2.1 Plot of reaction velocity versus substrate concentration for the Puromycindata.

and since both these derivatives involve at least one of the parameters,the model is recognized as nonlinear.

2.1.1 Transformably Linear Models

The MichaelisMenten model (2.3) can be transformed to a linear model by

expressing the reciprocal of the velocity as a function of the reciprocal sub-strate concentration,

1

f=

1

1+

21

1

x(2.5)

= 1 + 2u

We call such models transformably linear. Some authors use the term in-trinsically linear, but we reserve the term intrinsic for a special geometricproperty of nonlinear models, as discussed in Chapter 7. As will be seen inChapter 3, transformably linear models have some advantages in nonlinearregression because it is easy to get starting values for some of the parameters.

It is important to understand, however, that a transformation of the datainvolves a transformation of the disturbance term too, which affects the as-sumptions on it. Thus, if we assume the model function (2.2) with an additive,spherical normal disturbance term is an appropriate representation of the ex-perimental situation, then these same assumptions will not be appropriate for


48/102


Concentration1

Velocity1

0 10 20 30 40 50

0.005

0.010

0.015

0.020

(a)

Concentration

Velocity

0.0 0.2 0.4 0.6 0.8 1.0 1.2

0

50

100

1

50

200

(b)

Fig. 2.2 Plot of inverse velocity versus inverse substrate concentration for thePuromycin experiment with the linear regression line (dashed) in part a, and the

corresponding fitted curve (dashed) in the original scale in part b.

the transformed data. Hence we should use nonlinear regression on the origi-nal data, or else weighted least squares on the transformed data. Sometimes,of course, transforming a data set to induce constant variance also producesa linear expectation function in which case linear regression can be used onthe transformed data.Example:

Because there are replications in the Puromycin data set, it is easy tosee from Figure 2.1 that the variance of the original data is constant,and hence that nonlinear regression should be used to estimate the pa-rameters. However, the reciprocal data, plotted in Figure 2.2a, while

showing a simple straight line relationship, also show decidely noncon-stant variance.

If we use linear regression to fit the model (2.5) to these data, we obtainthe estimates

= (0.005107, 0.0002472)T

corresponding to = (195.8, 0.04841)T

The fitted curve is overlaid with the data in the original scale in Figure2.2b, where we see that the predicted asymptote is too small. Becausethe variance of the replicates has been distorted by the transformation,the cases with low concentration (high reciprocal concentration) domi-nate the determination of the parameters and the curve does not fit thedata well at high concentrations.

This example demonstrates two important features. First, it emphasizesthe value of replications, because without replications it may not be possibleto detect either the constant variance in the original data or the nonconstant


49/102


variance in the transformed data; and second, it shows that while transformingcan produce simple linear behavior, it also affects the disturbances.

2.1.2 Conditionally Linear Parameters

The MichaelisMenten model is also an example of a model in which thereis a conditionally linear parameter, 1. It is conditionally linear because thederivative of the expectation function with respect to 1 does not involve 1.We can therefore estimate 1, conditional on 2, by a linear regression ofvelocity x/(2 + x). Models with conditionally linear parameters enjoy someadvantageous properties, which can be exploited in nonlinear regression.

2.1.3 The Geometry of the Expectation Surface

The assumption of a spherical normal distribution for the disturbance termZ leads us to consider the Euclidean geometry of the N-dimensional response

space, because again we will be interested in the least squares estimates ofthe parameters. The N-vectors () define a P-dimensional surface calledthe expectation surface in the response space, and the least squares estimatescorrespond to the point on the expectation surface,

= ()

which is closest to y. That is, minimizes the residual sum of squares

S() = y ()2

Example:

To illustrate the geometry of nonlinear models, consider the two casest = 4 and t = 41 for the Rumford data. Under the assumption thatNewtons law of cooling holds for these data, the expected responsesare

() =

60 + 70e4

60 + 70e41

0

Substituting values for in these equations and plotting the pointsin a 2-dimensional response space gives the 1-dimensional expectationsurface (curve) shown in Figure 2.3.

Note that the expectation surface is curved and of finite extent, whichis in contrast to the linear model in which the expectation surface isa plane of infinite extent. Note, too, that points with equal spacingon the parameter line () map to points with unequal spacing on theexpectation surface.

Example:As another example, consider the three cases from Example Puromycin2.1: x = 1.10, x = 0.56, and x = 0.22. Under the assumption that the


50/102


1

2

0 20 40 60 80 100 120 140

0

20

40

60

80

100

120

140

0

0.05

0.10.51

Fig. 2.3 Plot of the expectation surface (solid line) in the response space for the 2-caseRumford data. The points corresponding to = 0, 0.01, 0.02, . . . , 0.1, 0.2, . . . , 1, aremarked.

expectation function (2.3) is the correct one, the expected responses forthese substrate values are

() = 1(1.10)2+1.101(0.56)2+0.56

1(0.22)2+0.22

1, 2 0

and so we can plot the expectation surface by substituting values for in these equations. A portion of the 2-dimensional expectation surfacefor these x values is shown in Figure 2.4. Again, in contrast to the linearmodel, this expectation surface is not an infinite plane, and in general,straight lines in the parameter plane do not map to straight lines on theexpectation surface. It is also seen that unit squares in the parameterplane map to irregularly shaped areas on the expectation surface andthat the sizes of these areas vary. Thus, the Jacobian determinant isnot constant, which can be seen analytically, of course, because thederivatives (2.4) depend on .

For this model, there are straight lines on the expectation surface in

Figure 2.4 corresponding to the 1 parameter lines (lines with 2 heldconstant), reflecting the fact that 1 is conditionally linear. However,the 1 parameter lines are neither parallel nor equispaced. The 2 linesare not straight, parallel, or equispaced.


51/102


100

200

1

100

200

3

100

200

2

2

= 0

2

= 0.5

2

= 1.0

1

= 300

1

= 200

Fig. 2.4 Expectation surface for the 3-case Puromycin example. We show a portionof the expectation surface (shaded) in the expectation space with 1 parameter lines(dashed) and 2 parameter lines (dotdashed).


52/102


As can be seen from these examples, for nonlinear models with P parame-ters, it is generally true that:

1. the expectation surface, (), is a P-dimensional curved surface in theN-dimensional response space;

2. parameter lines in the parameter space map to curves on the curvedexpectation surface;

3. the Jacobian determinant, which measures how large unit areas in become in (), is not constant.

We explore these interesting and important aspects of the expectation sur-face later, but first we discuss how to obtain the least squares estimates forthe parameters . Nonlinear least squares estimation from the point of viewof sum of squares contours is given in Section 2.4.

2.2 DETERMINING THE LEAST SQUARES ESTIMATES

The problem of finding the least squares estimates can be stated very simplygeometricallygiven a data vector y, an expectation function f(xn,), anda set of design vectors xn, n = 1, . . . , N

(1)find the point on the expectation surface which is closest to y, and

then (2)determine the parameter vector which corresponds to the point .For a linear model, step (1) is straightforward because the expectation sur-

face is a plane of infinite extent, and we may write down an explicit expressionfor the point on that plane which is closest to y,

= Q1QT1 y

For a linear model, step (2) is also straightforward because the P-dimensionalparameter plane maps linearly and invertibly to the expectation plane, soonce we know where we are on one plane we can easily find the correspondingpoint on the other. Thus

= R11 QT1

In the nonlinear case, however, the two steps are very difficult: the firstbecause the expectation surface is curved and often of finite extent (or, atleast, has edges) so that it is difficult even to find , and the second becausewe can map points easily only in one directionfrom the parameter plane tothe expectation surface. That is, even if we know , it is extremely difficultto determine the parameter plane coordinates corresponding to that point.To overcome these difficulties, we use iterative methods to determine the leastsquares estimates.


53/102

DETERMINING THE LEAST SQUARES ESTIMATES 41

2.2.1 The GaussNewton Method 2 2 1

An approach suggested by Gauss is to use a linear approximation to theexpectation function to iteratively improve an initial guess 0 for and keepimproving the estimates until there is no change. That is, we expand theexpectation function f(xn, ) in a first order Taylor series about

0 as

f(xn, ) f(xn,0) + vn1(1 01) + vn2(2 02) + . . . + vnP(P 0P)where

vnp =f(xn, )

p

0

p = 1, 2, . . . , P

Incorporating all N cases, we write

() (0) + V0( 0) (2.6)where V0 is the NP derivative matrix with elements vnp. This is equivalentto approximating the residuals, z() = y (), by

z() y [(0) + V0] = z0 V0 (2.7)where z0 = y (0) and = 0.

We then calculate the Gauss increment 0 to minimize the approximateresidual sum of squares z0 V02, using

V0 = QR= Q1R1[cf.(1.19)]

w1 = QT1 z

0[cf.(1.21)]

1 = Q1w1[cf.(1.23)]

and soR

10 = w

1[cf.(1.24)]

The point1 = (1) = (0 + 0)

should now be closer to y than (0), and so we move to this better parametervalue 1 = 0 +0 and perform another iteration by calculating new residualsz1 = y (1), a new derivative matrix V1, and a new increment. Thisprocess is repeated until convergence is obtained, that is, until the incrementis so small that there is no useful change in the elements of the parametervector.Example:

To illustrate these calculations, consider the data from Example Puromycin2.1, with the starting estimates 0 = (205, 0.08)T. The data, along with

the fitted values, residuals, and derivatives evaluated at0

, are shownin Table 2.1.

Collecting these derivatives into the derivative matrix V0, we then per-form a QR decomposition, from which we generate w1 = Q

T1 z

0 and


54/102


Table 2.1 Residuals and derivatives for Puromycin data at = (205, 0.08)T.

n xn yn 0n z0n v

0n1 v

0n2

1 0.02 76 41.00 35.00 0.2000 410.002 0.02 47 41.00 6.00 0.2000 410.003 0.06 97 87.86 9.14 0.4286 627.554 0.06 107 87.86 19.14 0.4286 627.555 0.11 123 118.68 4.32 0.5789 624.656 0.11 139 118.68 20.32 0.5789 624.657 0.22 159 150.33 8.67 0.7333 501.118 0.22 152 150.33 1.67 0.7333 501.119 0.56 191 179.38 11.62 0.8750 280.2710 0.56 201 179.38 21.62 0.8750 280.2711 1.10 207 191.10 15.90 0.9322 161.9512 1.10 200 191.10 8.90 0.9322 161.95

then solve for 0 using R10 = w1. In this case,

0 = (8.03,0.017)Tand the sum of squares at 1 = 0 +0 is S(1) = 1206, which is muchsmaller than S(0) = 3155. We therefore move to 1 = (213.03, 0.063)T

and perform another iteration. Example:

As a second example, we consider data on biochemical oxygen demand(BOD) from Marske (1967), reproduced in Appendix A.4. The data areplotted in Figure 2.5. For these data, the model

f(x,) = 1(1 e2x) (2.8)

is considered appropriate.Using the starting estimates 0 = (20, 0.24)T, for which S(0) = 128.2,produces an increment to 1 = (13.61, 0.52)T with S(1) = 145.2. Inthis case, the sum of squares has increased and so we must modify theincrement as discussed below.

Step Factor

As seen in the last example, the GaussNewton increment can produce anincrease in the sum of squares when the requested increment extends beyondthe region where the linear approximation is valid. Even in these circum-stances, however, the linear approximation will be a close approximation tothe actual surface for a sufficiently small region around (0). Thus a small

step in the direction 0

should produce a decrease in the sum of squares. Wetherefore introduce a step factor , and calculate

1 = 0 + 0


55/102


Time (days)

BOD

(mg/l)

0 2 4 6 8

0

5

10

15

20

Fig. 2.5 Plot of BOD versus time

where is chosen to ensure that

S(1) < S(0) (2.9)

A common method of selecting is to start with = 1 and halve it until (2.9)is satisfied. This modification to the GaussNewton algorithm was suggestedin Box (1960)Hartley (1961)Example:

For the data and starting estimates in Example BOD 2.2.1, the value = 0.5 gave a reduced sum of squares, 94.2, at = (16.80, 0.38)T.

Pseudocode for the GaussNewton algorithm for nonlinear least squares isgiven in Appendix 3, Section A3.1, together with implementations in GAUSS,S, and SAS/IML.

2.2.2 The Geometry of Nonlinear Least Squares

Geometrically a GaussNewton iteration consists of:

1. approximating () by a Taylor series expansion at 0 = (0),

2. generating the residual vector z0 = y 0,3. projecting the residual z0 onto the tangent plane to give 1,


56/102


1

2

90 100 110 120 130 140 150

60

70

80

90

100

110

0.01

0.05

0.10.15

y

0.05

0

0.05

1^

Fig. 2.6 A geometric interpretation of calculation of the GaussNewton incrementusing the 2-case Rumford data. A portion of the expectation surface (heavy solid line)is shown in the response space together with the observed response y. Also shown isthe projection 1 ofy (0.05) onto the tangent plane at (0.05) (solid line). Thetick marks indicate true positions on the expectation surface and linear approximationpositions on the tangent plane.

4. mapping 1 through the linear coordinate system to produce the incre-ment 0, and finally

5. moving to (0 + 0).

The first step actually involves two distinct approximations:

1. the planar assumption, in which we approximate the expectation surface() near (0) by its tangent plane at (0), and

2. the uniform coordinate assumption, in which we impose a linear coor-dinate system V( 0) on the approximating tangent plane.

We give geometrical interpretations of these steps and assumptions in thefollowing examples.Example:

For the 2-case Rumford data set of Example Rumford 2.1.3, we ploty and a portion of the expectation surface in Figure 2.6. The ex-pectation surface is a curved line, and the points corresponding to = 0.01, 0.02, . . . , 0.2 are unevenly spaced.


57/102


For the initial estimate 0 = 0.05, a GaussNewton iteration involvesthe linear approximation

()

0 + v

where = ( 0.05), 0 is the expectation vector at = 0.05,60 + 70e4

60 + 70e41

=

117.3169.01

and v is the derivative vector at = 0.05,

v = 70(4)e470(41)e41

= 229.25369.47

The Taylor series approximation, consisting of the tangent plane andthe linear coordinate system, is shown as a solid line in Figure 2.6. Thisreplaces the curved expectation surface with the nonlinear parametercoordinates by a linear surface with a uniform coordinate system on it.

Next we use linear least squares to obtain the point 1 on the tangentline which is closest to y. We then calculate the apparent parameterincrement 0 corresponding to 1 and from this obtain 1 = 0 + 0.For this example,

z0 =

126110

117.3169.01

=

8.6940.99

so 1 = (138.1, 102.5)T, 0 = 0.091, and 1 = 0.05 0.091 = 0.041.It is clear that the linear approximation increment is too large, since1 = 0.041, whereas we can see from the points on the expectationsurface that is near 0.01. We must therefore use a step factor to reducethe increment before proceeding.

Example:

For a two parameter example, we consider the data and the startingvalues from Example Puromycin 2.2.1. Since the response space is 12-dimensional, we cannot picture it directly, but we can represent thesalient features in the 3-dimensional space spanned by the tangent planeand the residual vector. We do this in Figure 2.7, where we show aportion of the curved expectation surface, the residual vector, and theapproximating tangent plane. It can be seen that the expectation sur-face is only slightly curved, and so is well approximated by the tangentplane.

In Figure 2.8a we show the parameter curves for 1 = 200, 210, 220, 230and 2 = 0.06, 0.07, . . . , 0.1 projected onto the tangent plane, and inFigure 2.8b the corresponding linear approximation lines on the tangentplane. It can be seen that the linear approximation lines match the

true parameter curves very well. Also shown on the tangent planes arethe points 0 and 1 and in Figure 2.8a the projection of the curve(0 + 0) for 0 1. The points corresponding to = 0.25, 0.5,and 1 (1) are marked.


58/102


1^

0

y

Fig. 2.7 A geometric interpretation of calculation of the GaussNewton incrementusing the full Puromycin data set. We show the projection of a portion of the expec-tation surface into the subspace spanned by the tangent plane at 0 (shaded) and theresidual vector y0. The region on the expectation surface is bordered by the heavysolid lines. Also shown is the projection 1 of the residual vector onto the tangentplane.


59/102


q1

q2

0 10 20 30 40

20

10

0

10

20

(a)

1^

0

1

q1

q2

0 10 20 30 40

20

10

0

10

20

(b)

1^

0

Fig. 2.8 A geometric interpretation of calculation of the GaussNewton incrementusing the full Puromycin data set (continued). The points 0 and 1 are shownin the tangent planes together with the parameter curves in part a and the linear

approximation parameter lines in part b. In part a we also show the projection 1

ofthe point (0 +0). The curve (heavy solid line) joining 0 to 1 is the projection of(0 + 0) for 0 1. The points corresponding to = 0.25 and 0.5 are marked.


60/102


0

y

1^

Fig. 2.9 A geometric interpretation of calculation of the GaussNewton incrementusing the BOD data set. We show the projection of a portion of the expectation surfaceinto the subspace spanned by the tangent plane at 0 (shaded) and the residual vectory 0. The region on the expectation surface is bordered by the heavy solid lines.Also shown is the projection 1 of the residual vector onto the tangent plane.

Because the planar and uniform coordinate assumptions are both valid,the points 1 and 1 are close together and are much closer to y than0. In this case, a full step ( = 1) can be taken resulting in a decreasein the sum of squares as shown in Example Puromycin 2.2.1.

Example:

As a second two-parameter example, we consider the data and startingvalues from Example BOD 2.2.1. In Figure 2.9 we show a portion of thecurved expectation surface, the residual vector, and the approximatingtangent plane in the space spanned by the tangent plane and the residualvector. It can be seen that the expectation surface is moderately curved,but is still apparently well approximated by the tangent plane. In thisexample, the edge of the finite expectation surface is shown as the angledsolid line along the top edge of the surface.

In Figure 2.10a we show the parameter curves for 1 = 20, 30, . . . and2 = 0.2, 0.4, . . . projected onto the tangent plane. In Figure 2.10b

we show the corresponding linear approximation lines on the tangentplane. In this case, the linear approximation lines do not match thetrue parameter curves well at all. Also shown on the tangent planes arethe points 0 and 1, and in Figure 2

Documents

Bates, Nueva Version