19
Part II: Linear methods for regression and classification Reading HTF: Chapters 3 and 4 RWC: Chapters 2.1-5, and 10.1-6 59 S. Haneuse; Biostat/Stat 572 Linear methods From the previous example, we see that any given problem can be addressed using a whole spectrum of methods Define the spectrum of methods in terms of structure complexity parametric assumptions How will they dier ? numerically the answers will dier; the predictions themselves ability to approximate the true underlying mechanism; bias performance under repeated sampling; variability/eciency generalizability; predictive performance in an independent test set 60 S. Haneuse; Biostat/Stat 572 Linear methods exist somewhere close to the end of the spectrum which imposes a fair amount of structure many approaches we’ll discuss are direct generalizations of these methods Methods for continuous outcome data are generally referred to as regression Methods for categorical outcome data are generally referred to as classification 61 S. Haneuse; Biostat/Stat 572 Continuous outcomes: regression Basic set-up Assume we have a vector of inputs, X =(X 1 ,...,X p ), and would like to predict Y The linear regression model has the form Y = X T β + β = (β 1 ,..., β p ) E[] =0 V[] = σ 2 Typically, assume model has an intercept 62 S. Haneuse; Biostat/Stat 572

P a r t II: L in e a r m e t h o d s fo r r e g r e s s io n...In t e r p r e t a t io n ¥ C o n s id e r t h e in t e r p r e t a t io n o f t h e j t h e le m e n t o f ! ¥ O b

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: P a r t II: L in e a r m e t h o d s fo r r e g r e s s io n...In t e r p r e t a t io n ¥ C o n s id e r t h e in t e r p r e t a t io n o f t h e j t h e le m e n t o f ! ¥ O b

!

"

#

$

Part II:

Linear methods for regression

and classification

Reading

• HTF: Chapters 3 and 4

• RWC: Chapters 2.1-5, and 10.1-6

59 S. Haneuse; Biostat/Stat 572

!

"

#

$

Linear methods

• From the previous example, we see that any given problem can be

addressed using a whole spectrum of methods

• Define the spectrum of methods in terms of

" structure

" complexity

" parametric assumptions

• How will they differ ?

" numerically the answers will differ; the predictions themselves

" ability to approximate the true underlying mechanism; bias

" performance under repeated sampling; variability/efficiency

" generalizability; predictive performance in an independent test set

60 S. Haneuse; Biostat/Stat 572

!

"

#

$

• Linear methods exist somewhere close to the end of the spectrum which

imposes a fair amount of structure

" many approaches we’ll discuss are direct generalizations of these

methods

• Methods for continuous outcome data are generally referred to as

regression

• Methods for categorical outcome data are generally referred to as

classification

61 S. Haneuse; Biostat/Stat 572

!

"

#

$

Continuous outcomes: regression

Basic set-up

• Assume we have a vector of inputs, X = (X1, . . . , Xp), and would like to

predict Y

• The linear regression model has the form

Y = XT β + ε

" β = (β1, . . . , βp)

" E[ε] = 0

" V[ε] = σ2

• Typically, assume model has an intercept

62 S. Haneuse; Biostat/Stat 572

Page 2: P a r t II: L in e a r m e t h o d s fo r r e g r e s s io n...In t e r p r e t a t io n ¥ C o n s id e r t h e in t e r p r e t a t io n o f t h e j t h e le m e n t o f ! ¥ O b

!

"

#

$

Interpretation

• Consider the interpretation of the jth element of β

• Obtained by isolating the coefficient in the regression model

" let X−j denote the X-vector with the jth element removed

" compare two levels of Xj , holding X−j constant at, say, x−j

βj = E[Y | Xj = x, X−j = x−j ] − E[Y | Xj = (x + 1), X−j = x−j ]

• Interpretation is, therefore,

Holding all other components of the model constant, βj is the

difference in conditional expectation of Y comparing two populations

whose Xj value differ by one unit

" comparison, or contrast, is independent of the baseline value of Xj

63 S. Haneuse; Biostat/Stat 572

!

"

#

$

Estimation

• Observe a training data set consisting of n independent observations

(yi, xi), i = 1, . . . , n.

• Without any further assumptions, estimation typically proceeds by least

squares

" estimates of β obtained by minimization of the residual sum of squares

RSS(β) = (y − Xβ)T (y − Xβ)

" y is a p × 1 column matrix with entries yi

" X is an n × p matrix with row entries xi

• As long as XT X is non-singular, or equivalently X is of full rank, the

solution is given as

β = (XT X)−1XT y

64 S. Haneuse; Biostat/Stat 572

!

"

#

$

The ‘hat’ matrix

• The predicted/fitted values at the training sample points are given by

y = Xβ

= X(XT X)−1XT y

= Hy

" H is called the ‘hat’ matrix since it converts y into y

• Explicitly, the ith fitted value can easily be shown to equal

yi = Hi1y1 + . . . + Hiiyi + . . . + Hinyn

" Hij represents the ‘influence’ of the jth observation on yi

• More generally, H plays a very important role in regression diagnostics

" studentized residuals, leverage, Cook’s distance

65 S. Haneuse; Biostat/Stat 572

!

"

#

$

Degrees of freedom

• Intuitively, ‘degrees of freedom’ refers to the number of estimated

parameters, p

• Consider the trace of the hat matrix

tr(H) = tr{X(XT X)−1XT }

= tr{XT X(XT X)−1} = tr(Ip) = p

• Hence, for the linear model the trace of the hat matrix coincides with the

number of parameters

• Calculation can be applied to any linear estimator

• Can use this as a general purpose definition for the degrees of freedom

" useful in more complex settings where the number of parameters may

not be not obvious

66 S. Haneuse; Biostat/Stat 572

Page 3: P a r t II: L in e a r m e t h o d s fo r r e g r e s s io n...In t e r p r e t a t io n ¥ C o n s id e r t h e in t e r p r e t a t io n o f t h e j t h e le m e n t o f ! ¥ O b

!

"

#

$

• Remember the form of the k-nearest-neighbor prediction

yi =1

k

xj∈Nk(xi)

yj

=n

j=1

πjyj

" where πj = 1/k if xj ∈ Nk(xi), and 0 otherwise

• So the k-nearest-neighbor estimate is linear and can be written as

y = Ly

" L is an n × n matrix

" each row consists of 0’s and 1/k in positions for which the x’s are

among the k nearest neighbors

" tr(L) = n/k, the effective degrees of freedom

67 S. Haneuse; Biostat/Stat 572

!

"

#

$

Gauss-Markov Theorem

• Suppose the random variables X and Y satisfy the following conditions:

Y = XT β + ε

E[ε] = 0

V[ε] = σ2

Then the least squares estimator of β has the smallest variance among all

linear unbiased estimators

" linear in the sense that the they are linear function of the outcome

vector, y

• Consider the prediction of y0 = xT0 β

68 S. Haneuse; Biostat/Stat 572

!

"

#

$

• The least squares estimator of y0 is the form

y0 = xT0 β = xT

0 (XT X)−1XT y

" this is a linear function of the outcome vector: cT0 y

• Assuming the linear model is correct, then y0 is unbiased since

E[y0] = E[

xT0 (XT X)−1XT y

]

= xT0 (XT X)−1XT Xβ = xT

0 β

• The Gauss-Markov theorem states that for any other linear estimator,

θ = cT y,

that is unbiased for xT0 β, then

V[xT0 β] ≤ V[cT y]

69 S. Haneuse; Biostat/Stat 572

!

"

#

$

• Another way of looking at the theorem is in terms of MSE

MSE = Variance + Bias2

• The Gauss-Markov theorem tells us that the least squares estimator attains

minimum MSE among those linear estimators with no bias

• We have already seen that MSE is intimately related to prediction

" under L2 loss for the linear model, MSE arises as a component of the

expected predictive error (EPE)

• If MSE is a criteria that you are interested in, the restriction to the class of

estimators may not be wise

" in optimizing MSE, we will need to seek a balance between bias and

variance

" in particular, we may be willing to trade a little bias for gains in terms

of variance

70 S. Haneuse; Biostat/Stat 572

Page 4: P a r t II: L in e a r m e t h o d s fo r r e g r e s s io n...In t e r p r e t a t io n ¥ C o n s id e r t h e in t e r p r e t a t io n o f t h e j t h e le m e n t o f ! ¥ O b

!

"

#

$

Model selection and coefficient shrinkage

• In many predictions situations there are a large number of inputs, X

• While it may be the case that f(X) = XT β appropriately describes the

underlying mechanisms, it is always the case that we have a finite training

sample size, n

• In such settings, least squares estimation may not be satisfying for a

couple of reasons

• Prediction accuracy:

" least squares estimates may have low bias, but in ‘small’-sample

settings can exhibit large variability

" we could sacrifice a little bias to reduce variation and achieve better

overall predictive accuracy

71 S. Haneuse; Biostat/Stat 572

!

"

#

$

• Interpretation:

" with a large number of predictors, it may be hard conceptualize

‘holding everything else constant’

" may be desirable to restrict attention to a smaller subset of variables

which exhibit the strongest effects

72 S. Haneuse; Biostat/Stat 572

!

"

#

$

King County Birth data

• As an example, let’s consider the King county birth dataset, which

contains information on n=2,500 births from 2001

• The key outcome variable of interest is birth weight

" Birth weight ranges from 255g to 5,175g

" 5.1% of babies (127) were born at low birth weight (< 2,500g)

• A total of 16 potential predictor variables are available for investigation

73 S. Haneuse; Biostat/Stat 572

!

"

#

$

0 1000 2000 3000 4000 5000

Birth weight, grams

Figure 1: King county 2001 birth weight data

74 S. Haneuse; Biostat/Stat 572

Page 5: P a r t II: L in e a r m e t h o d s fo r r e g r e s s io n...In t e r p r e t a t io n ¥ C o n s id e r t h e in t e r p r e t a t io n o f t h e j t h e le m e n t o f ! ¥ O b

!

"

#

$

15 20 25 30 35 40 45

10

00

20

00

30

00

40

00

50

00

Mother’s age, years

0 5 10 15 20 25

10

00

20

00

30

00

40

00

50

00

Cigarettes smoked per day

0 5 10 15

10

00

20

00

30

00

40

00

50

00

Alchoholic drinks per week

0 5 10 15

10

00

20

00

30

00

40

00

50

00

Highest grade completed

Figure 2: King county 2001 birth weight data continued

75 S. Haneuse; Biostat/Stat 572

!

"

#

$

• Complete variable listing

"gender" M = male, F = female baby

"plural" 1 = singleton, 2 = twin, 3 = triplet

"age" mother’s age in years

"race" race categories (for mother)

"parity" number of previous live born infants

"married" Y = yes, N = no

"bwt" birth weight in grams

"smokeN" number of cigarettes smoked per day during pregnancy

"drinkN" number of alcoholic drinks per week during pregnancy

"firstep" 1 = participant in program; 0 = did not participate

"welfare" 1 = participant in public assistance program; 0 = did not

"smoker" Y = yes, N = no, U = unknown

"drinker" Y = yes, N = no, U = unknown

"wpre" mother’s weight in pounds prior to pregnancy

"wgain" mother’s weight gain in pounds during pregnancy

"education" highest grade completed (add 12 + 1 / year of college)

"gestation" weeks from last menses to birth of child

76 S. Haneuse; Biostat/Stat 572

!

"

#

$

• Rather than attempting to fit and report a model which includes all the

potential predictors, we can consider two strategies

" subset selection

" shrinkage methods

Subset selection

• Here we retain only a subset of variables

" the remaining variables essentially have their β coefficients set to zero

• Various strategies exist for ‘choosing’ the variables to keep (or throw out)

" best subset selection

" stepwise strategies

77 S. Haneuse; Biostat/Stat 572

!

"

#

$

Best subsets regression

• Suppose X consists of p components; X1, . . . , Xp

• For each k ∈ {1, . . . , p}, find the subset of k variables which results in the

smallest residual sums of squares

" other criteria include Mallow’s Cp, R2 and adjusted R2

• Can quickly become computationally intensive when p gets large

• In R, code is implemented in the leaps package

##

library(leaps)

weight <- read.table("birthweight.txt")

names(weight) <- c("gender", "plural", "age", "race", "parity", "married",

"bwt", "smokeN", "drinkN", "firststep", "welfare",

"smoker", "drinker", "wpre", "wgain", "education",

"gestation")

78 S. Haneuse; Biostat/Stat 572

Page 6: P a r t II: L in e a r m e t h o d s fo r r e g r e s s io n...In t e r p r e t a t io n ¥ C o n s id e r t h e in t e r p r e t a t io n o f t h e j t h e le m e n t o f ! ¥ O b

!

"

#

$

## Model with only the intercept

##

fit0 <- lm(bwt ~ 1, data=weight)

## Perform best subsets analysis

##

## maxModel: a model which includes all the variables you wish to

## entertain

## nvmax: the maximum number of variables for the subset selection

## nbest: specify, for any given k, the number of the best models

## are to be returned

##

maxModel <- as.formula(bwt ~ gender + age + race + parity + married

+ smokeN + drinkN + firststep + welfare

+ smoker + drinker + wpre + wgain

+ education + gestation)

bestSub <- summary(regsubsets(maxModel, data=weight, nvmax=12, nbest=10))

##

names(bestSub)

[1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"

79 S. Haneuse; Biostat/Stat 572

!

"

#

$

## ’results’ contains the subset size, k, and the residual sum of squares

##

results <- c(0, sum((weight$bwt- fitted(fit0))^2))

results <- rbind(results,

cbind(apply(bestSub$which, 1, sum)-1, bestSub$rss))

##

minRSS <- tapply(results[,2], results[,1], FUN=min)

80 S. Haneuse; Biostat/Stat 572

!

"

#

$Subset size, k

Re

sid

ua

l su

m o

f sq

ua

res

0 1 2 3 4 5 6 7 8 9 10 11 12

81 S. Haneuse; Biostat/Stat 572

!

"

#

$

• The best-subset curve is necessarily decreasing, so it cannot be used as a

criteria for choosing k

• Typically choose a model which minimizes an estimate of the EPE

" Mallow’s Cp, AIC, BIC, cross-validation

Stepwise procedures

• Instead of performing an exhaustive enumeration for each value for k, we

can search for a ‘good path’

• Forward selection

" start with an ‘intercept-only’ model and build up the model

• Backward selection

" start with an ‘full’ model and reduce the model

82 S. Haneuse; Biostat/Stat 572

Page 7: P a r t II: L in e a r m e t h o d s fo r r e g r e s s io n...In t e r p r e t a t io n ¥ C o n s id e r t h e in t e r p r e t a t io n o f t h e j t h e le m e n t o f ! ¥ O b

!

"

#

$

Shrinkage methods

• Rather retaining some variables and discarding the rest, an alternative is to

keep all the variables but impose restrictions on the size of the coefficients

" the point esitmates for β are subject to bias

" results often don’t suffer as much in terms of variability

• Ridge regression imposes an L2-type penalty

" the solution is give by

βridge

= argminβ RSS(β)

subject to the constraint:

p∑

j=1

β2j ≤ s

" value of s influences how large the components of β can get

83 S. Haneuse; Biostat/Stat 572

!

"

#

$

• An alternative way of writing the problem is

βridge

= argminβ

RSS(β) + λp

j=1

β2j

• Here, λ ≥ 0 controls the amount of shrinkage

" when λ = 0, we are performing ordinary least squares estimation

" for large λ, minimizing the penalized RSS requires the components of

β to be small

" there is a one-to-one relationship between λ and s

• Minimization yields the solution

βridge

= (XT X + λI)−1XT y

• Could allow λ to be a vector, and ensure no shrinkage among certain

coefficients

84 S. Haneuse; Biostat/Stat 572

!

"

#

$

• The solution is linear, and we can therefore obtain the effective degrees of

freedom as

df(λ) = tr{X(XT X + λI)−1XT }

" depends on the complexity/smoothing parameter λ

• Ridge regression for the linear model is implemented in R

##

library(MASS)

##

ridgeFit <- lm.ridge(bwt ~ ., data=weight, lambda=c(0, 100, 1000, 10000))

## Output:

##

0 100 1000 10000

genderM 67.247 64.496 47.026 12.611

age 19.756 19.532 17.125 7.662

raceblack -16.062 -16.175 -14.413 -5.620

...

85 S. Haneuse; Biostat/Stat 572

!

"

#

$Smoothing parameter, !

Ridge regression coefficient estimates

0 6250 12500 18750 25000

!3

9!

13

14

41

67

genderM

age

smokerY

drinkerY

86 S. Haneuse; Biostat/Stat 572

Page 8: P a r t II: L in e a r m e t h o d s fo r r e g r e s s io n...In t e r p r e t a t io n ¥ C o n s id e r t h e in t e r p r e t a t io n o f t h e j t h e le m e n t o f ! ¥ O b

!

"

#

$

• Relationship between λ and df(λ) is non-linear

• For these data, there is a dramatic reduction in ‘complexity’ of the

model up to about λ = 1,000

Smoothing parameter, !

Eff

ective

de

gre

es o

f fr

ee

do

m,

d(!

)Smoothing parameter vs. effective degrees of freedom

0 6250 12500 18750 25000

13

57

91

11

31

51

71

9

87 S. Haneuse; Biostat/Stat 572

!

"

#

$Effective degrees of freedom, d(!)

Ridge regression coefficient estimates

1 3 5 7 9 11 13 15 17 19

!3

9!

13

14

41

67

genderM

age

smokerY

drinkerY

88 S. Haneuse; Biostat/Stat 572

!

"

#

$

• Ridge regression can also be derived from a Bayesian perspective

• Suppose

Yi| Xi = xi ∼ Normal(xTi β, σ2)

and that we adopt independent Normal priors on the components of β

βj ∼ Normal(0, τ2) j = 1, . . . , p

• The marginal posterior for β is given by

π(β| y, X, σ) ∝ Likelihood × Prior

=n

i=1

exp

{

−1

2σ2

(

yi − xTi β

)2}

×p

j=1

exp

{

−1

2τ2β2

j

}

= exp

−1

2σ2

n∑

i=1

(

yi − xTi β

)2−

1

2τ2

p∑

j=1

β2j

89 S. Haneuse; Biostat/Stat 572

!

"

#

$

• Taking the log and multiplying by −2σ2, we find that the negative

log-posterior density for β is equivalent to the objective function in ridge

regression:

n∑

i=1

(

yi − xTi β

)2+ λ

p∑

j=1

β2j ,

• Hence βridge

correspond to the posterior mode in the above set-up

" also the mean, since we are assuming Normal distributions everywhere

90 S. Haneuse; Biostat/Stat 572

Page 9: P a r t II: L in e a r m e t h o d s fo r r e g r e s s io n...In t e r p r e t a t io n ¥ C o n s id e r t h e in t e r p r e t a t io n o f t h e j t h e le m e n t o f ! ¥ O b

!

"

#

$

• An alternative to ridge regression is the Lasso, which imposes an L1-type

penalty and can be written as

βlasso

= argminβRSS(β)

subject to the constraint:

p∑

j=1

|βj | ≤ t

" alternative constraint makes the solution nonlinear in y

" requires a quadrative programming solution

• Several R libraries implement the lasso

" lasso2

91 S. Haneuse; Biostat/Stat 572

!

"

#

$

Model complexity

• In each of the approaches we have seen, there is an additional parameter

which controls the ‘complexity’ of the model

" best subset selection: k

" ridge regression: s or λ

" the lasso: t

• The choice can have a dramatic impact on the results

• Question: How can allow the specific value to be determined on the basis

of the data?

• The results presented here are based on 10-fold cross validation

" we’ll go over this in more detail down the road

" briefly explained here in terms of ridge regression

92 S. Haneuse; Biostat/Stat 572

!

"

#

$

10-fold Cross-validation

• Initially, split the full data into two parts

" training data: n = 1,900

" test data: n = 600

• Further split the training data into 10 equal parts

" train(1)

"...

" train(10)

• For each l = 1, . . ., 10

i. fit the ridge regression model to nine-tenths of the training data

f (−l)ridge (train(−l); λ)

93 S. Haneuse; Biostat/Stat 572

!

"

#

$

ii. evaluate the prediction error in the remaining one-tenth

" here we are using L2 loss

L(train(l); λ) =190∑

i=1

{

y(l)i − f (−l)

ridge (train(l)i ; λ)

}2

• Calculate the average prediction error across the 10 sets of results

CV(λ) =1

10

10∑

l=1

L(train(l); λ)

" estimate of EPE based on the training data and for a given value of λ

• Repeat this for every value of λ

94 S. Haneuse; Biostat/Stat 572

Page 10: P a r t II: L in e a r m e t h o d s fo r r e g r e s s io n...In t e r p r e t a t io n ¥ C o n s id e r t h e in t e r p r e t a t io n o f t h e j t h e le m e n t o f ! ¥ O b

!

"

#

$

• Plots of CV(k) and CV(df(λ)) for best subsets and ridge regression

Subset size, k

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Be

st

su

bse

ts

X

Effective degrees of freedom, d(!)

12 13 14 15 16 17 18 19

Rid

ge

re

gre

ssio

n

X

95 S. Haneuse; Biostat/Stat 572

!

"

#

$

• Best subsets regression:

" let k∗ denote the value of k which minimizes CV(K)

" obtain the best subset regression fixing the number of variables at k∗

• Ridge regression:

" let λ∗ denote the value of λ which minimizes CV(λ), or CV(df(λ))

" fit a ridge regression to the full training data, with λ∗

• The estimated test error in each case is the error obtained by applying the

resulting fits to the test data

" n = 900 data set aside earlier

96 S. Haneuse; Biostat/Stat 572

!

"

#

$

• Test error results

Method Complexity Training Error Test Error

parameter

Least squares 3610956 1521808

Best subset k∗ = 11 4037468 1038386

Ridge regression df(λ∗) = 13.44 3835416 1468410

• Note, in absolute terms, the test error values don’t mean much

" scale is dependent on choice of the loss function, as well as the data

" relative differences are meaningful though

97 S. Haneuse; Biostat/Stat 572

!

"

#

$

Categorical outcomes: classification

• Here we consider problems where the outcome of interest is categorical

• For example, in the King county birth weight data scientific focus may rest

on the prediction of whether or not a baby will have low birth weight

" typically defined as weight < 2,500g

" 127 babies out of 2,500 had low birth weight

• From a notational perspective, it is useful to define the outcome random

variable G

• Denote the set of values G can take on as the discrete set G

" set consisting of K classes: G1, . . . ,GK

98 S. Haneuse; Biostat/Stat 572

Page 11: P a r t II: L in e a r m e t h o d s fo r r e g r e s s io n...In t e r p r e t a t io n ¥ C o n s id e r t h e in t e r p r e t a t io n o f t h e j t h e le m e n t o f ! ¥ O b

!

"

#

$

• The general approach is to establish decision boundaries which provide the

means to classify observations into one of the K classes

" the boundaries split the X-space

• Boundaries are determined by a set of discriminant functions:

δk(x), k = 1, . . . , K

" classify the observation, x, as one of the elements of G on the basis of

which δk(x) is largest

• Here we are going to focus on decision boundaries which are linear

" δk(x) are linear in x

" or, some monotone transformation of δk(x) is linear in x

99 S. Haneuse; Biostat/Stat 572

!

"

#

$

Decision theory

• In Part I we saw the 0-1 loss function for a categorical outcome variable

L(G, f(X)) =

1 if f(X) (= G

0 if f(X) = G

• For a categorical variable, the loss function can be represented by a

K × K matrix L

" the L(k, l) element is the cost of classifying an observation as l, when

the truth is class k

" L is zero on the diagonal and (typically) nonnegative everywhere else

" for a binary outcome, 0-1 loss gives

L =

0 1

1 0

100 S. Haneuse; Biostat/Stat 572

!

"

#

$

• The expected prediction error is again given by

EPE = EX,G[L(G, G(X))]

= EX

[

K∑

k=1

L(Gk, G(X)) Pr(Gk| X = x)

]

• Typically, we condition on X and minimize pointwise, so that the solution

is given as

G(x) = argming∈G

K∑

k=1

L(Gk, g) Pr(Gk| X = x)

• Under 0-1 loss, this simplifies to

G(x) = argming∈G [1 − Pr(g| X = x)]

" the solution is therefore to pick the element of G which gives the

largest Pr(Gk| X = x)

" i.e. pick the most probable outcome

101 S. Haneuse; Biostat/Stat 572

!

"

#

$

Linear regression for categorical outcomes

• In the toy example, we explored the use of linear regression as an approach

for predicting a 0/1 outcome variable

• For a K-level categorical outcome, we can adapt the approach by defining

a set of K indicator variables

Yk =

1 if G = Gk

0 otherwise

for k = 1, . . ., K

• Training data form an n × K outcome matrix, Y

" consisting of 1’s and 0’s

" each row contains a single 1

102 S. Haneuse; Biostat/Stat 572

Page 12: P a r t II: L in e a r m e t h o d s fo r r e g r e s s io n...In t e r p r e t a t io n ¥ C o n s id e r t h e in t e r p r e t a t io n o f t h e j t h e le m e n t o f ! ¥ O b

!

"

#

$

• Fitting a linear regression model to each column of Y yields

Y = X(XT X)−1XT Y

" note there is a coefficient vector for each Yk

• Each row of Y consists of K fitted values

(f1(xi), . . . , fK(xi)), i = 1, . . . , n.

• Assign a class by choosing the component with the largest value

G(xi) = argmaxGk∈G fk(xi)

• This may be reasonable in many situations, especially since linear

regression attempts to estimate the conditional expectation and

E[Yk| X = x] = Pr(Gk| X = x)

" from the decision theory, linear regressions goal is a sensible one

103 S. Haneuse; Biostat/Stat 572

!

"

#

$

• As long as the model contains an intercept, results from linear regression

satisfy

K∑

k=1

fk(xi) = 1

• However, some of the fk(xi) can be negative or a number greater than 1

" due to the rigid nature of linear regression

• Within the linear regression framework, we can get around this by

introducing transformations of the X-space

" quadratic terms and two-way interactions

" more general polynomial terms

" more general basis functions

• With finite n, the gains in terms of bias generally result in a cost in terms

of variance

104 S. Haneuse; Biostat/Stat 572

!

"

#

$

Linear discriminant analysis

• Decision theory told us that, under 0-1 loss, the optimal classification

scheme involves the use of the class probabilities

Pr(Gk| X = x)

• Given these quantities, the decision boundary between classes Gk and Gl

can be expressed in terms of values in the X-space for which the class

probabilities are equal

{x ∈ X : Pr(G = Gk| X = x) = Pr(G = Gl| X = x)}

or{

x ∈ X :Pr(G = Gk| X = x)

Pr(G = Gl| X = x)= 1

}

105 S. Haneuse; Biostat/Stat 572

!

"

#

$

• Let fk(x) denote the class-conditional density of the p-vector X in class

G = Gk

• Further, let πk denote the prior probability of class k, with∑

πk = 1

• Bayes theorem then gives us

Pr(G = Gk| X = x) =fk(x)πk

K∑

l=1fl(x)πl

• In terms of ‘choosing’ between classes, the denominator is constant and we

can see that the fk(x) provide almost as much information as the

Pr(G = Gk| X = x)

106 S. Haneuse; Biostat/Stat 572

Page 13: P a r t II: L in e a r m e t h o d s fo r r e g r e s s io n...In t e r p r e t a t io n ¥ C o n s id e r t h e in t e r p r e t a t io n o f t h e j t h e le m e n t o f ! ¥ O b

!

"

#

$

Linear Discriminant Analysis

• Suppose we assume the class densities to be multivariate Normal, with a

common variance-covariance matrix, Σ

fk(x) =1

(1π)p/2|Σ|1/2exp

{

−1

2(x − µk)T Σ−1(x − µk)

}

• Given this form, and taking logs, the decision boundary between classes Gk

and Gl reduces to

log

{

Pr(G = Gk| X = x)

Pr(G = Gl| X = x)

}

= xT Σ−1(µk − µl)

−1

2(µk + µl)

T Σ−1(µk − µl)

+ log

{

πk

πl

}

107 S. Haneuse; Biostat/Stat 572

!

"

#

$

• Consequently, the log-odds function is linear in x

" holds true for any (k, l) pair

" decision boundary solutions are therefore linear

• Working somewhat backwards, we see that the linear discriminant

functions

δk(x) = xT Σ−1µk + µTk Σ−1µk + log{πk}

are sufficient to determine the decision rule, with

G(x) = argmaxGk∈G δk(x)

108 S. Haneuse; Biostat/Stat 572

!

"

#

$

Estimation

• In practice, we need to estimate the parameters of the multivariate Normal

distributions using the training data

πk =nk

n

µk =1

nk

gi=Gk

xi

Σ =1

n − K

K∑

k=1

gi=Gk

(xi − µk)(xi − µk)T

109 S. Haneuse; Biostat/Stat 572

!

"

#

$

Quadratic Discriminant Analysis

• We can relax the assumption of a common variance-covariance matrix

across the K classes

• When forming the log-odds

log

{

Pr(G = Gk| X = x)

Pr(G = Gl| X = x)

}

the convenient cancellations no longer occur

• This results in a series of quadratic discriminant functions

δk(x) = −1

2log |Σk| −

1

2(x − µk)T Σ−1

k (x − µk) + log{πk}

" these are no longer linear in x

110 S. Haneuse; Biostat/Stat 572

Page 14: P a r t II: L in e a r m e t h o d s fo r r e g r e s s io n...In t e r p r e t a t io n ¥ C o n s id e r t h e in t e r p r e t a t io n o f t h e j t h e le m e n t o f ! ¥ O b

!

"

#

$

• The decision boundary between classes Gk and Gl is now defined by a

quadratic equation

{x ∈ X : δk(x) = δl(x)}

Implementation

• In R, both LDA and QDA are implemented in the MASS library

" lda()

" qda()

111 S. Haneuse; Biostat/Stat 572

!

"

#

$

Logistic regression

• In LDA, the linearity of the decision boundaries arises from the use of

multivariate normal distributions (and the common variance-covariance

assumption)

" Bayes theorem ensures that the estimates of Pr(G = Gk| X = x) add

up to 1.0

" no guarantee they remain in [0,1]

• Logistic regression attempts to model the decision boundaries directly as

linear functions of X

" ensure the Pr(G = Gk| X = x) add up to 1.0

" ensure they remain in [0,1]

112 S. Haneuse; Biostat/Stat 572

!

"

#

$

• For K classes, the model has the form

log

{

Pr(G = G1| X = x)

Pr(G = GK | X = x)

}

= xT β1

log

{

Pr(G = G2| X = x)

Pr(G = GK | X = x)

}

= xT β2

...

log

{

Pr(G = GK−1| X = x)

Pr(G = GK | X = x)

}

= xT βK−1

" here, class K is the referent group

" K − 1 sets of parameters reflect the constraint that the probabilities

add up to 1.0

113 S. Haneuse; Biostat/Stat 572

!

"

#

$

• A simple calculation reveals that

Pr(G = Gk| X = x) =exp{xT βk}

1 +K−1∑

l=1exp{xT βl}

, k = 1, . . . , K − 1

Pr(G = GK | X = x) =1

1 +K−1∑

l=1exp{xT βl}

" Pr(G = Gk| X = x) ∈ [0,1] for each k

" sum adds up to 1.0

114 S. Haneuse; Biostat/Stat 572

Page 15: P a r t II: L in e a r m e t h o d s fo r r e g r e s s io n...In t e r p r e t a t io n ¥ C o n s id e r t h e in t e r p r e t a t io n o f t h e j t h e le m e n t o f ! ¥ O b

!

"

#

$

Estimation

• Given the Pr(G = Gk| X = x), the multinomial distribution arises as a

natural distribution on which to base the likelihood

• Maximum likelihood estimation proceeds via iteratively re-weighted least

squares

" Fisher’s method of scoring

" Newton-Raphson with the Hessian replaced by its expected value

• In the binary case, the details of the algorithm simplify

• If we re-code g = 1/2 to y = 0/1, and let

µ ≡ µ(x; β) = E[Y | X = x, β] =exp{xT β}

1 + exp{xT β}

115 S. Haneuse; Biostat/Stat 572

!

"

#

$

• For a given starting value of β, at each step of the algorithm, update the

current value of β via

βnew

←− βold

+ (XT WX)−1XT (y − µ)

= (XT WX)−1XT W (Xβold

+ W−1XT (y − µ))

= (XT WX)−1XT Wz

" z is referred to as the adjusted response

" the weight matrix W is a diagonal matrix with entries

Wii = µ(x; βold

)(1 − µ(x; βold

))

=exp{xT β

old}

[1 + exp{xT βold}]2

116 S. Haneuse; Biostat/Stat 572

!

"

#

$

The hat matrix and degress of freedom

• In logistic regression, and GLMs in general, the estimated y are nonlinear

in the y

• Hence there is no matrix H such that y = Hy

• We can define a hat matrix by linearization using the following analogy

with linear regression

• For the linear model, note that

y − E[y] = H{y − E[y]}

" since HE[y] = X(XT X)−1XT Xβ = Xβ

117 S. Haneuse; Biostat/Stat 572

!

"

#

$

• For a generalized linear model, we can define the hat matrix H(β) to be

the matrix such that

µ(X; β) − µ(X; β) ≈ H(β){y − E[y]}

• A definition which satisfies this is to set

H(β) = WX(XT WX)−1XT

" uses a Taylor approximation in the update step in the IWLS algorithm

" remember, the W matrix depends on β

• It is relatively straightforward to show that

tr(H(β)) = p

so that, again, using the trace of the newly defined hat matrix serves as a

reasonable definition of the degrees of freedom

118 S. Haneuse; Biostat/Stat 572

Page 16: P a r t II: L in e a r m e t h o d s fo r r e g r e s s io n...In t e r p r e t a t io n ¥ C o n s id e r t h e in t e r p r e t a t io n o f t h e j t h e le m e n t o f ! ¥ O b

!

"

#

$

Implementation

• glm() for fitting binary logistic regression models

• multinom() in the nnet library for fitting multinomial logistic regression

models

King county birth data

• Define a new outcome of ‘low birth weight’

" binary 0/1 variable

• Compare the overall results obtained by LDA and logistic regression

• Consider using all 16 predictors

" could apply model selection and shrinkage methods here

119 S. Haneuse; Biostat/Stat 572

!

"

#

$

• To evaluate the expected prediction error, we can use cross-validation with

K = N

" leave-one-out cross-validation

" full code is available online

##

library(MASS)

## Cycle through each observation

##

results <- matrix(NA, nrow=2500, ncol=2)

for(i in 1:2500)

{

## Fit a linear discriminant analysis

##

fitLDA <- lda(low ~ ., data=weight[-i, -c(2,7)])

Yhat <- predict(fitLDA, newdata=weight[i, -c(2,7)])$class

results[i,1] <- (Yhat != weight[i, -c(2,7)]$low)

120 S. Haneuse; Biostat/Stat 572

!

"

#

$

## fit a logistic regression model

##

fitGLM <- glm(low ~ ., data=weight[-i, -c(2,7)], family=binomial)

Yhat <- predict(fitGLM,

newdata=weight[i, -c(2,7)], type="response") > 0.5

results[i,2] <- (Yhat != weight[i, -c(2,7)]$low)

}

##

apply(results, 2, mean)

121 S. Haneuse; Biostat/Stat 572

!

"

#

$

• Results based on a 0-1 loss

Method Estimated

Prediction Error

LDA 4.28%

Logistic regression 4.24%

• LDA and logistic regression don’t missclassify the same observations

though

Logistic Regression

Correct Missclassified

LDA Correct 2,380 13

Missclassified 14 93

122 S. Haneuse; Biostat/Stat 572

Page 17: P a r t II: L in e a r m e t h o d s fo r r e g r e s s io n...In t e r p r e t a t io n ¥ C o n s id e r t h e in t e r p r e t a t io n o f t h e j t h e le m e n t o f ! ¥ O b

!

"

#

$

Comparison with LDA

• In the development of LDA, we saw the decision boundary between classes

k and K to be of the form

log

{

Pr(G = Gk| X = x)

Pr(G = GK | X = x)

}

= xT Σ−1(µk − µK)

−1

2(µk + µK)T Σ−1(µk − µK)

+ log

{

πk

πK

}

= α0k + xT αk

• Hence, the log-odds have the same form as that of logistic regression

" the two approaches attempt to estimate the same quantities

" the major difference is in how they estimate these quantities

123 S. Haneuse; Biostat/Stat 572

!

"

#

$

• In LDA, by virtue of making the assumption of multivariate Normality for

X within each class, we are estimating the full joint distribution of (G, X).

• Make assumptions about the conditional distribution X | G and the

marginal distribution G

X | G = Gk ∼ MVNp(µk,Σ), k = 1, . . . , K

P (G = Gk) = πk

" use Bayes theorem to inform us on G|X

" there is a resulting induced marginal distribution for X which depends

on the log-odds parameters and hence plays a role in estimation

• If these assumptions are true, then LDA will be the best we can do

" LDA estimates will be ‘maximum likelihood’

124 S. Haneuse; Biostat/Stat 572

!

"

#

$

• In logistic regression, we only focus on the conditional distribution G|X

" maximize a conditional likelihood based on the multinomial distribution

• No additional assumptions on the marginal distribution of X

" actually ignored in the estimation of the model

" think of the marginal density as being estimated in a fully

nonparametric way

• In the advent that the underlying densities are multivariate Normal, there

will be a loss of efficiency when using logistic regression

" typically not large, and at most 30%

" reasonable trade-off to possible misspecification

125 S. Haneuse; Biostat/Stat 572

!

"

#

$

Seperating hyperplanes

• LDA and logistic regression both estimate linear decision boundaries by

adopting structure on the log-odds scale

" exploit all the data via some global structure

• An alternative class of methods attempts to directly separate observations

into the various possible classes

" focus on the points themselves

" viewed as being more local in nature

126 S. Haneuse; Biostat/Stat 572

Page 18: P a r t II: L in e a r m e t h o d s fo r r e g r e s s io n...In t e r p r e t a t io n ¥ C o n s id e r t h e in t e r p r e t a t io n o f t h e j t h e le m e n t o f ! ¥ O b

!

"

#

$

• Consider the data in the opposite

figure

• Included is the least squares solution

to the problem

" Y = -1/1

" boundary line is given by

{

x ∈ R2 : xT β = 0

}

• We see two points are missclassified

• Surprising since its not hard to see

that one could draw a line separat-

ing the two groups

!3 !2 !1 0 1 2 3

!3

!2

!1

01

23

X1

X2

127 S. Haneuse; Biostat/Stat 572

!

"

#

$

• We refer to such data as being separable

" there exists at least one separating hyperplane

" in this example there actually are infinitely many such planes

" two are shown here with dashed lines

!3 !2 !1 0 1 2 3

!3

!2

!1

01

23

X1

X2

128 S. Haneuse; Biostat/Stat 572

!

"

#

$

Perceptron learning algorithm

• The perceptron algorithm finds a separating hyperplane by minimizing the

distance between the decision boundary and missclassified points

• For Y = -1/1, and for a given β, a respose is missclassified if either

yi = −1 and xTi β > 0

or

yi = 1 and xTi β < 0

• Let M denote the set of missclassified observations

• The goal is to minimize

D(β) = −∑

i∈M

yi(xTi β)

129 S. Haneuse; Biostat/Stat 572

!

"

#

$

• Taking derivatives, the update is simply cycles through each element of M

via

βnew

← βold

+ ρyixi

" ρ is a tuning parameter

" algorithm uses a stochastic gradient descent by visiting each

observation in sequence

" rather than computing the sum of the derivatives

• The form of the algorithm suggests that there is a greater local focus on

points ‘close’ to the decision boundary

" algorithm updates on the basis of elements of M

" may be more robust to model misspecification than, say, LDA or

logistic regression which impose more global structure

130 S. Haneuse; Biostat/Stat 572

Page 19: P a r t II: L in e a r m e t h o d s fo r r e g r e s s io n...In t e r p r e t a t io n ¥ C o n s id e r t h e in t e r p r e t a t io n o f t h e j t h e le m e n t o f ! ¥ O b

!

"

#

$

• The algorithm suffers from a series of drawbacks however

" when the data are separable, there are many solutions and which one

you get depends on the starting values

" it can sometimes take a long time to converge

" when the data are not separable there is no solution, so that the

algorithm will not converge and can cycle

• One solution to the first problem is to incorporate an additional constraint

into the optimization problem

" choose the boundary which maximizes the margin between the two

classes in the training data

• In most situations, however, it is unlikely to be the case that the classes

are perfectly separable

" support vector machines provide a solution to this problem

" we’ll see these later

131 S. Haneuse; Biostat/Stat 572