Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

1/49

Lecture 6: Logistic Regression

CSC 84020 - Machine Learning

Andrew Rosenberg

February 19, 2009
http://find/http://goback/


2/49

Last Time

Regression

Regularization and Overfitting


3/49

Today

Logistic Regression


4/49

Classification

Classification

Goal: Identify which of K classes a data point x belongs to.

Like Regression, Classification is a supervised task.

For each data point xi we have a corresponding target (or label, orclass) ti that describes the correct classification of the data point.

Goal: identify a function y : RD C whereti C = {c0, . . . , cK1}


5/49

Representations of the target variable

y : RD C where ti C

For binary (two-way) classification it is convenient to represent tias a single scalar variable ti {0, 1}.

This will allow us to interpret ti as the likelihood that a pointxi is a member of class cK1

When hypothesized from a model, this can represent theconfidence of the prediction.

For K 2 classes, we represent t as a K element vector, where, ifa point is a member of class cj the j-th element is 1, and all the

others are 0.In 5-way classification, a member of class c2 is

t = (0, 0, 1, 0, 0)T

We may also represent t as a nominal variable when using

non-probabilistic models.

C


6/49

Three approaches to Classification

Generative Approach

p(cj|x) =p(x|cj)p(cj)

p(x)

Discriminative approach

p(cj|x)

Discriminant function

f(x) = cj

Th h Cl ifi i


7/49


Generative Approach Highest resource requirements. Need toapproximate the joint probability p(x, cj)


p(x)


p(cj|x)

Discriminant functionf(x) = cj

Th h Cl ifi i


8/49


Generative Approach


p(x)

Discriminative approach Moderate resource requirements.Typically less parameters to approximate than generative models

p(cj|x)

Discriminant functionf(x) = cj

Th h t Cl ifi ti


9/49


Generative Approach


p(x)


p(cj|x)

Discriminant function Can be trained probabilistically, but theoutput does not include confidence information.

f(x) = cj

Di i i t F ti


10/49

Discriminant Functions

Wh Dis i i t F ti s li iti


11/49

Why Discriminant Functions are limiting

What can Generative and Discriminative approaches thatDiscriminant Functions cannot?

...Or why we like probabilities

Minimizing Risk continuous updating.Reject Option I dont know

Compensating for Priors

Combining Models

Well talk about these more when we discuss Perceptrons andNeural Networks.

Generative Modeling


12/49

Generative Modeling

Generative modeling model the posterior

p(c1|x) =p(x|c1)p(c1)

p(x)

Generative Modeling


13/49

Generative Modeling


p(c1|x) =p(x|c1)p(c1)

p(x)

=p(x|c1)p(c1)

j p(x, cj)

Generative Modeling


14/49

Generative Modeling


p(c1|x) =p(x|c1)p(c1)

p(x)

=p(x|c1)p(c1)

j p(x, cj)

=p(x|c1)p(c1)

p(x, c0) + p(x, c1)

Generative Modeling


16/49

Sigmoid function

The sigmoid1 function is a squashing function.

(x) =1

1 + exp(x)

Squashing function maps the reals to a finite domain.

: R (0, 1)

1S-shaped

Generative Modeling


18/49

Some more vocabulary

log-odds or log-odds-ratio

a = lnp(x|c1)p(c1)

p(x|c0)p(c0)

logit function inverse of the sigmoid.

=1

1 + exp(a)

a = ln

1

Generative Model


19/49

Generative Model

Derive p(c0|x) with Gaussian class conditional probability.

p(x|ck) =1

(2)D/21

||1/2exp

1

2(x k)

T1(x k)

Well assume that p(x|c0) and p(x|c1) have equal covariancematrices.

Want to show that p(c0|x) = (wTx)

p(c0|x) = (a)

a = lnp(x|c0)p(c0)

p(x|c1)p(c1)

a = ln

1

(2)D/21

||1/2exp

1

2(x 0)

T1(x 0)

ff

ln

1

(2)D/21

||1/2exp

1

2(x 1)

T1(x 1)

ff+ ln

p(c0)

p(c1)

Generative model


20/49

Generative model

p(c0|x) = (a)

a = ln

1

(2)D/21

||1/2exp

1

2(x 0)

T1(x 0)

ff

ln

1

(2)D/21

||1/2exp

1

2(x 1)

T1(x 1)

ff+ ln

p(c0)

p(c1)

Generative model


21/49

Generative model

p(c0|x) = (a)

a = ln

1

(2)D/21

||1/2exp

1

2(x 0)

T1(x 0)

ff

ln

1

(2)D/21

||1/2exp

1

2(x 1)

T1(x 1)

ff+ ln

p(c0)

p(c1)

= 12

(x 0)T

1(x 0) + 12(x 1)

T1(x 1) + ln

p(c0)p(c1)

Generative model


22/49

G

p(c0|x) = (a)

a = ln

1

(2)D/21

||1/2exp

1

2(x 0)

T1(x 0)

ff

ln

1

(2)D/21

||1/2exp

1

2(x 1)

T1(x 1)

ff+ ln

p(c0)

p(c1)

= 12

(x 0)T

1(x 0) + 12(x 1)

T1(x 1) + ln

p(c0)p(c1)

= 1

2

xT

1

x + 0T

10 x

T10 0

T1

x

+1

2 xT

1

x + 1T

11 x

T11 1

T1

x+ ln

p(c0)

p(c1)

If A is symmetric A = AT. If A is symmetric, xTAy = yTAx. (HW).

Generative model


23/49

p(c0|x) = (a)

a = 1

2

xT

1

x + 0T

10 x

T10 0

T1

x

+1

2

xT

1

x + 1T

11 x

T11 1

T1

x

+ lnp

(c0)p(c1)

Generative model


24/49

p(c0|x) = (a)

a = 1

2

xT

1

x + 0T

10 x

T10 0

T1

x

+1

2

xT

1

x + 1T

11 x

T11 1

T1

x

+ lnp

(c0)p(c1)

a = 1

2

0

T10 20

T1

x

+1

2

1

T11 21

T1

x

+ ln p(c0)p(c1)

If A is symmetric A = AT. If A is symmetric, xTAy = yTAxT. (HW).

Generative model


25/49

p(c0|x) =

(a

)a =

1

2

xT

1

x + 0T

10 x

T10 0

T1

x

+1

2

xT

1

x + 1T

11 x

T11 1

T1

x

+ lnp(c0)

p(c1)

a = 1

2

0

T10 20

T1

x

+1

2

1

T11 21

T1

x

+ ln

p(c0)

p(c1)

a = (01 1

1)x 1

20

T10 +

1

21

T11 + ln

p(c0)

p(c1)

If A is symmetric A = AT. If A is symmetric, xTAy = yTAxT. (HW).

Generative model


26/49

p(c0|x) = (a)

a = (01 1

1)x 1

20

T10 +

1

21

T11 + ln

p(c0)

p(c1)

Generative model


27/49

p(c0|x) = (a)

a = (01 1

1)x 1

20

T10 +

1

21

T11 + ln

p(c0)

p(c1)

a = (wTx + w0)

w = 10T 11

T

w0 = 1

20

T10 +

1

21

T11 + ln

p(c0)

p(c1)

Maximum Likelihood Solution


28/49

Now we have a way to describe the linear transformation to x to

generate a prediction under a Gaussian assumption.

How do we estimate the parameters?

Maximize the likelihood function with respect to each parameter.

p(t, X|,0,1, ) =N1n=0

(N(xn |0, ))tn((1)N(xn |1, ))

1tn

tn = 1 for class 0, tn = 0 for class 1.

Prior class probabilities p(C0) = , p(C1) = 1 .



30/49

ln p(t, X|,0,1, ) =N1

Xn=0

tn ln() + (1 tn) ln(1 ) + const

ln p(t, X|,0,1, )

=

1

N1Xn=0

tn 1

1

N1Xn=0

(1 tn) = 0

1

N1

Xn=0

tn =1

1

N1

Xn=0

(1 tn)

1

N1Xn=0

tn =N1Xn=0

(1 tn)

1

1

N1Xn=0

tn =

N1

Xn=0(1 tn)

1

N1Xn=0

tn =N1Xn=0

(1 tn) +N1Xn=0

(tn)

1

N1Xn=0

tn =

N1

Xn=01



31/49

1 N1X

n=0tn =

N1Xn=0

1

1

N1Xn=0

tn = N

1

N

N1Xn=0

tn =

1

N

N1Xn=0

tn =

N0

N=

N0

N0 + N1=

Be prepared to maximize 0 and for HW.

Discriminative Linear Classification


32/49

In the generative case recall,

p(t|x) =p(x|t)p(t)

p(x)

Can generate synthetic data from p(x).

Need to model the joint probability

In Discriminative Modeling

Model p(t|x) directly

Logistic Regression


33/49

From the generative case we can find that under someassumptions:

p(t|x) = y(x) = (wTx)

In M-dimensions this has M parameters.In the generative case 2M means and M(M+1)/2 covariancematrix2.

Parameters grow linearly in M or quadratically in M.

So wed rather optimize this function directly.

2Covariance matrices are symmetric

Maximum likelihood


34/49

Define the Likelihood.

p(t|w) =N1n=0

p(c0|xn)tn

1 p(c1|xn)1tn

E(w) = ln p(t|w) = N1

n=0

{tn ln p(c0|xn) + (1 tn) ln p(c1|xn)}

Where y0 = p(c0|xn) = (an).

E(w) = ln p(t|w) = N1

n=0

{tn

ln yn

+ (1 tn

)ln(1 yn

)}

This is also the cross entropy error function.3

3Logisitic Regression is also called maximum entropy or maxent.

Maximum Likelihood


35/49

E(w) = ln p(t|w) = N1n=0

{tn ln yn + (1 tn)ln(1 yn)}

chain rule

wE =N1n=0

E

yn

yn

anwan

Maximum Likelihood


36/49

Derivation of Eyn .

E(w) = N1n=0

{tn ln yn + (1 tn)ln(1 yn)}

E

yn (w) =

1 tn1 yn

tn

yn

=yn(1 tn) tn(1 yn)

yn(1 yn)

=yn yntn tn + yntn

yn(1 yn)

=yn tn

yn(1 yn)

Maximum Likelihood


37/49

Derivation of ynan .

dda

=d 1

1+exp(

a)da

=d(1 + exp(a))1

da

= (1)(1 + exp(a))2(exp(a))(1)

= (1 + exp(a))2(exp(a))

=1

1 + exp(a)

exp(a)

1 + exp(a)

=1

1 + exp(a)1 + exp(a) 1

1 + exp(a)

=1

1 + exp(a)

1 + exp(a)

1 + exp(a)

1

1 + exp(a)

= (1 )

Maximum Likelihood


38/49

Derivation of wan

an = wTx

wan = xn

Maximum Likelihood


39/49

Putting it all together

wE =N1n=0

E

yn

yn

anwan

=

N1n=0

yn tnyn(1 yn) (y

n(1 yn))xn

=N1n=0

(yn tn)xn

Same as gradient of the sum of squares error in linear regression.

How do we optimize this?


40/49

We know the gradient, but how do we find the maximum value

wE =N1n=0

(yn tn)xn

Numerical Approximations

Gradient Ascent

wn+1 = wn + wE(wn)

Guess.

Jump in the direction of the negative gradient.

Guess again.

Example of Gradient Descent


41/49

10 8 6 4 2 0 2 4 6 8 10100

50

0

50

100

150

x0 = 5, f(x0) = 10, = .2

x1 = x0 f

(x0) = 5 .2 10 = 3



42/49

10 8 6 4 2 0 2 4 6 8 1060

40

20

0

20

40

60

80

100

120

140

x1 = 3, f(x1) = 6, = .2

x2 = x1 f

(x1) = 3 .2 6 = 1.8



43/49

10 8 6 4 2 0 2 4 6 8 1020

0

20

40

60

80

100

120

140

x2 = 1.8, f(x2) = 3.6, = .2

x3 = x2 f

(x2) = 1.8 .2 3.6 = 1.08



44/49

10 8 6 4 2 0 2 4 6 8 100

20

40

60

80

100

120

140

x3 = 1.08, f(x3) = 2.16, = .2

x4 = x3 f

(x3) = 5 .2 10 = .648

Another approach to N-way classification


45/49

In this derivation we used a 1-of-K representation with K-classdiscriminant function.

Another approach is to construct K 1 binary classifiers.

Each classifier Cn compares cn to not cn

Binary Classifiers are simpler.

But there are some problems with this approach.

One versus the rest


46/49

K-class discriminant


47/49

Context


48/49

Logistic Regression

Powerful classification technique.

Must be approximated no closed form.Assumption of linearity

Can also be extended with basis functions.

Also called maximum entropy.

Bye


49/49

Next

Graphical Models

Documents

Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning