Classification : Logistic regression. Generative ...milos/courses/cs2750... · CS 2750 Machine...

Preview:

Citation preview

1

CS 2750 Machine Learning

CS 2750 Machine LearningLecture 8

Milos Hauskrechtmilos@cs.pitt.edu5329 Sennott Square

Classification : Logistic regression.

Generative classification model.

CS 2750 Machine Learning

Binary classification

• Two classes• Our goal is to learn to classify correctly two types of examples

– Class 0 – labeled as 0, – Class 1 – labeled as 1

• We would like to learn• Zero-one error (loss) function

• Error we would like to minimize:• First step: we need to devise a model of the function

}1,0{=Y

}1,0{: →Xf

=≠

=ii

iiii yf

yfyError

),(0),(1

),(1 wxwx

x

)),(( 1),( yErrorE yx x

2

CS 2750 Machine Learning

Discriminant functions

• One way to represent a classifier is by using– Discriminant functions

• Works for binary and multi-way classification

• Idea: – For every class i = 0,1, …k define a function

mapping– When the decision on input x should be made choose the

class with the highest value of

• So what happens with the input space? Assume a binary case.

)(xigℜ→X

)(xig

CS 2750 Machine Learning

Discriminant functions

)()( 01 xx gg ≥

-2 -1.5 -1 -0.5 0 0.5 1 1.5-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

3

CS 2750 Machine Learning

Discriminant functions

)()( 01 xx gg ≥

)()( 01 xx gg ≤

-2 -1.5 -1 -0.5 0 0.5 1 1.5-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

)()( 01 xx gg ≤

CS 2750 Machine Learning

Discriminant functions

)()( 01 xx gg ≥

)()( 01 xx gg ≤

-2 -1.5 -1 -0.5 0 0.5 1 1.5-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

)()( 01 xx gg ≤

)()( 01 xx gg ≥

4

CS 2750 Machine Learning

Discriminant functions

• Define decision boundary

)()( 01 xx gg ≥

)()( 01 xx gg ≤

-2 -1.5 -1 -0.5 0 0.5 1 1.5-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

)()( 01 xx gg ≥

)()( 01 xx gg ≤

)()( 01 xx gg =

CS 2750 Machine Learning

-2 -1.5 -1 -0.5 0 0.5 1 1.5-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

3Decision boundary

Quadratic decision boundary

)()( 01 xx gg ≥

)()( 01 xx gg ≤ )()( 01 xx gg =

5

CS 2750 Machine Learning

Logistic regression model

• Defines a linear decision boundary• Discriminant functions:

• where

)()()( 1 xwxwwx, TT ggf ==

)1/(1)( zezg −+=

xInput vector

1

1x )( wx,f

0w

1w2w

dw2x

z

dx

Logistic function

)()(1 xwx Tgg = )(1)(0 xwx Tgg −=- is a logistic function

CS 2750 Machine Learning

Logistic functionfunction

• Is also referred to as a sigmoid function• Replaces the threshold function with smooth switching • takes a real number and outputs the number in the interval [0,1]

)1(1)( ze

zg −+=

-20 -15 -10 -5 0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

6

CS 2750 Machine Learning

Logistic regression model

• Discriminant functions:

• Values of discriminant functions vary in [0,1]– Probabilistic interpretation

),|1( wx=yp

xInput vector

1

1x0w

1w2w

dw2x

z

dx

)()(1 xwx Tgg = )(1)(0 xwx Tgg −=

)()(),|1()( 1 xwxxwwx, Tggypf ====

CS 2750 Machine Learning

Logistic regression

• We learn a probabilistic function

– where f describes the probability of class 1 given x

Note that:

• Transformation to binary class values:

),|1()()( 1 wxxwwx, === ypgf T

]1,0[: →Xf

2/1)|1( ≥= xypIf then choose 1Else choose 0

)|1(1),|0( wx,wx =−== ypyp

7

CS 2750 Machine Learning

Linear decision boundary

• Logistic regression model defines a linear decision boundary• Why?• Answer: Compare two discriminant functions.• Decision boundary:• For the boundary it must hold:

0)(

)(1log)()(log

1

=−

=xw

xwxx

T

T

gg

ggo

)()( 01 xx gg =

0)(explog

)(exp11

)(exp1)(exp

log)()(log

1

==−=

−+

−+−

= xwxw

xw

xwxw

xx TT

T

T

T

ggo

CS 2750 Machine Learning

Logistic regression model. Decision boundary

• LR defines a linear decision boundaryExample: 2 classes (blue and red points)

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Decision boundary

8

CS 2750 Machine Learning

Likelihood of outputs• Let

• Then

• Find weights w that maximize the likelihood of outputs– Apply the log-likelihood trick The optimal weights are the

same for both the likelihood and the log-likelihood

Logistic regression: parameter learning

=−=−= ∑∏=

=

−n

i

yi

yi

n

i

yi

yi

iiiiDl1

1

1

1 )1(log)1(log),( µµµµw

∏∏=

=

−===n

i

yi

yiii

n

i

iiyyPDL1

1

1

)1(),|(),( µµwxw

)()(),|1( xwwx Tiiii gzgyp ====µ

)1log()1(log1

iii

n

ii yy µµ −−+= ∑

=

>=< iii yD ,x

CS 2750 Machine Learning

Logistic regression: parameter learning• Log likelihood

• Derivatives of the loglikelihood

• Gradient descent:

)1log()1(log),(1

iii

n

ii yyDl µµ −−+= ∑

=

w

)),(())((),(11

iii

n

ii

Tii

n

i

fygyDl xwxxwxww −−=−−=−∇ ∑∑==

)1(|)],([)()1()(−−∇−← −

kDlkkkww www α

Nonlinear in weights !!

∑=

−− −+←n

iii

ki

kk fyk1

)1()1()( )],([)( xxwww α

))((),(1

, ii

n

iji

j

zgyxDlw

−−=∂∂

− ∑=

w

9

CS 2750 Machine Learning

Logistic regression. Online gradient descent

• On-line component of the loglikelihood

• On-line learning update for weight w

• ith update for the logistic regression and

)1(|)],([)()1()(−∇−← −

kkonlinekk DJk

ww www α

),( wkonline DJ

>=< kkk yD ,x

kkk

iki fyk xxwww )],()[( )1()1()( −− −+← α

)1log()1(log),(online iiiii yyDJ µµ −−+=− w

CS 2750 Machine Learning

Online logistic regression algorithm

Online-logistic-regression (D, number of iterations)initialize weightsfor i=1:1: number of iterations

do select a data point from Dset update weights (in parallel)

end forreturn weights

),,( 210 dwwww K=w

i/1=α

iii fyi xxwww )],()[( −+← α

>=< iii yD ,x

w

10

CS 2750 Machine Learning

Online algorithm. Example.

CS 2750 Machine Learning

Online algorithm. Example.

11

CS 2750 Machine Learning

Online algorithm. Example.

CS 2750 Machine Learning

Derivation of the gradient• Log likelihood

• Derivatives of the loglikelihood

)1log()1(log),(1

iii

n

ii yyDl µµ −−+= ∑

=

w

)),(())((),(11

iii

n

ii

Tii

n

i

fygyDl xwxxwxww −−=−−=∇ ∑∑==

[ ]j

in

iiiii

ij wzyy

zDl

w ∂∂

−−+∂∂

=∂∂ ∑

=1

)1log()1(log),( µµw

[ ]i

i

ii

i

i

iiiiii

i zzg

zgy

zzg

zgyyy

z ∂∂

−−

−+∂

∂=−−+

∂∂ )(

)(11)1()(

)(1)1log()1(log µµ

))(1)(()(ii

i

i zgzgzzg

−=∂

∂Derivative of a logistic function

)())()(1())(1( iiiiii zgyzgyzgy −=−−+−=

jij

i xwz

,=∂∂

12

CS 2750 Machine Learning

Generative approach to classification

Idea: 1. Represent and learn the distribution2. Use it to define probabilistic discriminant functions

E.g.

Typical model• = Class-conditional distributions (densities)

binary classification: two class-conditional distributions

• = Priors on classes - probability of class ybinary classification: Bernoulli distribution

)0|( =yp x

1)1()0( ==+= ypyp

y

x

),( yp x

)()|(),( ypypyp xx =)|( yp x

)1|( =yp x)( yp

)|0()( xx == ypg o )|1()(1 xx == ypg

CS 2750 Machine Learning

Generative approach to classification

Example:• Class-conditional distributions

– multivariate normal distributions

• Priors on classes (class 0,1)– Bernoulli distribution

−−−= − )()(

21exp

)2(1)|( 1

2/12/µxΣµx

ΣΣµ,x T

dp

π

0for),(~ 00 =yN Σµx1for),(~ 11 =yN Σµx

y

x

Multivariate normal ),(~ Σµx N

yyyp −−= 1)1(),( θθθ

Bernoulliy ~

}1,0{∈y

13

CS 2750 Machine Learning

Learning of parameters of the model

Density estimation in statistics• We see examples – we do not know the parameters of

Gaussians (class-conditional densities)

• ML estimate of parameters of a multivariate normal for a set of n examples of x Optimize log-likelihood:

• How about class priors?

−−−= − )()(

21exp

)2(1),|( 1

2/12/µxΣµx

ΣΣµx T

dp

π

∑=

=n

iin 1

1ˆ xµ Tn

in)ˆ)(ˆ(1ˆ

1µxµxΣ ii −−= ∑

=

)|(log),,(1

Σ,µxΣµ ∏=

=n

iipDl

),( ΣµN

CS 2750 Machine Learning

Generative model

)()( 01 xx gg ≥

-2 -1.5 -1 -0.5 0 0.5 1 1.5-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

14

CS 2750 Machine Learning

2 Gaussian class-conditional densities

• .

CS 2750 Machine Learning

Making class decision

Basically we need to design discriminant functionsTwo possible choices:• Likelihood of data – choose the class (Gaussian) that explains

the input data (x) better (likelihood of the data)

• Posterior of a class – choose the class with better posterior probability

),|(),|( 0011 ΣxΣx µµ pp > then y=1else y=0

)|0()|1( xx =>= ypyp then y=1else y=0

)1(),|()0(),|()1(),|()|1(

1100

11

=+==

==yppypp

yppypΣxΣx

Σxxµµ

µ

)(1 xg )(0 xg

15

CS 2750 Machine Learning

2 Gaussians: Quadratic decision boundary

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Contours of class-conditional densities

CS 2750 Machine Learning

2 Gaussians: Quadratic decision boundary

-2 -1.5 -1 -0.5 0 0.5 1 1.5-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

3Decision boundary

16

CS 2750 Machine Learning

2 Gaussians: Linear decision boundary• When covariances are the same 0,),(~ 0 =yN Σµx

1,),(~ 1 =yN Σµx

CS 2750 Machine Learning

2 Gaussians: Linear decision boundary

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Contours of class-conditional densities

17

CS 2750 Machine Learning

2 Gaussians: linear decision boundary

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Decision boundary

Recommended