ICS 178 Introduction Machine Learning & data Mining Instructor max Welling Lecture 6: Logistic Regression

ICS 178Introduction Machine Learning

& data Mining

Instructor max Welling

Lecture 6: Logistic Regression

Logistic Regression

• This is also regression but with targets Y=(0,1). I.e. it is classification!

• We will fit a regression function on P(Y=1|X)

jxijA

( )if x

jxijA

n nY AX b

( 1| ) ( )n n nP Y X f AX b

1( )

1 exp[ ( )]f X

AX b

linear regression logistic regression

Sigmoid function f(x)

( )if x

jxijA

( 1| ) ( )n n nP Y X f AX b

1( )

1 exp[ ( )]f X

AX b

data-points with Y=0

data-points with Y=1

In 2 Dimensions

A,b determine1) orientation 2) thickness (margin)3) offsetof decision surface

sigmoid f(x)

Cost Function

• We want a different error measure that is better suited for 0/1 data.

• This can be derived from maximizing the probability of the data again.

1( | ) ( ) (1 ( ))n nY Yn n n nP Y X f X f Y

( 1| ) ( )n n nP Y X f AX b ( 0| ) 1 ( )n n nP Y X f AX b

1

log ( ) (1 )log(1 ( ))N

n n n nn

Error Y f X Y f X

Learning A,b

• Again, we take the derivatives of the Error wrt to the parameters.

• This time however, we can’t solve them analytically, so we use gradient descent.

dError dErrorA A b b

dA db

http://upload.wikimedia.org/wikipedia/commons/d/db/Gradient_ascent_%28contour%29.png

Gradients for Logistic Regression

1 ( ) (1 ) ( )

1 ( ) (1 ) ( )

T

n n n n nn

n n n nn

ErrorY f X Y f X X

A

ErrorY f X Y f X

b

• After the math (on the white-board) we find:

Note: first term in each eqn. (multiplied by Y) only sums over data with Y=1, while second term (multiplied by (1-Y) only sums over data with Y=0.

Follow the gradient until the change in A,b falls below a small theshold (e.g. 1E-6).

Classification

• Once we have found the optimal values for A,b we classify future data with:

( ( ))new newY round f X

• Least squares and Logistic regression are parametric methods since all the information in the data is stored in the parameters A,b, i.e. after learning you can toss out the data.

• Also, the decision surface is always linear, its complexity does not grow with the amount of data.

• We have imposed our prior knowledge that the decision surface should be linear.

A Real Example

• Fingerprints are matched against a data-base.• Each match is scored.• Using Logistic Regression we try to predict if a future match is a real or false.• Human fingerprint examiners claim 100% accuracy. Is this true?

collaboration with S. Cole)

Exercise

• You have layed your hands on a dataset where data have a single attribute and a class label (0 or 1). You train a logistic regression classifier.A new data-case is presented. What do you do to decide in what class it falls (use an equation or pseudo-code)

• How many parameters are there to tune for this problem? Explain what these parameters mean in terms of the function P(Y=1|X).

Documents

ICS 178 Introduction Machine Learning & data Mining Instructor max Welling Lecture 6: Logistic Regression