ICS 178 Introduction Machine Learning & data Mining

ICS 178Introduction Machine Learning

& data Mining

Instructor max Welling

Lecture 4: Least squares Regression

What have we done so far?

non-parametric parametric

supe

rvis

edun

supe

rvis

ed

density estimation:

parzen-windowing

future example: k-mean

kNN

classification regression classification regression

Xfuture example:

logistic regression

today:

least-squares

Problem

# of mantee-kills versus boats

Goal• Given data find a linear relationship between them.

• In 1 dimension we have data {Xn,Yn} (blue dots)

• We imagine vertical springs (red lines) between the data and a stiff rod (line). (imagine they can slide over the rod so they remain vertical).

• Springs have rest length 0, so they compete to pull the rod towards them.

• The relaxed solution is what we are after.

Cost FunctionWe measure the total squared length of all the springs:

22

1 1

( )N N

n n nn n

Error d Y aX b

We can now take derivatives wrt a,band set that to 0.

After some algebra (on white board)we find,

* 1 1 12

2

1 1

* *

1 1

1 1 1

1 1

1 1

N N N

n n n nn n n

N N

n nn n

N N

n nn n

Y X Y XN N Na

X XN N

b Y a XN N

More Variables

22

1 1

|| ||N N

n n nn n

Error d Y AX b

1

*

1 1 1 1 1 1

1

* *

1 1

*

1 1 1 1 1 1

[ ] [ ] [ ] * [ ]

1 1

[ ] [ ]

N N N N N NT T T T

n n n n n n n nn n n n n n

T T

N N

n nn n

A Y X Y X X X X XN N N N N N

E YX E Y E X COV X

b Y A XN N

E Y A E X

• More generally, we want to have Dx input variables and Dy output variables.

• The cost is now:

In Matlabfunction [A,b] = LSRegression(X,Y)

[D,N] = size(X);

EX = sum(X,2)/N;

CovX = X*X'/N - EX*EX';

EY = sum(Y,2)/N;

CovXY = Y*X'/N - EY*EX';

A = CovXY * inv(CovX);

b = EY - A*EX;

Statistical Interpretation

• We can think of the problem as one where we are trying to find the probability distribution for P(Y|X).

• We can write: where d is the residual error pointing vertically from the line to the data-point.

• d is a random vector and we may assume is has a Gaussian distribution.

n n nY AX b d

(0, )d N

Statistical Interpretation

22

( ) (0, )

1 1( , ) exp[ || || ]

22

n n n

n n n n

d Y AX b N

Y N AX b Y AX b

• We can now maximize the probability of the data under the model by adapting the parameters A,b.

• If we use negative log-probability we get:

22

1

1|| || log .

2

N

n nn

C Y AX b cons

• Looks familiar?• We can also optimize for (It won’t affect A,b) • This is called “maximum likelihood learning”.

Documents

ICS 178 Introduction Machine Learning & data Mining