Upload
gareth-clark
View
10
Download
0
Embed Size (px)
DESCRIPTION
ICS 178 Introduction Machine Learning & data Mining. Instructor max Welling Lecture 4: Least squares Regression. What have we done so far?. parametric. non-parametric. density estimation: parzen-windowing. future example: k-mean. unsupervised. classification. regression. regression. - PowerPoint PPT Presentation
Citation preview
ICS 178Introduction Machine Learning
& data Mining
Instructor max Welling
Lecture 4: Least squares Regression
What have we done so far?
non-parametric parametric
supe
rvis
edun
supe
rvis
ed
density estimation:
parzen-windowing
future example: k-mean
kNN
classification regression classification regression
Xfuture example:
logistic regression
today:
least-squares
Problem
# of mantee-kills versus boats
Goal• Given data find a linear relationship between them.
• In 1 dimension we have data {Xn,Yn} (blue dots)
• We imagine vertical springs (red lines) between the data and a stiff rod (line). (imagine they can slide over the rod so they remain vertical).
• Springs have rest length 0, so they compete to pull the rod towards them.
• The relaxed solution is what we are after.
Cost FunctionWe measure the total squared length of all the springs:
22
1 1
( )N N
n n nn n
Error d Y aX b
We can now take derivatives wrt a,band set that to 0.
After some algebra (on white board)we find,
* 1 1 12
2
1 1
* *
1 1
1 1 1
1 1
1 1
N N N
n n n nn n n
N N
n nn n
N N
n nn n
Y X Y XN N Na
X XN N
b Y a XN N
More Variables
22
1 1
|| ||N N
n n nn n
Error d Y AX b
1
*
1 1 1 1 1 1
1
* *
1 1
*
1 1 1 1 1 1
[ ] [ ] [ ] * [ ]
1 1
[ ] [ ]
N N N N N NT T T T
n n n n n n n nn n n n n n
T T
N N
n nn n
A Y X Y X X X X XN N N N N N
E YX E Y E X COV X
b Y A XN N
E Y A E X
• More generally, we want to have Dx input variables and Dy output variables.
• The cost is now:
In Matlabfunction [A,b] = LSRegression(X,Y)
[D,N] = size(X);
EX = sum(X,2)/N;
CovX = X*X'/N - EX*EX';
EY = sum(Y,2)/N;
CovXY = Y*X'/N - EY*EX';
A = CovXY * inv(CovX);
b = EY - A*EX;
Statistical Interpretation
• We can think of the problem as one where we are trying to find the probability distribution for P(Y|X).
• We can write: where d is the residual error pointing vertically from the line to the data-point.
• d is a random vector and we may assume is has a Gaussian distribution.
n n nY AX b d
(0, )d N
Statistical Interpretation
22
( ) (0, )
1 1( , ) exp[ || || ]
22
n n n
n n n n
d Y AX b N
Y N AX b Y AX b
• We can now maximize the probability of the data under the model by adapting the parameters A,b.
• If we use negative log-probability we get:
22
1
1|| || log .
2
N
n nn
C Y AX b cons
• Looks familiar?• We can also optimize for (It won’t affect A,b) • This is called “maximum likelihood learning”.