45
Regression Practical Machine Learning Fabian Wauthier 09/10/2009 Adapted from slides by Kurt Miller and Romain Thibaux 1

Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

  • Upload
    others

  • View
    19

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

Regression

Practical Machine LearningFabian Wauthier

09/10/2009

Adapted from slides by Kurt Miller and Romain Thibaux

1

Page 2: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

Outline• Ordinary Least Squares Regression

- Online version

- Normal equations

- Probabilistic interpretation

• Overfitting and Regularization

• Overview of additional topics

- L1 Regression

- Quantile Regression

- Generalized linear models

- Kernel Regression and Locally Weighted Regression

2

Page 3: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

Outline• Ordinary Least Squares Regression

- Online version

- Normal equations

- Probabilistic interpretation

• Overfitting and Regularization

• Overview of additional topics

- L1 Regression

- Quantile Regression

- Generalized linear models

- Kernel Regression and Locally Weighted Regression

3

Page 4: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

Regression vs. Classification:

Anything:

• continuous (ℜ, ℜd, …)

• discrete ({0,1}, {1,…k}, …)

• structured (tree, string, …)

• …

• Discrete:

– {0,1} binary

– {1,…k} ! multi-class

– tree, etc. structured

Classification

X Y⇒4

Page 5: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

Regression vs. Classification:

Anything:

• continuous (ℜ, ℜd, …)

• discrete ({0,1}, {1,…k}, …)

• structured (tree, string, …)

• …

PerceptronLogistic RegressionSupport Vector Machine

Decision TreeRandom Forest

Kernel trick

Classification

X Y⇒5

Page 6: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

Regression vs. Classification:

Anything:

• continuous (ℜ, ℜd, …)

• discrete ({0,1}, {1,…k}, …)

• structured (tree, string, …)

• …

Regression

• continuous:– ℜ, ℜd

X Y⇒6

Page 7: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

Examples

• Voltage Temperature• Processes, memory Power consumption• Protein structure Energy• Robot arm controls Torque at effector• Location, industry, past losses Premium

⇒⇒⇒

7

Page 8: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

Linear regression

0 10 200

20

40

x

y

010

2030

40

0

10

20

30

20

22

24

26

x

y

Given examplesPredict given a new point

8

Page 9: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

where is a parameter to be estimated and we have used the standard convention of letting the first component of be 1.

We wish to estimate by a linear function of our data :

010

2030

40

0

10

20

30

20

22

24

26

Linear regression

0 10 200

20

40

x

y

x

y

y x

w

yn+1 = w0 + w1xn+1,1 + w2xn+1,2

= w!xn+1

x

9

Page 10: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

Choosing the regressor

10

Of the many regression fits that approximate the data, which should we choose?

Observation

0 200

Xi =(

1xi

)

10

Page 11: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

LMS Algorithm(Least Mean Squares)

In order to clarify what we mean by a good choice of , we will define a cost function for how well we are doing on the training data:

w

0 200

Error or “residual”

Prediction

Observation

Cost =12

n∑i=1

(w!xi − yi)2

Xi =(

1xi

)

11

Page 12: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

LMS Algorithm(Least Mean Squares)

The best choice of is the one that minimizes our cost functionw

E =12

n∑i=1

(w!xi − yi)2 =n∑

i=1

Ei

In order to optimize this equation, we use standard gradient descent

where

∂wE =

n∑i=1

∂wEi and

∂wEi =

12

∂w(w!xi − yi)2

= (w!xi − yi)xi

wt+1 := wt − α∂

∂wE

12

Page 13: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

LMS Algorithm(Least Mean Squares)

The LMS algorithm is an online method that performs the following update for each new data point

wt+1 := wt − α∂

∂wEi

= wt + α(yi − x!i w)xi

α∂Ei

∂w

13

Page 14: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

LMS, Logistic regression, and Perceptron updates

• LMS

• Logistic Regression

• Perceptron

wt+1 := wt + α(yi − x!i w)xi

wt+1 := wt + α(yi − fw(xi))xi

wt+1 := wt + α(yi − fw(xi))xi

14

Page 15: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

Ordinary Least Squares (OLS)

0 200

Error or “residual”

Prediction

Observation

Cost =12

n∑i=1

(w!xi − yi)2

Xi =(

1xi

)

15

Page 16: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

Minimize the sum squared error

n

d

∂wE = X!Xw −X!y

Setting the derivative equal to zero gives us the Normal Equations

X!Xw = X!y

w = (X!X)−1X!y

E =12

n∑i=1

(w!xi − yi)2

=12(Xw − y)!(Xw − y)

=12(w!X!Xw − 2y!Xw + y!y)

16

Page 17: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

A geometric interpretation

17

We solved∂

∂wE = X!(Xw − y) = 0

Residuals are orthogonal to columns of X⇒⇒ gives the best reconstruction ofy = Xw y

in the range ofX

17

Page 18: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

18

[X]1

y

[X]2

y’

y’ is an orthogonal projection of y onto S

Subspace S spanned by columns of X

Residual vector y!y’ isorthogonal to subspace S

18

Page 19: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

Computing the solution

19

w.

.

Euclidean norm.

the pseudoinverse XIf X!X is not invertible, there is no unique solution

In that case chooses the solution with smallest

and the solution is unique.

w = (X!X)−1X!yWe compute

If X!X is invertible, then (X!X)−1X! coincides with

X+ of

An alternative way to deal with non-invertible X!X isto add a small portion of the identity matrix (= Ridge regression).

w = X+y

19

Page 20: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

Beyond lines and planes

0 10 200

20

40

Linear models become powerful function approximators when we consider non-linear feature transformations.

⇒All the math is the same!

Predictions are still linear in X !

20

Page 21: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

Geometric interpretation

[Matlab demo]

010

20 0

100

200

300

400

-10

0

10

20

y = w0 + w1x + w2x2

21

Page 22: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

Ordinary Least Squares [summary]

n

d

Let

For example

Let

Minimize by solving

Given examples

Predict22

Page 23: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

Probabilistic interpretation

0 200

Likelihood

23

Page 24: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

240 2 4 6 8 10

0

5

10

15

20

25

X

y

µ=8µ=5µ=3

Mean µ

ConditionalGaussiansp(y|x)

24

Page 25: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

BREAK

25

Page 26: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

Outline• Ordinary Least Squares Regression

- Online version

- Normal equations

- Probabilistic interpretation

• Overfitting and Regularization

• Overview of additional topics

- L1 Regression

- Quantile Regression

- Generalized linear models

- Kernel Regression and Locally Weighted Regression

26

Page 27: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

Overfitting

• So the more features the better? NO! • Carefully selected features can improve

model accuracy. • But adding too many can lead to overfitting.• Feature selection will be discussed in a

separate lecture.

27

27

Page 28: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

Overfitting

0 2 4 6 8 10 12 14 16 18 20-15

-10

-5

0

5

10

15

20

25

30

[Matlab demo]

Degree 15 polynomial

28

Page 29: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

Ridge Regression(Regularization)

0 2 4 6 8 10 12 14 16 18 20-10

-5

0

5

10

15Effect of regularization (degree 19)

with “small” by solving

Minimize

(X!X + εI)w = X!y

[Continue Matlab demo]29

Page 30: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

Probabilistic interpretation

Likelihood

Prior

Posterior

P (w|X, y) =P (w, x1, . . . , xn, y1, . . . , yn)P (x1, . . . , xn, y1, . . . , yn)

∝ P (w.x1, . . . , x1, y1, . . . , yn)

∝ exp{− ε

2σ2||w||22

}∏i

exp{− 1

2σ2

(X!

i w − yi

)2}

= exp

{− 1

2σ2

[ε||w||22 +

∑i

(X!i w − yi)2

]}

30

Page 31: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

Outline• Ordinary Least Squares Regression

- Online version

- Normal equations

- Probabilistic interpretation

• Overfitting and Regularization

• Overview of additional topics

- L1 Regression

- Quantile Regression

- Generalized linear models

- Kernel Regression and Locally Weighted Regression

31

Page 32: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

Errors in Variables(Total Least Squares)

00

32

Page 33: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

Sensitivity to outliers

High weight given to outliers

010

2030

40

010

2030

5

10

15

20

25

Temperature at noon

Influence function

33

Page 34: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

L1 Regression

Linear programInfluence function

[Matlab demo]34

Page 35: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

Quantile Regression

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

15 16 17 18 19 20 21

260

280

300

320

340

360

workload (ViewItem.php) [req/s]

CPU

utiliz

atio

n [M

Hz]

mean CPU95th percentile of CPU

Slide courtesy of Peter Bodik35

Page 36: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

Generalized Linear Models

36

Probabilistic interpretation of OLSMean is linear in Xi

OLS: linearly predict the mean of a Gaussian conditional.

GLM: predict the mean of some other conditional density.

May need to transform linear prediction by to produce a valid parameter.

yi|xi ∼ p(f(X!

i w))

f(·)

36

Page 37: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

Example: “Poisson regression”

37

yi|xi ∼ Poisson(f(X!

i w))

Suppose data are event counts:y

Typical distribution for count data: Poisson

Poisson(y|λ) =e−λλy

y!Mean parameter is λ > 0

Say we predict λ = f(x!w) = exp{x!w

}

y ∈ N0

GLM:

37

Page 38: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

380 2 4 6 8 100

5

10

15

20

25

X

y Mean !

ConditionalPoissonsp(y|x)

!=8!=5!=3

38

Page 39: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

Poisson regression: learning

39

As for OLS: optimize by maximizing the likelihood of data.w

Equivalently: maximize log likelihood.

Likelihood L =∏

i

Poisson(yi|f(X!

i w))

l =∑

i

(X!

i wyi − exp{X!

i w})

+ const.Log likelihood

∂l

∂w=

∑i

(yi − exp

{X!

i w})

Xi

=∑

i

(yi − f

(X!

i w))

Xi

Batch gradient:

︸ ︷︷ ︸“residual”

39

Page 40: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

LMS, Logistic regression, Perceptron and GLM updates

• GLM (online)

• LMS

• Logistic Regression

• Perceptron

wt+1 := wt + α(yi − x!i w)xi

wt+1 := wt + α(yi − fw(xi))xi

wt+1 := wt + α(yi − fw(xi))xi

wt+1 := wt + α(yi − fw(xi))xi

40

Page 41: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

Kernel Regression and Locally Weighted Linear Regression

• Kernel Regression: Take a very very conservative function approximator called

AVERAGING. Locally weight it.

• Locally Weighted Linear Regression: Take a conservative function approximator called LINEAR

REGRESSION. Locally weight it.

Slide from Paul Viola 200341

Page 42: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

Kernel Regression

0 2 4 6 8 10 12 14 16 18 20-10

-5

0

5

10

15Kernel regression (sigma=1)

42

Page 43: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

! " # $ % &! &" &# &$ &% "!!&!

!'

!

'

&!

&'

Locally Weighted Linear Regression(LWR)

Kernel regression (sigma=1)

E =12

n∑i=1

(w!xi − yi)2

OLS cost function:

LWR cost function:

E′ =n∑

i=1

k(xi − x)(w"xi − yi)2

[Matlab demo]43

Page 44: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

0 1 20

#requests per minute

Time (days)

5000

Heteroscedasticity

44

Page 45: Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

What we covered• Ordinary Least Squares Regression

- Online version

- Normal equations

- Probabilistic interpretation

• Overfitting and Regularization

• Overview of additional topics

- L1 Regression

- Quantile Regression

- Generalized linear models

- Kernel Regression and Locally Weighted Regression

45