Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given

Regression

Practical Machine LearningFabian Wauthier

09/10/2009

Adapted from slides by Kurt Miller and Romain Thibaux

1

Outline• Ordinary Least Squares Regression

- Online version

- Normal equations

- Probabilistic interpretation

• Overfitting and Regularization

• Overview of additional topics

- L1 Regression

- Quantile Regression

- Generalized linear models

- Kernel Regression and Locally Weighted Regression

2


- Online version

- Normal equations




- L1 Regression




3

Regression vs. Classification:

Anything:

• continuous (ℜ, ℜd, …)

• discrete ({0,1}, {1,…k}, …)

• structured (tree, string, …)

• …

• Discrete:

– {0,1} binary

– {1,…k} ! multi-class

– tree, etc. structured

Classification

X Y⇒4


Anything:


• discrete ({0,1}, {1,…k}, …)


• …

PerceptronLogistic RegressionSupport Vector Machine

Decision TreeRandom Forest

Kernel trick

Classification

X Y⇒5


Anything:


• discrete ({0,1}, {1,…k}, …)


• …

Regression

• continuous:– ℜ, ℜd

X Y⇒6

Examples

• Voltage Temperature• Processes, memory Power consumption• Protein structure Energy• Robot arm controls Torque at effector• Location, industry, past losses Premium

⇒⇒⇒

⇒

⇒

7

Linear regression

0 10 200

20

40

x

y

010

2030

40

0

10

20

30

20

22

24

26

x

y

Given examplesPredict given a new point

8

where is a parameter to be estimated and we have used the standard convention of letting the first component of be 1.

We wish to estimate by a linear function of our data :

010

2030

40

0

10

20

30

20

22

24

26

Linear regression

0 10 200

20

40

x

y

x

y

y x

w

yn+1 = w0 + w1xn+1,1 + w2xn+1,2

= w!xn+1

x

9

Choosing the regressor

10

Of the many regression fits that approximate the data, which should we choose?

Observation

0 200

Xi =(

1xi

)

10

LMS Algorithm(Least Mean Squares)

In order to clarify what we mean by a good choice of , we will define a cost function for how well we are doing on the training data:

w

0 200

Error or “residual”

Prediction

Observation

Cost =12

n∑i=1

(w!xi − yi)2

Xi =(

1xi

)

11


The best choice of is the one that minimizes our cost functionw

E =12

n∑i=1

(w!xi − yi)2 =n∑

i=1

Ei

In order to optimize this equation, we use standard gradient descent

where

∂

∂wE =

n∑i=1

∂

∂wEi and

∂

∂wEi =

12

∂

∂w(w!xi − yi)2

= (w!xi − yi)xi

wt+1 := wt − α∂

∂wE

12


The LMS algorithm is an online method that performs the following update for each new data point

wt+1 := wt − α∂

∂wEi

= wt + α(yi − x!i w)xi

α∂Ei

∂w

13

LMS, Logistic regression, and Perceptron updates

• LMS

• Logistic Regression

• Perceptron

wt+1 := wt + α(yi − x!i w)xi

wt+1 := wt + α(yi − fw(xi))xi


14

Ordinary Least Squares (OLS)

0 200

Error or “residual”

Prediction

Observation

Cost =12

n∑i=1

(w!xi − yi)2

Xi =(

1xi

)

15

Minimize the sum squared error

n

d

∂

∂wE = X!Xw −X!y

Setting the derivative equal to zero gives us the Normal Equations

X!Xw = X!y

w = (X!X)−1X!y

E =12

n∑i=1

(w!xi − yi)2

=12(Xw − y)!(Xw − y)

=12(w!X!Xw − 2y!Xw + y!y)

16

A geometric interpretation

17

We solved∂

∂wE = X!(Xw − y) = 0

Residuals are orthogonal to columns of X⇒⇒ gives the best reconstruction ofy = Xw y

in the range ofX

17

18

[X]1

y

[X]2

y’

y’ is an orthogonal projection of y onto S

Subspace S spanned by columns of X

Residual vector y!y’ isorthogonal to subspace S

18

Computing the solution

19

w.

.

Euclidean norm.

the pseudoinverse XIf X!X is not invertible, there is no unique solution

In that case chooses the solution with smallest

and the solution is unique.

w = (X!X)−1X!yWe compute

If X!X is invertible, then (X!X)−1X! coincides with

X+ of

An alternative way to deal with non-invertible X!X isto add a small portion of the identity matrix (= Ridge regression).

w = X+y

19

Beyond lines and planes

0 10 200

20

40

Linear models become powerful function approximators when we consider non-linear feature transformations.

⇒All the math is the same!

Predictions are still linear in X !

20

Geometric interpretation

[Matlab demo]

010

20 0

100

200

300

400

-10

0

10

20

y = w0 + w1x + w2x2

21

Ordinary Least Squares [summary]

n

d

Let

For example

Let

Minimize by solving

Given examples

Predict22

Probabilistic interpretation

0 200

Likelihood

23

240 2 4 6 8 10

0

5

10

15

20

25

X

y

µ=8µ=5µ=3

Mean µ

ConditionalGaussiansp(y|x)

24

BREAK

25


- Online version

- Normal equations




- L1 Regression




26

Overfitting

• So the more features the better? NO! • Carefully selected features can improve

model accuracy. • But adding too many can lead to overfitting.• Feature selection will be discussed in a

separate lecture.

27

27

Overfitting

0 2 4 6 8 10 12 14 16 18 20-15

-10

-5

0

5

10

15

20

25

30

[Matlab demo]

Degree 15 polynomial

28

Ridge Regression(Regularization)

0 2 4 6 8 10 12 14 16 18 20-10

-5

0

5

10

15Effect of regularization (degree 19)

with “small” by solving

Minimize

(X!X + εI)w = X!y

[Continue Matlab demo]29

Probabilistic interpretation

Likelihood

Prior

Posterior

P (w|X, y) =P (w, x1, . . . , xn, y1, . . . , yn)P (x1, . . . , xn, y1, . . . , yn)

∝ P (w.x1, . . . , x1, y1, . . . , yn)

∝ exp{− ε

2σ2||w||22

}∏i

exp{− 1

2σ2

(X!

i w − yi

)2}

= exp

{− 1

2σ2

[ε||w||22 +

∑i

(X!i w − yi)2

]}

30


- Online version

- Normal equations




- L1 Regression




31

Errors in Variables(Total Least Squares)

00

32

Sensitivity to outliers

High weight given to outliers

010

2030

40

010

2030

5

10

15

20

25

Temperature at noon

Influence function

33

L1 Regression

Linear programInfluence function

[Matlab demo]34

Quantile Regression

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

15 16 17 18 19 20 21

260

280

300

320

340

360

workload (ViewItem.php) [req/s]

CPU

utiliz

atio

n [M

Hz]

mean CPU95th percentile of CPU

Slide courtesy of Peter Bodik35

Generalized Linear Models

36

Probabilistic interpretation of OLSMean is linear in Xi

OLS: linearly predict the mean of a Gaussian conditional.

GLM: predict the mean of some other conditional density.

May need to transform linear prediction by to produce a valid parameter.

yi|xi ∼ p(f(X!

i w))

f(·)

36

Example: “Poisson regression”

37

yi|xi ∼ Poisson(f(X!

i w))

Suppose data are event counts:y

Typical distribution for count data: Poisson

Poisson(y|λ) =e−λλy

y!Mean parameter is λ > 0

Say we predict λ = f(x!w) = exp{x!w

}

y ∈ N0

GLM:

37

380 2 4 6 8 100

5

10

15

20

25

X

y Mean !

ConditionalPoissonsp(y|x)

!=8!=5!=3

38

Poisson regression: learning

39

As for OLS: optimize by maximizing the likelihood of data.w

Equivalently: maximize log likelihood.

Likelihood L =∏

i

Poisson(yi|f(X!

i w))

l =∑

i

(X!

i wyi − exp{X!

i w})

+ const.Log likelihood

∂l

∂w=

∑i

(yi − exp

{X!

i w})

Xi

=∑

i

(yi − f

(X!

i w))

Xi

Batch gradient:

︸︷︷︸“residual”

39

LMS, Logistic regression, Perceptron and GLM updates

• GLM (online)

• LMS

• Logistic Regression

• Perceptron

wt+1 := wt + α(yi − x!i w)xi




40

Kernel Regression and Locally Weighted Linear Regression

• Kernel Regression: Take a very very conservative function approximator called

AVERAGING. Locally weight it.

• Locally Weighted Linear Regression: Take a conservative function approximator called LINEAR

REGRESSION. Locally weight it.

Slide from Paul Viola 200341

Kernel Regression

0 2 4 6 8 10 12 14 16 18 20-10

-5

0

5

10

15Kernel regression (sigma=1)

42

! " # $ % &! &" &# &$ &% "!!&!

!'

!

'

&!

&'

Locally Weighted Linear Regression(LWR)

Kernel regression (sigma=1)

E =12

n∑i=1

(w!xi − yi)2

OLS cost function:

LWR cost function:

E′ =n∑

i=1

k(xi − x)(w"xi − yi)2

[Matlab demo]43

0 1 20

#requests per minute

Time (days)

5000

Heteroscedasticity

44

What we covered• Ordinary Least Squares Regression

- Online version

- Normal equations




- L1 Regression




45

Documents

Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors in Variables (Total Least Squares) 0 0 32. Sensitivity to outliers High weight given