32
Machine learning: lecture 12 Tommi S. Jaakkola MIT CSAIL [email protected]

Tommi S. Jaakkola MIT CSAIL1 Tommi Jaakkola, MIT CSAIL 17 Data compression and model selection • We can alternatively view model selection as a problem of finding the best way of

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Tommi S. Jaakkola MIT CSAIL1 Tommi Jaakkola, MIT CSAIL 17 Data compression and model selection • We can alternatively view model selection as a problem of finding the best way of

Machine learning: lecture 12

Tommi S. Jaakkola

MIT CSAIL

[email protected]

Page 2: Tommi S. Jaakkola MIT CSAIL1 Tommi Jaakkola, MIT CSAIL 17 Data compression and model selection • We can alternatively view model selection as a problem of finding the best way of

Topics• Complexity and model selection

– structural risk minimization

• Complexity, compression, and model selection

– description length

– minimum description length principle

Tommi Jaakkola, MIT CSAIL 2

Page 3: Tommi S. Jaakkola MIT CSAIL1 Tommi Jaakkola, MIT CSAIL 17 Data compression and model selection • We can alternatively view model selection as a problem of finding the best way of

VC-dimension: reviewShattering: A set of classifiers F (e.g., linear classifiers)

is said to shatter n points x1, . . . ,xn if for any possible

configuration of labels y1, . . . , yn we can find h ∈ F that

reproduces those labels.

VC-dimension: The VC-dimension of a set of classifiers F is

the largest number of points that F can shatter (maximized

over the choice of the n points).

Learning: We don’t expect to learn anything until we have

more than dV C training examples and labels (this statement

will be refined later on).

Tommi Jaakkola, MIT CSAIL 3

Page 4: Tommi S. Jaakkola MIT CSAIL1 Tommi Jaakkola, MIT CSAIL 17 Data compression and model selection • We can alternatively view model selection as a problem of finding the best way of

The number of labelings

0 50 100 150 2000

20

40

60

80

100

120

140

160

# of training examples

log2

(# o

f lab

elin

gs)

n ≤ dV C : # of labelings = 2n

n > dV C : # of labelings ≤(

en

dV C

)dV C

Tommi Jaakkola, MIT CSAIL 4

Page 5: Tommi S. Jaakkola MIT CSAIL1 Tommi Jaakkola, MIT CSAIL 17 Data compression and model selection • We can alternatively view model selection as a problem of finding the best way of

Learning and VC-dimension• By essentially replacing log M in the finite case with the log

of the number of possible labelings by the set of classifiers

over n (really 2n) points, we get an analogous result:

Theorem: With probability at least 1− δ over the choice of

the training set, for all h ∈ F

E(h) ≤ En(h) + ε(n, dV C, δ)

where

ε(n, dV C, δ) =

√dV C(log(2n/dV C) + 1) + log(1/(4δ))

n

Tommi Jaakkola, MIT CSAIL 5

Page 6: Tommi S. Jaakkola MIT CSAIL1 Tommi Jaakkola, MIT CSAIL 17 Data compression and model selection • We can alternatively view model selection as a problem of finding the best way of

Model selection• We try to find the model with the best balance of complexity

and fit to the training data

• Ideally, we would select a model from a nested sequence of

models of increasing complexity (VC-dimension)

Model 1, F1 VC-dim = d1

Model 2, F2 VC-dim = d2

Model 3, F3 VC-dim = d3

where F1 ⊆ F2 ⊆ F3 ⊆ . . .

• Model selection criterion: find the model (set of classifiers)

that achieves the lowest upper bound on the expected loss

(generalization error):

Expected error ≤ Training error + Complexity penalty

Tommi Jaakkola, MIT CSAIL 6

Page 7: Tommi S. Jaakkola MIT CSAIL1 Tommi Jaakkola, MIT CSAIL 17 Data compression and model selection • We can alternatively view model selection as a problem of finding the best way of

Structural risk minimization• We choose the model class Fi that minimizes the upper

bound on the expected error:

E(hi) ≤ En(hi) +

√di(log(2n/di) + 1) + log(1/(4δ))

n

where hi is the classifier from Fi that minimizes the training

error.

0 10 20 30 40 500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

VC dimension

Bound

Complexity penalty

Training error

Tommi Jaakkola, MIT CSAIL 7

Page 8: Tommi S. Jaakkola MIT CSAIL1 Tommi Jaakkola, MIT CSAIL 17 Data compression and model selection • We can alternatively view model selection as a problem of finding the best way of

Example• Models of increasing complexity

Model 1 K(x1,x2) = (1 + (xT1 x2))

Model 2 K(x1,x2) = (1 + (xT1 x2))2

Model 3 K(x1,x2) = (1 + (xT1 x2))3

. . . . . .

• These are nested, i.e.,

F1 ⊆ F2 ⊆ F3 ⊆ . . .

where Fk refers to the set of possible decision boundaries

that the model k can represent.

Tommi Jaakkola, MIT CSAIL 8

Page 9: Tommi S. Jaakkola MIT CSAIL1 Tommi Jaakkola, MIT CSAIL 17 Data compression and model selection • We can alternatively view model selection as a problem of finding the best way of

Structural risk minimization: example

−1.5 −1 −0.5 0 0.5 1 1.5 2−1

−0.5

0

0.5

1

1.5

2

−1.5 −1 −0.5 0 0.5 1 1.5 2−1

−0.5

0

0.5

1

1.5

2

linear 2nd order polynomial

−1.5 −1 −0.5 0 0.5 1 1.5 2−1

−0.5

0

0.5

1

1.5

2

−1.5 −1 −0.5 0 0.5 1 1.5 2−1

−0.5

0

0.5

1

1.5

2

4th order polynomial 8th order polynomial

Tommi Jaakkola, MIT CSAIL 9

Page 10: Tommi S. Jaakkola MIT CSAIL1 Tommi Jaakkola, MIT CSAIL 17 Data compression and model selection • We can alternatively view model selection as a problem of finding the best way of

Structural risk minimization: example cont’d• Number of training examples n = 50, confidence parameter

δ = 0.05.

Model dV C Empirical fit ε(n, dV C, δ)1st order 3 0.06 0.5501

2nd order 6 0.06 0.6999

4th order 15 0.04 0.9494

8th order 45 0.02 1.2849

• Structural risk minimization would select the simplest (linear)

model in this case.

Tommi Jaakkola, MIT CSAIL 10

Page 11: Tommi S. Jaakkola MIT CSAIL1 Tommi Jaakkola, MIT CSAIL 17 Data compression and model selection • We can alternatively view model selection as a problem of finding the best way of

Complexity and margin• The number of possible labelings of points with large margin

can be dramatically less than the (basic) VC-dimension would

imply

x

o

oo

oo

ooo

o x xx

xx

xx

• The set of separating hyperplaces which attain margin γ

or better for examples within a sphere of radius R has

VC-dimension bounded by dV C(γ) ≤ R2/γ2

Tommi Jaakkola, MIT CSAIL 11

Page 12: Tommi S. Jaakkola MIT CSAIL1 Tommi Jaakkola, MIT CSAIL 17 Data compression and model selection • We can alternatively view model selection as a problem of finding the best way of

Topics• Complexity and model selection

– structural risk minimization

• Complexity, compression, and model selection

– description length

– minimum description length principle

Tommi Jaakkola, MIT CSAIL 12

Page 13: Tommi S. Jaakkola MIT CSAIL1 Tommi Jaakkola, MIT CSAIL 17 Data compression and model selection • We can alternatively view model selection as a problem of finding the best way of

Data compression and model selection• We can alternatively view model selection as a problem of

finding the best way of communicating the available data

• Compression and learning:

?

x2 . . . xn x1 x2 . . . xn

. . .y1 y2 yn ? ? . . .

x1

Tommi Jaakkola, MIT CSAIL 13

Page 14: Tommi S. Jaakkola MIT CSAIL1 Tommi Jaakkola, MIT CSAIL 17 Data compression and model selection • We can alternatively view model selection as a problem of finding the best way of

Data compression and model selection• We can alternatively view model selection as a problem of

finding the best way of communicating the available data

• Compression and learning:

h(x; θ)

x2 . . . xn x1 x2 . . . xn

. . .y1 y2 yn ? ? . . . ?

x1

Tommi Jaakkola, MIT CSAIL 14

Page 15: Tommi S. Jaakkola MIT CSAIL1 Tommi Jaakkola, MIT CSAIL 17 Data compression and model selection • We can alternatively view model selection as a problem of finding the best way of

Data compression and model selection• We can alternatively view model selection as a problem of

finding the best way of communicating the available data

• Compression and learning:

θ

x2 . . . xn x1 x2 . . . xn

. . .y1 y2 yn ? ? . . . ?

h(x; θ)

x1

Tommi Jaakkola, MIT CSAIL 15

Page 16: Tommi S. Jaakkola MIT CSAIL1 Tommi Jaakkola, MIT CSAIL 17 Data compression and model selection • We can alternatively view model selection as a problem of finding the best way of

Data compression and model selection• We can alternatively view model selection as a problem of

finding the best way of communicating the available data

• Compression and learning:

θ

x2 . . . xn x1 x2 . . . xn

. . .y1 y2 yn ? ? . . . ?

h(x; θ) h(x; θ)

x1

Tommi Jaakkola, MIT CSAIL 16

Page 17: Tommi S. Jaakkola MIT CSAIL1 Tommi Jaakkola, MIT CSAIL 17 Data compression and model selection • We can alternatively view model selection as a problem of finding the best way of

Data compression and model selection• We can alternatively view model selection as a problem of

finding the best way of communicating the available data

• Compression and learning:

errors

x2 . . . xn x1 x2 . . . xn

. . .y1 y2 yn ? ? . . . ?

h(x; θ) h(x; θ)θ

x1

Tommi Jaakkola, MIT CSAIL 17

Page 18: Tommi S. Jaakkola MIT CSAIL1 Tommi Jaakkola, MIT CSAIL 17 Data compression and model selection • We can alternatively view model selection as a problem of finding the best way of

Data compression and model selection• We can alternatively view model selection as a problem of

finding the best way of communicating the available data

• Compression and learning:

errors

x2 . . . xn x1 x2 . . . xn

. . .y1 y2 yn ? ? . . . ?

h(x; θ) h(x; θ)θ

x1

The receiver already knows

– input examples, models we consider

Need to communicate

– model class, parameter estimates, prediction errors

Tommi Jaakkola, MIT CSAIL 18

Page 19: Tommi S. Jaakkola MIT CSAIL1 Tommi Jaakkola, MIT CSAIL 17 Data compression and model selection • We can alternatively view model selection as a problem of finding the best way of

Compression and sequential estimation• We don’t have to communicate any real valued parameters

if we setup the learning problem sequentially

?

x2 . . . xn x1 x2 . . . xn

. . .y1 y2 yn ? ? . . .

x1

Tommi Jaakkola, MIT CSAIL 19

Page 20: Tommi S. Jaakkola MIT CSAIL1 Tommi Jaakkola, MIT CSAIL 17 Data compression and model selection • We can alternatively view model selection as a problem of finding the best way of

Compression and sequential estimation• We don’t have to communicate any real valued parameters

if we setup the learning problem sequentially

θ0 : default parameter values

x2 . . . xn x1 x2 . . . xn

. . .y1 y2 yn ? ? . . . ?

h(x; θ0)

x1

Tommi Jaakkola, MIT CSAIL 20

Page 21: Tommi S. Jaakkola MIT CSAIL1 Tommi Jaakkola, MIT CSAIL 17 Data compression and model selection • We can alternatively view model selection as a problem of finding the best way of

Compression and sequential estimation• We don’t have to communicate any real valued parameters

if we setup the learning problem sequentially

θ1 : based on θ0 and (x1, y1)

x2 . . . xn x1 x2 . . . xn

. . .y1 y2 yn ? ? . . . ?

θ0 : default parameter values

h(x; θ1)

x1

Tommi Jaakkola, MIT CSAIL 21

Page 22: Tommi S. Jaakkola MIT CSAIL1 Tommi Jaakkola, MIT CSAIL 17 Data compression and model selection • We can alternatively view model selection as a problem of finding the best way of

Compression and sequential estimation• We don’t have to communicate any real valued parameters

if we setup the learning problem sequentially

θ1 : based on θ0 and (x1, y1)

x2 . . . xn x1 x2 . . . xn

. . .y1 y2 yn ? ? . . . ?

θ0 : default parameter values

h(x; θ1)

x1

Tommi Jaakkola, MIT CSAIL 22

Page 23: Tommi S. Jaakkola MIT CSAIL1 Tommi Jaakkola, MIT CSAIL 17 Data compression and model selection • We can alternatively view model selection as a problem of finding the best way of

Compression and sequential estimation• We don’t have to communicate any real valued parameters

if we setup the learning problem sequentially

θn−1 : based on θ0 and (x1, y1), . . . , (xn−1, yn−1)

x2 . . . xn x1 x2 . . . xn

. . .y1 y2 yn ? ? . . . ?

θ0 : default parameter values

θ1 : based on θ0 and (x1, y1)

h(x; θn−1)

· · ·

x1

Tommi Jaakkola, MIT CSAIL 23

Page 24: Tommi S. Jaakkola MIT CSAIL1 Tommi Jaakkola, MIT CSAIL 17 Data compression and model selection • We can alternatively view model selection as a problem of finding the best way of

Compression and sequential estimation• We don’t have to communicate any real valued parameters

if we setup the learning problem sequentially

errors

x2 . . . xn x1 x2 . . . xn

. . .y1 y2 yn ? ? . . . ?

θ0 : default parameter values

θ1 : based on θ0 and (x1, y1)

h(x; θn−1)

· · ·θn−1 : based on θ0 and (x1, y1), . . . , (xn−1, yn−1)

x1

– we only need to communicate the model class (index) and

prediction errors

– but the answer depends on the sequential order

Tommi Jaakkola, MIT CSAIL 24

Page 25: Tommi S. Jaakkola MIT CSAIL1 Tommi Jaakkola, MIT CSAIL 17 Data compression and model selection • We can alternatively view model selection as a problem of finding the best way of

Probabilistic sequential prediction• To communicate the labels effectively we need to cast the

problem in probabilistic terms

Tommi Jaakkola, MIT CSAIL 25

Page 26: Tommi S. Jaakkola MIT CSAIL1 Tommi Jaakkola, MIT CSAIL 17 Data compression and model selection • We can alternatively view model selection as a problem of finding the best way of

Probabilistic sequential prediction• To communicate the labels effectively we need to cast the

problem in probabilistic terms

• Suppose we define a model P (y|x, θ), θ ∈ Θ and prior P (θ),both known to the receiver

Tommi Jaakkola, MIT CSAIL 26

Page 27: Tommi S. Jaakkola MIT CSAIL1 Tommi Jaakkola, MIT CSAIL 17 Data compression and model selection • We can alternatively view model selection as a problem of finding the best way of

Probabilistic sequential prediction• To communicate the labels effectively we need to cast the

problem in probabilistic terms

• Suppose we define a model P (y|x, θ), θ ∈ Θ and prior P (θ),both known to the receiver

We predict the first label according to

y1|x1 : P (y1|x1) =∫

P (y1|x1, θ)P (θ)dθ

and update the prior (posterior)

P (θ|D1) =P (θ)P (y1|x1, θ)

P (y1|x1)

where D1 = {(x1, y1)}.

Tommi Jaakkola, MIT CSAIL 27

Page 28: Tommi S. Jaakkola MIT CSAIL1 Tommi Jaakkola, MIT CSAIL 17 Data compression and model selection • We can alternatively view model selection as a problem of finding the best way of

Probabilistic sequential prediction• To communicate the labels effectively we need to cast the

problem in probabilistic terms

• Suppose we define a model P (y|x, θ), θ ∈ Θ and prior P (θ),both known to the receiver

We predict the second label according to

y2|x2 : P (y2|x2, D1) =∫

P (y2|x2, θ)P (θ|D1)dθ

and again update the posterior

P (θ|D2) =P (θ|D1)P (y2|x2, θ)

P (y2|x2, D1)

where D2 = {(x1, y1), (x2, y2)}.

Tommi Jaakkola, MIT CSAIL 28

Page 29: Tommi S. Jaakkola MIT CSAIL1 Tommi Jaakkola, MIT CSAIL 17 Data compression and model selection • We can alternatively view model selection as a problem of finding the best way of

Probabilistic sequential prediction• To communicate the labels effectively we need to cast the

problem in probabilistic terms

• Suppose we define a model P (y|x, θ), θ ∈ Θ and prior P (θ),both known to the receiver

Finally, we predict the last nth label according to

yn|xn : P (yn|xn, Dn−1) =∫

P (yn|xn, θ)P (θ|Dn−1)dθ

where Dn−1 = {(x1, y1), . . . , (xn−1, yn−1)}.

Tommi Jaakkola, MIT CSAIL 29

Page 30: Tommi S. Jaakkola MIT CSAIL1 Tommi Jaakkola, MIT CSAIL 17 Data compression and model selection • We can alternatively view model selection as a problem of finding the best way of

Probabilistic sequential prediction• To communicate the labels effectively we need to cast the

problem in probabilistic terms

• Suppose we define a model P (y|x, θ), θ ∈ Θ and prior P (θ),both known to the receiver

Our sequential prediction method defines a probability

distribution over all the labels given the examples:

P (y1|x1)P (y2|x2, D1) · · ·P (yn|xn, Dn−1)

This does not depend on the order in which we processed

the examples.

Tommi Jaakkola, MIT CSAIL 30

Page 31: Tommi S. Jaakkola MIT CSAIL1 Tommi Jaakkola, MIT CSAIL 17 Data compression and model selection • We can alternatively view model selection as a problem of finding the best way of

Probabilistic sequential prediction• To communicate the labels effectively we need to cast the

problem in probabilistic terms

• Suppose we define a model P (y|x, θ), θ ∈ Θ and prior P (θ),both known to the receiver

Our sequential prediction method defines a probability

distribution over all the labels given the examples:

P (y1|x1)P (y2|x2, D1) · · ·P (yn|xn, Dn−1)

This does not depend on the order in which we processed

the examples.

=∫

P (y1|x1, θ) · · ·P (yn|xn, θ)P (θ)dθ

(Bayesian marginal likelihood)

Tommi Jaakkola, MIT CSAIL 31

Page 32: Tommi S. Jaakkola MIT CSAIL1 Tommi Jaakkola, MIT CSAIL 17 Data compression and model selection • We can alternatively view model selection as a problem of finding the best way of

Description length and probabilities• It takes − log2 P (y1, . . . , yn) bits to communicate y1, . . . , yn

according to distribution P .

Example: suppose y = 1, . . . , 8 and each value is equally

likely according to P

8

1st bit

2nd bit

3rd bit

1 2 3 4 5 6 7

We need − log2 P (y) = − log2(1/8) = 3 bits to describe

each y.

Tommi Jaakkola, MIT CSAIL 32