Bayesian regularization of learning Sergey Shumsky NeurOK Software LLC

Bayesian regularization of learning

Sergey Shumsky NeurOK Software LLC

Scientific methods

InductionF.Bacon

Machine

Models

Data

Deduction R.Descartes

Math. modeling

Outline Learning as ill-posed problem General problem: data generalization General remedy: model regularization

Bayesian regularization. Theory Hypothesis comparison Model comparison Free Energy & EM algorithm

Bayesian regularization. Practice Hypothesis testing Function approximation Data clustering

Outline Learning as ill-posed problem

General problem: data generalization General remedy: model regularization



Problem statement

Learning is inverse, ill-posed problem Model Data

Learning paradoxes Infinite predictions Finite data? How to optimize future predictions? How to select regular from casual in

data?

Regularization of learning Optimal model complexity

Well-posed problem

Solution is unique Solution is stable Hadamard (1900-s)

Tikhonoff (1960-s)

:h H h D

1 2

1 1

1 2

2 2

limD D

H h Dh h

H h D

lim limi ii i

H h H h h h

Learning from examples

Problem: Find hypothesis h, generating

observed data D in model H

Well defined if not sensitive to: noise in data (Hadamard) learning procedure (Tikhonoff)

:h H h D

Learning is ill-posed problem

Example: Function approximation

Sensitive tonoise in data

Sensitive tolearning procedure

Learning is ill-posed problem

Solution is non-unique

h x

h f

x x x x





Problem regularization Main idea: restrict solutions –

sacrifice precision to stability

: P hh H h D

arg minh H h D P h

How to choose?

Statistical Learning practice

Data Learning set

+ Validation set

Cross-validation:

Systematic approach to ensembles Bayes

+ +… +





Statistical Learning theory

Learning as inverse Probability Probability theory. H: h D

Learning theory. H: h D

, 1 hhN NN

h

NP D h H h h

N

,

,P D h H P h H

P h D HP D H

HBernoulli (1713)

Bayes (~ 1750)

Bayesian learning

D

h

,P h D H

,

,

,

P D h H P h H

P h D

P D

D

h

H

H

H P

P h H

P D HEvidence

Prior

Posterior

Coin tossing gameH

1P h

11, ,

2MP MPh h

MP

h hN Nh h h

N N N

,h h

P D h P h DP h D D P h N N N N

P D D

1 ,h

P D h P hP h D N P D h P h N N

P D

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

2

4

6

8

10

12

14

16

18

20

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1

2

3

4

5

6

Monte Carlo simulations

100N

P h N t

10N

Bayesian regularization

Most Probable hypothesis

arg max log ,

arg min log , log

MPh P h D H

P D h H P h H

Learning error

212

2

η

ηlog ,

2η e

y hP D h H y h

P

x

x

Example: Function approximation

Regularization

Minimal Description Length

Most Probable hypothesis

2logL x P x

Code length for:

arg max log ,

arg min log , log

MPh P h D H

P D h H P h H

Data hypothesis

Example: Optimal prefix code

1

21

4 18

18

0 1

1110

110 111

P x

Rissanen (1978)

Data Complexity

Complexity K(D |H) = min L(h,D|

H)Code length L(h,D)= coded data L(D|h) + decoding program L(h)

Data D

Decoding:H h D

Kolmogoroff (1965)

Complex = Unpredictable

Prediction error ~ L(h,D)/L(D) Random data is uncompressible Compression = predictability

Program h: length L(h,D)

Data D

Decoding:H h D

Example: block coding

Solomonoff (1978)

Universal Prior All 2L programs with

length L are equiprobable

Data complexity

,

,2, , 2

L h D HL h D H

hP h D H P D H

P D H

Solomonoff (1960) Bayes (~1750)

logK D H L D H P D H

L(h,D)

D

H

Statistical ensemble Shorter description length

Proof:

Corollary: Ensemble predictions are superior to most probable prediction

,log 2 ,L h D H

MPhL D H L h D H

, ,log 2 log 2 ,MPL h D H L h D H

MPhL h D H

Ensemble prediction

MPh

1h

2h P h H

1h

MPh

P h H2h





Model comparison

D

h

,P h D H

, iP D h H

P D HEvidence

Posterior

Statistics: Bayes vs. Fisher

Fisher: max Likelihood

Bayes: max Evidence

arg max log ,

arg max log , log

MP

ML

h P h D H

P D h H P h H h

arg max log ,

arg max log , log

MP

ML

H P H D

P D H P H H

H

H H

arg max log , logMPh P D h H P h H

Historical outlook

20 – 60s of ХХ century Parametric statistics Asymptotic N

60 - 80s of ХХ century Non-Parametric statistics Regularization of ill-posed problems Non-asymptotic learning Algorithmic complexity Statistical physics of disordered systems

h

Fisher (1912)

Chentsoff (1962)

Tikhonoff (1963)

Vapnik (1968)

Kolmogoroff (1965)

h

Gardner (1988)





Statistical physics

Probability of hypothesis - microstate

Optimal model - macrostate

,

,

1, L h D H

L h D H

h

P h D H eZ

Z D H e

arg min logMLH L D H Z D H

Free energy

F = - log Z: Log of Sum

F = E – TS:

Sum of logs P = P{L}

, , log ,h

L D H P h D H L h D H P h D H

,log L h D H

hL D H e

EM algorithm. Main idea

Introduce independent P:

Iterations E-step:

М-step:

, log

log,

h

h

F h L h D H h

hL D H h L D H

P h D H

P P

PP

arg min ,

arg min ,

t t

t t

H

F H

H F H

P

P P

P

EM algorithm

Е-step Estimate Posterior for given Model

М-step Update Model for given Posterior

,t t th P D h H P D HP

arg min ,t

t

HH L D h H

P





Bayesian regularization: Examples

Hypothesis testing

Function approximation

Data clustering

yh

y

h(x) x

P(x|H)

x





Hypothesis testing Problem Noisy observations: y

Is theoretical value h0 true?

Model H:

2

2

0

2 exp2

2 exp2

y h

P

h hP h

Gaussian noise

Gaussian prior

yh0

00.2

0.40.6

0.8

1

0

2

4

6

8

107.5

8

8.5

9

9.5

10

10.5

11

11.5

Optimal model: Phase transition

1 2

1ML y

N

N

2 2, ln lnyL D N y NN N

,MLL D

y

y

Confidence finite infinite

Threshold effect

Student coefficient

Hypothesis h0 is true

Corrections to h0

22

2 1y

Nt y N

y

P(h)1Nt

yh

1Nt P(h)2

2

1NMP

N

th y

t

0P h h h





Function approximation

Problem Noisy data: y (x ) Find approximation h(x)

Model:

Noise

Prior

1

1

,

exp

exp W

y h

P Z E

P Z E

x w

w w

y

h(x)x

Optimal model

Free energy minimization

,

,

, ln ln ln

exp

x

e

e p

xp

D

W

W

D

Z d E E

L D Z Z Z

Z d E

Z dD E

w w

w w

w

w

,n nD n

E E y h w x w

Saddle point approximation

Function of best hypothesis

,ln exp

1ln

2

D W

W MP D MP

D MP W MP

Z d E E

E E

E E

w w w

w w

w w

MPw

ЕМ learning

Е-step. Optimal hypothesis

М-step. Optimal regularization

arg minMP ML W ML DE E w w w

, arg min , ,ML ML MPL D w

Laplace Prior

Pruned weights

Equisensitive weights

1

W

W iiE w

w

sgnD W DE E Et

w

w w w w

0D i iE w w

0i D iw E w





Clustering

Problem Noisy data: x

Find prototypes (mixture density approximation)

How many clusters?

Модель:

Noise

1

exp

M

m

m

P H P m P m

P m E

x x

x x h

P(x|H)

x

Optimal model

Free energy minimization

Iterations E-step:

М-step:

min ln ,n

n mL D H P m x

,min ln ln ,n

m nF m n m n P m P P x

arg min ,

arg min ,

t t

t t

H

F H

H F H

P

P P

P

ЕМ algorithm

Е-step:

М-step:

exp

exp

nm

n

nmm

P m EP m

P m E

x hx

x h

1

n n

n

m n

n

n

n

P m

P m

P m P mN

x xh

x

x

How many clusters?

Number of clusters M()

Optimal number of clusters

h(m)

1/

min ln ,m mmL D d P D h h

Simulations: Uniform data

Optimal model

0 10 20 30 4011

12

13

14

15

16

M

L D M

Simulations: Gaussian data

Optimal model

M0 10 20 30 40 50

-12.5

-12

-11.5

-11

-10.5

-10

-9.5 L D M

0 5 10 1520

25

30

35

40

45

Simulations: Gaussian mixture

Optimal model

M

L D M





Summary

Learning Ill-posed problem Remedy: regularization

Bayesian learning Built-in regularization (model assumptions) Optimal model = minimal Description Length

= minimal Free Energy

Practical issues Learning algorithms with built-in optimal

regularization - from first principles (opposite to cross validation)

Documents

Bayesian regularization of learning Sergey Shumsky NeurOK Software LLC