46
1/34 Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois [email protected] CRIL-U Nankin 2016

Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois [email protected] CRIL-U

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

1/34

Online Learning of Probabilistic Graphical Models

Frédéric KoricheCRIL - CNRS UMR 8188, Univ. Artois

[email protected]

CRIL-U Nankin 2016

Page 2: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Probabilistic Graphical Models 2/34

Outline

1 Probabilistic Graphical Models

2 Online Learning

3 Online Learning of Markov Forests

4 Beyond Markov Forests

Page 3: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Probabilistic Graphical Models 3/34

GraphicalModels

Bioinformatics Financial Modeling Image Processing Social Networks Analysis

Graphical ModelsAt the intersection of Probability Theory and Graph Theory, graphical models are awell-studied representation framework with various applications.

Page 4: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Probabilistic Graphical Models 4/34

...

...

...

000000000111000000000110000000000101000000000100000000000011000000000010000000000001000000000000

111111111000111111111001111111111010111111111011111111111100111111111101111111111110111111111111

X1

X2

X3

X4

X5

X6

X7X8

X9

X10X11

X12

0.01 0.490.48 0.02

0.750.25

Graphical ModelsEncoding high-dimensional distributions in a compact and intuitive way:

Qualitative uncertainty (interdependencies) is captured by the structure

Quantitative uncertainty (probabilities) is captured by the parameters

Page 5: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Probabilistic Graphical Models 5/34

GraphicalModels

BayesianNetworks

MarkovNetworks

Sparse BayesianNetworks

BayesianPolytrees

BayesianForests

Sparse MarkovNetworks

BoundedTree-width MNs

MarkovForests

Classes of Graphical ModelsFor an outcome space X ⊆Rn, a class of graphical models is a pair M = G×Θ, where G isspace of n-dimensional graphs, and Θ is a space of d-dimensional vectors.

G captures structural constraints (directed vs. undirected, sparse vs. dense, etc.)

Θ captures parametric constraints (binomial, multinomial, Gaussian, etc.)

Page 6: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Probabilistic Graphical Models 6/34

X1

X2

X3

X4

X5

X6

X7X8

X9

X10X11X12

X13

X14

X15

00 0.0110 0.4901 0.4811 0.02

P(X2,X3)

0 0.751 0.25

P(X1)

(Multinomial) Markov ForestsUsing [m] = {0, · · · ,m−1}, the class of Markov Forests over [m]n is given by Fn ×Θm,n, where

Fn is the space of all acyclic graphs of order n;

Θm,n is the space of all parameter vectors mapping

Ï a probability table θi ⊆ [0,1]m to each candidate node i, andÏ a probability table θij ⊆ [0,1]m×m to each candidate edge (i, j).

Page 7: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Probabilistic Graphical Models 7/34

X1

X2

X3

X4

X5

X6

X7X8

X9

X10X11X12

X13

X14

X15

00 0.0110 0.9901 0.9811 0.02

P(X3 | X2)

0 0.751 0.25

P(X1)

(Multinomial) Bayesian ForestsThe class of Bayesian Forests over [m]n is given by Fn ×Θm,n, where

Fn is the space of all directed forests of order n;

Θm,n is the space of all parameter vectors mapping

Ï a probability table θi ⊆ [0,1]m to each candidate node i, andÏ a conditional probability table θj|i ⊆ [m]× [0,1]m to each candidate arc (i, j).

Page 8: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Probabilistic Graphical Models 8/34

BurglaryEarthquake

AlarmRadio

Call

00 0.4910 0.0101 0.4811 0.02

P(A,E)

Example: The Alarm Model

Markov tree representation

Bayesian tree representation

Page 9: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Probabilistic Graphical Models 8/34

BurglaryEarthquake

AlarmRadio

Call

00 0.6510 0.3501 0.0111 0.99

P(A | E)

Example: The Alarm Model

Markov tree representation

Bayesian tree representation

Page 10: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Online Learning 9/34

Outline

1 Probabilistic Graphical Models

2 Online Learning

3 Online Learning of Markov Forests

4 Beyond Markov Forests

Page 11: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Online Learning 10/34

E B R A C0 1 0 1 10 1 0 0 10 1 0 0 0

.

.

.0 1 0 1 10 1 0 1 11 0 1 1 0

A

B E

C R

0 0.651 0.35

LearningGiven a class of graphical models M = G×Θ, the learning problem is to extract from asequence of outcomes x1:T = (x1, · · · ,xT ),

the structure G ∈ G, and

the parameters θ ∈Θof a generative model M = (G,θ) capable of predicting future, unseen, outcomes.A natural loss function for measuring the performance of M is the log-loss:

`(M ,x) =− lnPM (x)

Page 12: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Online Learning 10/34

E B R A C0 1 0 1 10 1 0 0 10 1 0 0 0

.

.

.0 1 0 1 10 1 0 1 11 0 1 1 0

A

B E

C R

0 0.651 0.35

LearningGiven a class of graphical models M = G×Θ, the learning problem is to extract from asequence of outcomes x1:T = (x1, · · · ,xT ),

the structure G ∈ G, and

the parameters θ ∈Θof a generative model M = (G,θ) capable of predicting future, unseen, outcomes.A natural loss function for measuring the performance of M is the log-loss:

`(M ,x) =− lnPM (x)

Page 13: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Online Learning 10/34

E B R A C0 1 0 1 10 1 0 0 10 1 0 0 0

.

.

.0 1 0 1 10 1 0 1 11 0 1 1 0

A

B E

C R

0 0.651 0.35

LearningGiven a class of graphical models M = G×Θ, the learning problem is to extract from asequence of outcomes x1:T = (x1, · · · ,xT ),

the structure G ∈ G, and

the parameters θ ∈Θof a generative model M = (G,θ) capable of predicting future, unseen, outcomes.A natural loss function for measuring the performance of M is the log-loss:

`(M ,x) =− lnPM (x)

Page 14: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Online Learning 11/34

...

100000000111100000000111010000000101010000000101000000000011000000000010000000010001000000010000

110111111000110111111001111110111010110111111011111011111101111011111101111011111101111111110111

Training Set

M = (G,θ)

Graphical Model

Batch LearningA two-stage process:

A model M ∈M is first extracted from a training set.

The average loss of the model M is evalued using a test set.

Page 15: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Online Learning 11/34

M = (G,θ)

Graphical Model

...

100000000111010000000101010000000101000000000011000000000010000000010001000000010000

110111111000110111111001111110111010111011111101111011111101111011111101111111110111

Test Set of size T

1T

∑Tt=1`(M ,xt )

Batch LearningA two-stage process:

A model M ∈M is first extracted from a training set.

The average loss of the model M is evalued using a test set.

Page 16: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Online Learning 12/34

Environment Learner

Online LearningA sequential process, or repeated game between the learner and its environment. During eachtrial t = 1, · · · ,T ,

the learner chooses a model Mt ∈M ;

the environment responds by an outcome xt ∈X , and the learner incurs the loss`(Mt ,xt ).

Page 17: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Online Learning 12/34

Environment LearnerMt = (Gt ,θt)

Online LearningA sequential process, or repeated game between the learner and its environment. During eachtrial t = 1, · · · ,T ,

the learner chooses a model Mt ∈M ;

the environment responds by an outcome xt ∈X , and the learner incurs the loss`(Mt ,xt ).

Page 18: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Online Learning 12/34

Environment LearnerMt = (Gt ,θt)100000000111

`(Mt ,θt)

Online LearningA sequential process, or repeated game between the learner and its environment. During eachtrial t = 1, · · · ,T ,

the learner chooses a model Mt ∈M ;

the environment responds by an outcome xt ∈X , and the learner incurs the loss`(Mt ,xt ).

Page 19: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Online Learning 12/34

Environment LearnerMt+1 = (Gt+1,θt+1)

Online LearningA sequential process, or repeated game between the learner and its environment. During eachtrial t = 1, · · · ,T ,

the learner chooses a model Mt ∈M ;

the environment responds by an outcome xt ∈X , and the learner incurs the loss`(Mt ,xt ).

Page 20: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Online Learning 12/34

Environment LearnerMt+1 = (Gt+1,θt+1)

100010010000

`(Mt+1,θt+1)

Online LearningA sequential process, or repeated game between the learner and its environment. During eachtrial t = 1, · · · ,T ,

the learner chooses a model Mt ∈M ;

the environment responds by an outcome xt ∈X , and the learner incurs the loss`(Mt ,xt ).

Page 21: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Online Learning 13/34

Online vs. Batch

In batch (PAC) learning, it is assumed that the data (training set + test set) is generatedby a fixed (but unknown) probability distribution.

In online learning, there is no statistical assumption about the data generated by theenvironment.

Online learning is particularly suited to:

* Adaptive environments, where the target distribution can change over time;

* Streaming applications, where all the data is not available in advance;

* Large-scale datasets, by processing only one outcome at a time.

Page 22: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Online Learning 13/34

Online vs. Batch

In batch (PAC) learning, it is assumed that the data (training set + test set) is generatedby a fixed (but unknown) probability distribution.

In online learning, there is no statistical assumption about the data generated by theenvironment.

Online learning is particularly suited to:

* Adaptive environments, where the target distribution can change over time;

* Streaming applications, where all the data is not available in advance;

* Large-scale datasets, by processing only one outcome at a time.

Page 23: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Online Learning 14/34

RegretLet M be a class of graphical models over an outcome space X , and A be a learningalgorithm for M .

The minimax regret of A at horizon T , is the maximum, over every sequence of out-comes x1:T = (x1, · · · ,xT ), of the cumulative relative loss between A and the best modelin M , i.e.

R(A,T) = maxx1:T∈X T

[T∑

t=1`(Mt ,xt )− min

M∈M

T∑t=1

`(M ,xt )

]

LearnabilityA class M is (online) learnable if it admits an online learning algorithm A such that:

1 the minimax regret of A is sublinear in T , i.e.

limT→∞R(A,T) = 0

2 the per-round computational complexity of A is polynomial in the dimension of M .

Page 24: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Online Learning 14/34

RegretLet M be a class of graphical models over an outcome space X , and A be a learningalgorithm for M .

The minimax regret of A at horizon T , is the maximum, over every sequence of out-comes x1:T = (x1, · · · ,xT ), of the cumulative relative loss between A and the best modelin M , i.e.

R(A,T) = maxx1:T∈X T

[T∑

t=1`(Mt ,xt )− min

M∈M

T∑t=1

`(M ,xt )

]

LearnabilityA class M is (online) learnable if it admits an online learning algorithm A such that:

1 the minimax regret of A is sublinear in T , i.e.

limT→∞R(A,T) = 0

2 the per-round computational complexity of A is polynomial in the dimension of M .

Page 25: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Online Learning of Markov Forests 15/34

Outline

1 Probabilistic Graphical Models

2 Online Learning

3 Online Learning of Markov ForestsRegret DecompositionParametric RegretStructural Regret

4 Beyond Markov Forests

Page 26: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Online Learning of Markov Forests 16/34

X1

X2

X3

X4

X5

X6

X7X8

X9

X10X11X12

X13

X14

X150.01 0.490.48 0.02

X2

X3

0.750.25

Does there exist an efficient online learning algorithm for Markov forests?How can we efficiently update at each iteration both the structure F t and the parameters θt inorder to minimize the cumulative loss

∑t `(F t ,θt ,xt )?

Page 27: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Online Learning of Markov Forests 16/34

X1

X2

X3

X4

X5

X6

X7X8

X9

X10X11X12

X13

X14

X150.05 0.450.45 0.05

X2

X3

0.650.35

Does there exist an efficient online learning algorithm for Markov forests?How can we efficiently update at each iteration both the structure F t and the parameters θt inorder to minimize the cumulative loss

∑t `(F t ,θt ,xt )?

Page 28: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Online Learning of Markov Forests 17/34

Two key propertiesFor the class Fm,n = Fn ×Θm,n of Markov forests,

The probability distribution associated with a Markov forest M = (F ,θ) can be factor-ized into a closed-form:

PM (x) =n∏

i=1θi(xi)

∏(i,j)∈F

θij(xi,xj)

θi(xi)θj(xj)

The space Fn of forest structures is a matroid; minimizing a linear function over Fncan be done in quadratic time using the matroid greedy algorithm.

Page 29: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Online Learning of Markov Forests 17/34

Two key propertiesFor the class Fm,n = Fn ×Θm,n of Markov forests,

The probability distribution associated with a Markov forest M = (F ,θ) can be factor-ized into a closed-form:

PM (x) =n∏

i=1θi(xi)

∏(i,j)∈F

θij(xi,xj)

θi(xi)θj(xj)

The space Fn of forest structures is a matroid; minimizing a linear function over Fncan be done in quadratic time using the matroid greedy algorithm.

Page 30: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Online Learning of Markov Forests 18/34

Log-LossLet M = (f ,θ) be a Markov forest, where f is the characteristic vector of the structure.

`(M ,x) =− lnPM (x)

=− ln

∏i=1

θi(xi)∏(i,j)

(θij(xi,xj)

θi(xi)θj(xj)

)fij

So, the log-loss is an affine function of the forest structure:

`(M ,x) =ψ(x)+⟨f ,φ(x)⟩

where

ψ(x) = ∑i∈[n]

ln1

θi(xi)and φij(xi,xj) = ln

(θi(xi)θj(xj)

θij(xi,xj)

)

Page 31: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Online Learning of Markov Forests 18/34

Log-LossLet M = (f ,θ) be a Markov forest, where f is the characteristic vector of the structure.

`(M ,x) =− lnPM (x)

=− ln

∏i=1

θi(xi)∏(i,j)

(θij(xi,xj)

θi(xi)θj(xj)

)fij

So, the log-loss is an affine function of the forest structure:

`(M ,x) =ψ(x)+⟨f ,φ(x)⟩

where

ψ(x) = ∑i∈[n]

ln1

θi(xi)and φij(xi,xj) = ln

(θi(xi)θj(xj)

θij(xi,xj)

)

Page 32: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Online Learning of Markov Forests Regret Decomposition 19/34

Regret DecompositionBased on the linearity of the log-loss, the regret is decomposable into two parts:

R(M1:T ,x1:T

)= R

(f 1:T ,x1:T

)+R

(θ1:T ,x1:T

)where

R(f 1:T ,x1:T

)=

T∑t=1

`(f t ,θt ,xt )−`(f ∗,θt ,xt ) (Structural Regret)

R(θ1:T ,x1:T

)=

T∑t=1

`(f ∗,θt ,xt )−`(f ∗,θ∗,xt ) (Parametric Regret)

Page 33: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Online Learning of Markov Forests Regret Decomposition 19/34

Regret DecompositionBased on the linearity of the log-loss, the regret is decomposable into two parts:

R(M1:T ,x1:T

)= R

(f 1:T ,x1:T

)+R

(θ1:T ,x1:T

)where

R(f 1:T ,x1:T

)=

T∑t=1

`(f t ,θt ,xt )−`(f ∗,θt ,xt ) (Structural Regret)

R(θ1:T ,x1:T

)=

T∑t=1

`(f ∗,θt ,xt )−`(f ∗,θ∗,xt ) (Parametric Regret)

Page 34: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Online Learning of Markov Forests Parametric Regret 20/34

Parametric RegretBased on the closed-form expression of Markov forests, the parametric regret isdecomposable into local regrets:

R(θ1:T ,x1:T

)=

n∑i=1

lnθ∗i (x1:T

i )

θ1:Ti (x1:T

i )(Univariate estimators)

+ ∑(i,j)∈F

lnθ∗ij(x1:T

ij )

θ1:Tij (x1:T

ij )(Bivariate estimators)

+ ∑(i,j)∈F

lnθ1:T

i (x1:Ti )

θ∗i (x1:Ti )

θ1:Tj (x1:T

j )

θ∗j (x1:Tj )

(Bivariate compensation)

where

θ∗i (x1:Ti ) =

T∏t=1

θ∗i (xti ), θ1:T

i (x1:Ti ) =

T∏t=1

θti (xt

i )

θ∗ij(x1:Tij ) =

T∏t=1

θ∗ij(xti ,xt

j ), θ1:Tij (x1:T

ij ) =T∏

t=1θt

ij(xti ,xt

j )

Page 35: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Online Learning of Markov Forests Parametric Regret 20/34

Parametric RegretBased on the closed-form expression of Markov forests, the parametric regret isdecomposable into local regrets:

R(θ1:T ,x1:T

)=

n∑i=1

lnθ∗i (x1:T

i )

θ1:Ti (x1:T

i )(Univariate estimators)

+ ∑(i,j)∈F

lnθ∗ij(x1:T

ij )

θ1:Tij (x1:T

ij )(Bivariate estimators)

+ ∑(i,j)∈F

lnθ1:T

i (x1:Ti )

θ∗i (x1:Ti )

θ1:Tj (x1:T

j )

θ∗j (x1:Tj )

(Bivariate compensation)

where

θ∗i (x1:Ti ) =

T∏t=1

θ∗i (xti ), θ1:T

i (x1:Ti ) =

T∏t=1

θti (xt

i )

θ∗ij(x1:Tij ) =

T∏t=1

θ∗ij(xti ,xt

j ), θ1:Tij (x1:T

ij ) =T∏

t=1θt

ij(xti ,xt

j )

Page 36: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Online Learning of Markov Forests Parametric Regret 21/34

Local regretsExpressions of the form

θ∗(x1:T )

θ1:T (x1:T )

have been extensively studied in literature of universal coding (Grünwald, 2007).

Use Dirichlet Mixtures

Dirichlet MixturesUsing symmetric Dirichlet mixtures for the parametric estimators,

θ1:T (x1:T ) =∫ T∏

t=1Pλ(xt )pµ(λ)dλ= Γ(mµ)

Γ(µ)m

∏mv=1Γ(tv +µ)

Γ(t +mµ)

If µ= 12 , we get the Jeffreys mixture.

Page 37: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Online Learning of Markov Forests Parametric Regret 21/34

Local regretsExpressions of the form

θ∗(x1:T )

θ1:T (x1:T )

have been extensively studied in literature of universal coding (Grünwald, 2007).

Use Dirichlet Mixtures

Dirichlet MixturesUsing symmetric Dirichlet mixtures for the parametric estimators,

θ1:T (x1:T ) =∫ T∏

t=1Pλ(xt )pµ(λ)dλ= Γ(mµ)

Γ(µ)m

∏mv=1Γ(tv +µ)

Γ(t +mµ)

If µ= 12 , we get the Jeffreys mixture.

Page 38: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Online Learning of Markov Forests Parametric Regret 21/34

Local regretsExpressions of the form

θ∗(x1:T )

θ1:T (x1:T )

have been extensively studied in literature of universal coding (Grünwald, 2007).

Use Dirichlet Mixtures

Dirichlet MixturesUsing symmetric Dirichlet mixtures for the parametric estimators,

θ1:T (x1:T ) =∫ T∏

t=1Pλ(xt )pµ(λ)dλ= Γ(mµ)

Γ(µ)m

∏mv=1Γ(tv +µ)

Γ(t +mµ)

If µ= 12 , we get the Jeffreys mixture.

Page 39: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Online Learning of Markov Forests Parametric Regret 22/34

Jeffreys Strategy (An extension of Xie and Barron (2000) to forest parameters)For each trial t,

1 Set θt+1i (u) = tu + 1

2

t + m2

for all i ∈ [n],u ∈ [m]

2 Set θt+1ij (u,v) = tuv + 1

2

t + m2

2

for all (i, j) ∈ ([n]2

),u,v ∈ [m]

Performance

Minimax regret:n(m−1)+ (n−1)(m−1)2

2ln T

2π +Cm,n +o(m2n)

Per-round time complexity: O(m2n2)

Page 40: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Online Learning of Markov Forests Parametric Regret 22/34

Jeffreys Strategy (An extension of Xie and Barron (2000) to forest parameters)For each trial t,

1 Set θt+1i (u) = tu + 1

2

t + m2

for all i ∈ [n],u ∈ [m]

2 Set θt+1ij (u,v) = tuv + 1

2

t + m2

2

for all (i, j) ∈ ([n]2

),u,v ∈ [m]

Performance

Minimax regret:n(m−1)+ (n−1)(m−1)2

2ln T

2π +Cm,n +o(m2n)

Per-round time complexity: O(m2n2)

Page 41: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Online Learning of Markov Forests Structural Regret 23/34

Structural RegretBased on the linear form of the log-loss, the structural learning problem can be cast as asequential combinatorial optimization problem (Audibert et al., 2011).

Minimize

T∑t=1

[t∑

s=1⟨f s,φ(xs)⟩

]

subject to

f 1, · · · , f T ∈ Fn

Page 42: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Online Learning of Markov Forests Structural Regret 23/34

Structural RegretBased on the linear form of the log-loss, the structural learning problem can be cast as asequential combinatorial optimization problem (Audibert et al., 2011).

Minimize

T∑t=1

[t∑

s=1⟨f s,φ(xs)⟩

]

subject to

f 1, · · · , f T ∈ Fn

Use the greedy algorithm

Page 43: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Online Learning of Markov Forests Structural Regret 24/34

Follow the Perturbed Leader (Kalai and Vempala, 2005)For each trial t,

1 Draw rt in[

0, 1αt

]uniformly at random

2 Set f t+1 = argminf ∈F n⟨f ,Lt + rt⟩

where Lt =∑ts=1φ(xs)

Performance

Minimax regret: n2 ln(T/2+m2/4)p

2T

Per-round time complexity: O(n2 logn)

Page 44: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Online Learning of Markov Forests Structural Regret 24/34

Follow the Perturbed Leader (Kalai and Vempala, 2005)For each trial t,

1 Draw rt in[

0, 1αt

]uniformly at random

2 Set f t+1 = argminf ∈F n⟨f ,Lt + rt⟩

where Lt =∑ts=1φ(xs)

Performance

Minimax regret: n2 ln(T/2+m2/4)p

2T

Per-round time complexity: O(n2 logn)

Page 45: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Online Learning of Markov Forests Structural Regret 25/34

2

3

4

5

6

7

10 100 1000 10000 100000

Ave

rage

log-

loss

KDD Cup 2000: Iterations

CLCLT

OFDEFOFDET

ExperimentsThe average log-loss is measured on the test set at each iteration.The online learner (FPL + Jeffreys) rapidly converges to the batch learner (Chow-Liu forMarkov trees, and Thresholded Chow-Liu for Markov forests).

Page 46: Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr CRIL-U

Online Learning of Markov Forests Structural Regret 25/34

6

6.5

7

7.5

8

8.5

9

9.5

10

10 100 1000 10000 100000

Ave

rage

log-

loss

MSNBC: Iterations

CLCLT

OFDEFOFDET

ExperimentsThe average log-loss is measured on the test set at each iteration.The online learner (FPL + Jeffreys) rapidly converges to the batch learner (Chow-Liu forMarkov trees, and Thresholded Chow-Liu for Markov forests).