Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois [email protected] CRIL-U

1/34

Online Learning of Probabilistic Graphical Models

Frédéric KoricheCRIL - CNRS UMR 8188, Univ. Artois

[email protected]

CRIL-U Nankin 2016

Probabilistic Graphical Models 2/34

Outline

1 Probabilistic Graphical Models

2 Online Learning

3 Online Learning of Markov Forests

4 Beyond Markov Forests


GraphicalModels

Bioinformatics Financial Modeling Image Processing Social Networks Analysis

Graphical ModelsAt the intersection of Probability Theory and Graph Theory, graphical models are awell-studied representation framework with various applications.


...

...

...

000000000111000000000110000000000101000000000100000000000011000000000010000000000001000000000000

111111111000111111111001111111111010111111111011111111111100111111111101111111111110111111111111

X1

X2

X3

X4

X5

X6

X7X8

X9

X10X11

X12

0.01 0.490.48 0.02

0.750.25

Graphical ModelsEncoding high-dimensional distributions in a compact and intuitive way:

Qualitative uncertainty (interdependencies) is captured by the structure

Quantitative uncertainty (probabilities) is captured by the parameters


GraphicalModels

BayesianNetworks

MarkovNetworks

Sparse BayesianNetworks

BayesianPolytrees

BayesianForests

Sparse MarkovNetworks

BoundedTree-width MNs

MarkovForests

Classes of Graphical ModelsFor an outcome space X ⊆Rn, a class of graphical models is a pair M = G×Θ, where G isspace of n-dimensional graphs, and Θ is a space of d-dimensional vectors.

G captures structural constraints (directed vs. undirected, sparse vs. dense, etc.)

Θ captures parametric constraints (binomial, multinomial, Gaussian, etc.)


X1

X2

X3

X4

X5

X6

X7X8

X9

X10X11X12

X13

X14

X15

00 0.0110 0.4901 0.4811 0.02

P(X2,X3)

0 0.751 0.25

P(X1)

(Multinomial) Markov ForestsUsing [m] = {0, · · · ,m−1}, the class of Markov Forests over [m]n is given by Fn ×Θm,n, where

Fn is the space of all acyclic graphs of order n;

Θm,n is the space of all parameter vectors mapping

Ï a probability table θi ⊆ [0,1]m to each candidate node i, andÏ a probability table θij ⊆ [0,1]m×m to each candidate edge (i, j).


X1

X2

X3

X4

X5

X6

X7X8

X9

X10X11X12

X13

X14

X15

00 0.0110 0.9901 0.9811 0.02

P(X3 | X2)

0 0.751 0.25

P(X1)

(Multinomial) Bayesian ForestsThe class of Bayesian Forests over [m]n is given by Fn ×Θm,n, where

Fn is the space of all directed forests of order n;

Θm,n is the space of all parameter vectors mapping

Ï a probability table θi ⊆ [0,1]m to each candidate node i, andÏ a conditional probability table θj|i ⊆ [m]× [0,1]m to each candidate arc (i, j).


BurglaryEarthquake

AlarmRadio

Call

00 0.4910 0.0101 0.4811 0.02

P(A,E)

Example: The Alarm Model

Markov tree representation

Bayesian tree representation


BurglaryEarthquake

AlarmRadio

Call

00 0.6510 0.3501 0.0111 0.99

P(A | E)

Example: The Alarm Model

Markov tree representation

Bayesian tree representation

Online Learning 9/34

Outline


2 Online Learning

3 Online Learning of Markov Forests



E B R A C0 1 0 1 10 1 0 0 10 1 0 0 0

.

.

.0 1 0 1 10 1 0 1 11 0 1 1 0

A

B E

C R

0 0.651 0.35

LearningGiven a class of graphical models M = G×Θ, the learning problem is to extract from asequence of outcomes x1:T = (x1, · · · ,xT ),

the structure G ∈ G, and

the parameters θ ∈Θof a generative model M = (G,θ) capable of predicting future, unseen, outcomes.A natural loss function for measuring the performance of M is the log-loss:

`(M ,x) =− lnPM (x)


E B R A C0 1 0 1 10 1 0 0 10 1 0 0 0

.

.

.0 1 0 1 10 1 0 1 11 0 1 1 0

A

B E

C R

0 0.651 0.35




`(M ,x) =− lnPM (x)


E B R A C0 1 0 1 10 1 0 0 10 1 0 0 0

.

.

.0 1 0 1 10 1 0 1 11 0 1 1 0

A

B E

C R

0 0.651 0.35




`(M ,x) =− lnPM (x)


...

100000000111100000000111010000000101010000000101000000000011000000000010000000010001000000010000

110111111000110111111001111110111010110111111011111011111101111011111101111011111101111111110111

Training Set

M = (G,θ)

Graphical Model

Batch LearningA two-stage process:

A model M ∈M is first extracted from a training set.

The average loss of the model M is evalued using a test set.


M = (G,θ)

Graphical Model

...

100000000111010000000101010000000101000000000011000000000010000000010001000000010000

110111111000110111111001111110111010111011111101111011111101111011111101111111110111

Test Set of size T

1T

∑Tt=1`(M ,xt )

Batch LearningA two-stage process:

A model M ∈M is first extracted from a training set.

The average loss of the model M is evalued using a test set.


Environment Learner

Online LearningA sequential process, or repeated game between the learner and its environment. During eachtrial t = 1, · · · ,T ,

the learner chooses a model Mt ∈M ;

the environment responds by an outcome xt ∈X , and the learner incurs the loss`(Mt ,xt ).


Environment LearnerMt = (Gt ,θt)





Environment LearnerMt = (Gt ,θt)100000000111

`(Mt ,θt)





Environment LearnerMt+1 = (Gt+1,θt+1)





Environment LearnerMt+1 = (Gt+1,θt+1)

100010010000

`(Mt+1,θt+1)





Online vs. Batch

In batch (PAC) learning, it is assumed that the data (training set + test set) is generatedby a fixed (but unknown) probability distribution.

In online learning, there is no statistical assumption about the data generated by theenvironment.

Online learning is particularly suited to:

* Adaptive environments, where the target distribution can change over time;

* Streaming applications, where all the data is not available in advance;

* Large-scale datasets, by processing only one outcome at a time.


Online vs. Batch

In batch (PAC) learning, it is assumed that the data (training set + test set) is generatedby a fixed (but unknown) probability distribution.

In online learning, there is no statistical assumption about the data generated by theenvironment.

Online learning is particularly suited to:

* Adaptive environments, where the target distribution can change over time;

* Streaming applications, where all the data is not available in advance;

* Large-scale datasets, by processing only one outcome at a time.


RegretLet M be a class of graphical models over an outcome space X , and A be a learningalgorithm for M .

The minimax regret of A at horizon T , is the maximum, over every sequence of out-comes x1:T = (x1, · · · ,xT ), of the cumulative relative loss between A and the best modelin M , i.e.

R(A,T) = maxx1:T∈X T

[T∑

t=1`(Mt ,xt )− min

M∈M

T∑t=1

`(M ,xt )

]

LearnabilityA class M is (online) learnable if it admits an online learning algorithm A such that:

1 the minimax regret of A is sublinear in T , i.e.

limT→∞R(A,T) = 0

2 the per-round computational complexity of A is polynomial in the dimension of M .


RegretLet M be a class of graphical models over an outcome space X , and A be a learningalgorithm for M .

The minimax regret of A at horizon T , is the maximum, over every sequence of out-comes x1:T = (x1, · · · ,xT ), of the cumulative relative loss between A and the best modelin M , i.e.

R(A,T) = maxx1:T∈X T

[T∑

t=1`(Mt ,xt )− min

M∈M

T∑t=1

`(M ,xt )

]

LearnabilityA class M is (online) learnable if it admits an online learning algorithm A such that:

1 the minimax regret of A is sublinear in T , i.e.

limT→∞R(A,T) = 0

2 the per-round computational complexity of A is polynomial in the dimension of M .

Online Learning of Markov Forests 15/34

Outline


2 Online Learning

3 Online Learning of Markov ForestsRegret DecompositionParametric RegretStructural Regret



X1

X2

X3

X4

X5

X6

X7X8

X9

X10X11X12

X13

X14

X150.01 0.490.48 0.02

X2

X3

0.750.25

Does there exist an efficient online learning algorithm for Markov forests?How can we efficiently update at each iteration both the structure F t and the parameters θt inorder to minimize the cumulative loss

∑t `(F t ,θt ,xt )?


X1

X2

X3

X4

X5

X6

X7X8

X9

X10X11X12

X13

X14

X150.05 0.450.45 0.05

X2

X3

0.650.35

Does there exist an efficient online learning algorithm for Markov forests?How can we efficiently update at each iteration both the structure F t and the parameters θt inorder to minimize the cumulative loss

∑t `(F t ,θt ,xt )?


Two key propertiesFor the class Fm,n = Fn ×Θm,n of Markov forests,

The probability distribution associated with a Markov forest M = (F ,θ) can be factor-ized into a closed-form:

PM (x) =n∏

i=1θi(xi)

∏(i,j)∈F

θij(xi,xj)

θi(xi)θj(xj)

The space Fn of forest structures is a matroid; minimizing a linear function over Fncan be done in quadratic time using the matroid greedy algorithm.


Two key propertiesFor the class Fm,n = Fn ×Θm,n of Markov forests,

The probability distribution associated with a Markov forest M = (F ,θ) can be factor-ized into a closed-form:

PM (x) =n∏

i=1θi(xi)

∏(i,j)∈F

θij(xi,xj)

θi(xi)θj(xj)

The space Fn of forest structures is a matroid; minimizing a linear function over Fncan be done in quadratic time using the matroid greedy algorithm.


Log-LossLet M = (f ,θ) be a Markov forest, where f is the characteristic vector of the structure.

`(M ,x) =− lnPM (x)

=− ln

∏i=1

θi(xi)∏(i,j)

(θij(xi,xj)

θi(xi)θj(xj)

)fij

So, the log-loss is an affine function of the forest structure:

`(M ,x) =ψ(x)+⟨f ,φ(x)⟩

where

ψ(x) = ∑i∈[n]

ln1

θi(xi)and φij(xi,xj) = ln

(θi(xi)θj(xj)

θij(xi,xj)

)


Log-LossLet M = (f ,θ) be a Markov forest, where f is the characteristic vector of the structure.

`(M ,x) =− lnPM (x)

=− ln

∏i=1

θi(xi)∏(i,j)

(θij(xi,xj)

θi(xi)θj(xj)

)fij

So, the log-loss is an affine function of the forest structure:

`(M ,x) =ψ(x)+⟨f ,φ(x)⟩

where

ψ(x) = ∑i∈[n]

ln1

θi(xi)and φij(xi,xj) = ln

(θi(xi)θj(xj)

θij(xi,xj)

)

Online Learning of Markov Forests Regret Decomposition 19/34

Regret DecompositionBased on the linearity of the log-loss, the regret is decomposable into two parts:

R(M1:T ,x1:T

)= R

(f 1:T ,x1:T

)+R

(θ1:T ,x1:T

)where

R(f 1:T ,x1:T

)=

T∑t=1

`(f t ,θt ,xt )−`(f ∗,θt ,xt ) (Structural Regret)

R(θ1:T ,x1:T

)=

T∑t=1

`(f ∗,θt ,xt )−`(f ∗,θ∗,xt ) (Parametric Regret)

Online Learning of Markov Forests Regret Decomposition 19/34

Regret DecompositionBased on the linearity of the log-loss, the regret is decomposable into two parts:

R(M1:T ,x1:T

)= R

(f 1:T ,x1:T

)+R

(θ1:T ,x1:T

)where

R(f 1:T ,x1:T

)=

T∑t=1

`(f t ,θt ,xt )−`(f ∗,θt ,xt ) (Structural Regret)

R(θ1:T ,x1:T

)=

T∑t=1

`(f ∗,θt ,xt )−`(f ∗,θ∗,xt ) (Parametric Regret)

Online Learning of Markov Forests Parametric Regret 20/34

Parametric RegretBased on the closed-form expression of Markov forests, the parametric regret isdecomposable into local regrets:

R(θ1:T ,x1:T

)=

n∑i=1

lnθ∗i (x1:T

i )

θ1:Ti (x1:T

i )(Univariate estimators)

+ ∑(i,j)∈F

lnθ∗ij(x1:T

ij )

θ1:Tij (x1:T

ij )(Bivariate estimators)

+ ∑(i,j)∈F

lnθ1:T

i (x1:Ti )

θ∗i (x1:Ti )

θ1:Tj (x1:T

j )

θ∗j (x1:Tj )

(Bivariate compensation)

where

θ∗i (x1:Ti ) =

T∏t=1

θ∗i (xti ), θ1:T

i (x1:Ti ) =

T∏t=1

θti (xt

i )

θ∗ij(x1:Tij ) =

T∏t=1

θ∗ij(xti ,xt

j ), θ1:Tij (x1:T

ij ) =T∏

t=1θt

ij(xti ,xt

j )


Parametric RegretBased on the closed-form expression of Markov forests, the parametric regret isdecomposable into local regrets:

R(θ1:T ,x1:T

)=

n∑i=1

lnθ∗i (x1:T

i )

θ1:Ti (x1:T

i )(Univariate estimators)

+ ∑(i,j)∈F

lnθ∗ij(x1:T

ij )

θ1:Tij (x1:T

ij )(Bivariate estimators)

+ ∑(i,j)∈F

lnθ1:T

i (x1:Ti )

θ∗i (x1:Ti )

θ1:Tj (x1:T

j )

θ∗j (x1:Tj )

(Bivariate compensation)

where

θ∗i (x1:Ti ) =

T∏t=1

θ∗i (xti ), θ1:T

i (x1:Ti ) =

T∏t=1

θti (xt

i )

θ∗ij(x1:Tij ) =

T∏t=1

θ∗ij(xti ,xt

j ), θ1:Tij (x1:T

ij ) =T∏

t=1θt

ij(xti ,xt

j )


Local regretsExpressions of the form

θ∗(x1:T )

θ1:T (x1:T )

have been extensively studied in literature of universal coding (Grünwald, 2007).

Use Dirichlet Mixtures

Dirichlet MixturesUsing symmetric Dirichlet mixtures for the parametric estimators,

θ1:T (x1:T ) =∫ T∏

t=1Pλ(xt )pµ(λ)dλ= Γ(mµ)

Γ(µ)m

∏mv=1Γ(tv +µ)

Γ(t +mµ)

If µ= 12 , we get the Jeffreys mixture.



θ∗(x1:T )

θ1:T (x1:T )




θ1:T (x1:T ) =∫ T∏


Γ(µ)m

∏mv=1Γ(tv +µ)

Γ(t +mµ)




θ∗(x1:T )

θ1:T (x1:T )




θ1:T (x1:T ) =∫ T∏


Γ(µ)m

∏mv=1Γ(tv +µ)

Γ(t +mµ)



Jeffreys Strategy (An extension of Xie and Barron (2000) to forest parameters)For each trial t,

1 Set θt+1i (u) = tu + 1

2

t + m2

for all i ∈ [n],u ∈ [m]

2 Set θt+1ij (u,v) = tuv + 1

2

t + m2

2

for all (i, j) ∈ ([n]2

),u,v ∈ [m]

Performance

Minimax regret:n(m−1)+ (n−1)(m−1)2

2ln T

2π +Cm,n +o(m2n)

Per-round time complexity: O(m2n2)


Jeffreys Strategy (An extension of Xie and Barron (2000) to forest parameters)For each trial t,

1 Set θt+1i (u) = tu + 1

2

t + m2

for all i ∈ [n],u ∈ [m]

2 Set θt+1ij (u,v) = tuv + 1

2

t + m2

2

for all (i, j) ∈ ([n]2

),u,v ∈ [m]

Performance

Minimax regret:n(m−1)+ (n−1)(m−1)2

2ln T

2π +Cm,n +o(m2n)

Per-round time complexity: O(m2n2)

Online Learning of Markov Forests Structural Regret 23/34

Structural RegretBased on the linear form of the log-loss, the structural learning problem can be cast as asequential combinatorial optimization problem (Audibert et al., 2011).

Minimize

T∑t=1

[t∑

s=1⟨f s,φ(xs)⟩

]

subject to

f 1, · · · , f T ∈ Fn


Structural RegretBased on the linear form of the log-loss, the structural learning problem can be cast as asequential combinatorial optimization problem (Audibert et al., 2011).

Minimize

T∑t=1

[t∑

s=1⟨f s,φ(xs)⟩

]

subject to

f 1, · · · , f T ∈ Fn

Use the greedy algorithm


Follow the Perturbed Leader (Kalai and Vempala, 2005)For each trial t,

1 Draw rt in[

0, 1αt

]uniformly at random

2 Set f t+1 = argminf ∈F n⟨f ,Lt + rt⟩

where Lt =∑ts=1φ(xs)

Performance

Minimax regret: n2 ln(T/2+m2/4)p

2T

Per-round time complexity: O(n2 logn)


Follow the Perturbed Leader (Kalai and Vempala, 2005)For each trial t,

1 Draw rt in[

0, 1αt

]uniformly at random

2 Set f t+1 = argminf ∈F n⟨f ,Lt + rt⟩

where Lt =∑ts=1φ(xs)

Performance

Minimax regret: n2 ln(T/2+m2/4)p

2T

Per-round time complexity: O(n2 logn)


2

3

4

5

6

7

10 100 1000 10000 100000

Ave

rage

log-

loss

KDD Cup 2000: Iterations

CLCLT

OFDEFOFDET

ExperimentsThe average log-loss is measured on the test set at each iteration.The online learner (FPL + Jeffreys) rapidly converges to the batch learner (Chow-Liu forMarkov trees, and Thresholded Chow-Liu for Markov forests).


6

6.5

7

7.5

8

8.5

9

9.5

10

10 100 1000 10000 100000

Ave

rage

log-

loss

MSNBC: Iterations

CLCLT

OFDEFOFDET

ExperimentsThe average log-loss is measured on the test set at each iteration.The online learner (FPL + Jeffreys) rapidly converges to the batch learner (Chow-Liu forMarkov trees, and Thresholded Chow-Liu for Markov forests).

Documents

Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical Models Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois [email protected] CRIL-U