Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical...

Preview:

Citation preview

1/34

Online Learning of Probabilistic Graphical Models

Frédéric KoricheCRIL - CNRS UMR 8188, Univ. Artois

koriche@cril.fr

CRIL-U Nankin 2016

Probabilistic Graphical Models 2/34

Outline

1 Probabilistic Graphical Models

2 Online Learning

3 Online Learning of Markov Forests

4 Beyond Markov Forests

Probabilistic Graphical Models 3/34

GraphicalModels

Bioinformatics Financial Modeling Image Processing Social Networks Analysis

Graphical ModelsAt the intersection of Probability Theory and Graph Theory, graphical models are awell-studied representation framework with various applications.

Probabilistic Graphical Models 4/34

...

...

...

000000000111000000000110000000000101000000000100000000000011000000000010000000000001000000000000

111111111000111111111001111111111010111111111011111111111100111111111101111111111110111111111111

X1

X2

X3

X4

X5

X6

X7X8

X9

X10X11

X12

0.01 0.490.48 0.02

0.750.25

Graphical ModelsEncoding high-dimensional distributions in a compact and intuitive way:

Qualitative uncertainty (interdependencies) is captured by the structure

Quantitative uncertainty (probabilities) is captured by the parameters

Probabilistic Graphical Models 5/34

GraphicalModels

BayesianNetworks

MarkovNetworks

Sparse BayesianNetworks

BayesianPolytrees

BayesianForests

Sparse MarkovNetworks

BoundedTree-width MNs

MarkovForests

Classes of Graphical ModelsFor an outcome space X ⊆Rn, a class of graphical models is a pair M = G×Θ, where G isspace of n-dimensional graphs, and Θ is a space of d-dimensional vectors.

G captures structural constraints (directed vs. undirected, sparse vs. dense, etc.)

Θ captures parametric constraints (binomial, multinomial, Gaussian, etc.)

Probabilistic Graphical Models 6/34

X1

X2

X3

X4

X5

X6

X7X8

X9

X10X11X12

X13

X14

X15

00 0.0110 0.4901 0.4811 0.02

P(X2,X3)

0 0.751 0.25

P(X1)

(Multinomial) Markov ForestsUsing [m] = {0, · · · ,m−1}, the class of Markov Forests over [m]n is given by Fn ×Θm,n, where

Fn is the space of all acyclic graphs of order n;

Θm,n is the space of all parameter vectors mapping

Ï a probability table θi ⊆ [0,1]m to each candidate node i, andÏ a probability table θij ⊆ [0,1]m×m to each candidate edge (i, j).

Probabilistic Graphical Models 7/34

X1

X2

X3

X4

X5

X6

X7X8

X9

X10X11X12

X13

X14

X15

00 0.0110 0.9901 0.9811 0.02

P(X3 | X2)

0 0.751 0.25

P(X1)

(Multinomial) Bayesian ForestsThe class of Bayesian Forests over [m]n is given by Fn ×Θm,n, where

Fn is the space of all directed forests of order n;

Θm,n is the space of all parameter vectors mapping

Ï a probability table θi ⊆ [0,1]m to each candidate node i, andÏ a conditional probability table θj|i ⊆ [m]× [0,1]m to each candidate arc (i, j).

Probabilistic Graphical Models 8/34

BurglaryEarthquake

AlarmRadio

Call

00 0.4910 0.0101 0.4811 0.02

P(A,E)

Example: The Alarm Model

Markov tree representation

Bayesian tree representation

Probabilistic Graphical Models 8/34

BurglaryEarthquake

AlarmRadio

Call

00 0.6510 0.3501 0.0111 0.99

P(A | E)

Example: The Alarm Model

Markov tree representation

Bayesian tree representation

Online Learning 9/34

Outline

1 Probabilistic Graphical Models

2 Online Learning

3 Online Learning of Markov Forests

4 Beyond Markov Forests

Online Learning 10/34

E B R A C0 1 0 1 10 1 0 0 10 1 0 0 0

.

.

.0 1 0 1 10 1 0 1 11 0 1 1 0

A

B E

C R

0 0.651 0.35

LearningGiven a class of graphical models M = G×Θ, the learning problem is to extract from asequence of outcomes x1:T = (x1, · · · ,xT ),

the structure G ∈ G, and

the parameters θ ∈Θof a generative model M = (G,θ) capable of predicting future, unseen, outcomes.A natural loss function for measuring the performance of M is the log-loss:

`(M ,x) =− lnPM (x)

Online Learning 10/34

E B R A C0 1 0 1 10 1 0 0 10 1 0 0 0

.

.

.0 1 0 1 10 1 0 1 11 0 1 1 0

A

B E

C R

0 0.651 0.35

LearningGiven a class of graphical models M = G×Θ, the learning problem is to extract from asequence of outcomes x1:T = (x1, · · · ,xT ),

the structure G ∈ G, and

the parameters θ ∈Θof a generative model M = (G,θ) capable of predicting future, unseen, outcomes.A natural loss function for measuring the performance of M is the log-loss:

`(M ,x) =− lnPM (x)

Online Learning 10/34

E B R A C0 1 0 1 10 1 0 0 10 1 0 0 0

.

.

.0 1 0 1 10 1 0 1 11 0 1 1 0

A

B E

C R

0 0.651 0.35

LearningGiven a class of graphical models M = G×Θ, the learning problem is to extract from asequence of outcomes x1:T = (x1, · · · ,xT ),

the structure G ∈ G, and

the parameters θ ∈Θof a generative model M = (G,θ) capable of predicting future, unseen, outcomes.A natural loss function for measuring the performance of M is the log-loss:

`(M ,x) =− lnPM (x)

Online Learning 11/34

...

100000000111100000000111010000000101010000000101000000000011000000000010000000010001000000010000

110111111000110111111001111110111010110111111011111011111101111011111101111011111101111111110111

Training Set

M = (G,θ)

Graphical Model

Batch LearningA two-stage process:

A model M ∈M is first extracted from a training set.

The average loss of the model M is evalued using a test set.

Online Learning 11/34

M = (G,θ)

Graphical Model

...

100000000111010000000101010000000101000000000011000000000010000000010001000000010000

110111111000110111111001111110111010111011111101111011111101111011111101111111110111

Test Set of size T

1T

∑Tt=1`(M ,xt )

Batch LearningA two-stage process:

A model M ∈M is first extracted from a training set.

The average loss of the model M is evalued using a test set.

Online Learning 12/34

Environment Learner

Online LearningA sequential process, or repeated game between the learner and its environment. During eachtrial t = 1, · · · ,T ,

the learner chooses a model Mt ∈M ;

the environment responds by an outcome xt ∈X , and the learner incurs the loss`(Mt ,xt ).

Online Learning 12/34

Environment LearnerMt = (Gt ,θt)

Online LearningA sequential process, or repeated game between the learner and its environment. During eachtrial t = 1, · · · ,T ,

the learner chooses a model Mt ∈M ;

the environment responds by an outcome xt ∈X , and the learner incurs the loss`(Mt ,xt ).

Online Learning 12/34

Environment LearnerMt = (Gt ,θt)100000000111

`(Mt ,θt)

Online LearningA sequential process, or repeated game between the learner and its environment. During eachtrial t = 1, · · · ,T ,

the learner chooses a model Mt ∈M ;

the environment responds by an outcome xt ∈X , and the learner incurs the loss`(Mt ,xt ).

Online Learning 12/34

Environment LearnerMt+1 = (Gt+1,θt+1)

Online LearningA sequential process, or repeated game between the learner and its environment. During eachtrial t = 1, · · · ,T ,

the learner chooses a model Mt ∈M ;

the environment responds by an outcome xt ∈X , and the learner incurs the loss`(Mt ,xt ).

Online Learning 12/34

Environment LearnerMt+1 = (Gt+1,θt+1)

100010010000

`(Mt+1,θt+1)

Online LearningA sequential process, or repeated game between the learner and its environment. During eachtrial t = 1, · · · ,T ,

the learner chooses a model Mt ∈M ;

the environment responds by an outcome xt ∈X , and the learner incurs the loss`(Mt ,xt ).

Online Learning 13/34

Online vs. Batch

In batch (PAC) learning, it is assumed that the data (training set + test set) is generatedby a fixed (but unknown) probability distribution.

In online learning, there is no statistical assumption about the data generated by theenvironment.

Online learning is particularly suited to:

* Adaptive environments, where the target distribution can change over time;

* Streaming applications, where all the data is not available in advance;

* Large-scale datasets, by processing only one outcome at a time.

Online Learning 13/34

Online vs. Batch

In batch (PAC) learning, it is assumed that the data (training set + test set) is generatedby a fixed (but unknown) probability distribution.

In online learning, there is no statistical assumption about the data generated by theenvironment.

Online learning is particularly suited to:

* Adaptive environments, where the target distribution can change over time;

* Streaming applications, where all the data is not available in advance;

* Large-scale datasets, by processing only one outcome at a time.

Online Learning 14/34

RegretLet M be a class of graphical models over an outcome space X , and A be a learningalgorithm for M .

The minimax regret of A at horizon T , is the maximum, over every sequence of out-comes x1:T = (x1, · · · ,xT ), of the cumulative relative loss between A and the best modelin M , i.e.

R(A,T) = maxx1:T∈X T

[T∑

t=1`(Mt ,xt )− min

M∈M

T∑t=1

`(M ,xt )

]

LearnabilityA class M is (online) learnable if it admits an online learning algorithm A such that:

1 the minimax regret of A is sublinear in T , i.e.

limT→∞R(A,T) = 0

2 the per-round computational complexity of A is polynomial in the dimension of M .

Online Learning 14/34

RegretLet M be a class of graphical models over an outcome space X , and A be a learningalgorithm for M .

The minimax regret of A at horizon T , is the maximum, over every sequence of out-comes x1:T = (x1, · · · ,xT ), of the cumulative relative loss between A and the best modelin M , i.e.

R(A,T) = maxx1:T∈X T

[T∑

t=1`(Mt ,xt )− min

M∈M

T∑t=1

`(M ,xt )

]

LearnabilityA class M is (online) learnable if it admits an online learning algorithm A such that:

1 the minimax regret of A is sublinear in T , i.e.

limT→∞R(A,T) = 0

2 the per-round computational complexity of A is polynomial in the dimension of M .

Online Learning of Markov Forests 15/34

Outline

1 Probabilistic Graphical Models

2 Online Learning

3 Online Learning of Markov ForestsRegret DecompositionParametric RegretStructural Regret

4 Beyond Markov Forests

Online Learning of Markov Forests 16/34

X1

X2

X3

X4

X5

X6

X7X8

X9

X10X11X12

X13

X14

X150.01 0.490.48 0.02

X2

X3

0.750.25

Does there exist an efficient online learning algorithm for Markov forests?How can we efficiently update at each iteration both the structure F t and the parameters θt inorder to minimize the cumulative loss

∑t `(F t ,θt ,xt )?

Online Learning of Markov Forests 16/34

X1

X2

X3

X4

X5

X6

X7X8

X9

X10X11X12

X13

X14

X150.05 0.450.45 0.05

X2

X3

0.650.35

Does there exist an efficient online learning algorithm for Markov forests?How can we efficiently update at each iteration both the structure F t and the parameters θt inorder to minimize the cumulative loss

∑t `(F t ,θt ,xt )?

Online Learning of Markov Forests 17/34

Two key propertiesFor the class Fm,n = Fn ×Θm,n of Markov forests,

The probability distribution associated with a Markov forest M = (F ,θ) can be factor-ized into a closed-form:

PM (x) =n∏

i=1θi(xi)

∏(i,j)∈F

θij(xi,xj)

θi(xi)θj(xj)

The space Fn of forest structures is a matroid; minimizing a linear function over Fncan be done in quadratic time using the matroid greedy algorithm.

Online Learning of Markov Forests 17/34

Two key propertiesFor the class Fm,n = Fn ×Θm,n of Markov forests,

The probability distribution associated with a Markov forest M = (F ,θ) can be factor-ized into a closed-form:

PM (x) =n∏

i=1θi(xi)

∏(i,j)∈F

θij(xi,xj)

θi(xi)θj(xj)

The space Fn of forest structures is a matroid; minimizing a linear function over Fncan be done in quadratic time using the matroid greedy algorithm.

Online Learning of Markov Forests 18/34

Log-LossLet M = (f ,θ) be a Markov forest, where f is the characteristic vector of the structure.

`(M ,x) =− lnPM (x)

=− ln

∏i=1

θi(xi)∏(i,j)

(θij(xi,xj)

θi(xi)θj(xj)

)fij

So, the log-loss is an affine function of the forest structure:

`(M ,x) =ψ(x)+⟨f ,φ(x)⟩

where

ψ(x) = ∑i∈[n]

ln1

θi(xi)and φij(xi,xj) = ln

(θi(xi)θj(xj)

θij(xi,xj)

)

Online Learning of Markov Forests 18/34

Log-LossLet M = (f ,θ) be a Markov forest, where f is the characteristic vector of the structure.

`(M ,x) =− lnPM (x)

=− ln

∏i=1

θi(xi)∏(i,j)

(θij(xi,xj)

θi(xi)θj(xj)

)fij

So, the log-loss is an affine function of the forest structure:

`(M ,x) =ψ(x)+⟨f ,φ(x)⟩

where

ψ(x) = ∑i∈[n]

ln1

θi(xi)and φij(xi,xj) = ln

(θi(xi)θj(xj)

θij(xi,xj)

)

Online Learning of Markov Forests Regret Decomposition 19/34

Regret DecompositionBased on the linearity of the log-loss, the regret is decomposable into two parts:

R(M1:T ,x1:T

)= R

(f 1:T ,x1:T

)+R

(θ1:T ,x1:T

)where

R(f 1:T ,x1:T

)=

T∑t=1

`(f t ,θt ,xt )−`(f ∗,θt ,xt ) (Structural Regret)

R(θ1:T ,x1:T

)=

T∑t=1

`(f ∗,θt ,xt )−`(f ∗,θ∗,xt ) (Parametric Regret)

Online Learning of Markov Forests Regret Decomposition 19/34

Regret DecompositionBased on the linearity of the log-loss, the regret is decomposable into two parts:

R(M1:T ,x1:T

)= R

(f 1:T ,x1:T

)+R

(θ1:T ,x1:T

)where

R(f 1:T ,x1:T

)=

T∑t=1

`(f t ,θt ,xt )−`(f ∗,θt ,xt ) (Structural Regret)

R(θ1:T ,x1:T

)=

T∑t=1

`(f ∗,θt ,xt )−`(f ∗,θ∗,xt ) (Parametric Regret)

Online Learning of Markov Forests Parametric Regret 20/34

Parametric RegretBased on the closed-form expression of Markov forests, the parametric regret isdecomposable into local regrets:

R(θ1:T ,x1:T

)=

n∑i=1

lnθ∗i (x1:T

i )

θ1:Ti (x1:T

i )(Univariate estimators)

+ ∑(i,j)∈F

lnθ∗ij(x1:T

ij )

θ1:Tij (x1:T

ij )(Bivariate estimators)

+ ∑(i,j)∈F

lnθ1:T

i (x1:Ti )

θ∗i (x1:Ti )

θ1:Tj (x1:T

j )

θ∗j (x1:Tj )

(Bivariate compensation)

where

θ∗i (x1:Ti ) =

T∏t=1

θ∗i (xti ), θ1:T

i (x1:Ti ) =

T∏t=1

θti (xt

i )

θ∗ij(x1:Tij ) =

T∏t=1

θ∗ij(xti ,xt

j ), θ1:Tij (x1:T

ij ) =T∏

t=1θt

ij(xti ,xt

j )

Online Learning of Markov Forests Parametric Regret 20/34

Parametric RegretBased on the closed-form expression of Markov forests, the parametric regret isdecomposable into local regrets:

R(θ1:T ,x1:T

)=

n∑i=1

lnθ∗i (x1:T

i )

θ1:Ti (x1:T

i )(Univariate estimators)

+ ∑(i,j)∈F

lnθ∗ij(x1:T

ij )

θ1:Tij (x1:T

ij )(Bivariate estimators)

+ ∑(i,j)∈F

lnθ1:T

i (x1:Ti )

θ∗i (x1:Ti )

θ1:Tj (x1:T

j )

θ∗j (x1:Tj )

(Bivariate compensation)

where

θ∗i (x1:Ti ) =

T∏t=1

θ∗i (xti ), θ1:T

i (x1:Ti ) =

T∏t=1

θti (xt

i )

θ∗ij(x1:Tij ) =

T∏t=1

θ∗ij(xti ,xt

j ), θ1:Tij (x1:T

ij ) =T∏

t=1θt

ij(xti ,xt

j )

Online Learning of Markov Forests Parametric Regret 21/34

Local regretsExpressions of the form

θ∗(x1:T )

θ1:T (x1:T )

have been extensively studied in literature of universal coding (Grünwald, 2007).

Use Dirichlet Mixtures

Dirichlet MixturesUsing symmetric Dirichlet mixtures for the parametric estimators,

θ1:T (x1:T ) =∫ T∏

t=1Pλ(xt )pµ(λ)dλ= Γ(mµ)

Γ(µ)m

∏mv=1Γ(tv +µ)

Γ(t +mµ)

If µ= 12 , we get the Jeffreys mixture.

Online Learning of Markov Forests Parametric Regret 21/34

Local regretsExpressions of the form

θ∗(x1:T )

θ1:T (x1:T )

have been extensively studied in literature of universal coding (Grünwald, 2007).

Use Dirichlet Mixtures

Dirichlet MixturesUsing symmetric Dirichlet mixtures for the parametric estimators,

θ1:T (x1:T ) =∫ T∏

t=1Pλ(xt )pµ(λ)dλ= Γ(mµ)

Γ(µ)m

∏mv=1Γ(tv +µ)

Γ(t +mµ)

If µ= 12 , we get the Jeffreys mixture.

Online Learning of Markov Forests Parametric Regret 21/34

Local regretsExpressions of the form

θ∗(x1:T )

θ1:T (x1:T )

have been extensively studied in literature of universal coding (Grünwald, 2007).

Use Dirichlet Mixtures

Dirichlet MixturesUsing symmetric Dirichlet mixtures for the parametric estimators,

θ1:T (x1:T ) =∫ T∏

t=1Pλ(xt )pµ(λ)dλ= Γ(mµ)

Γ(µ)m

∏mv=1Γ(tv +µ)

Γ(t +mµ)

If µ= 12 , we get the Jeffreys mixture.

Online Learning of Markov Forests Parametric Regret 22/34

Jeffreys Strategy (An extension of Xie and Barron (2000) to forest parameters)For each trial t,

1 Set θt+1i (u) = tu + 1

2

t + m2

for all i ∈ [n],u ∈ [m]

2 Set θt+1ij (u,v) = tuv + 1

2

t + m2

2

for all (i, j) ∈ ([n]2

),u,v ∈ [m]

Performance

Minimax regret:n(m−1)+ (n−1)(m−1)2

2ln T

2π +Cm,n +o(m2n)

Per-round time complexity: O(m2n2)

Online Learning of Markov Forests Parametric Regret 22/34

Jeffreys Strategy (An extension of Xie and Barron (2000) to forest parameters)For each trial t,

1 Set θt+1i (u) = tu + 1

2

t + m2

for all i ∈ [n],u ∈ [m]

2 Set θt+1ij (u,v) = tuv + 1

2

t + m2

2

for all (i, j) ∈ ([n]2

),u,v ∈ [m]

Performance

Minimax regret:n(m−1)+ (n−1)(m−1)2

2ln T

2π +Cm,n +o(m2n)

Per-round time complexity: O(m2n2)

Online Learning of Markov Forests Structural Regret 23/34

Structural RegretBased on the linear form of the log-loss, the structural learning problem can be cast as asequential combinatorial optimization problem (Audibert et al., 2011).

Minimize

T∑t=1

[t∑

s=1⟨f s,φ(xs)⟩

]

subject to

f 1, · · · , f T ∈ Fn

Online Learning of Markov Forests Structural Regret 23/34

Structural RegretBased on the linear form of the log-loss, the structural learning problem can be cast as asequential combinatorial optimization problem (Audibert et al., 2011).

Minimize

T∑t=1

[t∑

s=1⟨f s,φ(xs)⟩

]

subject to

f 1, · · · , f T ∈ Fn

Use the greedy algorithm

Online Learning of Markov Forests Structural Regret 24/34

Follow the Perturbed Leader (Kalai and Vempala, 2005)For each trial t,

1 Draw rt in[

0, 1αt

]uniformly at random

2 Set f t+1 = argminf ∈F n⟨f ,Lt + rt⟩

where Lt =∑ts=1φ(xs)

Performance

Minimax regret: n2 ln(T/2+m2/4)p

2T

Per-round time complexity: O(n2 logn)

Online Learning of Markov Forests Structural Regret 24/34

Follow the Perturbed Leader (Kalai and Vempala, 2005)For each trial t,

1 Draw rt in[

0, 1αt

]uniformly at random

2 Set f t+1 = argminf ∈F n⟨f ,Lt + rt⟩

where Lt =∑ts=1φ(xs)

Performance

Minimax regret: n2 ln(T/2+m2/4)p

2T

Per-round time complexity: O(n2 logn)

Online Learning of Markov Forests Structural Regret 25/34

2

3

4

5

6

7

10 100 1000 10000 100000

Ave

rage

log-

loss

KDD Cup 2000: Iterations

CLCLT

OFDEFOFDET

ExperimentsThe average log-loss is measured on the test set at each iteration.The online learner (FPL + Jeffreys) rapidly converges to the batch learner (Chow-Liu forMarkov trees, and Thresholded Chow-Liu for Markov forests).

Online Learning of Markov Forests Structural Regret 25/34

6

6.5

7

7.5

8

8.5

9

9.5

10

10 100 1000 10000 100000

Ave

rage

log-

loss

MSNBC: Iterations

CLCLT

OFDEFOFDET

ExperimentsThe average log-loss is measured on the test set at each iteration.The online learner (FPL + Jeffreys) rapidly converges to the batch learner (Chow-Liu forMarkov trees, and Thresholded Chow-Liu for Markov forests).

Recommended