Online Learning of Probabilistic Graphical Models

Frédéric KoricheCRIL - CNRS UMR 8188, Univ. Artois

koriche@cril.fr

CRIL-U Nankin 2016

Probabilistic Graphical Models 2/34

Outline

1 Probabilistic Graphical Models

2 Online Learning

3 Online Learning of Markov Forests

4 Beyond Markov Forests

GraphicalModels

Bioinformatics Financial Modeling Image Processing Social Networks Analysis

Graphical ModelsAt the intersection of Probability Theory and Graph Theory, graphical models are awell-studied representation framework with various applications.

000000000111000000000110000000000101000000000100000000000011000000000010000000000001000000000000

111111111000111111111001111111111010111111111011111111111100111111111101111111111110111111111111

X10X11

0.01 0.490.48 0.02

0.750.25

Graphical ModelsEncoding high-dimensional distributions in a compact and intuitive way:

Qualitative uncertainty (interdependencies) is captured by the structure

Quantitative uncertainty (probabilities) is captured by the parameters

GraphicalModels

BayesianNetworks

MarkovNetworks

Sparse BayesianNetworks

BayesianPolytrees

BayesianForests

Sparse MarkovNetworks

BoundedTree-width MNs

MarkovForests

Classes of Graphical ModelsFor an outcome space X ⊆Rn, a class of graphical models is a pair M = G×Θ, where G isspace of n-dimensional graphs, and Θ is a space of d-dimensional vectors.

G captures structural constraints (directed vs. undirected, sparse vs. dense, etc.)

Θ captures parametric constraints (binomial, multinomial, Gaussian, etc.)

X10X11X12

00 0.0110 0.4901 0.4811 0.02

P(X2,X3)

0 0.751 0.25

(Multinomial) Markov ForestsUsing [m] = {0, · · · ,m−1}, the class of Markov Forests over [m]n is given by Fn ×Θm,n, where

Fn is the space of all acyclic graphs of order n;

Θm,n is the space of all parameter vectors mapping

Ï a probability table θi ⊆ [0,1]m to each candidate node i, andÏ a probability table θij ⊆ [0,1]m×m to each candidate edge (i, j).

X10X11X12

00 0.0110 0.9901 0.9811 0.02

P(X3 | X2)

0 0.751 0.25

(Multinomial) Bayesian ForestsThe class of Bayesian Forests over [m]n is given by Fn ×Θm,n, where

Fn is the space of all directed forests of order n;

Θm,n is the space of all parameter vectors mapping

Ï a probability table θi ⊆ [0,1]m to each candidate node i, andÏ a conditional probability table θj|i ⊆ [m]× [0,1]m to each candidate arc (i, j).

BurglaryEarthquake

AlarmRadio

00 0.4910 0.0101 0.4811 0.02

P(A,E)

Example: The Alarm Model

Markov tree representation

Bayesian tree representation

BurglaryEarthquake

AlarmRadio

00 0.6510 0.3501 0.0111 0.99

P(A | E)

Example: The Alarm Model

Markov tree representation

Bayesian tree representation

Online Learning 9/34

Outline

2 Online Learning

3 Online Learning of Markov Forests

E B R A C0 1 0 1 10 1 0 0 10 1 0 0 0

.0 1 0 1 10 1 0 1 11 0 1 1 0

0 0.651 0.35

LearningGiven a class of graphical models M = G×Θ, the learning problem is to extract from asequence of outcomes x1:T = (x1, · · · ,xT ),

the structure G ∈ G, and

the parameters θ ∈Θof a generative model M = (G,θ) capable of predicting future, unseen, outcomes.A natural loss function for measuring the performance of M is the log-loss:

`(M ,x) =− lnPM (x)

E B R A C0 1 0 1 10 1 0 0 10 1 0 0 0

.0 1 0 1 10 1 0 1 11 0 1 1 0

0 0.651 0.35

`(M ,x) =− lnPM (x)

E B R A C0 1 0 1 10 1 0 0 10 1 0 0 0

.0 1 0 1 10 1 0 1 11 0 1 1 0

0 0.651 0.35

`(M ,x) =− lnPM (x)

100000000111100000000111010000000101010000000101000000000011000000000010000000010001000000010000

110111111000110111111001111110111010110111111011111011111101111011111101111011111101111111110111

Training Set

M = (G,θ)

Graphical Model

Batch LearningA two-stage process:

A model M ∈M is first extracted from a training set.

The average loss of the model M is evalued using a test set.

M = (G,θ)

Graphical Model

100000000111010000000101010000000101000000000011000000000010000000010001000000010000

110111111000110111111001111110111010111011111101111011111101111011111101111111110111

Test Set of size T

∑Tt=1`(M ,xt )

Batch LearningA two-stage process:

A model M ∈M is first extracted from a training set.

The average loss of the model M is evalued using a test set.

Environment Learner

Online LearningA sequential process, or repeated game between the learner and its environment. During eachtrial t = 1, · · · ,T ,

the learner chooses a model Mt ∈M ;

the environment responds by an outcome xt ∈X , and the learner incurs the loss`(Mt ,xt ).

Environment LearnerMt = (Gt ,θt)

Environment LearnerMt = (Gt ,θt)100000000111

`(Mt ,θt)

Environment LearnerMt+1 = (Gt+1,θt+1)

100010010000

`(Mt+1,θt+1)

Online vs. Batch

In batch (PAC) learning, it is assumed that the data (training set + test set) is generatedby a fixed (but unknown) probability distribution.

In online learning, there is no statistical assumption about the data generated by theenvironment.

Online learning is particularly suited to:

* Adaptive environments, where the target distribution can change over time;

* Streaming applications, where all the data is not available in advance;

* Large-scale datasets, by processing only one outcome at a time.

Online vs. Batch

In batch (PAC) learning, it is assumed that the data (training set + test set) is generatedby a fixed (but unknown) probability distribution.

In online learning, there is no statistical assumption about the data generated by theenvironment.

Online learning is particularly suited to:

* Adaptive environments, where the target distribution can change over time;

* Streaming applications, where all the data is not available in advance;

* Large-scale datasets, by processing only one outcome at a time.

RegretLet M be a class of graphical models over an outcome space X , and A be a learningalgorithm for M .

The minimax regret of A at horizon T , is the maximum, over every sequence of out-comes x1:T = (x1, · · · ,xT ), of the cumulative relative loss between A and the best modelin M , i.e.

R(A,T) = maxx1:T∈X T

t=1`(Mt ,xt )− min

T∑t=1

`(M ,xt )

LearnabilityA class M is (online) learnable if it admits an online learning algorithm A such that:

1 the minimax regret of A is sublinear in T , i.e.

limT→∞R(A,T) = 0

2 the per-round computational complexity of A is polynomial in the dimension of M .

RegretLet M be a class of graphical models over an outcome space X , and A be a learningalgorithm for M .

The minimax regret of A at horizon T , is the maximum, over every sequence of out-comes x1:T = (x1, · · · ,xT ), of the cumulative relative loss between A and the best modelin M , i.e.

R(A,T) = maxx1:T∈X T

t=1`(Mt ,xt )− min

T∑t=1

`(M ,xt )

LearnabilityA class M is (online) learnable if it admits an online learning algorithm A such that:

1 the minimax regret of A is sublinear in T , i.e.

limT→∞R(A,T) = 0

2 the per-round computational complexity of A is polynomial in the dimension of M .

Online Learning of Markov Forests 15/34

Outline

2 Online Learning

3 Online Learning of Markov ForestsRegret DecompositionParametric RegretStructural Regret

X10X11X12

X150.01 0.490.48 0.02

0.750.25

Does there exist an efficient online learning algorithm for Markov forests?How can we efficiently update at each iteration both the structure F t and the parameters θt inorder to minimize the cumulative loss

∑t `(F t ,θt ,xt )?

X10X11X12

X150.05 0.450.45 0.05

0.650.35

Does there exist an efficient online learning algorithm for Markov forests?How can we efficiently update at each iteration both the structure F t and the parameters θt inorder to minimize the cumulative loss

∑t `(F t ,θt ,xt )?

Two key propertiesFor the class Fm,n = Fn ×Θm,n of Markov forests,

The probability distribution associated with a Markov forest M = (F ,θ) can be factor-ized into a closed-form:

PM (x) =n∏

i=1θi(xi)

∏(i,j)∈F

θij(xi,xj)

θi(xi)θj(xj)

The space Fn of forest structures is a matroid; minimizing a linear function over Fncan be done in quadratic time using the matroid greedy algorithm.

Two key propertiesFor the class Fm,n = Fn ×Θm,n of Markov forests,

The probability distribution associated with a Markov forest M = (F ,θ) can be factor-ized into a closed-form:

PM (x) =n∏

i=1θi(xi)

∏(i,j)∈F

θij(xi,xj)

θi(xi)θj(xj)

The space Fn of forest structures is a matroid; minimizing a linear function over Fncan be done in quadratic time using the matroid greedy algorithm.

Log-LossLet M = (f ,θ) be a Markov forest, where f is the characteristic vector of the structure.

`(M ,x) =− lnPM (x)

=− ln

∏i=1

θi(xi)∏(i,j)

(θij(xi,xj)

θi(xi)θj(xj)

So, the log-loss is an affine function of the forest structure:

`(M ,x) =ψ(x)+⟨f ,φ(x)⟩

ψ(x) = ∑i∈[n]

θi(xi)and φij(xi,xj) = ln

(θi(xi)θj(xj)

θij(xi,xj)

Log-LossLet M = (f ,θ) be a Markov forest, where f is the characteristic vector of the structure.

`(M ,x) =− lnPM (x)

=− ln

∏i=1

θi(xi)∏(i,j)

(θij(xi,xj)

θi(xi)θj(xj)

So, the log-loss is an affine function of the forest structure:

`(M ,x) =ψ(x)+⟨f ,φ(x)⟩

ψ(x) = ∑i∈[n]

θi(xi)and φij(xi,xj) = ln

(θi(xi)θj(xj)

θij(xi,xj)

Online Learning of Markov Forests Regret Decomposition 19/34

Regret DecompositionBased on the linearity of the log-loss, the regret is decomposable into two parts:

R(M1:T ,x1:T

(f 1:T ,x1:T

(θ1:T ,x1:T

)where

R(f 1:T ,x1:T

T∑t=1

`(f t ,θt ,xt )−`(f ∗,θt ,xt ) (Structural Regret)

R(θ1:T ,x1:T

T∑t=1

`(f ∗,θt ,xt )−`(f ∗,θ∗,xt ) (Parametric Regret)

Online Learning of Markov Forests Regret Decomposition 19/34

Regret DecompositionBased on the linearity of the log-loss, the regret is decomposable into two parts:

R(M1:T ,x1:T

(f 1:T ,x1:T

(θ1:T ,x1:T

)where

R(f 1:T ,x1:T

T∑t=1

`(f t ,θt ,xt )−`(f ∗,θt ,xt ) (Structural Regret)

R(θ1:T ,x1:T

T∑t=1

`(f ∗,θt ,xt )−`(f ∗,θ∗,xt ) (Parametric Regret)

Online Learning of Markov Forests Parametric Regret 20/34

Parametric RegretBased on the closed-form expression of Markov forests, the parametric regret isdecomposable into local regrets:

R(θ1:T ,x1:T

n∑i=1

lnθ∗i (x1:T

θ1:Ti (x1:T

i )(Univariate estimators)

+ ∑(i,j)∈F

lnθ∗ij(x1:T

θ1:Tij (x1:T

ij )(Bivariate estimators)

+ ∑(i,j)∈F

lnθ1:T

i (x1:Ti )

θ∗i (x1:Ti )

θ1:Tj (x1:T

θ∗j (x1:Tj )

(Bivariate compensation)

θ∗i (x1:Ti ) =

T∏t=1

θ∗i (xti ), θ1:T

i (x1:Ti ) =

T∏t=1

θti (xt

θ∗ij(x1:Tij ) =

T∏t=1

θ∗ij(xti ,xt

j ), θ1:Tij (x1:T

ij ) =T∏

t=1θt

ij(xti ,xt

Parametric RegretBased on the closed-form expression of Markov forests, the parametric regret isdecomposable into local regrets:

R(θ1:T ,x1:T

n∑i=1

lnθ∗i (x1:T

θ1:Ti (x1:T

i )(Univariate estimators)

+ ∑(i,j)∈F

lnθ∗ij(x1:T

θ1:Tij (x1:T

ij )(Bivariate estimators)

+ ∑(i,j)∈F

lnθ1:T

i (x1:Ti )

θ∗i (x1:Ti )

θ1:Tj (x1:T

θ∗j (x1:Tj )

(Bivariate compensation)

θ∗i (x1:Ti ) =

T∏t=1

θ∗i (xti ), θ1:T

i (x1:Ti ) =

T∏t=1

θti (xt

θ∗ij(x1:Tij ) =

T∏t=1

θ∗ij(xti ,xt

j ), θ1:Tij (x1:T

ij ) =T∏

t=1θt

ij(xti ,xt

Local regretsExpressions of the form

θ∗(x1:T )

θ1:T (x1:T )

have been extensively studied in literature of universal coding (Grünwald, 2007).

Use Dirichlet Mixtures

Dirichlet MixturesUsing symmetric Dirichlet mixtures for the parametric estimators,

θ1:T (x1:T ) =∫ T∏

t=1Pλ(xt )pµ(λ)dλ= Γ(mµ)

Γ(µ)m

∏mv=1Γ(tv +µ)

Γ(t +mµ)

If µ= 12 , we get the Jeffreys mixture.

θ∗(x1:T )

θ1:T (x1:T )

θ1:T (x1:T ) =∫ T∏

Γ(µ)m

∏mv=1Γ(tv +µ)

Γ(t +mµ)

θ∗(x1:T )

θ1:T (x1:T )

θ1:T (x1:T ) =∫ T∏

Γ(µ)m

∏mv=1Γ(tv +µ)

Γ(t +mµ)

Jeffreys Strategy (An extension of Xie and Barron (2000) to forest parameters)For each trial t,

1 Set θt+1i (u) = tu + 1

t + m2

for all i ∈ [n],u ∈ [m]

2 Set θt+1ij (u,v) = tuv + 1

t + m2

for all (i, j) ∈ ([n]2

),u,v ∈ [m]

Performance

Minimax regret:n(m−1)+ (n−1)(m−1)2

2π +Cm,n +o(m2n)

Per-round time complexity: O(m2n2)

Jeffreys Strategy (An extension of Xie and Barron (2000) to forest parameters)For each trial t,

1 Set θt+1i (u) = tu + 1

t + m2

for all i ∈ [n],u ∈ [m]

2 Set θt+1ij (u,v) = tuv + 1

t + m2

for all (i, j) ∈ ([n]2

),u,v ∈ [m]

Performance

Minimax regret:n(m−1)+ (n−1)(m−1)2

2π +Cm,n +o(m2n)

Per-round time complexity: O(m2n2)

Online Learning of Markov Forests Structural Regret 23/34

Structural RegretBased on the linear form of the log-loss, the structural learning problem can be cast as asequential combinatorial optimization problem (Audibert et al., 2011).

Minimize

T∑t=1

s=1⟨f s,φ(xs)⟩

subject to

f 1, · · · , f T ∈ Fn

Structural RegretBased on the linear form of the log-loss, the structural learning problem can be cast as asequential combinatorial optimization problem (Audibert et al., 2011).

Minimize

T∑t=1

s=1⟨f s,φ(xs)⟩

subject to

f 1, · · · , f T ∈ Fn

Use the greedy algorithm

Follow the Perturbed Leader (Kalai and Vempala, 2005)For each trial t,

1 Draw rt in[

0, 1αt

]uniformly at random

2 Set f t+1 = argminf ∈F n⟨f ,Lt + rt⟩

where Lt =∑ts=1φ(xs)

Performance

Minimax regret: n2 ln(T/2+m2/4)p

Per-round time complexity: O(n2 logn)

Follow the Perturbed Leader (Kalai and Vempala, 2005)For each trial t,

1 Draw rt in[

0, 1αt

]uniformly at random

2 Set f t+1 = argminf ∈F n⟨f ,Lt + rt⟩

where Lt =∑ts=1φ(xs)

Performance

Minimax regret: n2 ln(T/2+m2/4)p

Per-round time complexity: O(n2 logn)

10 100 1000 10000 100000

KDD Cup 2000: Iterations

OFDEFOFDET

ExperimentsThe average log-loss is measured on the test set at each iteration.The online learner (FPL + Jeffreys) rapidly converges to the batch learner (Chow-Liu forMarkov trees, and Thresholded Chow-Liu for Markov forests).

10 100 1000 10000 100000

MSNBC: Iterations

OFDEFOFDET

ExperimentsThe average log-loss is measured on the test set at each iteration.The online learner (FPL + Jeffreys) rapidly converges to the batch learner (Chow-Liu forMarkov trees, and Thresholded Chow-Liu for Markov forests).

Online Learning of Probabilistic Graphical Models · Online Learning of Probabilistic Graphical...

Documents

m Cril Social Rating

Executive Summary Review 2011 m Cril

CRIL - Propaganda do Governo vende gato por lebre PropagandaModelo sustentável construído em Madrid (M-30) CRIL - Propaganda do Governo vende gato por lebre Propaganda Triste Realidade

Foundations of Probabilistic Programming/Program SynthesisOverview Foundations of Probabilistic Programming Probabilistic Programs Probabilistic programs extend sequential programs

Probabilistic Logic and Probabilistic Networks · 2015. 4. 8. · Probabilistic Logic and Probabilistic Networks Rolf Haenni Reasoning under UNcertainty Group Institute of Computer

CRIL - Università del Salento

Probabilistic Design Introduction An Example Motivation Features Benefits Probabilistic Methods Probabilistic Results/Interpretation Summary

1 RELAT“RIO PRELIMINAR DE PERITAGEM ¬S - CRIL segura

Learning probabilistic logic models from probabilistic examples

Contriubtions du CRIL dans le cadre de l’ACI Daddi Mars 2006

Probabilistic Graphical Modelsassets.disi.unitn.it/uploads/doctoral_school/... · Introduction Probabilistic Graphical Models Need: probabilistic models that can represent models

Algorithmique Avancée II Lakhdar Saïs CRIL, Université d ’Artois Bureau: C301 sais@cril.univ-artois.fr sais

Simple, Distributed, and Accelerated Probabilistic Programmingpapers.nips.cc/paper/...probabilistic-programming.pdfWe deﬁned probabilistic programs as arbitrary Python functions

On the Probabilistic Foundations of Probabilistic Roadmap ...robots.stanford.edu/isrr-papers/draft/hsu-latombe.pdf · On the Probabilistic Foundations of Probabilistic Roadmap Planning

Nexii Webinar Series 2012: M-CRIL Ratings

GECOL Cril piscinas · GECOL Cril piscinas Technical data sheet _ 18/06/18 2 ·Do not apply if there is a risk of frost, with direct sunlight, strong wind or rain. ·To ensure color

1 Contributions du CRIL dans le cadre de lACI Daddi Présenté par : S. Benferhat benferhat@cril.univ-artois.fr

probabilistic ranking

A Cril Amida

Contributions futures du CRIL dans le cadre de l’ACI Daddi