Finding Optimal Bayesian Networks with Greedy Search
Max Chickering
Outline
• Bayesian-Network Definitions
• Learning
• Greedy Equivalence Search (GES)
• Optimality of GES
Bayesian Networks
)),(|(),|,...,(1
1 ii
n
iin XParXpSXXp
Use B = (S,) to represent p(X1, …, Xn)
Markov Conditions
X
Desc
Desc
Par ParPar
ND
ND
From factorization: I(X, ND | Par(X))
Markov Conditions + Graphoid Axioms characterize all independencies
Structure/Distribution Inclusion
p is included in S if there exists s.t. B(S,) defines p
X Y Z
pAll distributions
S
Structure/Structure Inclusion T ≤ S
T is included in S if every p included in T is included in S
X Y Z
All distributions
X Y Z
S T
(S is an I-map of T)
Structure/Structure EquivalenceT S
X Y Z
All distributions
X Y Z
S T
Reflexive, Symmetric, Transitive
Equivalence
V-structure
D
A B C
Theorem (Verma and Pearl, 1990)S T same v-structures and skeletons
D
A B C
Skeleton
Learning Bayesian Networks
1. Learn the structure
2. Estimate the conditional distributions
X Y Z0 1 11 0 10 1 0
.
.
.1 0 1
X
Y
Z
p*
iidsamples
GenerativeDistribution
Observed Data Learned Model
Learning Structure
• Scoring criterion
F(D, S)
• Search procedure
Identify one or more structures with high values
for the scoring function
Properties of Scoring Criteria
• Consistent
• Locally Consistent
• Score Equivalent
Consistent Criterion
S includes p*, T does not include p* F(S,D) > F(T,D)
Both include p*, S has fewer parameters F(S,D) > F(T,D)
Criterion favors (in the limit) simplest model that includes the generative distribution p*
X Y Z
X Y Z
X Y ZX Y Z
p*
Locally Consistent Criterion
X Y X Y
S T
If I(X,Y|Par(X)) in p* then F(S,D) > F(T,D)Otherwise F(S,D) < F(T,D)
S and T differ by one edge:
Score-Equivalent Criterion
ST F(S,D) = F(T,D)
X Y
X Y
S
T
Bayesian Criterion(Consistent, locally consistent and score equivalent)
Sh : generative distribution p* has same
independence constraints as S.
FBayes(S,D) = log p(Sh |D)
= k + log p(D|Sh) + log p(Sh)
Marginal Likelihood(closed form w/ assumptions)
Structure Prior(e.g. prefer simple)
Search Procedure
• Set of states
• Representation for the states
• Operators to move between states
• Systematic Search Algorithm
Greedy Equivalence Search
• Set of statesEquivalence classes of DAGs
• Representation for the statesEssential graphs
• Operators to move between statesForward and Backward Operators
• Systematic Search AlgorithmTwo-phase Greedy
Representation: Essential Graphs
E
A B C
F
Compelled Edges
Reversible Edges
D
E
A B C
FD
GES Operators
Forward Direction – single edge additions
Backward Direction – single edge deletions
Two-Phase Greedy Algorithm
Phase 1: Forward Equivalence Search (FES)• Start with all-independence model• Run Greedy using forward operators
Phase 2: Backward Equivalence Search (BES)• Start with local max from FES• Run Greedy using backward operators
Forward Operators
• Consider all DAGs in the current state
• For each DAG, consider all single-edge additions (acyclic)
• Take the union of the resulting equivalence classes
Forward-Operators ExampleB
C
ACurrent State: All DAGs: B
C
A B
C
A
All DAGs resulting from single-edge addition:
B
C
A B
C
A
B
C
A B
C
A
B
C
A B
C
A
B
C
A B
C
A
B
C
A B
C
AB
C
A B
C
A
Union of corresponding essential graphs:
Forward-Operators Example
B
C
AB
C
A B
C
A
B
C
A
B
C
A
Backward Operators
• Consider all DAGs in the current state
• For each DAG, consider all single-edge deletions
• Take the union of the resulting equivalence classes
Backward-Operators ExampleCurrent State: All DAGs: B
C
A B
C
A
All DAGs resulting from single-edge deletion:
B
C
A
Union of corresponding essential graphs:
B
C
A B
C
A
B
C
A B
C
A B
C
A B
C
A B
C
A B
C
A
B
C
A
Backward-Operators Example
B
C
AB
C
A B
C
A
DAG PerfectDAG-perfect distribution p
Exists DAG G:
I(X,Y|Z) in p I(X,Y|Z) in G
Non-DAG-perfect distribution q
BA
DC
I(A,D|B,C)I(B,C|A,D)
BA
DC
BA
DC
I(B,C|A,D) I(A,D|B,C)
DAG-Perfect Consequence: Composition Axiom Holds in p*
If I(X,Y | Z) then I(X,Y | Z)for some singleton Y Y
A B C
X
D C
X
Optimality of GES
X Y Z0 1 11 0 10 1 0 . . .1 0 1
X
Y
ZS*
X
Y
Ziid
samples
If p* is DAG-perfect wrt some G*
G*
X
Y
Zn
GES
S
p*
For large n, S = S*
Optimality of GES
Proof Outline• After first phase (FES), current state includes S*• After second phase (BES), the current state = S*
FES BES
All-independence State includes S* State equals S*
FES Maximum Includes S*Assume: Local Max does NOT include S* Any DAG G from S
Markov Conditions characterize independencies:In p*, exists X not indep. non-desc given parents
E
A B C
XD I(X,{A,B,C,D} | E) in p*
p* is DAG-perfect composition axiom holds
E
A B C
XD I(X,C | E) in p*
Locally consistent: adding CX edge improves score, and EQ class isa neighbor
BES Identifies S*
• Current state always includes S*:
Local consistency of the criterion
• Local Minimum is S*:
Meek’s conjecture
Meek’s Conjecture
Any pair of DAGs G,H such that H includes G (G ≤ H)
There exists a sequence of
(1) covered edge reversals in G
(2) single-edge additions to G
after each change G ≤ Hafter all changes G=H
Meek’s Conjecture
BA
C D
BA
C D
I(A,B)I(C,B|A,D)
BA
C D
BA
C D
BA
C D
H
G
Meek’s Conjecture and BESS*≤S
Assume: Local Max S Not S* Any DAG H from S Any DAG G from S*
G H
Add AddRev Rev Rev
Meek’s Conjecture and BESS*≤S
Assume: Local Max S Not S* Any DAG H from S Any DAG G from S*
G H
Add AddRev Rev Rev
G H
Del DelRev Rev Rev
Meek’s Conjecture and BESS*≤S
Assume: Local Max S Not S* Any DAG H from S Any DAG G from S*
G H
Add AddRev Rev Rev
S* SNeighbor of S in BES
G H
Del DelRev Rev Rev
Discussion Points
• In practice, GES is as fast as DAG-based search
Neighborhood of essential graphs can be generated and scored very efficiently
• When DAG-perfect assumption fails, we still get optimality guarantees
As long as composition holds in generative distribution, local maximum is inclusion-minimal
Thanks!My Home Page:
http://research.microsoft.com/~dmax
Relevant Papers:
“Optimal Structure Identification with Greedy Search”JMLR SubmissionContains detailed proofs of Meek’s conjecture and optimality of GES
“Finding Optimal Bayesian Networks”UAI02 Paper with Chris MeekContains extension of optimality results of GES when not DAG perfect
Bayesian Criterion is Locally Consistent
• Bayesian score approaches BIC + constant
• BIC is decomposible:
• Difference in score same for any DAGS that differ by YX edge if X has same parents
))(,(),(1
i
n
ii XParXFSBIC
D
X Y X Y
Complete network (always includes p*)
Bayesian Criterion is Consistent
Assume Conditionals:(1) unconstrained multinomials(2) linear regressions
Network structures = curved exponential models
Bayesian Criterion is consistent
Geiger, Heckerman, King and Meek (2001)
Haughton (1988)
Bayesian Criterion isScore Equivalent
ST F(S,D) = F(T,D)
X Y
X Y
S
T
Sh = Th
Sh : no independence constraints
Th : no independence constraints
Active Paths
Z-active Path between X and Y: (non-standard)1. Neither X nor Y is in Z2. Every pair of colliding edges meets at a member of Z3. No other pair of edges meets at a member of Z
X Z Y
G ≤ H If Z-active path between X and Y in Gthen Z-active path between X and Y in H
Active Paths
X A Z W B Y
A B C
• X-Y: Out-of X and In-to Y
• X-W Out-of both X and W
• Any sub-path between A,BZ is also active
• A – B, B – C, at least one is out-of B Active path between A and C
Simple Active Paths
Y X BA
A B contains YX
Then active path
X Y BA Y X
(1) Edge appears exactly once
(2) Edge appears exactly twice
OR
Simplify discussion: Assume (1) only – proofs for (2) almost identical
Typical Argument:Combining Active Paths
X
Z
Y BA
X
Z
Y BA XA
Y B
X
Z
Y
G
H
G’ : Suppose AP in G’ (X not in CS) with no corresp. AP in H. Then Z not in CS.
Z sink node adj X,Y
G≤H
Proof Sketch
Two DAGs G, H with G<H
Identify either:
(1) a covered edge XY in G that has opposite orientation in H
(2) a new edge XY to be added to G such that it remains included in H
The Transformation
XY
Y
XY
XY
W
Y
XY
XY
XY
W
Y
Choose any node Y that is a sink in H
Case 1a: Y is a sink in G X ParH(Y) X ParG(Y)
Case 1b: Y is a sink in G same parents
Case 2a: X s.t. YX covered
Case 2b: X s.t. YX & W par of Y but not X
Case 2c: Every YX, Par (Y) Par(X)
Preliminaries
• The adjacencies in G are a subset of the adjacencies in H
• If XYZ is a v-structure in G but not H, then X and Z are adjacent in H
• Any new active path that results from adding XY to G includes XY
(G ≤ H)
Proof Sketch: Case 1
Z
Y is a sink in G
XY
Case 1a: X ParH(Y) X ParG(Y)
Case 1b: Parents identical
Remove Y from both graphs: proof similar
XYH:
XYG:
Suppose there’s some new active path between A and B not in H
Y X BA
1. Y is a sink in G, so it must be in CS2. Neither X nor next node Z is in CS3. In H, AP(A,Z), AP(X,B), ZYX
Proof Sketch: Case 2
Case 2a: There is a covered edge YX : Reverse the edge
Case 2b: There is a non-covered edge YX such that W is a parent of Y but not a parent of X
X
W
Y X
W
YG: G’:
X
W
YH:
Y must be in CS, else replace WX by W Y X (not new).If X not in CS, then in H active: A-W, X-B, WYX
X
W
Y Z X
W
YH:G’:
Z
A B BA
Y is not a sink in G
Suppose there’s some new active path between A and B not in H
Case 2c: The Difficult CaseAll non-covered edges YZ have Par(Y) Par(Z)
W1
Z1
Y
Z2
W2 W1
Z1
Y
Z2
W2
G H
W1Y: G no longer < H (Z2-active path between W1 and W2)W2Y: G < H
Choosing Z
D
Y
Z
Descendants of Y in G
D
Y
Descendants of Y in G
G H
D is the maximal G-descendant in HZ is any maximal child of Y such that D is a descendant of Z in G
Choosing ZW1
Z1
YZ2
W2 W1
Z1
YZ2
W2
G H
Descendants of Y in G:Y, Z1, Z2
Maximal descendant in H:D=Z2
Maximal child of Y in G that has D=Z2 as descendantZ2
Add W2Y
Difficult Case: Proof Intuition
D
Y
Z
G H
W Y
Z
W
1. W not in CS2. Y not in CS, else active in H3. In G, next edges must be away from Y until B or CS reached4. In G, neither Z nor desc in CS, else active before addition5. From (1,2,4), AP (A,D) and (B,D) in H6. Choice of D: directed path from D to B or CS in H
A A
DB or CS
B or CS
BB
Optimality of GES
Definitionp is DAG-perfect wrt G:Independence constraints in p are precisely those in G
Assumption Generative distribution p* is perfect wrt some G* definedover the observable variables
S* = Equivalence class containing G*
Under DAG-perfect assumption, GES results in S*
Important Definitions
• Bayesian Networks
• Markov Conditions
• Distribution/Structure Inclusion
• Structure/Structure Inclusion