4
Information Processing Letters 108 (2008) 60–63 Contents lists available at ScienceDirect Information Processing Letters www.elsevier.com/locate/ipl Approximation algorithms for restricted Bayesian network structures Valentin Ziegler 1 Humboldt Universität zu Berlin, Institut für Informatik, 10099 Berlin, Germany article info abstract Article history: Received 21 August 2007 Received in revised form 10 February 2008 Available online 29 March 2008 Communicated by K. Iwama Keywords: Approximation algorithms Graph algorithms Bayesian network Structure learning Bayesian network structures with a maximum in-degree of k can be approximated with respect to a positive scoring metric up to an factor of 1/k. © 2008 Elsevier B.V. All rights reserved. 1. Introduction Bayesian networks [8,10,7] are widely used models of expert knowledge with applications including computa- tional biology, artificial intelligence, medicine and text analysis. A Bayesian network encodes probabilistic dependency relations among a set of random variables by a directed acyclic graph, where nodes correspond to variables and arcs represent dependence. The task of deducing this graphical structure from observed sets of variable assign- ments is commonly referred to as structure learning. To avoid overfitting the model to limited data and to reduce computing costs, the maximum number of parents of a node in the graph is often assumed to be limited by a constant k [7]. Structure learning has been shown to be NP-hard even in the restricted case where the in-degree of each node is bounded by two [3]. Therefore, only approximate solutions can be found in polynomial time, unless P = NP. Many heuristics have been described to attack the structure learning problem, including simulated annealing [4], heuristic local search [7], ant colony optimization [2] E-mail address: [email protected]. 1 Supported by the DFG Research Center Matheon “Mathematics for key technologies”. and genetic algorithms [9]. However, there is no guaran- teed approximation ratio with any of these approaches. This paper introduces two simple approximation algo- rithms for learning graph structures with a maximum in- degree k and proves approximation ratios of 1/(k + 1) and 1/k. 2. Preliminaries A directed acyclic graph (DAG) on a set of random variables Var is a directed graph G = ( V (G), E (G)) with V (G) = Var, E (G) Var × Var, such that G contains no loops or directed cycles. Let ( w, v ) E (G), then w is said to be a parent of v V , and v is called a child of w. The set of all parents of v in a DAG G is denoted by pa G ( v ). If |pa G ( v )| k for all v V (G), then G is a k-restricted DAG. Let G = ( V (G), E (G)) be a DAG, P V (G) and w V (G) with pa G ( w) =∅. We write G + ( w, P ) as a short hand term for the digraph G where w has been given par- ents P , i.e. E (G ) = E (G) vP ( v , w). Given a set V , an arborescence with root r V is an edge minimal directed graph on V such that there is a path from r to every v V . The problem of finding the optimal (k-restricted) Bayes- ian network structure for an observed data set on n vari- 0020-0190/$ – see front matter © 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.ipl.2008.03.015

Approximation algorithms for restricted Bayesian network structures

Embed Size (px)

Citation preview

Page 1: Approximation algorithms for restricted Bayesian network structures

Information Processing Letters 108 (2008) 60–63

Contents lists available at ScienceDirect

Information Processing Letters

www.elsevier.com/locate/ipl

Approximation algorithms for restricted Bayesian network structures

Valentin Ziegler 1

Humboldt Universität zu Berlin, Institut für Informatik, 10099 Berlin, Germany

a r t i c l e i n f o a b s t r a c t

Article history:Received 21 August 2007Received in revised form 10 February 2008Available online 29 March 2008Communicated by K. Iwama

Keywords:Approximation algorithmsGraph algorithmsBayesian networkStructure learning

Bayesian network structures with a maximum in-degree of k can be approximated withrespect to a positive scoring metric up to an factor of 1/k.

© 2008 Elsevier B.V. All rights reserved.

1. Introduction

Bayesian networks [8,10,7] are widely used models ofexpert knowledge with applications including computa-tional biology, artificial intelligence, medicine and textanalysis.

A Bayesian network encodes probabilistic dependencyrelations among a set of random variables by a directedacyclic graph, where nodes correspond to variables andarcs represent dependence. The task of deducing thisgraphical structure from observed sets of variable assign-ments is commonly referred to as structure learning. Toavoid overfitting the model to limited data and to reducecomputing costs, the maximum number of parents of anode in the graph is often assumed to be limited by aconstant k [7].

Structure learning has been shown to be NP-hard evenin the restricted case where the in-degree of each node isbounded by two [3]. Therefore, only approximate solutionscan be found in polynomial time, unless P = NP.

Many heuristics have been described to attack thestructure learning problem, including simulated annealing[4], heuristic local search [7], ant colony optimization [2]

E-mail address: [email protected] Supported by the DFG Research Center Matheon “Mathematics for

key technologies”.

0020-0190/$ – see front matter © 2008 Elsevier B.V. All rights reserved.doi:10.1016/j.ipl.2008.03.015

and genetic algorithms [9]. However, there is no guaran-teed approximation ratio with any of these approaches.

This paper introduces two simple approximation algo-rithms for learning graph structures with a maximum in-degree k and proves approximation ratios of 1/(k + 1) and1/k.

2. Preliminaries

A directed acyclic graph (DAG) on a set of randomvariables Var is a directed graph G = (V (G), E(G)) withV (G) = Var, E(G) ⊆ Var × Var, such that G contains noloops or directed cycles. Let (w, v) ∈ E(G), then w is saidto be a parent of v ∈ V , and v is called a child of w . Theset of all parents of v in a DAG G is denoted by paG(v). If|paG(v)| � k for all v ∈ V (G), then G is a k-restricted DAG.

Let G = (V (G), E(G)) be a DAG, P ⊆ V (G) and w ∈V (G) with paG(w) = ∅. We write G + (w, P ) as a shorthand term for the digraph G ′ where w has been given par-ents P , i.e.

E(G ′) = E(G) ∪⋃v∈P

(v, w).

Given a set V , an arborescence with root r ∈ V is an edgeminimal directed graph on V such that there is a pathfrom r to every v ∈ V .

The problem of finding the optimal (k-restricted) Bayes-ian network structure for an observed data set on n vari-

Page 2: Approximation algorithms for restricted Bayesian network structures

V. Ziegler / Information Processing Letters 108 (2008) 60–63 61

1 G ← (Var,∅)

2 Connected ← ∅3 while Connected �= Var do4 Ext := {(v, P ) | v ∈ Var \ Connected, P ⊆ Var, |P | � k,

G + (v, P ) contains no circle}5 (w, Q ) := argmax(v,P )∈Ext nodescore(v, P )

6 Connected ← Connected ∪ {w}7 G ← G + (w, Q )

endreturn G

Algorithm 1. Greedy algorithm.

ables Var = {v1 . . . vn} is to find a directed acyclic (k-re-stricted) graph G on Var which maximizes a given scor-ing metric score(G). Commonly used metrics are Bayesianscore [5], BDeu [1], MDL score [11] and BDe [7]. These met-rics are decomposable, that is they can be written as a sumover independent parent scores for each variable:

score(G) =∑

v∈V (G)

nodescore(

v,paG(v)), (1)

where nodescore : Var × 2Var → R is a function that rates achoice for the parents of a variable.

The algorithms described in this paper will deal withthe k-restricted case, the score value of the optimal k-re-stricted solution will be denoted by OPT . The nodescoreis assumed to be a nonnegative function (this can alwaysbe archived by adding a sufficiently large constant to eachvalue).

3. The Greedy algorithm

A simple, iterative approach to calculate an approximatek-restricted network structure is to start with a graph con-taining no edges, and then in each step, greedily add aparent set which gives the highest gain to the current di-graph’s score (Algorithm 1).

Theorem 1. Algorithm 1 has an approximation ratio of1/(k + 1).

Proof. Obviously the returned Graph G is a k-restrictedDAG, thus score(G) � OPT . It remains to show that score(G)

� (1/(k + 1))OPT .Let Pi denote the best parent set for the node vi , i.e.

Pi := argmaxP⊆Var,|P |�k

nodescore(vi, P )

and let

si :={

nodescore(vi, Pi) i � n,

0 i > n.

Without loss of generality, we assume that

s1 � s2 � · · · � sn.

The main loop of the algorithm is traversed exactly ntimes. We write Extt for the set Ext in step t and defineLt := Extt ∩ {(v1, P1), . . . , (vn, Pn)}. Having selected (w, Q )

in line 5 and extended the graph in line 7, the at mostk + 1 nodes {w} ∪ Q are connected by new edges in G . A

node not incident to any edges may still be assigned anyparent set, therefore

|L1| = n, |Lt | � n − (k + 1)(t − 1).

The score-gain s′t within step t is

s′t = max

(v,P )∈Extt

nodescore(v, P )

� max(v,P )∈Lt

nodescore(v, P )

� s1+(k+1)(t−1),

which implies a lower bound to the score of the returnedDAG G:

score(G) =n∑

t=1

s′t �

n∑t=1

s1+(k+1)(t−1)

�n∑

t=1

1

k + 1

k+1∑j=1

s j+(k+1)(t−1)

= 1

k + 1

(k+1)n∑i=1

si = 1

k + 1

n∑i=1

si

� 1

k + 1OPT. �

Remark. The analysis of Algorithm 1 is tight, as the in-stance below shows: Let Var := {v, w1, . . . , wk, r1, . . . , rk}and

nodescore(

v, {w1, . . . , wk}) = 1,

nodescore(

v, {r1, . . . , rk}) = 1 − ε,

nodescore(

wi, {v, r[i]k , . . . , r[k+i−2]k }) = 1 − ε ∀i,

all remaining nodescores equal to zero.

Algorithm 1 will choose (v, {w1, . . . , wk}) in line 5 ofthe first iteration, thus obstructing any other parent setwith positive nodescore in the remaining iterations. The re-sulting DAG will have a score of 1, whereas the optimum

G∗ := G0 + (v, {r1, . . . , rk}

) +k∑

i=1

(wi, {v, r[i]k , . . . , r[k+i−2]k }

)

with G0 = (Var,∅) has score(G∗) = (k + 1)(1 − ε).

4. Approximation using arborescences

A better approximation ratio is archived by Algorithm 4using a different approach. The key idea is to “project” thestructure learning instance onto an instance of a maximumweight arborescence problem on the set of nodes Var withan additional root node R . In the corresponding arbores-cence problem (which can be solved in polynomial time[6]), parent sets are represented by single parent nodes,whereas the root R represents the empty parent set. Dueto the loss of information in this projection, expanding theresulting arborescence into a digraph with more than oneparent per node will yield cycles, which have to be re-moved.

Page 3: Approximation algorithms for restricted Bayesian network structures

62 V. Ziegler / Information Processing Letters 108 (2008) 60–63

1 w(a,b) := max|X |�k,a∈X nodescore(b, X) ∀a,b �= a2 w(R,b) := nodescore(b,∅) ∀b3 T ← maximum weight arborescence on (Var ∪ {R}, w) with root R4 forall (x, y) ∈ T do5 S(y) := w(x, y)

6 P (y) := argmax|X |�k,x∈X nodescore(y, X) resp. ∅, if x = R .end

7 Let v1, . . . , vn be an ascending ordering of Var according todist(R, vi), where dist(R, vi) is the number of edges on the pathfrom R to vi .

8 vn.c ← 19 for i = n − 1, . . . ,1 do

10 vi .c ← min{c ∈ N | c �= v j .c for all j > i with v j ∈ P (vi)}end

11 G0 ← (Var,∅)

12 c := argmaxi∑

v j .c=i S(v j)

13 return G0 + ∑v j .c=c(v j, P (v j))

Algorithm 2. Arborescence approximation.

Theorem 2. Let scoreA be the score of the graph returned byAlgorithm 4. Then scoreA satisfies

OPT � scoreA � 1

kOPT,

i.e., the algorithm has an approximation ratio of 1k .

Proof. Directly by definition of w , S and P in lines 1–6 ofthe algorithm, we get

scoreA = maxi

∑v j .c=i

nodescore(v j, P (v j))

= maxi

∑v j .c=i

S(v j).

Let GOPT be the optimal k-restricted DAG for the probleminstance and consider an arborescence T ′ on Var ∪{R} thatonly contains reversed edges from GOPT together with anedge from R to every node with empty parent set in GOPT.From the definition of w(x, y) follows

OPT �∑

(a,b)∈T ′w(a,b) �

∑(a,b)∈T

w(a,b) =∑

j

S(v j). (2)

In addition, every vi ∈ Var has at most k − 1 parents v j ∈P (vi), j > i, because if P (vi) �= ∅ there is one vk ∈ P (vi)

such that (vk, vi) ∈ T —thus dist(R, vk) < dist(R, vi) andk < i. Therefore vi .c � k for all i and

scoreA = maxi

∑v j .c=i

S(v j) � 1

k

∑j

S(v j). (3)

Combining the inequalities (2) and (3) one gets

scoreA � 1

kOPT.

It remains to show the returned graph is a acyclic, i.e.,scoreA is indeed a lower bound to OPT . Consider the familyof graphs

Gi := G0 +∑

v j .c=i

(v j, P (v j)

), where G0 := (Var,∅).

To see that all these graphs are DAGs, assume that Gi con-tains a cycle C . Then there must be an edge (vl, vm) ∈ C

with l < m. Therefore vl.c �= vm.c which implies that oneof the two nodes has an empty parent set in Gi , contra-dicting the existence of C . �Remark. The analysis of Algorithm 4 is tight:

Let Var = {x1, . . . , xk, r} and

nodescore

(xi, {r} ∪

⋃j �=i

x j

)= 1 + iε,

nodescore(r, {xk}

) = 1,

nodescore(xi, {xk}

) = 1 for i �= k.

All remaining nodescores equal to zero.

With edge weights w as in Algorithm 4, the optimal ar-borescence T on Var∪{R} has edges (R, r), (r, x1), . . . , (r, xk)

and total weight k + ε′ . Only one of the expanded parentsets P (xi) = {ri} ∪ ⋃

j �=i x j can be chosen without creatinga cycle, i.e., xi .c �= x j .c for all i �= j. The value of scoreA is1 + kε, whereas in the optimal DAG xk has no parents andevery other node as child, so OPT = k.

References

[1] W. Buntine, Theory refinement on Bayesian networks, Uncertainty inArtificial Intelligence 7 (1991) 52–60.

[2] L.M. de Campos, J.M. Fernández-Luna, J.A. Gámez, J.M. Puerta, Antcolony optimization for learning Bayesian networks, InternationalJournal of Approximate Reasoning 31 (2002) 291–311.

[3] D.M. Chickering, Learning Bayesian networks is NP-complete, in: V.D.Fisher, H.J. Lenz (Eds.), Artificial Intelligence and Statistics, Springer-Verlag, 1996, pp. 121–130.

[4] D.M. Chickering, D. Geiger, D. Heckerman, Learning Bayesian net-works: Search methods and experimental results, in: Proceedingsof Fifth Conference on Artificial Intelligence and Statistics, 1995,pp. 112–128.

[5] G. Cooper, E. Herskovits, A Bayesian method for the induction ofprobabilistic networks from data, Machine Learning 9 (1992) 309–347.

[6] J. Edmonds, Optimum branchings, Journal of Research of the NationalBureau of Standards 71B (1967) 233–240.

[7] D. Heckerman, D. Geiger, D.M. Chickering, Learning Bayesian net-works: The combination of knowledge and statistical data, MachineLearning 20 (3) (1995) 197–243.

Page 4: Approximation algorithms for restricted Bayesian network structures

V. Ziegler / Information Processing Letters 108 (2008) 60–63 63

[8] R. Howard, J. Matheson, Influence diagrams, in: Readings on the Prin-ciples Applications of Decision Analysis, vol. II, 1981, pp. 721–762.

[9] P. Larrañaga, M. Poza, Structure learning of Bayesian networks by ge-netic algorithms: A performance analysis of control parameters, IEEEJournal on Pattern Analysis and Machine Intelligence 18 (9) (1996)912–926.

[10] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks ofPlausible Inference, Morgan Kaufmann, San Mateo, CA, 1988.

[11] J. Suzuki, A construction of Bayesian networks from databases basedon an MDL scheme, in: D. Heckerman, A. Mamdani (Eds.), Proceed-ings of the Ninth Conference on Uncertainty in Artificial Intelligence,Morgan Kaufmann, San Mateo, CA, 1993, pp. 266–273.