Upload
valentin-ziegler
View
221
Download
6
Embed Size (px)
Citation preview
Information Processing Letters 108 (2008) 60–63
Contents lists available at ScienceDirect
Information Processing Letters
www.elsevier.com/locate/ipl
Approximation algorithms for restricted Bayesian network structures
Valentin Ziegler 1
Humboldt Universität zu Berlin, Institut für Informatik, 10099 Berlin, Germany
a r t i c l e i n f o a b s t r a c t
Article history:Received 21 August 2007Received in revised form 10 February 2008Available online 29 March 2008Communicated by K. Iwama
Keywords:Approximation algorithmsGraph algorithmsBayesian networkStructure learning
Bayesian network structures with a maximum in-degree of k can be approximated withrespect to a positive scoring metric up to an factor of 1/k.
© 2008 Elsevier B.V. All rights reserved.
1. Introduction
Bayesian networks [8,10,7] are widely used models ofexpert knowledge with applications including computa-tional biology, artificial intelligence, medicine and textanalysis.
A Bayesian network encodes probabilistic dependencyrelations among a set of random variables by a directedacyclic graph, where nodes correspond to variables andarcs represent dependence. The task of deducing thisgraphical structure from observed sets of variable assign-ments is commonly referred to as structure learning. Toavoid overfitting the model to limited data and to reducecomputing costs, the maximum number of parents of anode in the graph is often assumed to be limited by aconstant k [7].
Structure learning has been shown to be NP-hard evenin the restricted case where the in-degree of each node isbounded by two [3]. Therefore, only approximate solutionscan be found in polynomial time, unless P = NP.
Many heuristics have been described to attack thestructure learning problem, including simulated annealing[4], heuristic local search [7], ant colony optimization [2]
E-mail address: [email protected] Supported by the DFG Research Center Matheon “Mathematics for
key technologies”.
0020-0190/$ – see front matter © 2008 Elsevier B.V. All rights reserved.doi:10.1016/j.ipl.2008.03.015
and genetic algorithms [9]. However, there is no guaran-teed approximation ratio with any of these approaches.
This paper introduces two simple approximation algo-rithms for learning graph structures with a maximum in-degree k and proves approximation ratios of 1/(k + 1) and1/k.
2. Preliminaries
A directed acyclic graph (DAG) on a set of randomvariables Var is a directed graph G = (V (G), E(G)) withV (G) = Var, E(G) ⊆ Var × Var, such that G contains noloops or directed cycles. Let (w, v) ∈ E(G), then w is saidto be a parent of v ∈ V , and v is called a child of w . Theset of all parents of v in a DAG G is denoted by paG(v). If|paG(v)| � k for all v ∈ V (G), then G is a k-restricted DAG.
Let G = (V (G), E(G)) be a DAG, P ⊆ V (G) and w ∈V (G) with paG(w) = ∅. We write G + (w, P ) as a shorthand term for the digraph G ′ where w has been given par-ents P , i.e.
E(G ′) = E(G) ∪⋃v∈P
(v, w).
Given a set V , an arborescence with root r ∈ V is an edgeminimal directed graph on V such that there is a pathfrom r to every v ∈ V .
The problem of finding the optimal (k-restricted) Bayes-ian network structure for an observed data set on n vari-
V. Ziegler / Information Processing Letters 108 (2008) 60–63 61
1 G ← (Var,∅)
2 Connected ← ∅3 while Connected �= Var do4 Ext := {(v, P ) | v ∈ Var \ Connected, P ⊆ Var, |P | � k,
G + (v, P ) contains no circle}5 (w, Q ) := argmax(v,P )∈Ext nodescore(v, P )
6 Connected ← Connected ∪ {w}7 G ← G + (w, Q )
endreturn G
Algorithm 1. Greedy algorithm.
ables Var = {v1 . . . vn} is to find a directed acyclic (k-re-stricted) graph G on Var which maximizes a given scor-ing metric score(G). Commonly used metrics are Bayesianscore [5], BDeu [1], MDL score [11] and BDe [7]. These met-rics are decomposable, that is they can be written as a sumover independent parent scores for each variable:
score(G) =∑
v∈V (G)
nodescore(
v,paG(v)), (1)
where nodescore : Var × 2Var → R is a function that rates achoice for the parents of a variable.
The algorithms described in this paper will deal withthe k-restricted case, the score value of the optimal k-re-stricted solution will be denoted by OPT . The nodescoreis assumed to be a nonnegative function (this can alwaysbe archived by adding a sufficiently large constant to eachvalue).
3. The Greedy algorithm
A simple, iterative approach to calculate an approximatek-restricted network structure is to start with a graph con-taining no edges, and then in each step, greedily add aparent set which gives the highest gain to the current di-graph’s score (Algorithm 1).
Theorem 1. Algorithm 1 has an approximation ratio of1/(k + 1).
Proof. Obviously the returned Graph G is a k-restrictedDAG, thus score(G) � OPT . It remains to show that score(G)
� (1/(k + 1))OPT .Let Pi denote the best parent set for the node vi , i.e.
Pi := argmaxP⊆Var,|P |�k
nodescore(vi, P )
and let
si :={
nodescore(vi, Pi) i � n,
0 i > n.
Without loss of generality, we assume that
s1 � s2 � · · · � sn.
The main loop of the algorithm is traversed exactly ntimes. We write Extt for the set Ext in step t and defineLt := Extt ∩ {(v1, P1), . . . , (vn, Pn)}. Having selected (w, Q )
in line 5 and extended the graph in line 7, the at mostk + 1 nodes {w} ∪ Q are connected by new edges in G . A
node not incident to any edges may still be assigned anyparent set, therefore
|L1| = n, |Lt | � n − (k + 1)(t − 1).
The score-gain s′t within step t is
s′t = max
(v,P )∈Extt
nodescore(v, P )
� max(v,P )∈Lt
nodescore(v, P )
� s1+(k+1)(t−1),
which implies a lower bound to the score of the returnedDAG G:
score(G) =n∑
t=1
s′t �
n∑t=1
s1+(k+1)(t−1)
�n∑
t=1
1
k + 1
k+1∑j=1
s j+(k+1)(t−1)
= 1
k + 1
(k+1)n∑i=1
si = 1
k + 1
n∑i=1
si
� 1
k + 1OPT. �
Remark. The analysis of Algorithm 1 is tight, as the in-stance below shows: Let Var := {v, w1, . . . , wk, r1, . . . , rk}and
nodescore(
v, {w1, . . . , wk}) = 1,
nodescore(
v, {r1, . . . , rk}) = 1 − ε,
nodescore(
wi, {v, r[i]k , . . . , r[k+i−2]k }) = 1 − ε ∀i,
all remaining nodescores equal to zero.
Algorithm 1 will choose (v, {w1, . . . , wk}) in line 5 ofthe first iteration, thus obstructing any other parent setwith positive nodescore in the remaining iterations. The re-sulting DAG will have a score of 1, whereas the optimum
G∗ := G0 + (v, {r1, . . . , rk}
) +k∑
i=1
(wi, {v, r[i]k , . . . , r[k+i−2]k }
)
with G0 = (Var,∅) has score(G∗) = (k + 1)(1 − ε).
4. Approximation using arborescences
A better approximation ratio is archived by Algorithm 4using a different approach. The key idea is to “project” thestructure learning instance onto an instance of a maximumweight arborescence problem on the set of nodes Var withan additional root node R . In the corresponding arbores-cence problem (which can be solved in polynomial time[6]), parent sets are represented by single parent nodes,whereas the root R represents the empty parent set. Dueto the loss of information in this projection, expanding theresulting arborescence into a digraph with more than oneparent per node will yield cycles, which have to be re-moved.
62 V. Ziegler / Information Processing Letters 108 (2008) 60–63
1 w(a,b) := max|X |�k,a∈X nodescore(b, X) ∀a,b �= a2 w(R,b) := nodescore(b,∅) ∀b3 T ← maximum weight arborescence on (Var ∪ {R}, w) with root R4 forall (x, y) ∈ T do5 S(y) := w(x, y)
6 P (y) := argmax|X |�k,x∈X nodescore(y, X) resp. ∅, if x = R .end
7 Let v1, . . . , vn be an ascending ordering of Var according todist(R, vi), where dist(R, vi) is the number of edges on the pathfrom R to vi .
8 vn.c ← 19 for i = n − 1, . . . ,1 do
10 vi .c ← min{c ∈ N | c �= v j .c for all j > i with v j ∈ P (vi)}end
11 G0 ← (Var,∅)
12 c := argmaxi∑
v j .c=i S(v j)
13 return G0 + ∑v j .c=c(v j, P (v j))
Algorithm 2. Arborescence approximation.
Theorem 2. Let scoreA be the score of the graph returned byAlgorithm 4. Then scoreA satisfies
OPT � scoreA � 1
kOPT,
i.e., the algorithm has an approximation ratio of 1k .
Proof. Directly by definition of w , S and P in lines 1–6 ofthe algorithm, we get
scoreA = maxi
∑v j .c=i
nodescore(v j, P (v j))
= maxi
∑v j .c=i
S(v j).
Let GOPT be the optimal k-restricted DAG for the probleminstance and consider an arborescence T ′ on Var ∪{R} thatonly contains reversed edges from GOPT together with anedge from R to every node with empty parent set in GOPT.From the definition of w(x, y) follows
OPT �∑
(a,b)∈T ′w(a,b) �
∑(a,b)∈T
w(a,b) =∑
j
S(v j). (2)
In addition, every vi ∈ Var has at most k − 1 parents v j ∈P (vi), j > i, because if P (vi) �= ∅ there is one vk ∈ P (vi)
such that (vk, vi) ∈ T —thus dist(R, vk) < dist(R, vi) andk < i. Therefore vi .c � k for all i and
scoreA = maxi
∑v j .c=i
S(v j) � 1
k
∑j
S(v j). (3)
Combining the inequalities (2) and (3) one gets
scoreA � 1
kOPT.
It remains to show the returned graph is a acyclic, i.e.,scoreA is indeed a lower bound to OPT . Consider the familyof graphs
Gi := G0 +∑
v j .c=i
(v j, P (v j)
), where G0 := (Var,∅).
To see that all these graphs are DAGs, assume that Gi con-tains a cycle C . Then there must be an edge (vl, vm) ∈ C
with l < m. Therefore vl.c �= vm.c which implies that oneof the two nodes has an empty parent set in Gi , contra-dicting the existence of C . �Remark. The analysis of Algorithm 4 is tight:
Let Var = {x1, . . . , xk, r} and
nodescore
(xi, {r} ∪
⋃j �=i
x j
)= 1 + iε,
nodescore(r, {xk}
) = 1,
nodescore(xi, {xk}
) = 1 for i �= k.
All remaining nodescores equal to zero.
With edge weights w as in Algorithm 4, the optimal ar-borescence T on Var∪{R} has edges (R, r), (r, x1), . . . , (r, xk)
and total weight k + ε′ . Only one of the expanded parentsets P (xi) = {ri} ∪ ⋃
j �=i x j can be chosen without creatinga cycle, i.e., xi .c �= x j .c for all i �= j. The value of scoreA is1 + kε, whereas in the optimal DAG xk has no parents andevery other node as child, so OPT = k.
References
[1] W. Buntine, Theory refinement on Bayesian networks, Uncertainty inArtificial Intelligence 7 (1991) 52–60.
[2] L.M. de Campos, J.M. Fernández-Luna, J.A. Gámez, J.M. Puerta, Antcolony optimization for learning Bayesian networks, InternationalJournal of Approximate Reasoning 31 (2002) 291–311.
[3] D.M. Chickering, Learning Bayesian networks is NP-complete, in: V.D.Fisher, H.J. Lenz (Eds.), Artificial Intelligence and Statistics, Springer-Verlag, 1996, pp. 121–130.
[4] D.M. Chickering, D. Geiger, D. Heckerman, Learning Bayesian net-works: Search methods and experimental results, in: Proceedingsof Fifth Conference on Artificial Intelligence and Statistics, 1995,pp. 112–128.
[5] G. Cooper, E. Herskovits, A Bayesian method for the induction ofprobabilistic networks from data, Machine Learning 9 (1992) 309–347.
[6] J. Edmonds, Optimum branchings, Journal of Research of the NationalBureau of Standards 71B (1967) 233–240.
[7] D. Heckerman, D. Geiger, D.M. Chickering, Learning Bayesian net-works: The combination of knowledge and statistical data, MachineLearning 20 (3) (1995) 197–243.
V. Ziegler / Information Processing Letters 108 (2008) 60–63 63
[8] R. Howard, J. Matheson, Influence diagrams, in: Readings on the Prin-ciples Applications of Decision Analysis, vol. II, 1981, pp. 721–762.
[9] P. Larrañaga, M. Poza, Structure learning of Bayesian networks by ge-netic algorithms: A performance analysis of control parameters, IEEEJournal on Pattern Analysis and Machine Intelligence 18 (9) (1996)912–926.
[10] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks ofPlausible Inference, Morgan Kaufmann, San Mateo, CA, 1988.
[11] J. Suzuki, A construction of Bayesian networks from databases basedon an MDL scheme, in: D. Heckerman, A. Mamdani (Eds.), Proceed-ings of the Ninth Conference on Uncertainty in Artificial Intelligence,Morgan Kaufmann, San Mateo, CA, 1993, pp. 266–273.