[Lecture Notes in Computer Science] Advances in Knowledge Discovery and Data Mining Volume 3518 || Learning Bayesian Networks Structures from Incomplete Data: An Efficient Approach

T.B. Ho, D. Cheung, and H. Liu (Eds.): PAKDD 2005, LNAI 3518, pp. 474 – 479, 2005. © Springer-Verlag Berlin Heidelberg 2005

Learning Bayesian Networks Structures from Incomplete Data: An Efficient Approach Based on Extended

Evolutionary Programming

Xiaolin Li1, Xiangdong He2, and Senmiao Yuan1

1 College of Computer Science and Technology, Jilin University, Changchun 130012, China

[email protected], [email protected] 2 VAS of China Operations, Vanda Group,

Changchun 130012, China [email protected]

Abstract. This paper describes a new data mining algorithm to learn Bayesian networks structures from incomplete data based on extended Evolutionary programming (EP) method and the Minimum Description Length (MDL) metric. This problem is characterized by a huge solution space with a highly multimodal landscape. The algorithm presents fitness function based on expectation, which converts incomplete data to complete data utilizing current best structure of evolu-tionary process. Aiming at preventing and overcoming premature convergence, the algorithm combines the restart strategy into EP. The experimental results illus-trate that our algorithm can learn a good structure from incomplete data.

1 Introduction

The Bayesian belief network is a powerful knowledge representation and reasoning tool under conditions of uncertainty. Recently, learning the Bayesian network from a database has drawn noticeable attention of researchers in the field of artificial intelli-gence. To this end, researchers developed many algorithms to induct a Bayesian net-work from a given database [1], [2], [3], [4], [5], [6].

Very recently, researchers have begun to tackle the problem of learning the network from incomplete data. A major stumbling block in this research is that when in closed form expressions do not exist for the scoring metric used to evaluate network struc-tures. This has led many researchers down the path of estimating the score using pa-rametric approaches such as the expectation-maximization (EM) algorithm [7]. How-ever, it has been noted [7] that the search landscape is large and multimodal, and deterministic search algorithms find local optima. An obvious choice to combat the problem is to use a stochastic search method.

This paper developed a new data mining algorithm to learn Bayesian networks structures from incomplete data based on extended Evolutionary Programming (EP) method and the Minimum Description Length (MDL) metric. The algorithm presents fitness function by using expectation, which converts incomplete data to complete

Learning Bayesian Networks Structures from Incomplete Data 475

data utilizing current best structure of evolutionary process. Another important char-acteristic of our algorithm is that, in order to preventing and overcoming premature convergence, we combine the restart technology [8] into EP. Furthermore, our algo-rithm, like some previous work, does not need to impose the restriction of having a complete variable ordering as input.

We’ll begin by briefly introducing Bayesian network and MDL metric. Next we will introduce the restart-EP method. In section 4, we will describe the algorithm based on the restart-EP method and the MDL metric. In the end, we will conduct a series of experiments to demonstrate the performance of our algorithm and sum up the whole paper in section 5 and 6, respectively.

2 Bayesian Network and MDL Metric

2.1 Bayesian Network

A Bayesian network is a directed acyclic graph (DAG), nodes of which are labeled with variables and conditional probability tables of the node variable given its parents in the graph. The joint probability distribution (JPD) is then expressed by the formula:

∑=

=n

iiin xxpxxp

11 ))(|(),...,( π

(1)

where )( ixπ is the configuration of iX ’s parent node set )( iXΠ .

2.2 The MDL Metric

The MDL metric [9] is derived from information theory. With the composition of the description length for network structure and the description length for data, the MDL metric tries to balance between model accuracy and complexity. Using the metric, a better network would have a smaller score. Similar to other metrics, the MDL score for a Bayesian network, S , is decomposable and could be written as in equation 2. The MDL score of the network is simply the summation of the MDL score of )( iXΠ

of every node iX in the network.

))(,()( ii

i XXMDLSMDL Π=∑ (2)

According to the resolvability of the MDL metric, equation 2 can be written when we learn Bayesian networks from complete data as follow:

))(,(log))(,()(1 )(,

iii

N

i XXi XXPXXPNSMDL

ii

ΠΠ= ∑ ∑= Π

)1||(||||)(||2

log

1

−Π−∑=

ii

N

i

XXN

(3)

476 X. Li, X. He, and S. Yuan

3 Restart-EP

Although EP was first proposed as an evolutionary algorithm to artificial intelligence, it has been recently applied to many numerical and combinatorial optimization prob-lems successfully.

One of EP's key features is its self-adaptation scheme. In EP, mutation is typically the only operator used to generate new offspring. The mutation is often implemented by adding a random number from a certain distribution to the parent. An important parameter of the Gaussian distribution is its standard deviation (or equivalently the variance). In the widely used self-adaptation scheme of EP, this parameter is evolved, rather than manually fixed, along with the objective variables.

Premature convergence is a serious issue in evolutionary algorithms since it might significantly degrade the overall performance. EP is easy to fall into local optimums. When a point enters the absorption domain of the certain local optimum, the factors and of many individuals diminish rapidly because of self-adaptation scheme.

We define a quantity which characterize the premature convergence. Suppose

population { }m

iiii xpP 1),( === η have arranged by the fitness. 1p denotes the

most excellent individual.

∑= ≤≤

=k

ii

nj

jk

mean1 1

)(1

maxη (4)

Where [ ]mk ×= 3.0 , m is population size, [ ] denotes the integer function. The main process of restart strategy is as follows. The population variety is moni-

tored dynamicly in the evolutionary process. When the population variety decreases to a certain finitude, we consider that the trend of premature convergence appears. Then initialize afresh the population & comeback the population variety. So the evo-lution can progress effectively.

We combine the restart strategy into EP. When mean is less than a positive num-

ber threshold which is confirmed beforehand, we consider that the evolution has danger of premature convergence and initializes afresh the population. Based on pre-vious analysis, we only initialize afresh the factors τ and τ ′ . Moreover, the indi-viduals can get rid of the absorption domain of a local optimum and prevent prema-ture convergence. We do not initialize afresh the objective vectors, which can with-hold the evolutionary information better.

4 Learning Bayesian Network from Incomplete Data

The algorithm we propose is shown below.

1. Set to 0. 2. Create an initial population, Pop(t), of PS random DAGs. The initial popula-

tion size is PS. 3. Convert incomplete data to complete data utilizing a DAG of the initial popu-

lation randomly


4. Each DAG in the population Pop(t) is evaluated using the MDL metric. 5. While t is smaller than the maximum number of generations G

a) Each DAG in Pop(t) produces one offspring by performing mutation opera-tions. If the offspring has cycles, delete the set of edges that violate the DAG condition. If choices of set of edges exist, we randomly pick one choice.

b) The DAG in Pop(t) and all new offspring are stored in the intermediate population Pop’(t).The size of Pop’(t) is 2*PS.

c) Conduct a number of pair-wise competitions over all DAGs in Pop’(t). Let

iS be the DAG being conditioned upon, q opponents are selected ran-

domly from Pop’(t) with equal probability. Let ijS , qj ≤≤1 , be the

randomly selected opponent DAGs. The iS gets one more score if

)()( ijiii SDSD ≤ , qj ≤≤1 . Thus, the maximum score of a DAG is

q .

d) Select PS DAGs with the highest scores from Pop’(t) and store them in the new population Pop(t+1).

e) Compute mean of Pop(t+1). Initialize afresh the factors τ and τ ′ of every individual if thresholdmean < .

f) Increase t by 1

6. Return the DAG with lowest MDL metric found in any generation of a run as the result of the algorithm.

5 Experimental Results and Analyses

We have conducted a number of experiments to evaluate the performance of our algo-rithm. The learning algorithms take the data set only as input. The data set is derived from ALARM network (http://www.norsys.com/netlib/alarm.htm).

Firstly, we generate 5,000 cases from this structure and learn a Bayesian network from the data set ten times. Then we select the most perfect network structure as the final structure. We also compare our algorithm with a classical GA algorithm. The algorithms run without missing data. The MDL metric of the original network struc-tures for the ALARM data sets of 5,000 cases is 81,219.74.

The population size PS is 30 and the maximum number of generations is 5,000. We employ our learning algorithm to solve the ALARM problem. The value of q is set to

be 5. We also implemented a classical GA to learning the ALARM network. The one-point crossover and mutation operations of classical GA are used. The crossover

probability cp is 0.9 and the mutation probability mp is 0.01. The MDL metric for

our learning algorithm and the classical GA are delineated in Figure 1. From Figure 1, we see that the value of the average of the MDL metric for restart-

EP is 81362.1 and the value of the average of the MDL metric for the GA is 8,1789.4. We find our learning algorithm evolves good Bayesian network structures at an aver-

478 X. Li, X. He, and S. Yuan

age generation of 4210.2. The GA obtains the solutions at an average generation of 4495.4. Thus, we can conclude that our learning algorithm finds better network struc-tures at earlier generations than the GA does. Our algorithm can also prevent and overcome the premature convergence.

Fig. 1. The MDL metric for the ALARM network

Our algorithm generates 1000, 10000 cases from the original network for training and testing. The algorithm runs with 10%, 20%, 30%, and 40% missing data. The experiment runs ten times for each level of missing data. Using the best network from each run we calculate the log loss. The log loss is a commonly used metric appropri-ate for probabilistic learning algorithms. Figure 2 shows the comparison of log loss between our algorithm and reference [10].

Fig. 2. The Comparison of log loss

As can be seen from figure 2, the algorithm finds better predictive networks at 10%, 20%, 30%, and 40% missing data than reference [10] does.


6 Conclusions

In this paper we describe a novel evolutionary algorithm for learning Bayesian net-works from incomplete data. This problem is extremely difficult for deterministic algorithms and is characterized by a large, multi-dimensional, multi-modal search space. The experimental results show that our learning algorithm can learn a good structure from incomplete data.

References

1. J Suzuki. A construction of Bayesian networks from databases based on a MDL scheme, Proc of the 9th Confon Uncertainty in Artificial Intelligence. San Mateo, CA: Morgan Kaufmann, (1993) 266~273,

2. Y Xiang, S K M Wong. Learning conditional independence relations from a probabilistic model, Department of Computer Science, University of Regina, CA, (1994) Tech Rep: CS-94-03,.

3. D Heckerman. Learning Bayesian network: The combination of knowledge and statistic data, Machine Learning, Vol. 20, No. 2, (1995), 197-243,.

4. Cheng J, Greiner R, Kelly J. Learning Bayesian networks from data: An efficient algo-rithm based on information theory, Artificial Intelligence. Vol.137, No.1-2, (2002) 43-90.

5. Lam, W. and Bacchus, F., Learning Bayesian belief networks: An algorithm based on the MDL principle, Computational Intelligence, Vol 10, No.4, 1994.

6. P. Larranaga, M. Poza, Y. Yurramendi, R. Murga, and C. Kuijpers, Structure Learning of Bayesian Network by Genetic Algorithms: A Performance Analysis of Control Parame-ters, IEEE Trans. Pattern Analysis and Machine Intelligence, vol.

7. Friedman, N. (1998a). The Bayesian Structural EM Algorithm. Proceedings of the Four-teenth Conference on Uncertainty in Artificial Intelligence, Madison, WI, Morgan Kauf-mann Publishers.

8. Eshelman L.J., The CHC adaptive search algorithm, In: Foundations of Genetic Algorithms. San Mateo: Kaufmann Publishers Inc, (1991) 265-283.

9. W. Lam and F. Bacchus. Learning Bayesian belief networks: an algorithm based on the MDL principle, Computational Intelligence, Vol.10, No.4, (1994) 269–293.

10. Friedman, N. (1998b). Learning Belief Networks in the Presence of Missing Values and Hidden Variables. Fourteenth International Conference on Machine Learning, Vanderbilt University, Morgan Kaufmann Publishers, (1997).

Documents

[Lecture Notes in Computer Science] Advances in Knowledge Discovery and Data Mining Volume 3518 || Learning Bayesian Networks Structures from Incomplete Data: An Efficient Approach