25
arXiv:1506.05692v1 [stat.ML] 18 Jun 2015 A hybrid algorithm for Bayesian network structure learning with application to multi-label learning Maxime Gasse, Alex Aussem , Haytham Elghazel Universit´ e de Lyon, CNRS Universit´ e Lyon 1, LIRIS, UMR5205, F-69622, France Abstract We present a novel hybrid algorithm for Bayesian network structure learning, called H2PC. It first reconstructs the skeleton of a Bayesian network and then performs a Bayesian-scoring greedy hill-climbing search to orient the edges. The algorithm is based on divide-and-conquer constraint-based subroutines to learn the local structure around a target variable. We conduct two series of experimental comparisons of H2PC against Max-Min Hill-Climbing (MMHC), which is currently the most powerful state-of-the-art algorithm for Bayesian network structure learning. First, we use eight well-known Bayesian network benchmarks with various data sizes to assess the quality of the learned structure returned by the algorithms. Our extensive experiments show that H2PC outperforms MMHC in terms of goodness of fit to new data and quality of the network structure with respect to the true dependence structure of the data. Second, we investigate H2PC’s ability to solve the multi-label learning problem. We provide theoretical results to characterize and identify graphically the so-called minimal label powersets that appear as irreducible factors in the joint distribution under the faithfulness condition. The multi-label learning problem is then decomposed into a series of multi-class classification problems, where each multi-class variable encodes a label powerset. H2PC is shown to compare favorably to MMHC in terms of global classification accuracy over ten multi-label data sets covering dierent application domains. Overall, our experiments support the conclusions that local structural learning with H2PC in the form of local neighborhood induction is a theoretically well-motivated and empirically eective learning framework that is well suited to multi-label learning. The source code (in R) of H2PC as well as all data sets used for the empirical tests are publicly available. Keywords: Bayesian networks, Multi-label learning, Markov boundary, Feature subset selection. 1. Introduction A Bayesian network (BN) is a probabilistic model formed by a structure and parameters. The structure of a BN is a directed acyclic graph (DAG), whilst its param- eters are conditional probability distributions associated with the variables in the model. The problem of finding the DAG that encodes the conditional independencies present in the data attracted a great deal of interest over the last years (Rodrigues de Morais & Aussem, 2010a; Scutari, 2010; Scutari & Brogini, 2012; Kojima et al., 2010; Perrier et al., 2008; Villanueva & Maciel, 2012; Pe˜ na, 2012; Gasse et al., 2012). The in- ferred DAG is very useful for many applications, Corresponding author Email address: [email protected] (Alex Aussem) including feature selection (Aliferis et al., 2010; Pe˜ na et al., 2007; Rodrigues de Morais & Aussem, 2010b), causal relationships inference from obser- vational data (Ellis & Wong, 2008; Aliferis et al., 2010; Aussem et al., 2012, 2010; Prestat et al., 2013; Cawley, 2008; Brown & Tsamardinos, 2008) and more recently multi-label learning (DembczyÅski et al., 2012; Zhang & Zhang, 2010; Guo & Gu, 2011). Ideally the DAG should coincide with the dependence structure of the global distribution, or it should at least identify a distribution as close as possible to the correct one in the probability space. This step, called structure learning, is similar in approaches and terminology to model selection procedures for classical statistical mod- els. Basically, constraint-based (CB) learning methods systematically check the data for conditional indepen- dence relationships and use them as constraints to con-

arXiv:1506.05692v1 [stat.ML] 18 Jun 2015

Embed Size (px)

Citation preview

arX

iv:1

506.

0569

2v1

[sta

t.ML]

18

Jun

2015

A hybrid algorithm for Bayesian network structure learningwith application to multi-label learning

Maxime Gasse, Alex Aussem∗, Haytham Elghazel

Universite de Lyon, CNRSUniversite Lyon 1, LIRIS, UMR5205, F-69622, France

Abstract

We present a novel hybrid algorithm for Bayesian network structure learning, called H2PC. It first reconstructs theskeleton of a Bayesian network and then performs a Bayesian-scoring greedy hill-climbing search to orient the edges.The algorithm is based on divide-and-conquer constraint-based subroutines to learn the local structure around a targetvariable. We conduct two series of experimental comparisons of H2PC against Max-Min Hill-Climbing (MMHC),which is currently the most powerful state-of-the-art algorithm for Bayesian network structure learning. First, we useeight well-known Bayesian network benchmarks with variousdata sizes to assess the quality of the learned structurereturned by the algorithms. Our extensive experiments showthat H2PC outperforms MMHC in terms of goodnessof fit to new data and quality of the network structure with respect to the true dependence structure of the data.Second, we investigate H2PC’s ability to solve the multi-label learning problem. We provide theoretical results tocharacterize and identify graphically the so-called minimal label powersets that appear as irreducible factors in thejoint distribution under the faithfulness condition. The multi-label learning problem is then decomposed into a seriesof multi-class classification problems, where each multi-class variable encodes a label powerset. H2PC is shownto compare favorably to MMHC in terms of global classification accuracy over ten multi-label data sets coveringdifferent application domains. Overall, our experiments support the conclusions that local structural learning withH2PC in the form of local neighborhood induction is a theoretically well-motivated and empirically effective learningframework that is well suited to multi-label learning. The source code (inR) of H2PC as well as all data sets used forthe empirical tests are publicly available.

Keywords: Bayesian networks, Multi-label learning, Markov boundary, Feature subset selection.

1. Introduction

A Bayesian network (BN) is a probabilistic modelformed by a structure and parameters. The structure of aBN is a directed acyclic graph (DAG), whilst its param-eters are conditional probability distributions associatedwith the variables in the model. The problem of findingthe DAG that encodes the conditional independenciespresent in the data attracted a great deal of interest overthe last years (Rodrigues de Morais & Aussem, 2010a;Scutari, 2010; Scutari & Brogini, 2012; Kojima et al.,2010; Perrier et al., 2008; Villanueva & Maciel,2012; Pena, 2012; Gasse et al., 2012). The in-ferred DAG is very useful for many applications,

∗Corresponding authorEmail address:[email protected] (Alex Aussem)

including feature selection (Aliferis et al., 2010;Pena et al., 2007; Rodrigues de Morais & Aussem,2010b), causal relationships inference from obser-vational data (Ellis & Wong, 2008; Aliferis et al.,2010; Aussem et al., 2012, 2010; Prestat et al., 2013;Cawley, 2008; Brown & Tsamardinos, 2008) and morerecently multi-label learning (DembczyÅski et al.,2012; Zhang & Zhang, 2010; Guo & Gu, 2011).

Ideally the DAG should coincide with the dependencestructure of the global distribution, or it should at leastidentify a distribution as close as possible to the correctone in the probability space. This step, called structurelearning, is similar in approaches and terminology tomodel selection procedures for classical statistical mod-els. Basically, constraint-based (CB) learning methodssystematically check the data for conditional indepen-dence relationships and use them as constraints to con-

struct a partially oriented graph representative of a BNequivalence class, whilst search-and-score (SS) meth-ods make use of a goodness-of-fit score function forevaluating graphical structures with regard to the dataset. Hybrid methods attempt to get the best of bothworlds: they learn a skeleton with a CB approach andconstrain on the DAGs considered during the SS phase.

In this study, we present a novel hybrid algorithmfor Bayesian network structure learning, called H2PC1.It first reconstructs the skeleton of a Bayesian net-work and then performs a Bayesian-scoring greedy hill-climbing search to orient the edges. The algorithm isbased on divide-and-conquer constraint-based subrou-tines to learn the local structure around a target vari-able. HPC may be thought of as a way to compen-sate for the large number of false negatives at the outputof the weak PC learner, by performing extra computa-tions. As this may arise at the expense of the numberof false positives, we control the expected proportion offalse discoveries (i.e. false positive nodes) among allthe discoveries made inPCT . We use a modificationof the Incremental association Markov boundary algo-rithm (IAMB), initially developed by Tsamardinos etal. in (Tsamardinos et al., 2003) and later modified byJose Pena in (Pena, 2008) to control the FDR of edgeswhen learning Bayesian network models. HPC scalesto thousands of variables and can deal with many fewersamples (n < q). To illustrate its performance by meansof empirical evidence, we conduct two series of exper-imental comparisons of H2PC against Max-Min Hill-Climbing (MMHC), which is currently the most pow-erful state-of-the-art algorithm for BN structure learn-ing (Tsamardinos et al., 2006), using well-known BNbenchmarks with various data sizes, to assess the good-ness of fit to new data as well as the quality of thenetwork structure with respect to the true dependencestructure of the data.

We then address a real application of H2PC wherethe true dependence structure is unknown. More specif-ically, we investigate H2PC’s ability to encode the jointdistribution of the label set conditioned on the inputfeatures in the multi-label classification (MLC) prob-lem. Many challenging applications, such as photoand video annotation and web page categorization,can benefit from being formulated as MLC tasks withlarge number of categories (DembczyÅski et al., 2012;Read et al., 2009; Madjarov et al., 2012; Kocev et al.,2007; Tsoumakas et al., 2010b). Recent research in

1A first version of HP2C without FDR control has been discussedin a paper that appeared in the Proceedings of ECML-PKDD, pages58-73, 2012.

MLC focuses on the exploitation of the label condi-tional dependency in order to better predict the labelcombination for each example. We show that local BNstructure discovery methods offer an elegant and pow-erful approach to solve this problem. We establish twotheorems (Theorem 6 and 7) linking the concepts ofmarginal Markov boundaries, joint Markov boundariesand so-called label powersets under the faithfulness as-sumption. These Theorems offer a simple guideline tocharacterize graphically : i) the minimal label powersetdecomposition, (i.e. into minimal subsetsYLP ⊆ Y suchthatYLP ⊥⊥ Y \ YLP | X), and ii) the minimal subset offeatures, w.r.t an Information Theory criterion, neededto predict each label powerset, thereby reducing the in-put space and the computational burden of the multi-label classification. To solve the MLC problem withBNs, the DAG obtained from the data plays a pivotalrole. So in this second series of experiments, we assessthe comparative ability of H2PC and MMHC to encodethe label dependency structure by means of an indirectgoodness of fit indicator, namely the 0/1 loss function,which makes sense in the MLC context.

The rest of the paper is organized as follows: Inthe Section 2, we review the theory of BN and dis-cuss the main BN structure learning strategies. We thenpresent the H2PC algorithm in details in Section 3. Sec-tion 4 evaluates our proposed method and shows resultsfor several tasks involving artificial data sampled fromknown BNs. Then we report, in Section 5, on our exper-iments on real-world data sets in a multi-label learningcontext so as to provide empirical support for the pro-posed methodology. The main theoretical results appearformally as two theorems (Theorem 8 and 9) in Section5. Their proofs are established in the Appendix. Fi-nally, Section 6 raises several issues for future work andwe conclude in Section 7 with a summary of our contri-bution.

2. Preliminaries

We define next some key concepts used along the pa-per and state some results that will support our analysis.In this paper, upper-case letters in italics denote randomvariables (e.g.,X,Y) and lower-case letters in italics de-note their values (e.g.,x, y). Upper-case bold letters de-note random variable sets (e.g.,X,Y,Z) and lower-casebold letters denote their values (e.g.,x, y). We denoteby X ⊥⊥ Y | Z the conditional independence betweenXandY given the set of variablesZ. To keep the notationuncluttered, we usep(y | x) to denotep(Y = y | X = x).

2

2.1. Bayesian networksFormally, a BN is a tuple< G,P >, whereG =<

U,E > is a directed acyclic graph (DAG) with nodesrepresenting the random variablesU andP a joint prob-ability distribution inU. In addition,G and P mustsatisfy the Markov condition: every variable,X ∈ U,is independent of any subset of its non-descendant vari-ables conditioned on the set of its parents, denoted byPaGi . From the Markov condition, it is easy to prove(Neapolitan, 2004) that the joint probability distributionP on the variables inU can be factored as follows :

P(V) = P(X1, . . . ,Xn) =n∏

i=1

P(Xi |PaGi ) (1)

Equation 1 allows a parsimonious decomposition ofthe joint distributionP. It enables us to reduce the prob-lem of determining a huge number of probability valuesto that of determining relatively few.

A BN structureG entails a set of conditional indepen-dence assumptions. They can all be identified by thed-separation criterion(Pearl, 1988). We useX ⊥G Y|Z todenote the assertion thatX is d-separated fromY givenZ inG. Formally,X ⊥G Y|Z is true when for every undi-rected path inG betweenX andY, there exists a nodeW in the path such that either (1)W does not have twoparents in the path andW ∈ Z, or (2)W has two parentsin the path and neitherW nor its descendants is inZ. If< G,P > is a BN,X ⊥P Y|Z if X ⊥G Y|Z. The conversedoes not necessarily hold. We say that< G,P > satis-fies thefaithfulness conditionif the d-separations inGidentify all and onlythe conditional independencies inP, i.e.,X ⊥P Y|Z iff X ⊥G Y|Z.

A Markov blanketMT of T is any set of variablessuch thatT is conditionally independent of all the re-maining variables givenMT . By extension, a Markovblanket ofT in V guarantees thatM T ⊆ V, and thatTis conditionally independent of the remaining variablesin V, givenMT . A Markov boundary,MB T , of T is anyMarkov blanket such that none of its proper subsets is aMarkov blanket ofT.

We denote byPCGT , the set of parents and childrenof T in G, and bySPGT , the set ofspousesof T in G.The spousesof T are the variables that have commonchildren withT. These sets are unique for allG, suchthat< G,P > satisfies the faithfulness condition and sowe will drop the superscriptG. We denote bydSep(X),the set that d-separatesX from the (implicit) targetT.

Theorem 1. Suppose< G,P > satisfies the faithfulnesscondition. Then X and Y are not adjacent inG iff ∃Z ∈U \ {X,Y} such that X⊥⊥ Y | Z . Moreover,MB X =

PCX ∪ SPX .

A proof can be found for instance in (Neapolitan,2004).

Two graphs are saidequivalentiff they encode thesame set of conditional independencies via the d-separation criterion. The equivalence class of a DAGG is a set of DAGs that are equivalent toG. The nextresult showed by (Pearl, 1988), establishes that equiv-alent graphs have the same undirected graph but mightdisagree on the direction of some of the arcs.

Theorem 2. Two DAGs are equivalent iff they have thesame underlying undirected graph and the same set ofv-structures (i.e. converging edges into the same node,such as X→ Y← Z).

Moreover, an equivalence class of network structurescan be uniquely represented by a partially directed DAG(PDAG), also called a DAG pattern. The DAG pattern isdefined as the graph that has the same links as the DAGsin the equivalence class and has oriented all and only theedges common to the DAGs in the equivalence class.A structure learning algorithm from data is said to becorrect (or sound) if it returns the correct DAG pattern(or a DAG in the correct equivalence class) under theassumptions that the independence tests are reliable andthat the learning database is a sample from a distributionP faithful to a DAGG, The (ideal) assumption that theindependence tests are reliable means that they decide(in)dependence iff the (in)dependence holds inP.

2.2. Conditional independence properties

The following three theorems, borrowedfrom Pena et al. (2007), are proven in Pearl (1988):

Theorem 3. Let X,Y,Z and W denote four mutu-ally disjoint subsets ofU. Any probability distribu-tion p satisfies the following four properties: Symme-try X ⊥⊥ Y | Z ⇒ Y ⊥⊥ X | Z, Decomposition,X ⊥⊥ (Y ∪ W) | Z ⇒ X ⊥⊥ Y | Z, Weak UnionX ⊥⊥ (Y∪W) | Z ⇒ X ⊥⊥ Y | (Z∪W) and Contraction,X ⊥⊥ Y | (Z ∪W) ∧ X ⊥⊥W | Z ⇒ X ⊥⊥ (Y ∪W) | Z .

Theorem 4. If p is strictly positive, then p satisfies theprevious four properties plus the Intersection propertyX ⊥⊥ Y | (Z ∪W) ∧ X ⊥⊥ W | (Z ∪ Y) ⇒ X ⊥⊥ (Y ∪W) | Z. Additionally, eachX ∈ U has a unique Markovboundary,MB X

Theorem 5. If p is faithful to a DAGG, then p satisfiesthe previous five properties plus the Composition prop-erty X ⊥⊥ Y | Z ∧ X ⊥⊥ W | Z ⇒ X ⊥⊥ (Y ∪W) | Z andthe local Markov property X⊥⊥ (NDX \ PaX) | PaX foreach X∈ U, whereNDX denotes the non-descendantsof X inG.

3

2.3. Structure learning strategies

The number of DAGs,G, is super-exponential in thenumber of random variables in the domain and the prob-lem of learning the most probablea posterioriBN fromdata is worst-case NP-hard (Chickering et al., 2004).One needs to resort to heuristical methods in order tobe able to solve very large problems effectively.

Both CB and SS heuristic approaches have advan-tages and disadvantages. CB approaches are relativelyquick, deterministic, and have a well defined stop-ping criterion; however, they rely on an arbitrary sig-nificance level to test for independence, and they canbe unstable in the sense that an error early on in thesearch can have a cascading effect that causes many er-rors to be present in the final graph. SS approacheshave the advantage of being able to flexibly incor-porate users’ background knowledge in the form ofprior probabilities over the structures and are also capa-ble of dealing with incomplete records in the database(e.g. EM technique). Although SS methods are fa-vored in practice when dealing with small dimensionaldata sets, they are slow to converge and the computa-tional complexity often prevents us from finding op-timal BN structures (Perrier et al., 2008; Kojima et al.,2010). With currently available exact algorithms(Koivisto & Sood, 2004; Silander & Myllymaki, 2006;Cussens & Bartlett, 2013; Studeny & Haws, 2014) anda decomposable score like BDeu, the computationalcomplexity remains exponential, and therefore, such al-gorithms are intractable for BNs with more than around30 vertices on current workstations. For larger sets ofvariables the computational burden becomes prohibitiveand restrictions about the structure have to be imposed,such as a limit on the size of the parent sets. Withthis in mind, the ability to restrict the search locallyaround the target variable is a key advantage of CBmethods over SS methods. They are able to constructa local graph around the target node without havingto construct the whole BN first, hence their scalabil-ity (Pena et al., 2007; Rodrigues de Morais & Aussem,2010b,a; Tsamardinos et al., 2006; Pena, 2008).

With a view to balancing the computation cost withthe desired accuracy of the estimates, several hybridmethods have been proposed recently. Tsamardinoset al. proposed in (Tsamardinos et al., 2006) the Min-Max Hill Climbing (MMHC) algorithm and conductedone of the most extensive empirical comparison per-formed in recent years showing that MMHC was thefastest and the most accurate method in terms ofstructural error based on the structural hamming dis-tance. More specifically, MMHC outperformed both

in terms of time efficiency and quality of reconstruc-tion the PC (Spirtes et al., 2000), the Sparse Candi-date (Friedman et al., 1999b), the Three Phase Depen-dency Analysis (Cheng et al., 2002), the Optimal Rein-sertion (Moore & Wong, 2003), the Greedy Equiva-lence Search (Chickering, 2002), and the Greedy Hill-Climbing Search on a variety of networks, samplesizes, and parameter values. Although MMHC is ratherheuristic by nature (it returns a local optimum of thescore function), it is currently considered as the mostpowerful state-of-the-art algorithm for BN structurelearning capable of dealing with thousands of nodes inreasonable time.

In order to enhance its performance on small dimen-sional data sets, Perrier et al. proposed in (Perrier et al.,2008) a hybrid algorithm that can learn anoptimalBN(i.e., it converges to the true model in the sample limit)when an undirected graph is given as a structural con-straint. They defined this undirected graph as a super-structure (i.e., every DAG considered in the SS phase iscompelled to be a subgraph of the super-structure). Thisalgorithm can learn optimal BNs containing up to 50vertices when the average degree of the super-structureis around two, that is, a sparse structural constraint isassumed. To extend its feasibility to BN with a few hun-dred of vertices and an average degree up to four, Ko-jima et al. proposed in (Kojima et al., 2010) to dividethe super-structure into several clusters and perform anoptimal search on each of them in order to scale up tolarger networks. Despite interesting improvements interms of score and structural hamming distance on sev-eral benchmark BNs, they report running times about103 times longer than MMHC on average, which is stillprohibitive.

Therefore, there is great deal of interest in hy-brid methods capable of improving the structural ac-curacy of both CB and SS methods on graphs con-taining up to thousands of vertices. However, theymake the strong assumption that the skeleton (alsocalled super-structure) contains at least the edges ofthe true network and as small as possible extra edges.While controlling the false discovery rate (i.e., falseextra edges) in BN learning has attracted some at-tention recently (Armen & Tsamardinos, 2011; Pena,2008; Tsamardinos & Brown, 2008), to our knowledge,there is no work on controlling actively the rate of false-negative errors (i.e., false missing edges).

2.4. Constraint-based structure learningBefore introducing the H2PC algorithm, we discuss

the general idea behind CB methods. The induction oflocal or global BN structures is handled by CB methods

4

through the identification of local neighborhoods (i.e.,PCX), hence their scalability to very high dimensionaldata sets. CB methods systematically check the data forconditional independence relationships in order to in-fer a target’s neighborhood. Typically, the algorithmsrun either aG2 or aχ2 independence test when the dataset is discrete and a Fisher’s Z test when it is continu-ous in order to decide on dependence or independence,that is, upon the rejection or acceptance of the null hy-pothesis of conditional independence. Since we arelimiting ourselves to discrete data, both the global andthe local distributions are assumed to be multinomial,and the latter are represented as conditional probabil-ity tables. Conditional independence tests and networkscores for discrete data are functions of these condi-tional probability tables through the observed frequen-cies {ni jk ; i = 1, . . . ,R; j = 1, . . . ,C; k = 1, . . . , L} forthe random variablesX andY and all the configurationsof the levels of the conditioning variablesZ. We useni+k as shorthand for the marginal

j ni jk and similarlyfor ni+k, n++k andn+++ = n. We use a classic conditionalindependence test based on the mutual information. Themutual information is an information-theoretic distancemeasure defined as

MI (X,Y|Z) =R∑

i=1

C∑

j=1

L∑

k=1

ni jk

nlog

ni jkn++k

ni+kn+ jk

It is proportional to the log-likelihood ratio testG2 (they differ by a 2n factor, where n is the samplesize). The asymptotic null distribution isχ2 with(R − 1)(C − 1)L degrees of freedom. For a detailedanalysis of their properties we refer the reader to(Agresti, 2002). The main limitation of this test isthe rate of convergence to its limiting distribution,which is particularly problematic when dealing withsmall samples and sparse contingency tables. Thedecision of accepting or rejecting the null hypothe-sis depends implicitly upon the degree of freedomwhich increases exponentially with the number ofvariables in the conditional set. Several heuristic solu-tions have emerged in the literature (Spirtes et al.,2000; Rodrigues de Morais & Aussem, 2010a;Tsamardinos et al., 2006; Tsamardinos & Borboudakis,2010) to overcome some shortcomings of the asymp-totic tests. In this study we use the two followingheuristics that are used in MMHC. First, we do notperform MI (X,Y|Z) and assume independence ifthere are not enough samples to achieve large enoughpower. We require that the average sample per countis above a user defined parameter, equal to 5, as in

(Tsamardinos et al., 2006). This heuristic is called thepower rule. Second, we consider as structural zeroeither casen+ jk or ni+k = 0. For example, ifn+ jk = 0,we consider y as a structurally forbidden value for Ywhen Z= z and we reduce R by 1 (as if we had onecolumn less in the contingency table where Z= z).This is known as the degrees of freedom adjustmentheuristic.

3. The H2PC algorithm

In this section, we present our hybrid algorithmfor Bayesian network structure learning, called Hy-brid HPC (H2PC). It first reconstructs the skeleton ofa Bayesian network and then performs a Bayesian-scoring greedy hill-climbing search to filter and orientthe edges. It is based on a CB subroutine called HPC tolearn the parents and children of a single variable. So,we shall discuss HPC first and then move to H2PC.

3.1. Parents and Children Discovery

HPC (Algorithm 1) can be viewed as an ensemblemethod for combining many weak PC learners in an at-tempt to produce a stronger PC learner. HPC is basedon three subroutines:Data-Efficient Parents and Chil-dren Superset(DE-PCS),Data-Efficient Spouses Super-set(DE-SPS), andIncremental Association Parents andChildrenwith false discovery rate control (FDR-IAPC),a weak PC learner based on FDR-IAMB (Pena, 2008)that requires little computation. HPC receives a targetnodeT, a data setD and a set of variablesU as inputand returns an estimation ofPCT . It is hybrid in thatit combines the benefits of incremental and divide-and-conquer methods. The procedure starts by extractinga supersetPCST of PCT (line 1) and a supersetSPST

of SPT (line 2) with a severe restriction on the max-imum conditioning size (Z <= 2) in order to signifi-cantly increase the reliability of the tests. A first can-didate PC set is then obtained by running the weak PClearner called FDR-IAPC onPCST∪SPST (line 3). Thekey idea is the decentralized search at lines 4-8 that in-cludes, in the candidate PC set, all variables in the su-persetPCST \ PCT that haveT in their vicinity. So,HPC may be thought of as a way to compensate for thelarge number of false negatives at the output of the weakPC learner, by performing extra computations. Notethat, in theory,X is in the output of FDR-IAPC(Y) ifand only ifY is in the output of FDR-IAPC(X). How-ever, in practice, this may not always be true, particu-larly when working in high-dimensional domains (Pena,2008). By loosening the criteria by which two nodes are

5

said adjacent, the effective restrictions on the size of theneighborhood are now far less severe. The decentral-ized search has a significant impact on the accuracy ofHPC as we shall see in in the experiments. We proved in(Rodrigues de Morais & Aussem, 2010a) that the orig-inal HPC(T) is consistent, i.e. its output converges inprobability toPCT , if the hypothesis tests are consis-tent. The proof also applies to the modified version pre-sented here.

We now discuss the subroutines in more detail. FDR-IAPC (Algorithm 2) is a fast incremental method thatreceives a data setD and a target nodeT as its in-put and promptly returns a rough estimation ofPCT ,hence the term “weak” PC learner. In this study, weuse FDR-IAPC because it aims at controlling the ex-pected proportion of false discoveries (i.e., false pos-itive nodes inPCT) among all the discoveries made.FDR-IAPC is a straightforward extension of the algo-rithm IAMBFDR developed by Jose Pena in (Pena,2008), which is itself a modification of the incremen-tal association Markov boundary algorithm (IAMB)(Tsamardinos et al., 2003), to control the expected pro-portion of false discoveries (i.e., false positive nodes) inthe estimated Markov boundary. FDR-IAPC simply re-moves, at lines 3-6, the spousesSPT from the estimatedMarkov boundaryMB T output by IAMBFDR at line 1,and returnsPCT assuming the faithfulness condition.

The subroutines DE-PCS (Algorithm 3) and DE-SPS(Algorithm 4) search a superset ofPCT andSPT respec-tively with a severe restriction on the maximum condi-tioning size (|Z| <= 1 in DE-PCS and|Z| <= 2 in DE-SPS) in order to significantly increase the reliability ofthe tests. The variable filtering has two advantages :i) it allows HPC to scale to hundreds of thousands ofvariables by restricting the search to a subset of relevantvariables, and ii) it eliminates many true or approximatedeterministic relationships that produce many false neg-ative errors in the output of the algorithm, as explainedin (Rodrigues de Morais & Aussem, 2010b,a). DE-SPSworks in two steps. First, a growing phase (lines 4-8)adds the variables that are d-separated from the targetbut still remain associated with the target when condi-tioned on another variable fromPCST . The shrinkingphase (lines 9-16) discards irrelevant variables that areancestors or descendants of a target’s spouse. Pruningsuch irrelevant variables speeds up HPC.

3.2. Hybrid HPC (H2PC)In this section, we discuss the SS phase. The follow-

ing discussion draws strongly on (Tsamardinos et al.,2006) as the SS phase in Hybrid HPC and MMHC areexactly the same. The idea of constraining the search

Algorithm 1 HPCRequire: T: target;D: data set;U: set of variablesEnsure: PCT : Parents and Children ofT

1: [PCST ,dSep] ← DE-PCS(T,D,U)2: SPST ← DE-SPS(T,D,U,PCST , dSep)3: PCT ← FDR-IAPC(T,D, (T ∪ PCST ∪ SPST))4: for all X ∈ PCST \ PCT do5: if T ∈ FDR-IAPC(X,D, (T ∪ PCST ∪ SPST)) then6: PCT ← PCT ∪ X7: end if8: end for

Algorithm 2 FDR-IAPCRequire: T: target;D: data set;U: set of variablesEnsure: PCT : Parents and children ofT;

* Learn the Markov boundary of T1: MBT ← IAMBFDR(X,D,U)

* Remove spouses of T fromMBT

2: PCT ← MBT

3: for all X ∈ MBT do4: if ∃Z ⊆ (MBT \ X) such thatT ⊥ X | Z then5: PCT ← PCT \ X6: end if7: end for

Algorithm 3 DE-PCSRequire: T: target;D: data set;U: set of variables;Ensure: PCST : parents and children superset ofT; dSep: d-

separating sets;

Phase I:Remove X if T⊥ X1: PCST ← U \ T2: for all X ∈ PCST do3: if (T ⊥ X) then4: PCST ← PCST \ X5: dSep(X)← ∅6: end if7: end for

Phase II: Remove X if T⊥ X|Y8: for all X ∈ PCST do9: for all Y ∈ PCST \ X do

10: if (T ⊥ X | Y) then11: PCST ← PCST \ X12: dSep(X)← Y13: break loop FOR14: end if15: end for16: end for

6

Algorithm 4 DE-SPSRequire: T: target; D: data set;U: the set of variables;

PCST : parents and children superset ofT; dSep: d-separating sets;

Ensure: SPST : Superset of the spouses ofT;

1: SPST ← ∅

2: for all X ∈ PCST do3: SPSX

T ← ∅

4: for all Y ∈ U \ {T ∪ PCST} do5: if (T 6⊥ Y|dSep(Y) ∪ X) then6: SPSX

T ← SPSXT ∪ Y

7: end if8: end for9: for all Y ∈ SPSX

T do10: for all Z ∈ SPSX

T \ Y do11: if (T ⊥ Y|X ∪ Z) then12: SPSX

T ← SPSXT \ Y

13: break loop FOR14: end if15: end for16: end for17: SPST ← SPST ∪ SPSX

T

18: end for

Algorithm 5 Hybrid HPCRequire: D: data set;U: set of variablesEnsure: A DAG G on the variablesU

1: for all pair of nodesX,Y ∈ U do2: Add X in PCY and Add Y inPCX if X ∈ HPC(Y) and

Y ∈ HPC(X)3: end for4: Starting from an empty graph, perform greedy hill-

climbing with operatorsadd-edge, delete-edge, reverse-edge. Only try operatoradd-edge X→ Y if Y ∈ PCX

to improve time-efficiency first appeared in the SparseCandidate algorithm (Friedman et al., 1999b). It resultsin efficiency improvements over the (unconstrained)greedy search. All recent hybrid algorithms build onthis idea, but employ a sound algorithm for identifyingthe candidate parent sets. The Hybrid HPC first identi-fies the parents and children set of each variable, thenperforms a greedy hill-climbing search in the space ofBN. The search begins with an empty graph. The edgeaddition, deletion, or direction reversal that leads to thelargest increase in score (the BDeu score with uniformprior was used) is taken and the search continues in asimilar fashion recursively. The important differencefrom standard greedy search is that the search is con-strained to only consider adding an edge if it was dis-covered by HPC in the first phase. We extend the greedysearch with a TABU list (Friedman et al., 1999b). Thelist keeps the last 100 structures explored. Instead of ap-plying the best local change, the best local change thatresults in a structure not on the list is performed in an at-tempt to escape local maxima. When 15 changes occurwithout an increase in the maximum score ever encoun-tered during search, the algorithm terminates. The over-all best scoring structure is then returned. Clearly, themore false positives the heuristic allows to enter candi-date PC set, the more computational burden is imposedin the SS phase.

4. Experimental validation on synthetic data

Before we proceed to the experiments on real-worldmulti-label data with H2PC, we first conduct an exper-imental comparison of H2PC against MMHC on syn-thetic data sets sampled from eight well-known bench-marks with various data sizes in order to gauge the prac-tical relevance of the H2PC. These BNs that have beenpreviously used as benchmarks for BN learning algo-rithms (see Table 1 for details). Results are reported interms of various performance indicators to investigatehow well the network generalizes to new data and howwell the learned dependence structure matches the truestructure of the benchmark network. We implementedH2PC inR(R Core Team, 2013) and integrated the codeinto thebnlearnpackage from (Scutari, 2010). MMHCwas implemented by Marco Scutari inbnlearn. Thesource code of H2PC as well as all data sets used forthe empirical tests are publicly available2. The thresh-old considered for the type I error of the test is 0.05.Our experiments were carried out on PC with Intel(R)

2https://github.com/madbix/bnlearn-clone-3.4

7

Table 1: Description of the BN benchmarks used in the experiments.

network # var. # edgesmax. degree domain min/med/max

in/out range PC set size

child 20 25 2/7 2-6 1/2/8insurance 27 52 3/7 2-5 1/3/9mildew 35 46 3/3 3-100 1/2/5alarm 37 46 4/5 2-4 1/2/6

hailfinder 56 66 4/16 2-11 1/1.5/17munin1 186 273 3/15 2-21 1/3/15

pigs 441 592 2/39 3-3 1/2/41link 724 1125 3/14 2-4 0/2/17

Core(TM) i5-3470M CPU @3.20 GHz 4Go RAM run-ning under Linux 64 bits.

We do not claim that those data sets resemble real-world problems, however, they make it possible to com-pare the outputs of the algorithms with the known struc-ture. All BN benchmarks (structure and probability ta-bles) were downloaded from thebnlearn repository3.Ten sample sizes have been considered: 50, 100, 200,500, 1000, 2000, 5000, 10000, 20000 and 50000. Allexperiments are repeated 10 times for each sample sizeand each BN. We investigate the behavior of both algo-rithms using the same parametric tests as a reference.

4.1. Performance indicatorsWe first investigate the quality of the skeleton re-

turned by H2PC during the CB phase. To this end, wemeasure the false positive edge ratio, the precision (i.e.,the number of true positive edges in the output dividedby the number of edges in the output), the recall (i.e., thenumber of true positive edges divided the true numberof edges) and a combination of precision and recall de-fined as

(1− precision)2 + (1− recall)2, to measurethe Euclidean distance from perfect precision and re-call, as proposed in (Pena et al., 2007). Second, to as-sess the quality of the final DAG output at the end ofthe SS phase, we report the five performance indicators(Scutari & Brogini, 2012) described below:

• the posterior density of the network for the datait was learned from, as a measure of goodness offit. It is known as the Bayesian Dirichlet equiva-lent score (BDeu) from (Heckerman et al., 1995;Buntine, 1991) and has a single parameter, theequivalent sample size, which can be thought ofas the size of an imaginary sample supportingthe prior distribution. The equivalent sample sizewas set to 10 as suggested in (Koller & Friedman,2009);

3 http://www.bnlearn.com/bnrepository

• the BIC score (Schwarz, 1978) of the network forthe data it was learned from, again as a measure ofgoodness of fit;

• the posterior density of the network for a new dataset, as a measure of how well the network general-izes to new data;

• the BIC score of the network for a new data set,again as a measure of how well the network gener-alizes to new data;

• the Structural Hamming Distance (SHD) betweenthe learned and the true structure of the network,as a measure of the quality of the learned depen-dence structure. The SHD between two PDAGs isdefined as the number of the following operatorsrequired to make the PDAGs match: add or deletean undirected edge, and add, remove, or reverse theorientation of an edge.

For each data set sampled from the true probabilitydistribution of the benchmark, we first learn a networkstructure with the H2PC and MMHC and then we com-pute the relevant performance indicators for each pairof network structures. The data set used to assess howwell the network generalizes to new data is generatedagain from the true probability structure of the bench-mark networks and contains 50000 observations.

Notice that using the BDeu score as a metric of recon-struction quality has the following two problems. First,the score corresponds to the a posteriori probability of anetwork only under certain conditions (e.g., a Dirichletdistribution of the hyper parameters); it is unknown towhat degree these assumptions hold in distributions en-countered in practice. Second, the score is highly sensi-tive to the equivalent sample size (set to 10 in our exper-iments) and depends on the network priors used. Since,typically, the same arbitrary value of this parameter isused both during learning and for scoring the learnednetwork, the metric favors algorithms that use the BDeuscore for learning. In fact, the BDeu score does not relyon the structure of the original, gold standard network atall; instead it employs several assumptions to score thenetworks. For those reasons, in addition to the score wealso report the BIC score and the SHD metric.

4.2. Results

In Figure 1, we report the quality of the skeleton ob-tained with HPC over that obtained with MMPC (be-fore the SS phase) as a function of the sample size.Results for each benchmark are not shown here in de-tail due to space restrictions. For sake of conciseness,

8

0

1000

0

2000

0

3000

0

4000

0

5000

00.0

0.2

0.4

0.6

0.8

sample size

mea

n ra

te

False negative rate(lower is better)

HPCMMPC

0

1000

0

2000

0

3000

0

4000

0

5000

0

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

sample sizem

ean

rate

False positive rate(lower is better)

HPCMMPC

50 100

200

500

1000

2000

5000

1000

020

000

5000

0

0

2

4

6

8

sample size

incr

ease

fact

or

Recall(higher is better)

50 100

200

500

1000

2000

5000

1000

020

000

5000

00.00.51.01.52.02.53.03.5

sample size

incr

ease

fact

or

Precision(higher is better)

50 100

200

500

1000

2000

5000

1000

020

000

5000

0

0.0

0.2

0.4

0.6

0.8

1.0

1.2

sample size

incr

ease

fact

or

Euclidian distance(lower is better)

50 100

200

500

1000

2000

5000

1000

020

000

5000

0

0

5

10

15

20

25

sample size

incr

ease

fact

or

Number of statistical tests

Figure 1: Quality of the skeleton obtained with HPC over thatobtained with MMPC (after the CB phase). The 2 first figures present mean valuesaggregated over the 8 benchmarks. The 4 last figures present increase factors of HPC/MMPC, with the median, quartile, and most extreme values(green boxplots), along with the mean value (black line).

9

the performance values are averaged over the 8 bench-marks depicted in Table 1. The increase factor for agiven performance indicator is expressed as the ratio ofthe performance value obtained with HPC over that ob-tained with MMPC (the gold standard). Note that forsome indicators, an increase is actually not an improve-ment but is worse (e.g., false positive rate, Euclideandistance). For clarity, we mention explicitly on the sub-plots whether an increase factor> 1 should be inter-preted as an improvement or not. Regarding the qual-ity of the superstructure, the advantages of HPC againstMMPC are noticeable. As observed, HPC consistentlyincreases the recall and reduces the rate of false negativeedges. As expected this benefit comes at a little expensein terms of false positive edges. HPC also improvesthe Euclidean distance from perfect precision and re-call on all benchmarks, while increasing the number ofindependence tests and thus the running time in the CBphase (see number of statistical tests). It is worth notingthat HPC is capable of reducing by 50% the Euclideandistance with 50000 samples (lower left plot). Theseresults are very much in line with other experimentspresented in (Rodrigues de Morais & Aussem, 2010a;Villanueva & Maciel, 2012).

In Figure 2, we report the quality of the final DAG ob-tained with H2PC over that obtained with MMHC (af-ter the SS phase) as a function of the sample size. Re-garding BDeu and BIC on both training and test data,the improvements are noteworthy. The results in termsof goodness of fit to training data and new data usingH2PC clearly dominate those obtained using MMHC,whatever the sample size considered, hence its ability togeneralize better. Regarding the quality of the networkstructure itself (i.e., how close is the DAG to the truedependence structure of the data), this is pretty mucha dead heat between the 2 algorithms on small sam-ple sizes (i.e., 50 and 100), however we found H2PCto perform significantly better on larger sample sizes.The SHD increase factor decays rapidly (lower is bet-ter) as the sample size increases (lower left plot). For50 000 samples, the SHD is on average only 50% that ofMMHC. Regarding the computational burden involved,we may observe from Table 2 that H2PC has a littlecomputational overhead compared to MMHC. The run-ning time increase factor grows somewhat linearly withthe sample size. With 50000 samples, H2PC is approxi-mately 10 times slower on average than MMHC. This ismainly due to the computational expense incurred in ob-taining larger PC sets with HPC, compared to MMPC.

5. Application to multi-label learning

In this section, we address the problem of multi-label learning with H2PC. MLC is a challengingproblem in many real-world application domains,where each instance can be assigned simultaneouslyto multiple binary labels (DembczyÅski et al., 2012;Read et al., 2009; Madjarov et al., 2012; Kocev et al.,2007; Tsoumakas et al., 2010b). Formally, learningfrom multi-label examples amounts to finding a map-ping from a space of features to a space of labels. Weshall assume throughout thatX (a random vector inRd)is the feature set,Y (a random vector in{0, 1}n) is thelabel set,U = X ∪ Y and p a probability distributiondefined overU. Given a multi-label training setD, thegoal of multi-label learning is to find a function whichis able to map any unseen example to its proper set oflabels. From the Bayesian point of view, this problemamounts to modeling the conditional joint distributionp(Y | X).

5.1. Related work

This MLC problem may be tackled in various ways(Luaces et al., 2012; Alvares-Cherman et al., 2011;Read et al., 2009; Blockeel et al., 1998; Kocev et al.,2007). Each of these approaches is supposed to cap-ture - to some extent - the relationships between la-bels. The two most straightforward meta-learningmethods (Madjarov et al., 2012) are: Binary Rele-vance (BR) (Luaces et al., 2012) and Label Powerset(LP) (Tsoumakas & Vlahavas, 2007; Tsoumakas et al.,2010b). Both methods can be regarded as opposite inthe sense that BR does consider each label indepen-dently, while LP considers the whole label set at once(one multi-class problem). An important question re-mains: what shall we capture from the statistical re-lationships between labels exactly to solve the multi-label classification problem? The problem attracteda great deal of interest (DembczyÅski et al., 2012;Zhang & Zhang, 2010). It is well beyond the scopeand purpose of this paper to delve deeper into these ap-proaches, we point the reader to (DembczyÅski et al.,2012; Tsoumakas et al., 2010b) for a review. Thesecond fundamental problem that we wish to addressinvolves finding an optimal feature subset selectionof a label set, w.r.t an Information Theory criterionKoller & Sahami (1996). As in the single-label case,multi-label feature selection has been studied recentlyand has encountered some success (Gu et al., 2011;Spolaor et al., 2013).

10

50 100

200

500

1000

2000

5000

1000

020

000

5000

00.5

0.6

0.7

0.8

0.9

1.0

sample size

incr

ease

fact

or

log(BDeu post.) on train data(lower is better)

50 100

200

500

1000

2000

5000

1000

020

000

5000

0

0.5

0.6

0.7

0.8

0.9

1.0

sample size

incr

ease

fact

or

BIC on train data(lower is better)

50 100

200

500

1000

2000

5000

1000

020

000

5000

0

0.5

0.6

0.7

0.8

0.9

1.0

sample size

incr

ease

fact

or

log(BDeu post.) on test data(lower is better)

50 100

200

500

1000

2000

5000

1000

020

000

5000

00.5

0.6

0.7

0.8

0.9

1.0

sample size

incr

ease

fact

or

BIC on test data(lower is better)

50 100

200

500

1000

2000

5000

1000

020

000

5000

0

0.0

0.2

0.4

0.6

0.8

1.0

1.2

sample size

incr

ease

fact

or

Structural Hamming Distance(lower is better)

50 100

200

500

1000

2000

5000

1000

020

000

5000

0

2

4

6

8

sample size

incr

ease

fact

or

Number of scores

Figure 2: Quality of the final DAG obtained with H2PC over thatobtained with MMHC (after the SS phase). The 6 figures presentincrease factorsof HPC/ MMPC, with the median, quartile, and most extreme values (green boxplots), along with the mean value (black line).

11

Table 2: Total running time ratioR (H2PC/MMHC). White cells indicate a ratioR< 1 (in favor of H2PC), while shaded cells indicate a ratioR> 1(in favor of MMHC). The darker, the larger the ratio.

NetworkSample Size

50 100 200 500 1000 2000 5000 10000 20000 50000

child0.94 0.87 1.14 1.99 2.26 2.12 2.36 2.58 1.75 1.78±0.1 ±0.1 ±0.1 ±0.2 ±0.1 ±0.2 ±0.4 ±0.3 ±0.6 ±0.5

insurance0.96 1.09 1.56 2.93 3.06 3.48 3.69 4.10 3.76 3.75±0.1 ±0.1 ±0.1 ±0.2 ±0.3 ±0.4 ±0.3 ±0.4 ±0.6 ±0.5

mildew0.77 0.80 0.79 0.94 1.01 1.23 1.74 2.14 3.26 6.20±0.1 ±0.1 ±0.1 ±0.1 ±0.1 ±0.1 ±0.2 ±0.2 ±0.6 ±1.0

alarm0.88 1.11 1.75 2.43 2.55 2.71 2.65 2.80 2.49 2.18±0.1 ±0.1 ±0.1 ±0.1 ±0.1 ±0.1 ±0.2 ±0.2 ±0.3 ±0.6

hailfinder0.85 0.85 1.40 1.69 1.83 2.06 2.13 2.12 1.95 1.96±0.1 ±0.1 ±0.1 ±0.1 ±0.1 ±0.1 ±0.1 ±0.2 ±0.2 ±0.6

munin10.77 0.85 0.93 1.35 2.11 4.30 12.92 23.32 24.95 24.76±0.0 ±0.0 ±0.0 ±0.0 ±0.0 ±0.2 ±0.7 ±2.6 ±5.1 ±6.7

pigs0.80 0.80 4.55 4.71 5.00 5.62 7.63 11.10 14.02 11.74±0.0 ±0.0 ±0.1 ±0.1 ±0.2 ±0.2 ±0.3 ±0.6 ±1.7 ±3.2

link1.16 1.93 2.76 5.55 7.04 8.19 10.00 13.87 15.32 24.74±0.0 ±0.0 ±0.0 ±0.1 ±0.2 ±0.2 ±0.3 ±0.4 ±2.5 ±4.2

all0.89 1.04 1.86 2.70 3.11 3.71 5.39 7.75 8.44 9.64±0.1 ±0.1 ±1.2 ±1.5 ±1.9 ±2.2 ±4.0 ±7.3 ±8.4 ±9.7

5.2. Label powerset decomposition

We shall first introduce the concept of label powersetthat will play a pivotal role in the factorization of theconditional distributionp(Y | X).

Definition 1. YLP ⊆ Y is called a label powersetiffYLP ⊥⊥ Y \ YLP | X. Additionally, YLP is said minimalif it is non empty and has no label powerset as propersubset.

Let LP denote the set of all powersets defined overU, andminLP the set of all minimal label powersets.It is easily shown{Y, ∅} ⊆ LP. The key idea behindlabel powersets is the decomposition of the conditionaldistribution of the labels into a product of factors

p(Y | X) =L∏

j=1

p(YLP j | X) =L∏

j=1

p(YLP j | M LP j )

where{YLP1 , . . . ,YLPL } is a partition of label power-sets andM LP j is a Markov blanket ofYLP j in X. Fromthe above definition, we haveYLPi ⊥⊥ YLP j | X, ∀i , j.

In the framework of MLC, one can consider a mul-titude of loss functions. In this study, we focus ona non label-wise decomposable loss function calledthe subset 0/1 loss which generalizes the well-known0/1 loss from the conventional to the multi-label set-ting. The risk-minimizing prediction for subset 0/1loss is simply given by the mode of the distribution

(DembczyÅski et al., 2012). Therefore, we seek a fac-torization into a product of minimal factors in order tofacilitate the estimation of the mode of the conditionaldistribution (also called the most probable explanation(MPE)):

maxy

p(y | x) =L∏

j=1

maxyLPj

p(yLP j | x) =L∏

j=1

maxyLPj

p(yLP j | mLP j )

The next section aims to obtain theoretical results forthe characterization of the minimal label powersetsYLP j

and their Markov boundariesM LP j from a DAG, in or-der to be able to estimate the MPE more effectively.

5.3. Label powerset characterization

In this section, we show that minimal label powersetscan be depicted graphically whenp is faithful to a DAG,

Theorem 6. Suppose p is faithful to a DAGG. Then,Yi and Yj belong to the same minimal label powerset ifand only if there exists an undirected path inG betweennodes Yi and Yj in Y such that all intermediate nodes Zare either (i) Z∈ Y, or (ii) Z ∈ X and Z has two parentsin Y (i.e. a collider of the form Yp → X← Yq).

The proof is given in the Appendix. We shall nowaddress the following questions: What is the Markovboundary inX of a minimal label powerset? Answering

12

this question for general distributions is not trivial. Weestablish a useful relation between a label powerset andits Markov boundary inX whenp is faithful to a DAG,

Theorem 7. Suppose p is faithful to a DAGG. LetYLP = {Y1,Y2, . . . ,Yn} be a label powerset. Then, itsMarkov boundaryM in U is also its Markov boundaryin X, and is given inG byM =

⋃nj=1{PCYj ∪ SPYj } \ Y.

The proof is given in the Appendix.

5.4. Experimental setup

The MLC problem is decomposed into a series ofmulti-class classification problems, where each multi-class variable encodes one label powerset, with as manyclasses as the number of possible combinations of la-bels, or those present in the training data. At this point,it should be noted that the LP method is a special case ofthis framework since the whole label set is a particularlabel powerset (not necessarily minimal though). Theabove procedure can been summarized as follows:

1. Learn the BN local graphG around the label set;2. Read off G the minimal label powersets and their

respective Markov boundaries;3. Train an independent multi-class classifier on each

minimal LP, with the input space restricted to itsMarkov boundary inG;

4. Aggregate the prediction of each classifierto output the most probable explanation, i.e.arg maxy p(y | x).

To assess separately the quality of the minimal labelpowerset decomposition and the feature subset selectionwith Markov boundaries, we investigate 4 scenarios:

• BR without feature selection (denotedBR): a clas-sifier is trained on each single label with all fea-tures as input. This is the simplest approach as itdoes not exploit any label dependency. It serves asa baseline learner for comparison purposes.

• BR with feature selection (denotedBR+MB ): aclassifier is trained on each single label with theinput space restricted to its Markov boundary inG. Compared to the previous strategy, we evaluatehere the effectiveness of the feature selection task.

• Minimum label powerset method without featureselection (denotedMLP ): the minimal label pow-ersets are obtained from the DAG. All features areused as inputs.

• Minimum label powerset method with feature se-lection (denotedMLP+MB ): the minimal labelpowersets are obtained from the DAG. the inputspace is restricted to the Markov boundary of thelabels in that powerset.

We use the same base learner in each meta-learningmethod: the well-known Random Forest classifier(Breiman, 2001). RF achieves good performance instandard classification as well as in multi-label prob-lems (Kocev et al., 2007; Madjarov et al., 2012), and isable to handle both continuous and discrete data eas-ily, which is much appreciated. The standard RF imple-mentation inR (Liaw & Wiener, 2002)4 was used. Forpractical purposes, we restricted the forest size of RF to100 trees, and left the other parameters to their defaultvalues.

5.4.1. Data and performance indicatorsA total of 10 multi-label data sets are collected for ex-

periments in this section, whose characteristics are sum-marized in Table 3. These data sets come from differ-ent problem domains including text, biology, and mu-sic. They can be found on the Mulan5 repository, exceptfor imagewhich comes from Zhou6 (Maron & Ratan,1998). LetD be the multi-label data set, we use|D|, dim(D), L(D), F(D) to represent the number of ex-amples, number of features, number of possible labels,and feature type respectively.DL(D) = |{Y|∃x : (x,Y) ∈D}| counts the number of distinct label combinations ap-pearing in the data set. Continues values are binarizedduring the BN structure learning phase.

Table 3: Data sets characteristics

data set domain |D| dim(D) F(D) L(D) DL(D)

emotions music 593 72 cont. 6 27yeast biology 2417 103 cont. 14 198image images 2000 135 cont. 5 20scene images 2407 294 cont. 6 15slashdot text 3782 1079 disc. 22 156genbase biology 662 1186 disc. 27 32medical text 978 1449 disc. 45 94enron text 1702 1001 disc. 53 753bibtex text 7395 1836 disc. 159 2856corel5k images 5000 499 disc. 374 3175

The performance of a multi-label classifiercan be assessed by several evaluation measures

4http://cran.r-project.org/web/packages/randomForest5http://mulan.sourceforge.net/datasets.html6http://lamda.nju.edu.cn/data_MIMLimage.ashx

13

(Tsoumakas et al., 2010a). We focus on maximizing anon-decomposable score function: the global accuracy(also termed subset accuracy, complement of the0/1-loss), which measures the correct classificationrate of the whole label set (exact match of all labelsrequired). Note that the global accuracy implicitlytakes into account the label correlations. It is thereforea very strict evaluation measure as it requires an exactmatch of the predicted and the true set of labels. Itwas recently proved in (DembczyÅski et al., 2012)that BR is optimal for decomposable loss functions(e.g., the hamming loss), while non-decomposableloss functions (e.g. subset loss) inherently require theknowledge of the label conditional distribution. 10-foldcross-validation was performed for the evaluation ofthe MLC methods.

5.4.2. ResultsTable 4 reports the outputs of H2PC and MMHC, Ta-

ble 5 shows the global accuracy of each method on the10 data sets. Table 6 reports the running time and theaverage node degree of the labels in the DAGs obtainedwith both methods. Figures 3 up to 7 display graph-ically the local DAG structures around the labels, ob-tained with H2PC and MMHC, for illustration purposes.

Several conclusions may be drawn from these exper-iments. First, we may observe by inspection of the av-erage degree of the label nodes in Table 6 that severalDAGs are densely connected, likesceneor bibtex, whileothers are rather sparse, likegenbase, medical, corel5k.The DAGs displayed in Figures 3 up to 7 lend them-selves to interpretation. They can be used for encod-ing as well as portraying the conditional independen-cies, and the d-sep criterion can be used to read themoff the graph. Many label powersets that are reportedin Table 4 can be identified by graphical inspection ofthe DAGs. For instance, the two label powersets inyeast(Figure 3, bottom plot) are clearly noticeable inboth DAGs. Clearly, BNs have a number of advantagesover alternative methods. They lay bare useful infor-mation about the label dependencies which is crucial ifone is interested in gaining an understanding of the un-derlying domain. It is however well beyond the scopeof this paper to delve deeper into the DAG interpreta-tion. Overall, it appears that the structures recoveredby H2PC are significantly denser, compared to MMHC.On bibtexandenron, the increase of the average labeldegree is the most spectacular. Onbibtex(resp.enron),the average label degree has raised from 2.6 to 6.4 (resp.from 1.2 to 3.3). This result is in nice agreement withthe experiments in the previous section, as H2PC wasshown to consistently reduce the rate of false negative

edges with respect to MMHC (at the cost of a slightlyhigher false discovery rate).

Second, Table 4 is very instructive as it reveals thatonemotions, image, and scene, the MLP approach boilsdown to the LP scheme. This is easily seen as there isonly one minimal label powerset extracted on average.In contrast, ongenbasethe MLP approach boils downto the simple BR scheme. This is confirmed by inspect-ing Table 5: the performances of MMHC and H2PC(without feature selection) are equal. An interesting ob-servation upon looking at the distribution of the labelpowerset size shows that for the remaining data sets, theMLP mostly decomposes the label set in two parts: oneon which it performs BR and the other one on whichit performs LP. Take for instanceenron, it can be seenfrom Table 4 that there are approximately 23 label sin-gletons and a powerset of 30 labels with H2PC for a to-tal of 53 labels. The gain in performance with MLP overBR, our baseline learner, can be ascribed to the qualityof the label powerset decomposition as BR is ignoringlabel dependencies. As expected, the results using MLPclearly dominate those obtained using BR on all datasets exceptgenbase. The DAGs display the minimumlabel powersets and their relevant features.

Third, H2PC compares favorably to MMHC. Onscenefor instance, the accuracy of MLP+MB has raisedfrom 20% (with MMHC) to 56% (with H2PC). Onyeast, it has raised from 7% to 23%. Without featureselection, the difference in global accuracy is less pro-nounced but still in favor of H2PC. A Wilcoxon signedrank paired test reveals statistically significant improve-ments of H2PC over MMHC in the MLP approach with-out feature selection (p < 0.02). This trend is more pro-nounced when the feature selection is used (p < 0.001),using MLP-MB. Note however that the LP decompo-sition for H2PC and MMHC onyeastand imagearestrictly identical, hence the same accuracy values in Ta-ble 5 in the MLP column.

Fourth, regarding the utility of the feature selection,it is difficult to reach any conclusion. Whether BR orMLP is used, the use of the selected features as inputsto the classification model is not shown greatly benefi-cial in terms of global accuracy on average. The per-formance of BR and our MLP with all the features out-performs that with the selected features in 6 data setsbut the feature selection leads to actual improvementsin 3 data sets. The difference in accuracy with and with-out the feature selection was not shown to be statisti-cally significant (p > 0.20 with a Wilcoxon signed rankpaired test). Surprisingly, the feature selection did ex-ceptionally well ongenbase. On this data set, the in-crease in accuracy is the most impressive: it raised from

14

7% to 98% which is atypical. The dramatic increase inaccuracy ongenbaseis due solely to the restricted fea-ture set as input to the classification model. This is alsoobserved onmedical, to a lesser extent though. Inter-estingly, on large and densely connected networks (e.g.bibtex, slashdotandcorel5K), the feature selection per-formed very well in terms of global accuracy and signif-icantly reduced the input space which is noticeable. Onemotions, yeast, image, sceneandgenbase, the methodreduced the feature set down to nearly 1/100 its origi-nal size. The feature selection should be evaluated inview of its effectiveness at balancing the increasing er-ror and the decreasing computational burden by drasti-cally reducing the feature space. We should also keepin mind that the feature relevance cannot be defined in-dependently of the learner and the model-performancemetric (e.g., the loss function used). Admittedly, ourfeature selection based on the Markov boundaries is notnecessarily optimal for the base MLC learner used here,namely the Random Forest model.

As far as the overall running time performance is con-cerned, we see from Table 6 that for both methods, therunning time grows somewhat exponentially with thesize of the Markov boundary and the number of fea-tures, hence the considerable rise in total running timeon bibtex. H2PC takes almost 200 times longer onbib-tex (1826 variables) andenron (1001 variables) whichis quite considerable but still affordable (44 hours ofsingle-CPU time onbibtexand 13 hours onenron). Wealso observe that the size of the parent set with H2PCis 2.5 (onbibtex) and 3.6 (onenron) times larger thanthat of MMHC (which may hurt interpretability). Infact, the running time is known to increase exponen-tially with the parent set size of the true underlying net-work. This is mainly due the computational overheadof greedy search-and-score procedure with larger par-ent sets which is the most promising part to optimize interms of computational gains as we discuss in the Con-clusion.

6. Discussion & practical applications

Our prime conclusion is that H2PC is a promisingapproach to constructing BN global or local structuresaround specific nodes of interest, with potentially thou-sands of variables. Concentrating on higher recall val-ues while keeping the false positive rate as low as pos-sible pays off in terms of goodness of fit and struc-ture accuracy. Historically, the main practical diffi-culty in the application of BN structure discovery ap-proaches has been a lack of suitable computing re-sources and relevant accessible software. The number

Table 4: Distribution of the number and the size of the minimal labelpowersets output by H2PC (top) and MMHC (bottom). On each dataset: mean number of powersets, minimum/median/maximum numberof labels per powerset, and minimum/median/maximum number ofdistinct classes per powerset. The total number of labels and distinctlabels combinations is recalled for convenience.

H2PC

dataset # LPs# labels/ LP

L(D)# classes/ LP

DL(D)min/med/max min/med/max

emotions 1.0± 0.0 6 / 6 / 6 (6) 26/ 27 / 27 (27)yeast 2.0± 0.0 2 / 7 / 12 (14) 3/ 64 / 135 (198)image 1.0± 0.0 5 / 5 / 5 (5) 19/ 20 / 20 (20)scene 1.0± 0.0 6 / 6 / 6 (6) 14/ 15 / 15 (15)slashdot 9.5± 0.8 1 / 1 / 14 (22) 1/ 2 / 111 (156)genbase 26.9± 0.3 1 / 1 / 2 (27) 1/ 2 / 2 (32)medical 38.6± 1.8 1 / 1 / 6 (45) 1/ 2 / 10 (94)enron 24.5± 1.9 1 / 1 / 30 (53) 1/ 2 / 334 (753)bibtex 39.6± 4.3 1 / 1 / 112 (159) 2/ 2 / 1588 (2856)corel5k 97.7± 3.9 1 / 1 / 263 (374) 1/ 2 / 2749 (3175)

MMHC

dataset # LPs# labels/ LP

L(D)# classes/ LP

DL(D)min/med/max min/med/max

emotions 1.2± 0.4 1 / 6 / 6 (6) 2 / 26 / 27 (27)yeast 2.0± 0.0 1 / 7 / 13 (14) 2/ 89 / 186 (198)image 1.0± 0.0 5 / 5 / 5 (5) 19/ 20 / 20 (20)scene 1.0± 0.0 6 / 6 / 6 (6) 14/ 15 / 15 (15)slashdot 15.1± 0.6 1 / 1 / 9 (22) 1/ 2 / 46 (156)genbase 27.0± 0.0 1 / 1 / 1 (27) 1/ 2 / 2 (32)medical 41.7± 1.4 1 / 1 / 4 (45) 1/ 2 / 7 (94)enron 36.8± 2.7 1 / 1 / 12 (53) 1/ 2 / 119 (753)bibtex 77.2± 3.1 1 / 1 / 28 (159) 2/ 2 / 233 (2856)corel5k 165.3± 5.5 1 / 1 / 177 (374) 1/ 2 / 2294 (3175)

Table 5: Global classification accuracies using 4 learning methods(BR, MLP, BR+MB, MLP+MB) based on the DAG obtained withH2PC and MMHC. Best values between H2PC and MMHC are bold-faced.

dataset BRMLP BR+MB MLP+MB

MMHC H2PC MMHC H2PC MMHC H2PC

emotions 0.327 0.371 0.391 0.147 0.266 0.189 0.285yeast 0.169 0.273 0.271 0.062 0.155 0.073 0.233image 0.342 0.504 0.504 0.2270.252 0.308 0.360scene 0.580 0.733 0.733 0.1230.361 0.199 0.565slashdot 0.355 0.419 0.474 0.274 0.373 0.278 0.439genbase 0.069 0.069 0.069 0.9740.976 0.974 0.976medical 0.454 0.499 0.502 0.630 0.651 0.636 0.669enron 0.134 0.165 0.180 0.024 0.079 0.079 0.157bibtex 0.108 0.115 0.167 0.120 0.133 0.121 0.177corel5k 0.002 0.030 0.043 0.000 0.002 0.010 0.034

15

Table 6: DAG learning time (in seconds) and average degree ofthelabel nodes.

datasetMMHC H2PC

time label degree time label degree

emotions 1.5 2.6± 1.1 16.2 4.7± 1.8yeast 9.0 3.0± 1.6 90.9 5.0± 2.9image 3.1 4.0± 0.8 28.4 4.6± 1.0scene 9.9 3.1± 1.0 662.9 6.6± 1.8slashdot 44.8 2.7± 1.4 822.3 4.0± 5.2genbase 12.5 1.0± 0.3 12.4 1.1± 0.3medical 43.8 1.7± 1.1 334.8 2.9± 2.4enron 199.8 1.2± 1.5 47863.0 4.3± 5.2bibtex 960.8 2.6± 1.3 159469.6 6.4± 10.3corel5k 169.6 1.5± 1.4 2513.0 2.9± 2.8

of variables which can be included in exact BN anal-ysis is still limited. As a guide, this might be less thanabout 40 variables for exact structural search techniques(Perrier et al., 2008; Kojima et al., 2010). In contrast,constraint-based heuristics like the one presented in thisstudy is capable of processing many thousands of fea-tures within hours on a personal computer, while main-taining a very high structural accuracy. H2PC andMMHC could potentially handle up to 10,000 labels ina few days of single-CPU time and far less by paralleliz-ing the skeleton identification algorithm as discussed in(Tsamardinos et al., 2006; Villanueva & Maciel, 2014).

The advantages in terms of structure accuracy andits ability to scale to thousands of variables opens upmany avenues of future possible applications of H2PCin various domains as we shall discuss next. For ex-ample, BNs have especially proven to be useful ab-stractions in computational biology (Nagarajan et al.,2013; Scutari & Nagarajan, 2013; Prestat et al., 2013;Aussem et al., 2012, 2010; Pena, 2008; Pena et al.,2005). Identifying the gene network is crucial for un-derstanding the behavior of the cell which, in turn, canlead to better diagnosis and treatment of diseases. Thisis also of great importance for characterizing the func-tion of genes and the proteins they encode in determin-ing traits, psychology, or development of an organism.Genome sequencing uses high-throughput techniqueslike DNA microarrays, proteomics, metabolomics andmutation analysis to describe the function and in-teractions of thousands of genes (Zhang & Zhou.,2006). Learning BN models of gene networks fromthese huge data is still a difficult task (Badea, 2004;Bernard & Hartemink, 2005; Friedman et al., 1999a;Ott et al., 2004; Peer et al., 2001). In these studies, theauthors had decide in advance which genes were in-cluded in the learning process, in all the cases less than1000, and which genes were excluded from it (Pena,

2008). H2PC overcome the problem by focusing thesearch around a targeted gene: the key step is the iden-tification of the vicinity of a nodeX (Pena et al., 2005).

Our second objective in this study was to demonstratethe potential utility of hybrid BN structure discovery tomulti-label learning. In multi-label data where manyinter-dependencies between the labels may be present,explicitly modeling all relationships between the labelsis intuitively far more reasonable (as demonstrated inour experiments). BNs explicitly account for such inter-dependencies and the DAG allows us to identify an op-timal set of predictors for every label powerset. The ex-periments presented here support the conclusion that lo-cal structural learning in the form of local neighborhoodinduction and Markov blanket is a theoretically well-motivated approach that can serve as a powerful learn-ing framework for label dependency analysis gearedtoward multi-label learning. Multi-label scenarios arefound in many application domains, such as multimediaannotation (Snoek et al., 2006; Trohidis et al., 2008),tag recommendation, text categorization (McCallum,1999; Zhang & Zhou., 2006), protein function classifi-cation (Roth & Fischer, 2007), and antiretroviral drugcategorization (Borchani et al., 2013).

7. Conclusion & avenues for future research

We first discussed a hybrid algorithm for globalor local (around target nodes) BN structure learn-ing called Hybrid HPC (H2PC). Our extensiveexperiments showed that H2PC outperforms thestate-of-the-art MMHC by a significant margin interms of edge recall without sacrificing the numberof extra edges, which is crucial for the sound-ness of the super-structure used during the secondstage of hybrid methods, like the ones proposed in(Perrier et al., 2008; Kojima et al., 2010). The code ofH2PC is open-source and publicly available online athttps://github.com/madbix/bnlearn-clone-3.4.Second, we discussed an application of H2PC to themulti-label learning problem which is a challengingproblem in many real-world application domains. Weestablished theoretical results, under the faithfulnesscondition, in order to characterize graphically theso-called minimal label powersets that appear asirreducible factors in the joint distribution and theirrespective Markov boundaries. As far as we know, thisis the first investigation of Markov boundary principlesto the optimal variable/feature selection problem inmulti-label learning. These formal results offer a simpleguideline to characterize graphically : i) the minimallabel powerset decomposition, (i.e. into minimal

16

subsetsYLP ⊆ Y such thatYLP ⊥⊥ Y \ YLP | X), andii) the minimal subset of features, w.r.t an InformationTheory criterion, needed to predict each label powerset,thereby reducing the input space and the computationalburden of the multi-label classification. The theoreticalanalysis laid the foundation for a practical multi-labelclassification procedure. Another set of experimentswere carried out on a broad range of multi-label datasets from different problem domains to demonstrate itseffectiveness. H2PC was shown to outperform MMHCby a significant margin in terms of global accuracy.

We suggest several avenues for future research. Asfar as BN structure learning is concerned, future workwill aim at: 1) ascertaining which independence test(e.g. tests targeting specific distributions, employingparametric assumptions etc.) is most suited to the dataat hand (Tsamardinos & Borboudakis, 2010; Scutari,2011); 2) controlling the false discovery rate of theedges in the graph output by H2PC (Pena, 2008) espe-cially when dealing with more nodes than samples, e.g.learning gene networks from gene expression data. Inthis study, H2PC was run independently on each nodewithout keeping track of the dependencies found previ-ously. This lead to some loss of efficiency due to re-dundant calculations. The optimization of the H2PCcode is currently being undertaken to lower the compu-tational cost while maintaining its performance. Theseoptimizations will include the use of a cache to storethe (in)dependencies and the use of a global structure.Other interesting research avenues to explore are exten-sions and modifications to the greedy search-and-scoreprocedure which is the most promising part to optimizein terms of computational gains. Regarding the multi-label learning problem, experiments with several thou-sands of labels are currently been conducted and willbe reported in due course. We also intend to work onrelaxing the faithfulness assumption and derive a prac-tical multi-label classification algorithm based on H2PCthat is correct under milder assumptions underlying thejoint distribution (e.g. Composition, Intersection). Thisneeds further substantiation through more analysis.

Acknowledgments

The authors thank Marco Scutari for sharing hisbn-learn package inR. The research was also supportedby grants from the European ENIAC Joint Undertak-ing (INTEGRATE project) and from the French Rhone-Alpes Complex Systems Institute (IXXI).

Appendix

Lemma 8. ∀YLPi ,YLP j ∈ LP, thenYLPi ∩ YLP j ∈ LP.

Proof. To keep the notation uncluttered and for the sakeof simplicity, consider a partition{Y1,Y2,Y3,Y4} of Ysuch thatYLPi = Y1 ∪ Y2 , YLP j = Y2 ∪ Y3 . Fromthe label powerset assumption forYLPi and YLP j , wehave (Y1 ∪ Y2) ⊥⊥ (Y3 ∪ Y4) | X and (Y2 ∪ Y3) ⊥⊥(Y1 ∪ Y4) | X . From the Decomposition, we have thatY2 ⊥⊥ (Y3 ∪ Y4) | X . Using the Weak union, we obtainthatY2 ⊥⊥ Y1 | (Y3 ∪ Y4 ∪ X) . From these two facts,we can use the Contraction property to show thatY2 ⊥⊥

(Y1 ∪ Y3 ∪ Y4) | X . Therefore,Y2 = YLPi ∩ YLP j is alabel powerset by definition.

Lemma 9. Let Yi and Yj denote two distinct labels inY and define byYLPi andYLP j their respective minimallabel powerset. Then we have,

∃Z ⊆ Y \ {Yi ,Yj}, {Yi} 6⊥⊥ {Yj} | (X ∪ Z)⇒ YLPi = YLP j

Proof. Let us supposeYLPi , YLP j . By the label pow-erset definition forYLPi , we haveYLPi ⊥⊥ Y \ YLPi | X.As YLPi ∩ YLP j = ∅ owing to Lemma 8, we havethat Yj < YLPi . ∀Z ⊆ Y \ {Yi ,Yj}, Z can be decom-posed asZ i ∪ Z j such thatZ i = Z ∩ (YLPi \ {Yi}) andZ j = Z \ Z i . Using the Decomposition property, wehave that ({Yi} ∪ Z i) ⊥⊥ ({Yj} ∪ Z j) | X . Using the Weakunion property, we have that{Yi} ⊥⊥ {Yj} | (X ∪ Z) . Asthis is true∀Z ⊆ Y \ {Yi ,Yj}, then∄Z ⊆ Y \ {Yi ,Yj}, {Yi} 6

⊥⊥ {Yj} | (X ∪ Z) , which completes the proof.

Lemma 10. ConsiderYLP a minimal label powerset.Then, if p satisfies the Composition property,Z 6⊥⊥ YLP\

Z | X.

Proof. By contradiction, suppose a nonemptyZ exists,such thatZ ⊥⊥ YLP \ Z | X . From the label powersetassumption ofYLP, we have thatYLP ⊥⊥ Y \ YLP | X. From these two facts, we can use the Compositionproperty to show thatZ ⊥⊥ Y \ Z | X which contradictsthe minimal label powerset assumption ofYLP. Thisconcludes the proof.

Theorem 7. Suppose p is faithful to a DAGG. Then,Yi and Yj belong to the same minimal label powerset ifand only if there exists an undirected path inG betweennodes Yi and Yj in Y such that all intermediate nodes Zare either (i) Z∈ Y, or (ii) Z ∈ X and Z has two parentsin Y (i.e. a collider of the form Yp → X← Yq).

17

Proof. Suppose such a path exists. By conditioning onall the intermediate collidersYk in Y of the formYp →

Yk ← Yq along an undirected path inG between nodesYi andYj , we ensure that∃Z ⊆ Y \ {Yi ,Yj}, dSep(Yi ,Yj |

X ∪ Z). Due to the faithfulness, this is equivalent to{Yi} 6⊥⊥ {Yj} | (X ∪Z). From Lemma 9, we conclude thatYi andYj belong to the same minimal label powerset.To show the inverse, note that owing to Lemma 10, weknow that there exists no partition{Z i ,Z j} of Z such that({Yi} ∪ Z i) ⊥⊥ ({Yj} ∪ Z j) | X. Due to the faithfulness,there exists at least a link between ({Yi}∪Z i) and ({Yj}∪

Z j) in the DAG, hence by recursion, there exists a pathbetweenYi andYj such that all intermediate nodesZ areeither (i)Z ∈ Y, or (ii) Z ∈ X andZ has two parents inY (i.e. a collider of the formYp → X← Yq).

Theorem 8. Suppose p is faithful to a DAGG. LetYLP = {Y1,Y2, . . . ,Yn} be a label powerset. Then, itsMarkov boundaryM in U is also its Markov boundaryin X, and is given inG byM =

⋃nj=1{PCYj ∪ SPYj } \ Y.

Proof. First, we prove thatM is a Markov boundaryof YLP in U. DefineM i the Markov boundary ofYi inU, andM ′

i = M i \ Y. From Theorem 5,M i is givenin G by M i = PCYj ∪ SPYj . We may now prove thatM ′

1∪· · ·∪M ′n is a Markov boundary of{Y1,Y2, . . . ,Yn}

in U. We show first that the statement holds forn = 2and then conclude that it holds for alln by induction.Let W denoteU \ (Y1 ∪ Y2 ∪ M ′

1 ∪ M ′2), and define

Y1 = {Y1} andY2 = {Y2}. From the Markov blanketassumption forY1 we haveY1 ⊥⊥ U \ (Y1 ∪M1) | M1

. Using the Weak Union property, we obtain thatY1 ⊥⊥

W | M ′1 ∪ M ′

2 ∪ Y2 . Similarly we can deriveY2 ⊥⊥

W | M ′2 ∪M ′

1 ∪ Y1 . Combining these two statementsyieldsY1∪Y2 ⊥⊥W | M ′

1∪M ′2 due to the Intersection

property. LetM = M ′1∪M ′

2 the last expression can beformulated asY1 ∪Y2 ⊥⊥ U \ (Y1 ∪Y2 ∪M ) | M whichis the definition of a Markov blanket ofY1 ∪ Y2 in U.We shall now prove thatM is minimal. Let us supposethat it is not the case, i.e.,∃Z ⊂ M such thatY1 ∪Y2 ⊥⊥

Z∪(U\(Y1∪Y2∪M )) | M \Z . DefineZ1 = Z∩M1 andZ2 = Z∩M2, we may apply the Weak Union property togetY1 ⊥⊥ Z1 | (M \Z)∪Y2∪Z2∪(U\(Y∪M )) which canbe rewritten more compactly asY1 ⊥⊥ Z1 | (M1 \ Z1) ∪(U \ (Y1∪M1)) . From the Markov blanket assumption,we haveY1 ⊥⊥ U \ (Y1 ∪ M1) | M1 . We may nowapply the Intersection property on these two statementsto obtainY1 ⊥⊥ Z1∪(U\(Y1∪M1)) | M1\Z1 . Similarly,we can deriveY2 ⊥⊥ Z2 ∪ (U \ (Y2 ∪ M2)) | M2 \ Z2 .From the Markov boundary assumption ofM1 andM2,we have necessarilyZ1 = ∅ andZ2 = ∅, which in turn

yieldsZ = ∅. To conclude for anyn > 2, it suffices to setY1 =

⋃n−1j=1{Yj} andY2 = {Yn} to conclude by induction.

Second, we prove thatM is a Markov blanket ofYLP

in X. DefineZ = M ∩ Y. From the label powersetdefinition, we haveYLP ⊥⊥ Y \ YLP | X . Using theWeak Union property, we obtainYLP ⊥⊥ Z | U\(YLP∪Z)which can be reformulated asYLP ⊥⊥ Z | (U \ (YLP ∪

M ))∪(M \Z) . Now, the Markov blanket assumption forYLP in U yieldsYLP ⊥⊥ U \ (YLP∪M ) | M which can berewritten asYLP ⊥⊥ U \ (YLP ∪M ) | (M \ Z) ∪ Z . Fromthe Intersection property, we getYLP ⊥⊥ Z∪ (U \ (YLP∪

M )) | M \ Z . From the Markov boundary assumptionof M in U, we know that there exists no proper subsetof M which satisfies this statement, and thereforeZ =M ∩Y = ∅. From the Markov blanket assumption ofMin U, we haveYLP ⊥⊥ U \ (YLP ∪ M ) | M . Using theDecomposition property, we obtainYLP ⊥⊥ X \ M | Mwhich, together with the assumptionM ∩ Y = ∅, is thedefinition of a Markov blanket ofYLP in X.

Finally, we prove thatM is a Markov boundary ofYLP in X. Let us suppose that it is not the case, i.e.,∃Z ⊂ M such thatYLP ⊥⊥ Z ∪ (X \ M ) | M \ Z. From the label powerset assumption ofYLP, wehaveYLP ⊥⊥ Y \ YLP | X which can be rewritten asYLP ⊥⊥ Y \ YLP | (X \ M ) ∪ (M \ Z) ∪ Z . Due to theContraction property, combining these two statementsyields YLP ⊥⊥ Z ∪ (U \ (M ∪ YLP) | M \ Z . Fromthe Markov boundary assumption ofM in U, we havenecessarilyZ = ∅, which suffices to prove thatM is aMarkov boundary ofYLP in X.

18

amazed−suprised

happy−pleased

relaxing−calm

quiet−still

sad−lonely

angry−aggresive

Mean_Acc1298_Mean_Mem40_MFCC_4

Mean_Acc1298_Mean_Mem40_Flux

BH_HighPeakAmp

BHSUM3

Mean_Acc1298_Mean_Mem40_MFCC_2

Mean_Acc1298_Std_Mem40_Flux

Mean_Acc1298_Std_Mem40_Rolloff

Mean_Acc1298_Std_Mem40_MFCC_4Std_Acc1298_Std_Mem40_Flux

amazed−suprised

happy−pleased

relaxing−calm

quiet−still

sad−lonely

angry−aggresive

Std_Acc1298_Std_Mem40_MFCC_4

Mean_Acc1298_Mean_Mem40_Rolloff

Std_Acc1298_Std_Mem40_MFCC_11

Mean_Acc1298_Mean_Mem40_MFCC_0

Mean_Acc1298_Mean_Mem40_Flux

BH_HighPeakAmp

Std_Acc1298_Std_Mem40_MFCC_1

Std_Acc1298_Std_Mem40_MFCC_3

Mean_Acc1298_Std_Mem40_MFCC_4

Std_Acc1298_Mean_Mem40_MFCC_4

Std_Acc1298_Std_Mem40_MFCC_2

Std_Acc1298_Std_Mem40_MFCC_5

Mean_Acc1298_Mean_Mem40_MFCC_1

Mean_Acc1298_Mean_Mem40_Centroid

Mean_Acc1298_Std_Mem40_Rolloff

Std_Acc1298_Std_Mem40_MFCC_10

Std_Acc1298_Std_Mem40_MFCC_12

Std_Acc1298_Std_Mem40_MFCC_8

Std_Acc1298_Mean_Mem40_MFCC_12

Mean_Acc1298_Std_Mem40_MFCC_11

Mean_Acc1298_Mean_Mem40_MFCC_8

Std_Acc1298_Mean_Mem40_MFCC_11

Std_Acc1298_Mean_Mem40_MFCC_0

Std_Acc1298_Std_Mem40_MFCC_0

Mean_Acc1298_Std_Mem40_Flux

BH_LowPeakAmp

Mean_Acc1298_Std_Mem40_MFCC_2

Mean_Acc1298_Std_Mem40_MFCC_0

Std_Acc1298_Std_Mem40_Flux

BHSUM3

BH_HighPeakBPM

Std_Acc1298_Mean_Mem40_MFCC_8

BHSUM2

Std_Acc1298_Mean_Mem40_MFCC_1

Std_Acc1298_Mean_Mem40_Flux

Std_Acc1298_Std_Mem40_Centroid

(a) Emotions

Class1

Class2

Class3

Class6

Class4

Class10

Class11

Class5

Class7Class8

Class9

Class12

Class13

Class14

Att88

Att96

Att89

Att100

Att84

Att87Att93

Class1

Class2

Class3

Class6

Class4

Class10 Class11Class5

Class7

Class8

Class9

Class12Class13

Class14

Att94

Att83

Att66

Att103

Att43

Att3

Att68

Att92

Att86

Att85

Att96

Att91

Att89

Att102

Att63

Att65

Att78

Att46Att37

Att77

Att47

Att84

Att80

Att88

Att42

Att38

Att62

Att71

Att12

Att4Att5

Att2Att29

Att1

Att36

Att44

Att11

Att67

Att13

Att16

(b) Yeast

Figure 3: The local BN structures learned by MMHC (left plot)and H2PC (right plot) on a single cross-validation split, onEmotions and Yeast.

19

desert

mountains

sea

sunset

trees

X9

X21

X78

X8

X18

X6

X20

X12

X24

X77

X69

X75 desertmountains

sea

sunset

trees

X21

X7

X76

X37

X2

X20

X12

X24X129

X23

X8

X16

X4

X89

X61X77 X67

X79

X73

X28

X46

X40

X118

X38

X3

X1

X11

(a) Image

Beach

Sunset

FallFoliage

Field

Mountain

Urban

Att45

Att46

Att44

Att38

Beach

Sunset

FallFoliage

Field

MountainUrban

Att1

Att115

Att78

Att85

Att238

Att26

Att27

Att263

Att271

Att68

Att22

Att173

Att2

Att8

Att3

Att114

Att116

Att122

Att108

Att59

Att213

Att71

Att29

Att86

Att281

Att92

Att36

Att91

Att183

Att237

Att231

Att245

Att280

Att294

Att25

Att19

Att75

Att33

Att28

Att34

Att20

Att76

Att262

Att264

Att165

Att270

Att67

Att256

Att69

Att61

Att166

Att23

Att15

Att64

Att172

Att174

Att180

Att121

(b) Scene

Figure 4: The local BN structures learned by MMHC (left plot)and H2PC (right plot) on a single cross-validation split, onImage and Scene.

20

Entertainment

Interviews

Main

Developers

Apache

News

Search

Mobile

Science

IT

BSD

Idle

Games

YourRightsOnline

AskSlashdot

Apple

BookReviews

Hardware

Meta

Linux

Politics

Technology

movie

format

television

answers

founder

developers

programming

perl

foundation

ruphus13

dead

google

standard

mobile

iphone

phonenetbookslaptop

scientists

X4writes

game

games

gaming

riaa

work

apple

mac

macbook

book

stoolpigeon

michael

intel

nvidia

amd

pc

linux

ubuntu

source

kernel

obamaelection

presidentialvoting

political

coondoggie

official

legal

tv

shows

questions

sendspoints

readers

notes

open

left

chrome

android

search

information

update

devices

broadband

notebook

app

cell

netbook

laptops

olpc

developed

research

video

developer

industry

newyorkcountrylawyer

company

experience

os

pro

lines

books

author

demo

graphics

gpu

ii

eeebarence

windows

X8code

barack

transition

discuss

machines

party

million

Entertainment

Interviews

Main

Developers

Apache

News

Search

Mobile

Science

IT

BSD

Idle

GamesYourRightsOnline

AskSlashdot

Apple

BookReviews

Hardware

Meta

Linux

Politics

Technology

movie

format

television

copyright

trek

answers

writes

founder

director

commission

developers

programming

java

project

developerlanguage

perl

snydeq

development

javascript

microsoft

foundation

ruphus13

dead

google

web

search

standard

mobile

iphone

phonenetbookslaptop

early

text laptops

blackberry

notebook

wireless

narramissic

devices

phones

X3g

t.mobile

android

scientists

universe

dna

planet

water

human

users

surface

quantum

humans

astronomers

evidence

physics

discovery

brain

solar

team

study

spacecraft

telescope

earth

discovered

dk

kentuckyfc

scientific

found

moon

research

mission

scientist

mars

nasa

science

researchers

cisco

released

X4

don

readers

sends

data

dog

future

interesting

isn

baby

japanese

officialsguy

recentreports

santa

disagree

women

story

announced

woman

man

game games

gaming

series

bring

trailer

details

edge

popular

players

drm

play

littlebigplanet

coming

massively

worlds

e3

fallout

interview

entertainment

warcraft

mmos

expansion

content

activision

downloadableonline

X1up

gameplay

gamer

spokeblizzard

console

gamers

kotaku

nintendo

upcoming

xbox

runningwii

ea

mmo

gamasutra

riaa

police

names

p2p

legal

government

isps

eu

music

bill

internet

dmca

rights

law

property

court

work

server

tricks good

email

time

small

books

computerpay

local boss

thinking

job

recently

apple

mactakes

itunes

macbook

steve

book

stoolpigeon

michael

management

ben

intel

nvidiaamd

pc

desktop

memory hp

chips

hardware

supercomputer

cpuchip

arcticstoat

linux

ubuntu

source

kernel

active

open.source

free

obama

election

presidential

voting

political palin

telecom

votes

states

debate

congress

gov

campaign

vote

mccain

coondoggie

insiderocket

silicon

system

prototype

tech

technology

developed

studios

official

consumer

war

tv switch

act

star

things

questions

policy

army

asked

points

notes

blog talks

eldavojohn

newyorkcountrylawyer

tips

anonymous

guide

education

federalsun

python

flash

open

olpc

gave

single

infoworld

exception

companies

paul

francisco

creator

committee

windows

electronic

announcement

zdnet

mentioned

left

chromeapps

yahoosite

browser

services

X0sites

applications

information

update

lamont

broadband

app

cell

callsnetbook

cheap

X360

child

atari

lenovo

reported

networkfcc

nokia

connection

g1

canadian

create

artificial

magnetic

red

energy

cancer

service

website

image

blacktaking

european

size

nobel

professor

japan

power

month

cells

transition

international

dark

conducted

finds risk

peopleshows

suggests

robots

million

hubble

space

survey

billionsatellite

matt

american

lunar

indiaimpact

institute

launch

original

holy

landerice

centerengineers

news

university

roland

working

replace

created

releases

version

release

today

firmware

security

X1

imaginary

feel

sending

wrote

written

death

bbc

excerpt

jagslive

word

anti.globalism

begins

article

firm

lucas123

trial

president

prettypickens

country

couple

X50

financial

technica

mail

slashdot

claimissued

men

washington

announces

september

plans

mondaysuit

wife

back

contest

video

piracysales

independent

learning

modern

industry

show

final

X3d

mirror

magazine

money

run

user

november

guitar

pointed

player

virtual

bit

battle begin

year

conference

X3

australian

reilly

evolution

sony

digital

world

king

call

set

warhammer

patent

tour

author

videosways

ceo

employees

makers

part

social

chance

live

feature

piece

discussing

sporeaction

wars

mediasentry

led

block

received

ruling

chinese

plan

agreement

warner

illegal

streaming

passed

pdfproposed

house

senate

censorship

china

access

bittorrent

submitted

speech

attempt

intellectual

supreme

appeals

ruled

case

filed

district

order

judge

appeal

company

experience

stupid problem

gmail

spam

long

computersattack

org ordered

student

makes

vista

close

jobs

offers

decided

psystar

os

youtube

store

prolines

line

include

noted

read

current

demoperformance

core

graphics

gpudriver ii

eee

market

pcsbarence

demonstrated

move

top X500

designed

revealed

hat

X6

X8

mark

code

networks

software

download

barack

america

night

tuesday

issues

machines

problems

party

state

apparently private

account

reporting

united

europe

privacy

change

register

john

challenge

pictures

ad

high

operating

mike

screen battery

built

unveiled

(a) Slashdot

PDOC00154

PDOC00343

PDOC00271

PDOC00064

PDOC00791

PDOC00380

PDOC50007

PDOC00224

PDOC00100

PDOC00670

PDOC50002

PDOC50106

PDOC00561

PDOC50017

PDOC50003

PDOC50006

PDOC50156

PDOC00662

PDOC00018

PDOC50001

PDOC00014

PDOC00750

PDOC50196

PDOC50199

PDOC00660

PDOC00653

PDOC00030

PS50072

PS50067

PS50053

PS50065

PS01031

PS50004

PS50007

PS50008

PS50049

PS50011

PS50052

PS50002

PS50106

PS50050

PS50017

PS50003

PS50006

PS50156

PS50051

PS00018

PS50001

PS00014

PS50235

PS50199

PS50102

PS00027

PDOC00154

PDOC00343

PDOC00271

PDOC00064

PDOC00791

PDOC00380

PDOC50007

PDOC00224

PDOC00100

PDOC00670

PDOC50002

PDOC50106

PDOC00561

PDOC50017

PDOC50003

PDOC50006PDOC50156

PDOC00662PDOC00018

PDOC50001

PDOC00014

PDOC00750

PDOC50196

PDOC50199

PDOC00660PDOC00653

PDOC00030

PS50072

PS50067

PS50053

PS50065

PS00318

PS00066

PS01031

PS50004

PS50008

PS50007

PS50049

PS50011

PS50052

PS50002

PS50106

PS50050

PS50017

PS50003

PS50006PS50156

PS50051PS00018

PS50001

PS00014

PS50235

PS50229

PS50199

PS50245PS50804

PS50102

(b) Genbase

Figure 5: The local BN structures learned by MMHC (left plot)and H2PC (right plot) on a single cross-validation split, onSlashdot and Genbase.

21

Class−0−593_70

Class−1−079_99

Class−2−786_09 Class−3−759_89

Class−4−753_0

Class−5−786_2

Class−6−V72_5

Class−7−511_9

Class−8−596_8

Class−9−599_0

Class−10−518_0

Class−11−593_5

Class−12−V13_09

Class−13−791_0

Class−14−789_00

Class−15−593_1

Class−16−462

Class−17−592_0

Class−18−786_59

Class−19−785_6

Class−20−V67_09

Class−21−795_5

Class−22−789_09

Class−23−786_50

Class−24−596_54

Class−25−787_03

Class−26−V42_0

Class−27−786_05

Class−28−753_21

Class−29−783_0

Class−30−277_00

Class−31−780_6

Class−32−486

Class−33−788_41 Class−34−V13_02

Class−35−493_90

Class−36−788_30

Class−37−753_3

Class−38−593_89

Class−39−758_6

Class−40−741_90

Class−41−591

Class−42−599_7

Class−43−279_12

Class−44−786_07

reflux

grade

syndrome

hemihypertrophy

cough

unilateral

drainage

present

uti

atelectasis

hydroureter

proteinuria

pain

abdominal

pharyngitis

calculi

10−year−9−month

followup

ppd

positive

chest

neurogenic

vomiting

transplant

shortness

appetite

fibrosis

yearly

pneumonia

fever

intraluminal

asthma

features

pyelectasis

moderate

pyelocaliectasis

horseshoe

duplication

duplicated

enuresis

incontinence

2;

myelomeningocele

meningomyelocele

turner

hydronephrosis

hematuria

wheezing

vesicoureteral

cystogram

ii

low 2

aldrich

lobe

9−day

alternatively

count

echogenicity

recent

normal

lung

patchy

proximally

flank

mass

conveyed

evidence

pole

growth

routine

study

antibody

renal

kidneys

accounts

aspirated

marrow

cystic

evaluation

viral focal

ago

year−old radiographic

distentionresidual

caliber

change

partial

collecting

calcium

date

infections

grossly

kidney

gross

microscopic

reactive

Class−0−593_70

Class−1−079_99

Class−2−786_09

Class−3−759_89

Class−4−753_0

Class−5−786_2

Class−6−V72_5

Class−7−511_9

Class−8−596_8

Class−9−599_0

Class−10−518_0

Class−11−593_5

Class−12−V13_09

Class−13−791_0

Class−14−789_00

Class−15−593_1

Class−16−462

Class−17−592_0

Class−18−786_59

Class−19−785_6

Class−20−V67_09

Class−21−795_5

Class−22−789_09

Class−23−786_50

Class−24−596_54

Class−25−787_03

Class−26−V42_0

Class−27−786_05Class−28−753_21

Class−29−783_0

Class−30−277_00

Class−31−780_6

Class−32−486

Class−33−788_41

Class−34−V13_02

Class−35−493_90

Class−36−788_30

Class−37−753_3

Class−38−593_89

Class−39−758_6

Class−40−741_90

Class−41−591

Class−42−599_7

Class−43−279_12

Class−44−786_07

reflux

grade

growth

syndrome

hemihypertrophy

mass

hypoventilation

cough

weeks

x−ray

unilateral

drainage

present

tract

uti

febrile

atelectasis

hydroureter

decreased

woman

proteinuria

pain

abdominal

pharyngitis

calculi

10−year−9−month

followup

ppd

positive

flank

chest

neurogenic

vomiting

transplant

shortness

appetite

fibrosis

yearly

pneumonia

fever

intraluminal

asthma

reactive

features

ago

utis

recent

pyelectasis

pyelocaliectasis

abnormal

mild

horseshoe

duplication

duplicated

kidneys

kidney

enuresis

incontinence

2;

anuresis

night

ultrasound myelomeningocele

meningomyelocele

turner

hydronephrosis

improvement

unchangeddecrease

bilateral

moderate

wiskott

hematuria

wheezing

consistent

vesicoureteral

cystogram

voiding

left

history

deflux

low

4

2

nuclear

ii

interval

wiedemann

utero

thickening

clearprior

dilatation

lobe

pleural

stable

bladder

streaky

approximately

2−3

evidence

9−day

appearance

echogenicity

thoraxwhite

urinary

infection

partial

lung

band

lobar

area

patchy

density

opacities

consolidation

subsegmental

represent

sounds

junction

conveyed

stool

underlying

pole

5

reimplantation

duplex

routine

studytiterrenal

normal

views

radiographic

radiograph

started

web

marrow

cystic

evaluation

viral

examination

male

rulefocal

days

disease

hyperinflation

findings

airway

airways

year−old

nonconventional

etiology

9;

month

degrees

years

2001

pyelonephritis

bilaterally

resolves

change

increased

residual

gross

related

appears

distention

calix

ureteral

ureters

cortical

collecting

kidney;

partially

sonographic

appearing

lungs

size

made

infections

r=9

wetting

row

day

including

vesicostomy

grossly

resolution

urogenitalprenatal

significant

parenchymal

technical

9−month−24−day

microscopic

airminimally

(a) Medical

A.A1

A.A2

A.A3

A.A4

A.A5

A.A6

A.A7

A.A8

B.B1

B.B10

B.B11

B.B12

B.B13

B.B2

B.B3

B.B4

B.B5

B.B6

B.B7

B.B8

B.B9

C.C1

C.C10

C.C11

C.C12

C.C13

C.C2

C.C3

C.C4

C.C5

C.C6

C.C7

C.C8

C.C9

D.D1

D.D10

D.D11

D.D12

D.D13D.D14

D.D15

D.D16D.D17D.D18

D.D19

D.D2

D.D3

D.D4

D.D5

D.D6

D.D7D.D8

D.D9

meeting

job

draft attached

doc

amto

article

http

media

california

confidential

comments

proposal

language

thursday

staff

securitiesfile

cc

enroncc

11

pmto

attorney

intended

legal

www

visit

hit

ca

deregulation

A.A1

A.A2

A.A3

A.A4

A.A5

A.A6

A.A7

A.A8

B.B1

B.B10

B.B11

B.B12

B.B13

B.B2

B.B3

B.B4

B.B5

B.B6

B.B7

B.B8

B.B9

C.C1

C.C10

C.C11

C.C12

C.C13

C.C2

C.C3

C.C4

C.C5

C.C6

C.C7

C.C8

C.C9

D.D1

D.D10

D.D11

D.D12

D.D13D.D14

D.D15

D.D16

D.D17

D.D18D.D19

D.D2

D.D3

D.D4

D.D5

D.D6

D.D7

D.D8

D.D9

press

kaminski

vince

00

meeting

job

draft

forwarded

news

attached

doc

weeks

subject

original

pmto

copyright

article

angeles

homes

bill

http

ferc

market

confidential

points

californiadiego

procedures

release

caps

head

solution

government

message

june

mailto

stanford

risk

london

22

tuesday

3

121

5

thursday

yesterday

development

staff

talk

discussion

concern

board

jandidn

doesnpuc

comments

language

proposal

williams

pm

stevenhou

cc

08

28 1001

02

05

09

fw

2001

data

friday

july

monday

amto

100

week

prices

good

utility

big

sale

million

securities

james

mara

find

file

cpuc

scott

trading

lineago

office republican

questions

15

number

kean

2000

3123

16

06

0403

enron

enroncc

reserved

ordered

27 world

shut

budget

losnet

held

la

spokesman

ed

temporary

valley

reached

cut

megawatts

senate

bills

chris

political

effect

al

law

bad

www

expense

siteproducers

total

iso

order

commission

alan

september

settlement

sarah

rtolinda

wood

4

issues

investigation

reasonable

commissioners

filing

filed

authority

wholesale

powerprice

interest fall

makingreal including

industrybased

markets

intended

cash

system

legal

information

attorney

talking

key

assets

america

crisis

deregulation

ca

pacific

san

residential

sdg

wayspotential

passed

lawmakerslegislators

legislature rate

shortages

policies

required

(b) Enron

Figure 6: The local BN structures learned by MMHC (left plot)and H2PC (right plot) on a single cross-validation split, onMedical and Enron.

22

TAG_2005

TAG_2006

TAG_2007

TAG_agdetection

TAG_algorithms

TAG_amperometry

TAG_analysis

TAG_and

TAG_annotation

TAG_antibody

TAG_apob

TAG_architecture

TAG_article

TAG_bettasplendens

TAG_bibteximport

TAG_book

TAG_children

TAG_classification

TAG_clustering

TAG_cognition

TAG_collaboration

TAG_collaborative

TAG_community

TAG_competition

TAG_complex

TAG_complexity

TAG_compounds

TAG_computer

TAG_computing

TAG_concept

TAG_context

TAG_cortex

TAG_critical

TAG_data

TAG_datamining

TAG_date

TAG_design

TAG_development

TAG_diffusion

TAG_diplomathesis

TAG_disability TAG_dynamics

TAG_education

TAG_elearning

TAG_electrochemistryTAG_elisa

TAG_empirical

TAG_energy

TAG_engineering

TAG_epitope

TAG_equation

TAG_evaluation

TAG_evolution

TAG_fca

TAG_folksonomy

TAG_formal

TAG_fornepomuk

TAG_games

TAG_granular

TAG_graph

TAG_hci

TAG_homogeneous

TAG_imaging

TAG_immunoassay

TAG_immunoelectrode

TAG_immunosensor

TAG_information

TAG_informationretrieval

TAG_kaldesignresearch

TAG_kinetic

TAG_knowledge

TAG_knowledgemanagement

TAG_langen

TAG_language

TAG_ldl

TAG_learning

TAG_liposome

TAG_litreview

TAG_logic

TAG_maintenance

TAG_management

TAG_mapping

TAG_marotzkiwinfried

TAG_mathematics

TAG_mathgamespatterns

TAG_methodology

TAG_metrics

TAG_mining

TAG_model

TAG_modeling

TAG_models

TAG_molecular

TAG_montecarlo

TAG_myown

TAG_narrative

TAG_nepomuk

TAG_network

TAG_networks

TAG_nlp

TAG_nonequilibrium

TAG_notag

TAG_objectoriented

TAG_of

TAG_ontologies

TAG_ontology

TAG_pattern

TAG_patterns

TAG_phase

TAG_physics

TAG_process

TAG_programming

TAG_prolearn

TAG_psycholinguistics

TAG_quantum

TAG_random

TAG_rdf

TAG_representation

TAG_requirements

TAG_research

TAG_review

TAG_science

TAG_search

TAG_semantic

TAG_semantics

TAG_semanticweb

TAG_sequence

TAG_simulation

TAG_simulations

TAG_sna

TAG_social

TAG_socialnets

TAG_software

TAG_spin

TAG_statistics

TAG_statphys23

TAG_structure

TAG_survey

TAG_system

TAG_systems

TAG_tagging TAG_technology

TAG_theory

TAG_topic1

TAG_topic10

TAG_topic11

TAG_topic2

TAG_topic3

TAG_topic4

TAG_topic6

TAG_topic7

TAG_topic8

TAG_topic9

TAG_toread

TAG_transition

TAG_visual

TAG_visualization

TAG_web

TAG_web20

TAG_wiki

2007

algorithm

algorithms

amperometric

analysis

patients

annotation

metadata

monoclonal

constants

apolipoprotein lipoprotein

apo

apob

b

architecture

architectures

reading

fish

fighting

calculations

molecular

journal

energy

initio

functional

review

children

child

classification

categories

clustering

cognitive

collaborative

collaboration

community

communities

link

entities

competitive

competition

complex

complexity

compounds

semantic

word

computer

computing

la

concept

context

aware

cortex

critical

automation

consumption

power

design

diffusion

assessment

dynamics

education

learning

electrochemical

electrode

glucose

assay

linked

empirical

engineering

equation

evaluation

evaluating

evolution

evolutionary

change

formal

games

gas

velocity

graph

interfaces

homogeneous

imaging

brain

images

immunoassay

antibody

serum

information

entropy

kinetic

constant

knowledge

management

language

languages

ldl

lipid

membrane

technology

logic

description

maintenance

reasoning

mapping

n

u

mathematics

mathematical

designing

antigen

metrics

mining

model

models

modeling

modelling

network

networks

equilibrium

extraction

object

ontologies

ontology

pattern

patterns

process

programming

programs

hypermedia

adaptive

quantum

random

rdf

representation

representations

requirements

specification

reviewed

science

scientific

search

semantics

sequence

sequences

simulation

simulations

social

latent

probabilistic

software

spin

statistics

structure

survey

systems

tagging

theory

boltzmann

proteins

biological

dna

financial

surfaces

liquid

polymer

gap

glass

relaxation

transition

visual visualization

web

first

discovery

data

machine

training

anti

antibodies

novel

efficient

techniques

analyses

disease

clin

left

primary

reaction

calculated

values

density

res

c

v

p

platform

built

siamese

molecules

protein

conference

american

psychology

physics

free

surface

ab

equations

nodes

physical

letters

young

care

vector

hierarchical

cluster

computers

supported

human

society

participants

practice

links

relationships

relations

co

determination

formation

reduce

system

applications

text

words

factors

ieee

internet

concepts

contexts

personal

exponents

application

law

coefficient

test

dynamical

students

educational

school

chem

electron

concentration

enzyme

method

theoretical

development

evaluated

evolving

genetic

biology

changes

world

wide

users

pages

services

game

phase

particles

particle

motionflow

graphs

interface

user

heterogeneous

resonance

image

retrieval

about

processing

production

maximum

measure

kinetics

rate

managing

natural

receptor

cholesterol

cell

technologies

map

onto

sci

s

interactive

non

oriented

objects

processes

program

multimedia

adaptation

mechanical

mechanics

classical

stochastic

notes

engine

numerical

onlinecultural

people

indexing

probability

magnetic

structures

its

gene

genome

markets

solid

crystal

chain

temperature

state

second

then

TAG_2005

TAG_2006

TAG_2007

TAG_agdetection

TAG_algorithms

TAG_amperometry

TAG_analysis

TAG_and

TAG_annotation

TAG_antibody TAG_apob

TAG_architecture

TAG_article

TAG_bettasplendens

TAG_bibteximport

TAG_book

TAG_children

TAG_classification

TAG_clustering

TAG_cognition

TAG_collaboration

TAG_collaborative

TAG_community

TAG_competition

TAG_complex

TAG_complexity

TAG_compounds

TAG_computer

TAG_computing

TAG_concept

TAG_context

TAG_cortex

TAG_critical

TAG_data

TAG_datamining

TAG_date

TAG_design

TAG_development

TAG_diffusion

TAG_diplomathesis

TAG_disability

TAG_dynamics

TAG_education

TAG_elearning

TAG_electrochemistry

TAG_elisa

TAG_empirical

TAG_energy

TAG_engineering

TAG_epitope

TAG_equation

TAG_evaluation

TAG_evolution

TAG_fca

TAG_folksonomy

TAG_formal

TAG_fornepomuk

TAG_games

TAG_granular

TAG_graph

TAG_hci

TAG_homogeneous

TAG_imaging

TAG_immunoassay

TAG_immunoelectrode

TAG_immunosensor

TAG_information

TAG_informationretrieval

TAG_kaldesignresearch

TAG_kinetic

TAG_knowledge

TAG_knowledgemanagement

TAG_langen

TAG_language

TAG_ldl

TAG_learning

TAG_liposome

TAG_litreview

TAG_logic

TAG_maintenance

TAG_management

TAG_mapping

TAG_marotzkiwinfried

TAG_mathematics

TAG_mathgamespatterns

TAG_methodology

TAG_metrics

TAG_mining

TAG_model

TAG_modeling

TAG_models

TAG_molecular

TAG_montecarlo

TAG_myown

TAG_narrative

TAG_nepomuk

TAG_network

TAG_networks

TAG_nlp

TAG_nonequilibrium

TAG_notag

TAG_objectoriented

TAG_of

TAG_ontologies

TAG_ontology

TAG_pattern

TAG_patterns

TAG_phase

TAG_physics

TAG_process

TAG_programming

TAG_prolearn

TAG_psycholinguistics

TAG_quantum

TAG_random

TAG_rdf

TAG_representation

TAG_requirements

TAG_research

TAG_review

TAG_science

TAG_search

TAG_semantic

TAG_semantics

TAG_semanticweb

TAG_sequence

TAG_simulation

TAG_simulations

TAG_sna

TAG_social

TAG_socialnets

TAG_software

TAG_spin

TAG_statistics

TAG_statphys23

TAG_structure

TAG_survey

TAG_system

TAG_systems

TAG_tagging

TAG_technology

TAG_theory

TAG_topic1

TAG_topic10

TAG_topic11

TAG_topic2

TAG_topic3

TAG_topic4

TAG_topic6

TAG_topic7TAG_topic8

TAG_topic9

TAG_toread

TAG_transition

TAG_visual

TAG_visualization

TAG_webTAG_web20

TAG_wiki

05

mining

2007

algorithm

algorithms

amperometric

working

competitive

glucose

analysis

patients

annotation

metadata

monoclonal

determining

mass

association

constants

apolipoprotein

lipoprotein

apo

apob

b

through

ratio

95

approach

variation

united

gene

significantly

particles

isolated

samples

containing

subjects

disease

protein

serum

clin

human

cholesterol

lipid

plasma

lipoproteins

ldl

architecture

architectures

in

reading

conference

movement

fishfighting

betta

neural

psychology

2004

behavior

display

male

calculations

molecular

energy

functional

density

could

science

carbon

employed

second

self

series

approximate

calculate

spin

acid

yield

species

coupled

gaussian

experiment

contributions

comparison

predicted

chem

set

x

treatment

relative

studied

interactions

site

level

active

structures

consistent

been

obtained

transfer

carried

frequencies

ion

geometry

compared

approximation

theoretical

found

activation

gas

sets

optimized

potentials catalytic

perturbation

complexes

study

accurate

computed

transition

effects

correlation6

computational

electron

molecule

agreement

charge

structure

experimental

exchange

calculation

electrostatic

methods

combined

barrier

reactions

atoms

method

hydrogen

mol

basis

potential

american

electronic

calculated solvent

reaction

molecules

c

bond

mechanical

dft

hybrid

qm

energies

quantum

journal

chemical

chemistry

initio

networks

physical

children

child

numbers

classification

text

categories

clustering

constraints

cognitive

collaborative

collaboration

shared

accuracy

community communities

entities

link

enzyme

competition

labeled

complex

complexity

compounds

proceedings

interpretation

target

similarity

semantic

words

language

word

computer

school

computing

la

concept

conceptual

context

aware

contexts

cortex

input

critical

data

genome

automation

consumption

power

average

technique

reduction

multi

policy

date

memory

reducing

dynamic

performance

embedded

design

creating

development

diffusion

ontology

hierarchical

assessment

identity

2006

basic

developmental

dynamics

physics

phys

exhibits

education

educational

learning

und

e

electrochemical

electrode

limits

polymer

current

direct

signal

vs

analytical

modified

range

oxidation

sensor

immobilized

immunoassay

antibody

linked

empirical

software

arise

engineering

proteins

identification

distinct

located

specific

fragments

expression

equation

evaluation

evaluating

a

evolution

requirements

evolutionary

changes

change

evolving

maintenance

web

share

tagging

formal

games

game

velocity

homogeneous

media

graph

interfaces

addition

separation

imaging

brain

medical

directions

visualization

image

multiple

images

matter

procedure

4

min

developed

assays

sensitivity

ml

antibodies

detection

assay

response

information

entropy

retrieval

mathematics

computers

teaching

constant

kinetic

antigen

knowledge

management

the

experiences

personal

languages

teachers

membrane

technology

was

access

devices

health

logic

description

reasoning

mapping

n

u

der

mathematical

references

culture

perspective

practices

environments

designing

methodology

metrics

model

models

modeling

international

modelling

proc

generation

06 network

equilibrium

extraction

documents

object

oriented

and

ontologies

to

pattern

patterns

phase

surface

process

programming

programs

hypermedia

course

standards

adaptation

adaptive

were

argued

novel

experiments

combination

principle

random

book

rdf

graphs

representation

representations

illustrated

re

specification

research

review

reviewed

synthesis

cause

enzymes

scientific

search

manner

semantics

sequence

partial

sequences

simulation

simulations

social

latent

probabilistic

source

statistics

xxiii

survey

system

systems

library

users

communication

adapted

theory

biological

dna

cells

financial

optimization

driven

formation

surfaces

temporal

growth

liquid

gap

glass

visual

views

exploration

graphical

w

0

workshop

background

first

p

02

associated

objects

vision

enabling

meaning

machine

automatically

discovery

training

situations

tools

experience

practice

tasks

activities

learned

learn

students

relevant

98

institute

statistical

letters

degrees

curves

chosen

conducted

prototype

fully

smaller

uniform

automated

similar

benefits

possibility

goal

discrete

expected

region

equal

connected

expansion

depend

considered

leads

content

strongly

short

origin

linear

introduced

corresponding

particular

shown

chains

induced

wave

bulk

forces

motion

stationary

force

local

coupling

observed

square

parameters

generalized

interacting

relaxation

also

symmetry

dependence

states

exact

particle

thermodynamic

equations

point

mean

are

scaling

field

numerical

finite

techniques

on

vector

we

proposed

problem

efficient

based

bioinformatics

produced

determination

this

developers

group

mediated

differential

an

flow

substrate

by

comparative

analyze

revealed

analyzing

analyses

is

of

that

combiningstability

improved

elements

domainsinfluence

components

importance

characteristics

highly

main

processes

four

further

significant

defined

among

including

applications

under

several

three

properties

s

its

other

different

both

between

for

daily

age

conclusion

clinical

creation

anti

directly

determine

capture

mixture

features

greater

detect

free

50

using

against

j

concentrations

bound

concentration

affinity

binding

recognition

directed

cell

previously

expressed

role

thermal

non

action

law

annual

rate

values

overall

identical

contrast

20

normal

amino

assessed

low

respectively

res

v

view

close

g

1noise

abstract

2005

lettour

matching

approaches

coefficients

coefficient

variations

national

genetic

genes

than

results

interact

fraction

charged

fluid

sample

as

subject

risk

individuals

major

beta

run

reference

those

l

factors

interaction

activity

extent

biol

profile

40

levels

total

amounts

composition

very

rich

contained

enhanced

modification

had

decreased

recognized

increased

receptor

paper

built

execution

platform

infrastructure

reveals

turn

kinds

namely

characterization

relatively

light

complete

across

lack

little

involved

various

space

out

terms

large

about

present

or

who

his

1998

least

european

instance

transactions

procedures

allows

show

control

behavioursiamese

splendens

processing

rev

behaviors

differ

24

pairs

trans

performed

base

hard

shift

weight

small

structural

atomic

mechanics

water

minimumground

state

improvement

mixed

iii

dependent

whereas

double

gradient

be

notes

progress

order

two

organized

distribution

magnetic

25

side

decreasepathway

acids

ions

cluster

functions

contribution

their

predict

prediction

predictions

fit

y 2

10

alpha

theoretically

effect

attractive

sites

lattice

residues

high

at

mechanism

with

has

have

implemented

identified

far

past

however

recently

frequency

configuration

modeled

intelligence

framework

no

quality

formed

case

qualitative

character

findings

intermediate

occurs

transitions

temperature

r

strong

parameter

good

correlated

correlations

5

7

8

18

11

12

3

restricted

computation

cost

excellent

experimentally

full

strength

boltzmann

applicable

reliable

best

reverse

kinetics

realistic

one

extendedconventional

estimate

applied

measurement

estimated

society

solution

combines

chain

indicates

intrinsic

involving

path

step

reserved

h

2003

2000

1997

o

d

systematic

mm

environment

give

body

treated

fock

classical

studies

obtain

tested

ab

nodes

emergence

links

degree

scale

topology

young

early

year

skills

years

play

care

developing

number

probabilityopportunities

document

articles

clusters

composed

parallel

controlled

selected

actual

line

positive

resulting

during

product

standard

availablefinally

variety

presented

single

within

function

simple

when

into

new

time

strategies

arguetheories

user

virtual

impact

technical

functionality

decisions

goals

building

project

designed

principles

implementation

co

supporting

online

success

members

errors

participants

relationships

relations

ph

determined

reduce

symposium

2002

joint

conditions

acm www

definition

higher

measure

measures

not

natural

defining

acquisition

written

query

xml

decision

sense

supported

reports

ieee

internet

mobile

de

des

descriptions

concepts

rapidly

situation

there

left

area

period

layer

responses

maps

primary

orientation

output

arbitrary

exponents

from

prior

database

sources

collected

databases

whole

achieve

strategy

off

i

sizes

error

distance

agent

rather

test

comprehensive

storage

static

task

precision

better

efficiency

created

challenges

create

successful

develop

volume

mobility

domain

enable

assess

reliability

choice

examined

ability

dynamicalsimulated

fluctuations

regime

2001

1999

while

settings

die

dem

auch

sich

eine

sind

f

ist

mit

zur

als

den

auf

im

zu

von

limit

some

future

art

trends

currently

technologies

30

wide

long

broad monitoring

contact

measurements

resonance

onto

advantages

sensitive

detected

evidence

identify

industrial

reuse

program

managing

projects

code

conf

testing

changing

supports

needs

discuss

section

integrated

concernssupport

each

derived

begin

evaluated

evaluate

criteria

researchers

every

biology

selection

events

world

pages

topic

http

usage

intelligent

tool

sharing navigation

resources

services

they

generated

popular

look

video

market

others

heterogeneous

multimedia

medium

interface

specificity

resolution

areas

regions

since

interactive

seenonlinear

17

used

protocoldevice

quantitative

cross

solid

100

achieved

rapid

described

format

being

types

enables

existing

organizational

provides

people

provide

production

maximum

indexing

life

historical

ideas

solving

apparent

offers

measuring

amount

maintaining

integrating

making

construction

representing

providing

know

service

organizations

business

al

general

over

regardingmight

technological

innovation

what

digital

demonstrated

indicated

investigated

did

showed

conclusions

describing

already

parts

roles

map

such

sci

m

cultural

distributions

these

unified

int

next

generating

automatic

presentation

resource

distributed

connections

global

aspect

applying

family

establish

extend

consequences

construct

entire

facilitate

meaningful

allowing

secondary

comparable

completely

center

easy

perform

generate

constructed

interesting

generally

extensive

itself

required

appear

leading

furthermore

established

additional

upon

points

groups

limited

alternative

rates

likely

requires

increasing

examine

following

mechanismsuses

provided

initial

improve

together

key

individual

become

investigate

respect

lead

value

increase

must

allow

need

therefore

made

type

understand

according

effective

application

make

without

thus

up

able

way

them

most

related

alldue

use

but

more

it

which

many

will

you

clear

right

work

shows

like

find

phenomena

diagramphases

continuous

etc

included

day

60

suggesting

subsequent

can

analyzed

affected

15

five

seven

either

differences

less

having

special

answer

issues

foundation

academic

needed

papers

focuses

topics

discusses

efforts

emerging

questions

overviewunderstanding

how

developments

literature

occur

engine

relevance

queries

now

explicit

define

length

peptide

monte

behavioral

public

kind

political

organization

sciences

class

advances

libraries

communications

discussion

bases

markets

spatial

relation

growing

crystal

materials

attention

exploring

explore

clearly

13

where

lower

t

9

towards

then

(a) Bibtex

city

mountain

sky

sun

water

clouds

tree

bay

lake

sea

beach

boats

people

branch

leaf

grass

plain

palm

horizon

shell

hills

waves

birds

land

dog

bridge

ships

buildings

fence

island

storm peaks

jet

plane

runway

basket

flight

flag

helicopter

boeing

prop

f−16

tails

smoke

formation

bear polar

snow

tundra

ice

head

black

reflection

ground

forest

fall

river

field

flowers

stream

meadow

rocks

hillside

shrubs

close−up

grizzly

cubs

drum

log

hut

sunset

display

plants

pool

coral

fan

anemone

fish

ocean

diver

sunrise

face

sand

rainbow

farms

reefs

vegetation

house

village

carvings

path

wood

dress

coast

sailboats

cattiger

bengal

fox

kit

run

shadows winter

autumn

cliff

bush

rockface

pair

den

coyote

light

arctic

shore

town

road

chapel

moon

harbor

windmills

restaurant

wall

skyline

window

clothes

shops

street

cafe

tables

nets

crafts

roofs

ruins

stone

cars

castle

courtyard

statue

stairs

costume

sponges

sign

palace

paintings

sheep

valley

balcony

post

gate

plaza

festival

temple

sculpture

museum

hotel

art

fountain

market

door

mural

garden

star

butterfly

angelfish

lion

cave

crab

grouper

pagoda

buddha

decoration

monastery

landscape

detail

writing

sails

food

room

entrance

fruit

night

perch

cow

figures

facadechairs

guard

pond

church

park

barn

arch

hats

cathedral

ceremony

crowd glass

shrine

model

pillar

carpet

monument

floor

vines cottage

poppies

lawn

tower

vegetables

bench

rose

tulip

canal

cheese

railing

dock

horses

petals

umbrella

column

waterfalls

elephant

monks pattern

interior

vendor

silhouette

architecture

blossoms

athlete

parade

ladder

sidewalk

store

steps

relief

fog

frost

frozen

rapids

crystals

spider needles

stick

mist

doorway

vineyard

pottery

pots

military

designs

mushrooms

terrace

tent

bulls

giant

tortoise

wingsalbatross

booby

nest

hawk

iguana

lizard

marine

penguin

deer

white−tailed

horns

slope

mule

fawn

antlers

elk

caribou

herd

moose

clearing

mare

foals

orchid

lily

stems

row

chrysanthemums

blooms

cactus

saguarogiraffe

zebra

tusks

hands

train

desertdunes

canyon

lighthousemast

seals

texture

dust

pepper

swimmers

pyramid

mosque

sphinx

truck

fly

trunk

baby

eagle

lynx

rodent

squirrel

goat

marsh

wolf

pack

dall

porcupine

whales

rabbit

tracks

crops

animals

moss

trail

locomotive

railroad

vehicle aerial range

insect

manwoman

rice

prayer

glacier

harvest

girl

indian

pole

dance

africanshirt

buddhist

tomboutside

shade

formula

turn

straightaway

prototype

steel

scotland

ceiling

furniture

lichen

pups

antelope

pebbles

remains leopard

jeep

calf

reptile

snake

cougar

oahu

kauai

maui

school

canoe

race

hawaii

Cluster442

Cluster380

Cluster20

Cluster317

Cluster243

Cluster157

Cluster364

Cluster423

Cluster21

Cluster133

Cluster113

Cluster93

Cluster412

Cluster383

Cluster399

Cluster408

Cluster252

Cluster268

Cluster426

Cluster170

Cluster446

Cluster241

Cluster166

Cluster478

Cluster82

Cluster137

Cluster341

Cluster145

Cluster471

Cluster96

Cluster79

Cluster230

Cluster260

Cluster461

Cluster362

Cluster409

Cluster492

Cluster42

Cluster59

Cluster205

Cluster458

Cluster81

Cluster347

Cluster129

Cluster427

Cluster348

Cluster128

Cluster101

Cluster235

Cluster351

Cluster165

Cluster390

Cluster150

Cluster39

Cluster144

Cluster392

Cluster70

Cluster493

Cluster9

Cluster305

Cluster281

Cluster193

Cluster267

Cluster126

Cluster232

Cluster159

Cluster138

Cluster173Cluster482

Cluster22

Cluster29

Cluster49

Cluster470

Cluster353

Cluster425Cluster217

Cluster326

Cluster209

Cluster110

Cluster278

Cluster322

Cluster256

Cluster185

Cluster282

Cluster431

Cluster17

Cluster117

Cluster318

Cluster114

Cluster149

Cluster244

Cluster95

Cluster56

Cluster331Cluster142

Cluster424

Cluster479

Cluster356

Cluster286

Cluster313

Cluster344

Cluster201

Cluster291

Cluster457

Cluster420

Cluster330

Cluster23

Cluster62

Cluster498

Cluster405

Cluster443

Cluster214

Cluster152

Cluster1

Cluster394

Cluster284

Cluster190

Cluster439

Cluster345

Cluster227

Cluster436

Cluster308

Cluster195

Cluster403

city

mountain

sky

sun

water

clouds

tree

bay

lake

sea

beach

boats

people

branch

leaf

grass

plain

palm

horizon

shell

hills

waves

birds

land

dog

bridge

ships

buildings

fence

island

storm

peaks

jet

plane

runway

basket

flight

flag

helicopter

boeing

prop

f−16

tails

smokeformation

bear

polar

snow

tundra

ice

head

black

reflection

ground

forest

fall

river

field

flowers

stream

meadow

rocks

hillside

shrubs

close−up

grizzly

cubs

drum

log

hut

sunset

display

plants

pool

coral

fan

anemone

fish

ocean

diver

sunrise

face

sand

rainbow

farms

reefs

vegetation

house

village

carvings

path

wood

dress

coast

sailboats

cat

tiger

bengal

fox

kit run

shadows

winter

autumn

cliff

bush

rockface

pair

den

coyote

light

arctic

shore

town

road

chapel

moon

harbor

windmills

restaurant

wall

skyline

window

clothes

shops

street

cafe

tables

nets

crafts

roofs

ruins

stone

cars

castle

courtyard

statuestairs

costume

sponges

sign

palace

paintings

sheep

valley

balcony

post

gate

plaza

festival

temple

sculpture

museum

hotel

art

fountain

market

door

mural

garden

star

butterfly

angelfish

lion

cave

crab

grouper

pagoda

buddha

decoration

monastery

landscape

detail

writing

sails

food

room

entrance

fruit

night

perch

cow

figures

facade

chairs

guard

pond

church

park

barnarchhats

cathedral

ceremony

crowd

glass

shrine

model

pillar

carpet

monument

floor

vines

cottage

poppies

lawn

tower

vegetables

bench

rose

tulip

canal

cheese

railing

dock

horses

petals

umbrella

column

waterfalls

elephant

monks

pattern

interior

vendor

silhouette

architecture

blossoms

athlete

parade

ladder

sidewalk

store

steps

relief

fog

frostfrozen

rapids

crystals

spider

needles

stick

mist

doorway

vineyard

pottery

pots

military

designs

mushrooms

terrace

tent

bulls

gianttortoise

wings

albatross

booby

nest

hawk

iguana

lizard

marine

penguin

deer

white−tailed

horns

slope

mule

fawn

antlers

elk

caribouherd

moose

clearing

marefoals

orchidlily

stems

row

chrysanthemums

blooms

cactus

saguaro

giraffe

zebra

tusks

hands

train

desert

dunes

canyon

lighthouse

mast

seals

texture

dustpepper

swimmers

pyramid

mosque

sphinx

truck

fly

trunk

baby

eagle

lynx

rodent

squirrel

goat

marsh

wolf

pack

dall

porcupine

whales

rabbit

tracks

crops

animals

moss

trail

locomotiverailroad

vehicleaerial

range

insect

man

woman

rice

prayer

glacier

harvest

girl

indian

pole

dance

african

shirt

buddhist

tomb

outside

shade

formula

turnstraightaway

prototype

steel

scotland

ceilingfurniture

lichen

pups

antelope

pebbles

remains

leopard

jeep

calf

reptile

snake

cougar

oahu

kauai

maui

school

canoe race

hawaii

Cluster429Cluster287

Cluster94

Cluster352

Cluster96

Cluster395

Cluster442

Cluster433

Cluster118

Cluster365

Cluster83

Cluster106

Cluster489

Cluster71

Cluster89

Cluster390 Cluster165

Cluster495

Cluster380

Cluster20

Cluster120

Cluster471

Cluster57

Cluster144

Cluster150

Cluster60

Cluster449

Cluster411

Cluster224

Cluster275

Cluster84

Cluster53

Cluster88

Cluster325

Cluster50

Cluster112

Cluster288

Cluster317Cluster9

Cluster490

Cluster66

Cluster214

Cluster397

Cluster100

Cluster336

Cluster85

Cluster18

Cluster74

Cluster413

Cluster297

Cluster364

Cluster11

Cluster369

Cluster243

Cluster185

Cluster273

Cluster427

Cluster279

Cluster194

Cluster385

Cluster231

Cluster396

Cluster5

Cluster157

Cluster22

Cluster423

Cluster159

Cluster121

Cluster180

Cluster498

Cluster496 Cluster466

Cluster358

Cluster432

Cluster254

Cluster366

Cluster456

Cluster62

Cluster493

Cluster21

Cluster494

Cluster282

Cluster492

Cluster228

Cluster133

Cluster258

Cluster204

Cluster274

Cluster440

Cluster444

Cluster113

Cluster460

Cluster412Cluster383

Cluster399

Cluster326

Cluster148

Cluster192

Cluster19

Cluster459

Cluster278

Cluster408

Cluster441

Cluster463

Cluster482

Cluster376

Cluster304

Cluster252

Cluster268

Cluster27

Cluster149

Cluster426

Cluster410

Cluster79

Cluster318

Cluster170

Cluster55

Cluster166

Cluster321

Cluster389

Cluster333

Cluster377

Cluster52

Cluster241

Cluster138

Cluster345

Cluster163

Cluster398

Cluster294

Cluster154

Cluster437

Cluster414

Cluster48

Cluster417

Cluster331

Cluster2

Cluster415

Cluster467

Cluster12

Cluster478

Cluster209

Cluster174

Cluster82

Cluster137

Cluster354

Cluster341

Cluster123

Cluster308

Cluster238

Cluster465

Cluster117

Cluster43

Cluster139 Cluster439

Cluster379

Cluster462

Cluster178

Cluster72

Cluster115

Cluster458

Cluster260

Cluster461

Cluster234

Cluster436

Cluster182

Cluster169

Cluster359

Cluster285Cluster362

Cluster51

Cluster38

Cluster244

Cluster263

Cluster295

Cluster59

Cluster205

Cluster301

Cluster293

Cluster420

Cluster230

Cluster281

Cluster160

Cluster242

Cluster37

Cluster108Cluster200

Cluster81

Cluster438

Cluster227

Cluster292

Cluster191

Cluster322

Cluster406

Cluster448

Cluster349

Cluster450

Cluster15

Cluster291

Cluster129

Cluster394

Cluster491

Cluster348

Cluster128

Cluster101

Cluster305

Cluster130

Cluster45

Cluster220

Cluster4

Cluster486

Cluster156

Cluster235

Cluster221

Cluster386

Cluster316

Cluster151

Cluster351

Cluster276

Cluster381

Cluster289

Cluster422

Cluster404

Cluster481

Cluster173

Cluster267

Cluster102Cluster320

Cluster393

Cluster454

Cluster127

Cluster477

Cluster499

Cluster378

Cluster430

Cluster35

Cluster136

Cluster416

Cluster98

Cluster119

Cluster7Cluster264

Cluster271

Cluster39

Cluster392

Cluster330

Cluster391

Cluster114

Cluster186

Cluster164

Cluster58

Cluster162

Cluster246

Cluster483

Cluster14

Cluster435

Cluster199

Cluster215

Cluster6

Cluster111

Cluster70

Cluster283Cluster286

Cluster167

Cluster451

Cluster239

Cluster202

Cluster25

Cluster306

Cluster211

Cluster168

Cluster217

Cluster457

Cluster146

Cluster485

Cluster219

Cluster47

Cluster155

Cluster67

Cluster193

Cluster363

Cluster40

Cluster75

Cluster455

Cluster299

Cluster405

Cluster323

Cluster372

Cluster468

Cluster475

Cluster313

Cluster104

Cluster223

Cluster311

Cluster124

Cluster213

Cluster8

Cluster122

Cluster284

Cluster190

Cluster350

Cluster181

Cluster210

Cluster41

Cluster434

Cluster183

Cluster474

Cluster126

Cluster44

Cluster232

Cluster147

Cluster63

Cluster342

Cluster46

Cluster140

Cluster309

Cluster403

Cluster470

Cluster357

Cluster152

Cluster207

Cluster340

Cluster203 Cluster65

Cluster262

Cluster233

Cluster269

Cluster387

Cluster68

Cluster34

Cluster208

Cluster23

Cluster375

Cluster29

Cluster310

Cluster49Cluster30

Cluster1

Cluster353

Cluster425

Cluster327

Cluster110

Cluster497

Cluster402

Cluster257

Cluster476

Cluster329Cluster266

Cluster197

Cluster175

Cluster76

Cluster256

Cluster179

Cluster338

Cluster222

Cluster187

Cluster370

Cluster384

Cluster277

Cluster431

Cluster17

Cluster109

Cluster265

Cluster447

Cluster407

Cluster226

Cluster56

Cluster360

Cluster153

Cluster184

Cluster445

Cluster400

Cluster382

Cluster91

Cluster247

Cluster255

Cluster142

Cluster424

Cluster479Cluster421

Cluster356

Cluster196

Cluster97

Cluster195

Cluster300Cluster335

Cluster361

Cluster206

Cluster259

Cluster480

Cluster3

Cluster347

Cluster419

Cluster107

Cluster443

Cluster87

Cluster240

Cluster24

Cluster280

Cluster61

Cluster225

Cluster343 Cluster332

Cluster270

Cluster296

Cluster488

Cluster16

Cluster116

Cluster302

Cluster26

Cluster487

Cluster171

Cluster143

Cluster249

Cluster77

Cluster464

(b) Corel5k

Figure 7: The local BN structures learned by MMHC (left plot)and H2PC (right plot) on a single cross-validation split, onBibtex and Corel5k.

23

References

Agresti, A. (2002).Categorical Data Analysis. Wiley, 2nd edition.Aliferis, C. F., Statnikov, A. R., Tsamardinos, I., Mani, S., & Kout-

soukos, X. D. (2010). Local causal and markov blanket inductionfor causal discovery and feature selection for classification part i:Algorithms and empirical evaluation.Journal of Machine Learn-ing Research, JMLR, 11, 171–234.

Alvares-Cherman, E., Metz, J., & Monard, M. (2011). Incorporatinglabel dependency into the binary relevance framework for multi-label classification.Expert Systems With Applications, ESWA, 39,1647–1655.

Armen, A. P., & Tsamardinos, I. (2011). A unified approach to esti-mation and control of the false discovery rate in bayesian networkskeleton identification. InEuropean Symposium on Artificial Neu-ral Networks, ESANN.

Aussem, A., Rodrigues de Morais, S., & Corbex, M. (2012). Analysisof nasopharyngeal carcinoma risk factors with bayesian networks.Artificial Intelligence in Medicine, 54.

Aussem, A., Tchernof, A., Rodrigues de Morais, S., & Rome, S.(2010). Analysis of lifestyle and metabolic predictors of visceralobesity with bayesian networks.BMC Bioinformatics, 11, 487.

Badea, A. (2004). Determining the direction of causal influence inlarge probabilistic networks: A constraint-based approach. In Pro-ceedings of the Sixteenth European Conference on ArtificialIntel-ligence(pp. 263–267).

Bernard, A., & Hartemink, A. (2005). Informative structurepriors:Joint learning of dynamic regulatory networks from multiple typesof data. InProceedings of the Pacific Symposium on Biocomputing(pp. 459–470).

Blockeel, H., Raedt, L. D., & Ramon, J. (1998). Top-down inductionof clustering trees. In J. W. Shavlik (Ed.),International Conferenceon Machine Learning, ICML(pp. 55–63). Morgan Kaufmann.

Borchani, H., Bielza, C., Toro, C., & Larranaga, P. (2013).Predictinghuman immunodeficiency virus inhibitors using multi-dimensionalbayesian network classifiers.Artificial Intelligence in Medicine,57, 219–229.

Breiman, L. (2001). Random forests.Machine Learning, 45, 5–32.Brown, L. E., & Tsamardinos, I. (2008). A strategy for makingpredic-

tions under manipulation.Journal of Machine Learning Research,JMLR, 3, 35–52.

Buntine, W. (1991). Theory refinement on Bayesian networks.InB. D’Ambrosio, P. Smets, & P. Bonissone (Eds.),Uncertainty inArtificial Intelligence, UAI (pp. 52–60). San Mateo, CA, USA:Morgan Kaufmann Publishers.

Cawley, G. (2008). Causal and non-causal feature selectionfor ridgeregression. JMLR: Workshop and Conference Proceedings, 3,107–128.

Cheng, J., Greiner, R., Kelly, J., Bell, D. A., & Liu, W. (2002). Learn-ing Bayesian networks from data: An information-theory basedapproach.Artificial Intelligence, 137, 43–90.

Chickering, D., Heckerman, D., & Meek, C. (2004). Large-samplelearning of Bayesian networks is NP-hard.Journal of MachineLearning Research, JMLR, 5, 1287–1330.

Chickering, D. M. (2002). Optimal structure identificationwithgreedy search.Journal of Machine Learning Research, JMLR, 3,507–554.

Cussens, J., & Bartlett, M. (2013). Advances in Bayesian NetworkLearning using Integer Programming. InUncertainty in ArtificialIntelligence(pp. 182–191).

DembczyÅski, K., Waegeman, W., Cheng, W., & Hullermeier, E.(2012). On label dependence and loss minimization in multi-labelclassification.Machine Learning, 88, 5–45.

Ellis, B., & Wong, W. H. (2008). Learning causal bayesian network

structures from experimental data.Journal of the American Statis-tical Association, 103, 778–789.

Friedman, N., Nachman, I., , & Peer, D. (1999a). Learning bayesiannetwork structure from massive datasets: The sparse candidate al-gorithm. In Proceedings of the Fifteenth Conference on Uncer-tainty in Artificial Intelligence(pp. 206–215).

Friedman, N., Nachman, I., & Pe’er, D. (1999b). Learning bayesiannetwork structure from massive datasets: the ”sparse candidate”algorithm. In K. B. Laskey, & H. Prade (Eds.),Uncertainty inArtificial Intelligence, UAI(pp. 21–30). Morgan Kaufmann Pub-lishers.

Gasse, M., Aussem, A., & Elghazel, H. (2012). Comparison of hybridalgorithms for bayesian network structure learning. InEuropeanConference on Machine Learning and Principles and PracticeofKnowledge Discovery in Databases, ECML-PKDD(pp. 58–73).Springer volume 7523 ofLecture Notes in Computer Science.

Gu, Q., Li, Z., & Han, J. (2011). Correlated multi-label feature selec-tion. In C. Macdonald, I. Ounis, & I. Ruthven (Eds.),CIKM (pp.1087–1096). ACM.

Guo, Y., & Gu, S. (2011). Multi-label classification using conditionaldependency networks. InInternational Joint Conference on Artifi-cial Intelligence, IJCAI(pp. 1300–1305). AAAI Press.

Heckerman, D., Geiger, D., & Chickering, D. (1995). Learningbayesian networks: The combination of knowledge and statisticaldata.Machine Learning, 20, 197–243.

Kocev, D., Vens, C., Struyf, J., & Dzeroski, S. (2007). Ensemblesof multi-objective decision trees. In J. N. Kok (Ed.),EuropeanConference on Machine Learning, ECML(pp. 624–631). Springervolume 4701 ofLecture Notes in Artificial Intelligence.

Koivisto, M., & Sood, K. (2004). Exact bayesian structure discov-ery in bayesian networks.Journal of Machine Learning Research,JMLR, 5, 549–573.

Kojima, K., Perrier, E., Imoto, S., & Miyano, S. (2010). Optimalsearch on clustered structural constraint for learning bayesian net-work structure.Journal of Machine Learning Research, JMLR, 11,285–310.

Koller, D., & Friedman, N. (2009).Probabilistic Graphical Models:Principles and Techniques. MIT Press.

Koller, D., & Sahami, M. (1996). Toward optimal feature selection. InInternational Conference on Machine Learning, ICML(pp. 284–292).

Liaw, A., & Wiener, M. (2002). Classification and regressionby ran-domforest.R News, 2, 18–22.

Luaces, O., Dıez, J., Barranquero, J., del Coz, J. J., & Bahamonde,A. (2012). Binary relevance efficacy for multilabel classification.Progress in AI, 1, 303–313.

Madjarov, G., Kocev, D., Gjorgjevikj, D., & Dzeroski, S. (2012).An extensive experimental comparison of methods for multi-labellearning.Pattern Recognition, 45, 3084–3104.

Maron, O., & Ratan, A. L. (1998). Multiple-Instance Learning forNatural Scene Classification. InInternational Conference on Ma-chine Learning, ICML(pp. 341–349). Citeseer volume 7.

McCallum, A. (1999). Multi-label text classification with amixturemodel trained by em. InAAAI Workshop on Text Learning.

Moore, A., & Wong, W. (2003). Optimal reinsertion: A newsearch operator for accelerated and more accurate Bayesiannet-work structure learning. In T. Fawcett, & N. Mishra (Eds.),Inter-national Conference on Machine Learning, ICML.

Nagarajan, R., Scutari, M., & Lbre, S. (2013).Bayesian Networks inR: with Applications in Systems Biology (Use R!). Springer.

Neapolitan, R. E. (2004).Learning Bayesian Networks. Upper SaddleRiver, NJ: Pearson Prentice Hall.

Ott, S., Imoto, S., & Miyano, S. (2004). Finding optimal models forsmall gene networks. InProceedings of the Pacific Symposium onBiocomputing(pp. 557–567).

24

Pena, J., Nilsson, R., Bjorkegren, J., & Tegner, J. (2007). Towardsscalable and data efficient learning of Markov boundaries.Inter-national Journal of Approximate Reasoning, 45, 211–232.

Pena, J. M., Bjorkegren, J., & Tegner, J. (2005). GrowingBayesiannetwork models of gene networks from seed genes.Bioinformat-ics, 40, 224–229.

Pearl, J. (1988).Probabilistic Reasoning in Intelligent Systems: Net-works of Plausible Inference.. San Francisco, CA, USA: MorganKaufmann.

Pena, J. M. (2008). Learning gaussian graphical models of gene net-works with false discovery rate control. InEuropean Conferenceon Evolutionary Computation, Machine Learning and Data Min-ing in Bioinformatics6 (pp. 165–176).

Pena, J. M. (2012). Finding consensus bayesian network structures.Journal of Artificial Intelligence Research, 42, 661–687.

Perrier, E., Imoto, S., & Miyano, S. (2008). Finding optimalbayesiannetwork given a super-structure.Journal of Machine Learning Re-search, JMLR, 9, 2251–2286.

Peer, D., Regev, A., Elidan, G., & Friedman, N. (2001). Inferringsubnetworks from perturbed expression profiles.Bioinformatics,17, 215–224.

Prestat, E., Rodrigues de Morais, S., Vendrell, J., Thollet, A., Gautier,C., Cohen, P., & Aussem, A. (2013). Learning the local Bayesiannetwork structure around the ZNF217 oncogene in breast tumours.Computers in Biology and Medicine, 4, 334–341.

R Core Team (2013).R: A Language and Environment for Statisti-cal Computing. R Foundation for Statistical Computing Vienna,Austria. URL:http://www.R-project.org .

Read, J., Pfahringer, B., Holmes, G., & Frank, E. (2009). Classi-fier chains for multi-label classification. InEuropean Conferenceon Machine Learning and Principles and Practice of KnowledgeDiscovery in Databases, ECML-PKDD(pp. 254–269). Springervolume 5782 ofLecture Notes in Computer Science.

Rodrigues de Morais, S., & Aussem, A. (2010a). An efficient learningalgorithm for local bayesian network structure discovery.In Euro-pean Conference on Machine Learning and Principles and Prac-tice of Knowledge Discovery in Databases, ECML-PKDD(pp.164–169).

Rodrigues de Morais, S., & Aussem, A. (2010b). A novel Markovboundary based feature subset selection algorithm.Neurocomput-ing, 73, 578–584.

Roth, V., & Fischer, B. (2007). Improved functional prediction of pro-teins by learning kernel combinations in multilabel settings. BMCBioinformatics, 8, S12+.

Schwarz, G. E. (1978). Estimating the dimension of a model.Journalof Biomedical Informatics, 6, 461–464.

Scutari, M. (2010). Learning bayesian networks with the bnlearn Rpackage.Journal of Statistical Software, 35, 1–22.

Scutari, M. (2011). Measures of Variability for Graphical Models.Ph.D. thesis School in Statistical Sciences, University ofPadova.

Scutari, M., & Brogini, A. (2012). Bayesian network structure learn-ing with permutation tests.Communications in Statistics Theoryand Methods, 41, 3233–3243.

Scutari, M., & Nagarajan, R. (2013). Identifying significant edges ingraphical models of molecular networks.Artificial Intelligence inMedicine, 57, 207–217.

Silander, T., & Myllymaki, P. (2006). A Simple Approach forFindingthe Globally Optimal Bayesian Network Structure. InUncertaintyin Artificial Intelligence, UAI(pp. 445–452).

Snoek, C., Worring, M., Gemert, J. V., Geusebroek, J., & Smeulders,A. (2006). The challenge problem for automated detection of101semantic concepts in multimedia. InACM International Confer-ence on Multimedia(pp. 421–430). ACM Press.

Spirtes, P., Glymour, C., & Scheines, R. (2000).Causation, Predic-tion, and Search. (2nd ed.). The MIT Press.

Spolaor, N., Cherman, E. A., Monard, M. C., & Lee, H. D. (2013).A comparison of multi-label feature selection methods using theproblem transformation approach.Electronic Notes in TheoreticalComputer Science, 292, 135–151.

Studeny, M., & Haws, D. (2014). Learning Bayesian network struc-ture: Towards the essential graph by integer linear programmingtools. International Journal of Approximate Reasoning, 55, 1043–1071.

Trohidis, K., Tsoumakas, G., Kalliris, G., & Vlahavas, I. (2008).Multi-label classification of music into emotions. InISMIR (pp.325–330).

Tsamardinos, I., Aliferis, C., & Statnikov, A. (2003). Algorithms forlarge scale Markov blanket discovery. InFlorida Artificial Intelli-gence Research Society Conference FLAIRS’03(pp. 376–381).

Tsamardinos, I., & Borboudakis, G. (2010). Permutation testing im-proves bayesian network learning. InEuropean Conference on Ma-chine Learning and Knowledge Discovery in Databases, ECML-PKDD (pp. 322–337).

Tsamardinos, I., Brown, L., & Aliferis, C. (2006). The Max-MinHill-Climbing Bayesian Network Structure Learning Algorithm.Machine Learning, 65, 31–78.

Tsamardinos, I., & Brown, L. E. (2008). Bounding the false discoveryrate in local Bayesian network learning. InAAAI Conference onArtificial Intelligence(pp. 1100–1105).

Tsoumakas, G., Katakis, I., & Vlahavas, I. (2010a). Mining Multi-label Data.Transformation, 135, 1–20.

Tsoumakas, G., Katakis, I., & Vlahavas, I. (2010b). Random k-labelsets for Multi-Label Classification.IEEE Transactions onKnowledge and Data Engineering, TKDE, 23, 1–12.

Tsoumakas, G., & Vlahavas, I. (2007). Random k-labelsets: An en-semble method for multilabel classification.Proceedings of the18th European Conference on Machine Learning, 4701, 406–417.

Villanueva, E., & Maciel, C. (2012). Optimized algorithm for learningbayesian network superstructures. InInternational Conference onPattern Recognition Applications and Methods, ICPRAM.

Villanueva, E., & Maciel, C. D. (2014). Efficient methods for learningbayesian network super-structures.Neurocomputing, (pp. 3–12).

Zhang, M. L., & Zhang, K. (2010). Multi-label learning by ex-ploiting label dependency. InEuropean Conference on MachineLearning and Principles and Practice of Knowledge Discovery inDatabases, ECML PKDD(p. 999). ACM Press volume 16 ofKnowledge Discovery in Databases, KDD.

Zhang, M.-L., & Zhou., Z.-H. (2006). Multilabel neural networkswith applications to functional genomics and text categorization.IEEE Transactions on Knowledge and Data Engineering, 18, 1338– 1351.

25