Upload
khangminh22
View
1
Download
0
Embed Size (px)
Citation preview
arX
iv:1
506.
0569
2v1
[sta
t.ML]
18
Jun
2015
A hybrid algorithm for Bayesian network structure learningwith application to multi-label learning
Maxime Gasse, Alex Aussem∗, Haytham Elghazel
Universite de Lyon, CNRSUniversite Lyon 1, LIRIS, UMR5205, F-69622, France
Abstract
We present a novel hybrid algorithm for Bayesian network structure learning, called H2PC. It first reconstructs theskeleton of a Bayesian network and then performs a Bayesian-scoring greedy hill-climbing search to orient the edges.The algorithm is based on divide-and-conquer constraint-based subroutines to learn the local structure around a targetvariable. We conduct two series of experimental comparisons of H2PC against Max-Min Hill-Climbing (MMHC),which is currently the most powerful state-of-the-art algorithm for Bayesian network structure learning. First, we useeight well-known Bayesian network benchmarks with variousdata sizes to assess the quality of the learned structurereturned by the algorithms. Our extensive experiments showthat H2PC outperforms MMHC in terms of goodnessof fit to new data and quality of the network structure with respect to the true dependence structure of the data.Second, we investigate H2PC’s ability to solve the multi-label learning problem. We provide theoretical results tocharacterize and identify graphically the so-called minimal label powersets that appear as irreducible factors in thejoint distribution under the faithfulness condition. The multi-label learning problem is then decomposed into a seriesof multi-class classification problems, where each multi-class variable encodes a label powerset. H2PC is shownto compare favorably to MMHC in terms of global classification accuracy over ten multi-label data sets coveringdifferent application domains. Overall, our experiments support the conclusions that local structural learning withH2PC in the form of local neighborhood induction is a theoretically well-motivated and empirically effective learningframework that is well suited to multi-label learning. The source code (inR) of H2PC as well as all data sets used forthe empirical tests are publicly available.
Keywords: Bayesian networks, Multi-label learning, Markov boundary, Feature subset selection.
1. Introduction
A Bayesian network (BN) is a probabilistic modelformed by a structure and parameters. The structure of aBN is a directed acyclic graph (DAG), whilst its param-eters are conditional probability distributions associatedwith the variables in the model. The problem of findingthe DAG that encodes the conditional independenciespresent in the data attracted a great deal of interest overthe last years (Rodrigues de Morais & Aussem, 2010a;Scutari, 2010; Scutari & Brogini, 2012; Kojima et al.,2010; Perrier et al., 2008; Villanueva & Maciel,2012; Pena, 2012; Gasse et al., 2012). The in-ferred DAG is very useful for many applications,
∗Corresponding authorEmail address:[email protected] (Alex Aussem)
including feature selection (Aliferis et al., 2010;Pena et al., 2007; Rodrigues de Morais & Aussem,2010b), causal relationships inference from obser-vational data (Ellis & Wong, 2008; Aliferis et al.,2010; Aussem et al., 2012, 2010; Prestat et al., 2013;Cawley, 2008; Brown & Tsamardinos, 2008) and morerecently multi-label learning (DembczyÅski et al.,2012; Zhang & Zhang, 2010; Guo & Gu, 2011).
Ideally the DAG should coincide with the dependencestructure of the global distribution, or it should at leastidentify a distribution as close as possible to the correctone in the probability space. This step, called structurelearning, is similar in approaches and terminology tomodel selection procedures for classical statistical mod-els. Basically, constraint-based (CB) learning methodssystematically check the data for conditional indepen-dence relationships and use them as constraints to con-
struct a partially oriented graph representative of a BNequivalence class, whilst search-and-score (SS) meth-ods make use of a goodness-of-fit score function forevaluating graphical structures with regard to the dataset. Hybrid methods attempt to get the best of bothworlds: they learn a skeleton with a CB approach andconstrain on the DAGs considered during the SS phase.
In this study, we present a novel hybrid algorithmfor Bayesian network structure learning, called H2PC1.It first reconstructs the skeleton of a Bayesian net-work and then performs a Bayesian-scoring greedy hill-climbing search to orient the edges. The algorithm isbased on divide-and-conquer constraint-based subrou-tines to learn the local structure around a target vari-able. HPC may be thought of as a way to compen-sate for the large number of false negatives at the outputof the weak PC learner, by performing extra computa-tions. As this may arise at the expense of the numberof false positives, we control the expected proportion offalse discoveries (i.e. false positive nodes) among allthe discoveries made inPCT . We use a modificationof the Incremental association Markov boundary algo-rithm (IAMB), initially developed by Tsamardinos etal. in (Tsamardinos et al., 2003) and later modified byJose Pena in (Pena, 2008) to control the FDR of edgeswhen learning Bayesian network models. HPC scalesto thousands of variables and can deal with many fewersamples (n < q). To illustrate its performance by meansof empirical evidence, we conduct two series of exper-imental comparisons of H2PC against Max-Min Hill-Climbing (MMHC), which is currently the most pow-erful state-of-the-art algorithm for BN structure learn-ing (Tsamardinos et al., 2006), using well-known BNbenchmarks with various data sizes, to assess the good-ness of fit to new data as well as the quality of thenetwork structure with respect to the true dependencestructure of the data.
We then address a real application of H2PC wherethe true dependence structure is unknown. More specif-ically, we investigate H2PC’s ability to encode the jointdistribution of the label set conditioned on the inputfeatures in the multi-label classification (MLC) prob-lem. Many challenging applications, such as photoand video annotation and web page categorization,can benefit from being formulated as MLC tasks withlarge number of categories (DembczyÅski et al., 2012;Read et al., 2009; Madjarov et al., 2012; Kocev et al.,2007; Tsoumakas et al., 2010b). Recent research in
1A first version of HP2C without FDR control has been discussedin a paper that appeared in the Proceedings of ECML-PKDD, pages58-73, 2012.
MLC focuses on the exploitation of the label condi-tional dependency in order to better predict the labelcombination for each example. We show that local BNstructure discovery methods offer an elegant and pow-erful approach to solve this problem. We establish twotheorems (Theorem 6 and 7) linking the concepts ofmarginal Markov boundaries, joint Markov boundariesand so-called label powersets under the faithfulness as-sumption. These Theorems offer a simple guideline tocharacterize graphically : i) the minimal label powersetdecomposition, (i.e. into minimal subsetsYLP ⊆ Y suchthatYLP ⊥⊥ Y \ YLP | X), and ii) the minimal subset offeatures, w.r.t an Information Theory criterion, neededto predict each label powerset, thereby reducing the in-put space and the computational burden of the multi-label classification. To solve the MLC problem withBNs, the DAG obtained from the data plays a pivotalrole. So in this second series of experiments, we assessthe comparative ability of H2PC and MMHC to encodethe label dependency structure by means of an indirectgoodness of fit indicator, namely the 0/1 loss function,which makes sense in the MLC context.
The rest of the paper is organized as follows: Inthe Section 2, we review the theory of BN and dis-cuss the main BN structure learning strategies. We thenpresent the H2PC algorithm in details in Section 3. Sec-tion 4 evaluates our proposed method and shows resultsfor several tasks involving artificial data sampled fromknown BNs. Then we report, in Section 5, on our exper-iments on real-world data sets in a multi-label learningcontext so as to provide empirical support for the pro-posed methodology. The main theoretical results appearformally as two theorems (Theorem 8 and 9) in Section5. Their proofs are established in the Appendix. Fi-nally, Section 6 raises several issues for future work andwe conclude in Section 7 with a summary of our contri-bution.
2. Preliminaries
We define next some key concepts used along the pa-per and state some results that will support our analysis.In this paper, upper-case letters in italics denote randomvariables (e.g.,X,Y) and lower-case letters in italics de-note their values (e.g.,x, y). Upper-case bold letters de-note random variable sets (e.g.,X,Y,Z) and lower-casebold letters denote their values (e.g.,x, y). We denoteby X ⊥⊥ Y | Z the conditional independence betweenXandY given the set of variablesZ. To keep the notationuncluttered, we usep(y | x) to denotep(Y = y | X = x).
2
2.1. Bayesian networksFormally, a BN is a tuple< G,P >, whereG =<
U,E > is a directed acyclic graph (DAG) with nodesrepresenting the random variablesU andP a joint prob-ability distribution inU. In addition,G and P mustsatisfy the Markov condition: every variable,X ∈ U,is independent of any subset of its non-descendant vari-ables conditioned on the set of its parents, denoted byPaGi . From the Markov condition, it is easy to prove(Neapolitan, 2004) that the joint probability distributionP on the variables inU can be factored as follows :
P(V) = P(X1, . . . ,Xn) =n∏
i=1
P(Xi |PaGi ) (1)
Equation 1 allows a parsimonious decomposition ofthe joint distributionP. It enables us to reduce the prob-lem of determining a huge number of probability valuesto that of determining relatively few.
A BN structureG entails a set of conditional indepen-dence assumptions. They can all be identified by thed-separation criterion(Pearl, 1988). We useX ⊥G Y|Z todenote the assertion thatX is d-separated fromY givenZ inG. Formally,X ⊥G Y|Z is true when for every undi-rected path inG betweenX andY, there exists a nodeW in the path such that either (1)W does not have twoparents in the path andW ∈ Z, or (2)W has two parentsin the path and neitherW nor its descendants is inZ. If< G,P > is a BN,X ⊥P Y|Z if X ⊥G Y|Z. The conversedoes not necessarily hold. We say that< G,P > satis-fies thefaithfulness conditionif the d-separations inGidentify all and onlythe conditional independencies inP, i.e.,X ⊥P Y|Z iff X ⊥G Y|Z.
A Markov blanketMT of T is any set of variablessuch thatT is conditionally independent of all the re-maining variables givenMT . By extension, a Markovblanket ofT in V guarantees thatM T ⊆ V, and thatTis conditionally independent of the remaining variablesin V, givenMT . A Markov boundary,MB T , of T is anyMarkov blanket such that none of its proper subsets is aMarkov blanket ofT.
We denote byPCGT , the set of parents and childrenof T in G, and bySPGT , the set ofspousesof T in G.The spousesof T are the variables that have commonchildren withT. These sets are unique for allG, suchthat< G,P > satisfies the faithfulness condition and sowe will drop the superscriptG. We denote bydSep(X),the set that d-separatesX from the (implicit) targetT.
Theorem 1. Suppose< G,P > satisfies the faithfulnesscondition. Then X and Y are not adjacent inG iff ∃Z ∈U \ {X,Y} such that X⊥⊥ Y | Z . Moreover,MB X =
PCX ∪ SPX .
A proof can be found for instance in (Neapolitan,2004).
Two graphs are saidequivalentiff they encode thesame set of conditional independencies via the d-separation criterion. The equivalence class of a DAGG is a set of DAGs that are equivalent toG. The nextresult showed by (Pearl, 1988), establishes that equiv-alent graphs have the same undirected graph but mightdisagree on the direction of some of the arcs.
Theorem 2. Two DAGs are equivalent iff they have thesame underlying undirected graph and the same set ofv-structures (i.e. converging edges into the same node,such as X→ Y← Z).
Moreover, an equivalence class of network structurescan be uniquely represented by a partially directed DAG(PDAG), also called a DAG pattern. The DAG pattern isdefined as the graph that has the same links as the DAGsin the equivalence class and has oriented all and only theedges common to the DAGs in the equivalence class.A structure learning algorithm from data is said to becorrect (or sound) if it returns the correct DAG pattern(or a DAG in the correct equivalence class) under theassumptions that the independence tests are reliable andthat the learning database is a sample from a distributionP faithful to a DAGG, The (ideal) assumption that theindependence tests are reliable means that they decide(in)dependence iff the (in)dependence holds inP.
2.2. Conditional independence properties
The following three theorems, borrowedfrom Pena et al. (2007), are proven in Pearl (1988):
Theorem 3. Let X,Y,Z and W denote four mutu-ally disjoint subsets ofU. Any probability distribu-tion p satisfies the following four properties: Symme-try X ⊥⊥ Y | Z ⇒ Y ⊥⊥ X | Z, Decomposition,X ⊥⊥ (Y ∪ W) | Z ⇒ X ⊥⊥ Y | Z, Weak UnionX ⊥⊥ (Y∪W) | Z ⇒ X ⊥⊥ Y | (Z∪W) and Contraction,X ⊥⊥ Y | (Z ∪W) ∧ X ⊥⊥W | Z ⇒ X ⊥⊥ (Y ∪W) | Z .
Theorem 4. If p is strictly positive, then p satisfies theprevious four properties plus the Intersection propertyX ⊥⊥ Y | (Z ∪W) ∧ X ⊥⊥ W | (Z ∪ Y) ⇒ X ⊥⊥ (Y ∪W) | Z. Additionally, eachX ∈ U has a unique Markovboundary,MB X
Theorem 5. If p is faithful to a DAGG, then p satisfiesthe previous five properties plus the Composition prop-erty X ⊥⊥ Y | Z ∧ X ⊥⊥ W | Z ⇒ X ⊥⊥ (Y ∪W) | Z andthe local Markov property X⊥⊥ (NDX \ PaX) | PaX foreach X∈ U, whereNDX denotes the non-descendantsof X inG.
3
2.3. Structure learning strategies
The number of DAGs,G, is super-exponential in thenumber of random variables in the domain and the prob-lem of learning the most probablea posterioriBN fromdata is worst-case NP-hard (Chickering et al., 2004).One needs to resort to heuristical methods in order tobe able to solve very large problems effectively.
Both CB and SS heuristic approaches have advan-tages and disadvantages. CB approaches are relativelyquick, deterministic, and have a well defined stop-ping criterion; however, they rely on an arbitrary sig-nificance level to test for independence, and they canbe unstable in the sense that an error early on in thesearch can have a cascading effect that causes many er-rors to be present in the final graph. SS approacheshave the advantage of being able to flexibly incor-porate users’ background knowledge in the form ofprior probabilities over the structures and are also capa-ble of dealing with incomplete records in the database(e.g. EM technique). Although SS methods are fa-vored in practice when dealing with small dimensionaldata sets, they are slow to converge and the computa-tional complexity often prevents us from finding op-timal BN structures (Perrier et al., 2008; Kojima et al.,2010). With currently available exact algorithms(Koivisto & Sood, 2004; Silander & Myllymaki, 2006;Cussens & Bartlett, 2013; Studeny & Haws, 2014) anda decomposable score like BDeu, the computationalcomplexity remains exponential, and therefore, such al-gorithms are intractable for BNs with more than around30 vertices on current workstations. For larger sets ofvariables the computational burden becomes prohibitiveand restrictions about the structure have to be imposed,such as a limit on the size of the parent sets. Withthis in mind, the ability to restrict the search locallyaround the target variable is a key advantage of CBmethods over SS methods. They are able to constructa local graph around the target node without havingto construct the whole BN first, hence their scalabil-ity (Pena et al., 2007; Rodrigues de Morais & Aussem,2010b,a; Tsamardinos et al., 2006; Pena, 2008).
With a view to balancing the computation cost withthe desired accuracy of the estimates, several hybridmethods have been proposed recently. Tsamardinoset al. proposed in (Tsamardinos et al., 2006) the Min-Max Hill Climbing (MMHC) algorithm and conductedone of the most extensive empirical comparison per-formed in recent years showing that MMHC was thefastest and the most accurate method in terms ofstructural error based on the structural hamming dis-tance. More specifically, MMHC outperformed both
in terms of time efficiency and quality of reconstruc-tion the PC (Spirtes et al., 2000), the Sparse Candi-date (Friedman et al., 1999b), the Three Phase Depen-dency Analysis (Cheng et al., 2002), the Optimal Rein-sertion (Moore & Wong, 2003), the Greedy Equiva-lence Search (Chickering, 2002), and the Greedy Hill-Climbing Search on a variety of networks, samplesizes, and parameter values. Although MMHC is ratherheuristic by nature (it returns a local optimum of thescore function), it is currently considered as the mostpowerful state-of-the-art algorithm for BN structurelearning capable of dealing with thousands of nodes inreasonable time.
In order to enhance its performance on small dimen-sional data sets, Perrier et al. proposed in (Perrier et al.,2008) a hybrid algorithm that can learn anoptimalBN(i.e., it converges to the true model in the sample limit)when an undirected graph is given as a structural con-straint. They defined this undirected graph as a super-structure (i.e., every DAG considered in the SS phase iscompelled to be a subgraph of the super-structure). Thisalgorithm can learn optimal BNs containing up to 50vertices when the average degree of the super-structureis around two, that is, a sparse structural constraint isassumed. To extend its feasibility to BN with a few hun-dred of vertices and an average degree up to four, Ko-jima et al. proposed in (Kojima et al., 2010) to dividethe super-structure into several clusters and perform anoptimal search on each of them in order to scale up tolarger networks. Despite interesting improvements interms of score and structural hamming distance on sev-eral benchmark BNs, they report running times about103 times longer than MMHC on average, which is stillprohibitive.
Therefore, there is great deal of interest in hy-brid methods capable of improving the structural ac-curacy of both CB and SS methods on graphs con-taining up to thousands of vertices. However, theymake the strong assumption that the skeleton (alsocalled super-structure) contains at least the edges ofthe true network and as small as possible extra edges.While controlling the false discovery rate (i.e., falseextra edges) in BN learning has attracted some at-tention recently (Armen & Tsamardinos, 2011; Pena,2008; Tsamardinos & Brown, 2008), to our knowledge,there is no work on controlling actively the rate of false-negative errors (i.e., false missing edges).
2.4. Constraint-based structure learningBefore introducing the H2PC algorithm, we discuss
the general idea behind CB methods. The induction oflocal or global BN structures is handled by CB methods
4
through the identification of local neighborhoods (i.e.,PCX), hence their scalability to very high dimensionaldata sets. CB methods systematically check the data forconditional independence relationships in order to in-fer a target’s neighborhood. Typically, the algorithmsrun either aG2 or aχ2 independence test when the dataset is discrete and a Fisher’s Z test when it is continu-ous in order to decide on dependence or independence,that is, upon the rejection or acceptance of the null hy-pothesis of conditional independence. Since we arelimiting ourselves to discrete data, both the global andthe local distributions are assumed to be multinomial,and the latter are represented as conditional probabil-ity tables. Conditional independence tests and networkscores for discrete data are functions of these condi-tional probability tables through the observed frequen-cies {ni jk ; i = 1, . . . ,R; j = 1, . . . ,C; k = 1, . . . , L} forthe random variablesX andY and all the configurationsof the levels of the conditioning variablesZ. We useni+k as shorthand for the marginal
∑
j ni jk and similarlyfor ni+k, n++k andn+++ = n. We use a classic conditionalindependence test based on the mutual information. Themutual information is an information-theoretic distancemeasure defined as
MI (X,Y|Z) =R∑
i=1
C∑
j=1
L∑
k=1
ni jk
nlog
ni jkn++k
ni+kn+ jk
It is proportional to the log-likelihood ratio testG2 (they differ by a 2n factor, where n is the samplesize). The asymptotic null distribution isχ2 with(R − 1)(C − 1)L degrees of freedom. For a detailedanalysis of their properties we refer the reader to(Agresti, 2002). The main limitation of this test isthe rate of convergence to its limiting distribution,which is particularly problematic when dealing withsmall samples and sparse contingency tables. Thedecision of accepting or rejecting the null hypothe-sis depends implicitly upon the degree of freedomwhich increases exponentially with the number ofvariables in the conditional set. Several heuristic solu-tions have emerged in the literature (Spirtes et al.,2000; Rodrigues de Morais & Aussem, 2010a;Tsamardinos et al., 2006; Tsamardinos & Borboudakis,2010) to overcome some shortcomings of the asymp-totic tests. In this study we use the two followingheuristics that are used in MMHC. First, we do notperform MI (X,Y|Z) and assume independence ifthere are not enough samples to achieve large enoughpower. We require that the average sample per countis above a user defined parameter, equal to 5, as in
(Tsamardinos et al., 2006). This heuristic is called thepower rule. Second, we consider as structural zeroeither casen+ jk or ni+k = 0. For example, ifn+ jk = 0,we consider y as a structurally forbidden value for Ywhen Z= z and we reduce R by 1 (as if we had onecolumn less in the contingency table where Z= z).This is known as the degrees of freedom adjustmentheuristic.
3. The H2PC algorithm
In this section, we present our hybrid algorithmfor Bayesian network structure learning, called Hy-brid HPC (H2PC). It first reconstructs the skeleton ofa Bayesian network and then performs a Bayesian-scoring greedy hill-climbing search to filter and orientthe edges. It is based on a CB subroutine called HPC tolearn the parents and children of a single variable. So,we shall discuss HPC first and then move to H2PC.
3.1. Parents and Children Discovery
HPC (Algorithm 1) can be viewed as an ensemblemethod for combining many weak PC learners in an at-tempt to produce a stronger PC learner. HPC is basedon three subroutines:Data-Efficient Parents and Chil-dren Superset(DE-PCS),Data-Efficient Spouses Super-set(DE-SPS), andIncremental Association Parents andChildrenwith false discovery rate control (FDR-IAPC),a weak PC learner based on FDR-IAMB (Pena, 2008)that requires little computation. HPC receives a targetnodeT, a data setD and a set of variablesU as inputand returns an estimation ofPCT . It is hybrid in thatit combines the benefits of incremental and divide-and-conquer methods. The procedure starts by extractinga supersetPCST of PCT (line 1) and a supersetSPST
of SPT (line 2) with a severe restriction on the max-imum conditioning size (Z <= 2) in order to signifi-cantly increase the reliability of the tests. A first can-didate PC set is then obtained by running the weak PClearner called FDR-IAPC onPCST∪SPST (line 3). Thekey idea is the decentralized search at lines 4-8 that in-cludes, in the candidate PC set, all variables in the su-persetPCST \ PCT that haveT in their vicinity. So,HPC may be thought of as a way to compensate for thelarge number of false negatives at the output of the weakPC learner, by performing extra computations. Notethat, in theory,X is in the output of FDR-IAPC(Y) ifand only ifY is in the output of FDR-IAPC(X). How-ever, in practice, this may not always be true, particu-larly when working in high-dimensional domains (Pena,2008). By loosening the criteria by which two nodes are
5
said adjacent, the effective restrictions on the size of theneighborhood are now far less severe. The decentral-ized search has a significant impact on the accuracy ofHPC as we shall see in in the experiments. We proved in(Rodrigues de Morais & Aussem, 2010a) that the orig-inal HPC(T) is consistent, i.e. its output converges inprobability toPCT , if the hypothesis tests are consis-tent. The proof also applies to the modified version pre-sented here.
We now discuss the subroutines in more detail. FDR-IAPC (Algorithm 2) is a fast incremental method thatreceives a data setD and a target nodeT as its in-put and promptly returns a rough estimation ofPCT ,hence the term “weak” PC learner. In this study, weuse FDR-IAPC because it aims at controlling the ex-pected proportion of false discoveries (i.e., false pos-itive nodes inPCT) among all the discoveries made.FDR-IAPC is a straightforward extension of the algo-rithm IAMBFDR developed by Jose Pena in (Pena,2008), which is itself a modification of the incremen-tal association Markov boundary algorithm (IAMB)(Tsamardinos et al., 2003), to control the expected pro-portion of false discoveries (i.e., false positive nodes) inthe estimated Markov boundary. FDR-IAPC simply re-moves, at lines 3-6, the spousesSPT from the estimatedMarkov boundaryMB T output by IAMBFDR at line 1,and returnsPCT assuming the faithfulness condition.
The subroutines DE-PCS (Algorithm 3) and DE-SPS(Algorithm 4) search a superset ofPCT andSPT respec-tively with a severe restriction on the maximum condi-tioning size (|Z| <= 1 in DE-PCS and|Z| <= 2 in DE-SPS) in order to significantly increase the reliability ofthe tests. The variable filtering has two advantages :i) it allows HPC to scale to hundreds of thousands ofvariables by restricting the search to a subset of relevantvariables, and ii) it eliminates many true or approximatedeterministic relationships that produce many false neg-ative errors in the output of the algorithm, as explainedin (Rodrigues de Morais & Aussem, 2010b,a). DE-SPSworks in two steps. First, a growing phase (lines 4-8)adds the variables that are d-separated from the targetbut still remain associated with the target when condi-tioned on another variable fromPCST . The shrinkingphase (lines 9-16) discards irrelevant variables that areancestors or descendants of a target’s spouse. Pruningsuch irrelevant variables speeds up HPC.
3.2. Hybrid HPC (H2PC)In this section, we discuss the SS phase. The follow-
ing discussion draws strongly on (Tsamardinos et al.,2006) as the SS phase in Hybrid HPC and MMHC areexactly the same. The idea of constraining the search
Algorithm 1 HPCRequire: T: target;D: data set;U: set of variablesEnsure: PCT : Parents and Children ofT
1: [PCST ,dSep] ← DE-PCS(T,D,U)2: SPST ← DE-SPS(T,D,U,PCST , dSep)3: PCT ← FDR-IAPC(T,D, (T ∪ PCST ∪ SPST))4: for all X ∈ PCST \ PCT do5: if T ∈ FDR-IAPC(X,D, (T ∪ PCST ∪ SPST)) then6: PCT ← PCT ∪ X7: end if8: end for
Algorithm 2 FDR-IAPCRequire: T: target;D: data set;U: set of variablesEnsure: PCT : Parents and children ofT;
* Learn the Markov boundary of T1: MBT ← IAMBFDR(X,D,U)
* Remove spouses of T fromMBT
2: PCT ← MBT
3: for all X ∈ MBT do4: if ∃Z ⊆ (MBT \ X) such thatT ⊥ X | Z then5: PCT ← PCT \ X6: end if7: end for
Algorithm 3 DE-PCSRequire: T: target;D: data set;U: set of variables;Ensure: PCST : parents and children superset ofT; dSep: d-
separating sets;
Phase I:Remove X if T⊥ X1: PCST ← U \ T2: for all X ∈ PCST do3: if (T ⊥ X) then4: PCST ← PCST \ X5: dSep(X)← ∅6: end if7: end for
Phase II: Remove X if T⊥ X|Y8: for all X ∈ PCST do9: for all Y ∈ PCST \ X do
10: if (T ⊥ X | Y) then11: PCST ← PCST \ X12: dSep(X)← Y13: break loop FOR14: end if15: end for16: end for
6
Algorithm 4 DE-SPSRequire: T: target; D: data set;U: the set of variables;
PCST : parents and children superset ofT; dSep: d-separating sets;
Ensure: SPST : Superset of the spouses ofT;
1: SPST ← ∅
2: for all X ∈ PCST do3: SPSX
T ← ∅
4: for all Y ∈ U \ {T ∪ PCST} do5: if (T 6⊥ Y|dSep(Y) ∪ X) then6: SPSX
T ← SPSXT ∪ Y
7: end if8: end for9: for all Y ∈ SPSX
T do10: for all Z ∈ SPSX
T \ Y do11: if (T ⊥ Y|X ∪ Z) then12: SPSX
T ← SPSXT \ Y
13: break loop FOR14: end if15: end for16: end for17: SPST ← SPST ∪ SPSX
T
18: end for
Algorithm 5 Hybrid HPCRequire: D: data set;U: set of variablesEnsure: A DAG G on the variablesU
1: for all pair of nodesX,Y ∈ U do2: Add X in PCY and Add Y inPCX if X ∈ HPC(Y) and
Y ∈ HPC(X)3: end for4: Starting from an empty graph, perform greedy hill-
climbing with operatorsadd-edge, delete-edge, reverse-edge. Only try operatoradd-edge X→ Y if Y ∈ PCX
to improve time-efficiency first appeared in the SparseCandidate algorithm (Friedman et al., 1999b). It resultsin efficiency improvements over the (unconstrained)greedy search. All recent hybrid algorithms build onthis idea, but employ a sound algorithm for identifyingthe candidate parent sets. The Hybrid HPC first identi-fies the parents and children set of each variable, thenperforms a greedy hill-climbing search in the space ofBN. The search begins with an empty graph. The edgeaddition, deletion, or direction reversal that leads to thelargest increase in score (the BDeu score with uniformprior was used) is taken and the search continues in asimilar fashion recursively. The important differencefrom standard greedy search is that the search is con-strained to only consider adding an edge if it was dis-covered by HPC in the first phase. We extend the greedysearch with a TABU list (Friedman et al., 1999b). Thelist keeps the last 100 structures explored. Instead of ap-plying the best local change, the best local change thatresults in a structure not on the list is performed in an at-tempt to escape local maxima. When 15 changes occurwithout an increase in the maximum score ever encoun-tered during search, the algorithm terminates. The over-all best scoring structure is then returned. Clearly, themore false positives the heuristic allows to enter candi-date PC set, the more computational burden is imposedin the SS phase.
4. Experimental validation on synthetic data
Before we proceed to the experiments on real-worldmulti-label data with H2PC, we first conduct an exper-imental comparison of H2PC against MMHC on syn-thetic data sets sampled from eight well-known bench-marks with various data sizes in order to gauge the prac-tical relevance of the H2PC. These BNs that have beenpreviously used as benchmarks for BN learning algo-rithms (see Table 1 for details). Results are reported interms of various performance indicators to investigatehow well the network generalizes to new data and howwell the learned dependence structure matches the truestructure of the benchmark network. We implementedH2PC inR(R Core Team, 2013) and integrated the codeinto thebnlearnpackage from (Scutari, 2010). MMHCwas implemented by Marco Scutari inbnlearn. Thesource code of H2PC as well as all data sets used forthe empirical tests are publicly available2. The thresh-old considered for the type I error of the test is 0.05.Our experiments were carried out on PC with Intel(R)
2https://github.com/madbix/bnlearn-clone-3.4
7
Table 1: Description of the BN benchmarks used in the experiments.
network # var. # edgesmax. degree domain min/med/max
in/out range PC set size
child 20 25 2/7 2-6 1/2/8insurance 27 52 3/7 2-5 1/3/9mildew 35 46 3/3 3-100 1/2/5alarm 37 46 4/5 2-4 1/2/6
hailfinder 56 66 4/16 2-11 1/1.5/17munin1 186 273 3/15 2-21 1/3/15
pigs 441 592 2/39 3-3 1/2/41link 724 1125 3/14 2-4 0/2/17
Core(TM) i5-3470M CPU @3.20 GHz 4Go RAM run-ning under Linux 64 bits.
We do not claim that those data sets resemble real-world problems, however, they make it possible to com-pare the outputs of the algorithms with the known struc-ture. All BN benchmarks (structure and probability ta-bles) were downloaded from thebnlearn repository3.Ten sample sizes have been considered: 50, 100, 200,500, 1000, 2000, 5000, 10000, 20000 and 50000. Allexperiments are repeated 10 times for each sample sizeand each BN. We investigate the behavior of both algo-rithms using the same parametric tests as a reference.
4.1. Performance indicatorsWe first investigate the quality of the skeleton re-
turned by H2PC during the CB phase. To this end, wemeasure the false positive edge ratio, the precision (i.e.,the number of true positive edges in the output dividedby the number of edges in the output), the recall (i.e., thenumber of true positive edges divided the true numberof edges) and a combination of precision and recall de-fined as
√
(1− precision)2 + (1− recall)2, to measurethe Euclidean distance from perfect precision and re-call, as proposed in (Pena et al., 2007). Second, to as-sess the quality of the final DAG output at the end ofthe SS phase, we report the five performance indicators(Scutari & Brogini, 2012) described below:
• the posterior density of the network for the datait was learned from, as a measure of goodness offit. It is known as the Bayesian Dirichlet equiva-lent score (BDeu) from (Heckerman et al., 1995;Buntine, 1991) and has a single parameter, theequivalent sample size, which can be thought ofas the size of an imaginary sample supportingthe prior distribution. The equivalent sample sizewas set to 10 as suggested in (Koller & Friedman,2009);
3 http://www.bnlearn.com/bnrepository
• the BIC score (Schwarz, 1978) of the network forthe data it was learned from, again as a measure ofgoodness of fit;
• the posterior density of the network for a new dataset, as a measure of how well the network general-izes to new data;
• the BIC score of the network for a new data set,again as a measure of how well the network gener-alizes to new data;
• the Structural Hamming Distance (SHD) betweenthe learned and the true structure of the network,as a measure of the quality of the learned depen-dence structure. The SHD between two PDAGs isdefined as the number of the following operatorsrequired to make the PDAGs match: add or deletean undirected edge, and add, remove, or reverse theorientation of an edge.
For each data set sampled from the true probabilitydistribution of the benchmark, we first learn a networkstructure with the H2PC and MMHC and then we com-pute the relevant performance indicators for each pairof network structures. The data set used to assess howwell the network generalizes to new data is generatedagain from the true probability structure of the bench-mark networks and contains 50000 observations.
Notice that using the BDeu score as a metric of recon-struction quality has the following two problems. First,the score corresponds to the a posteriori probability of anetwork only under certain conditions (e.g., a Dirichletdistribution of the hyper parameters); it is unknown towhat degree these assumptions hold in distributions en-countered in practice. Second, the score is highly sensi-tive to the equivalent sample size (set to 10 in our exper-iments) and depends on the network priors used. Since,typically, the same arbitrary value of this parameter isused both during learning and for scoring the learnednetwork, the metric favors algorithms that use the BDeuscore for learning. In fact, the BDeu score does not relyon the structure of the original, gold standard network atall; instead it employs several assumptions to score thenetworks. For those reasons, in addition to the score wealso report the BIC score and the SHD metric.
4.2. Results
In Figure 1, we report the quality of the skeleton ob-tained with HPC over that obtained with MMPC (be-fore the SS phase) as a function of the sample size.Results for each benchmark are not shown here in de-tail due to space restrictions. For sake of conciseness,
8
0
1000
0
2000
0
3000
0
4000
0
5000
00.0
0.2
0.4
0.6
0.8
sample size
mea
n ra
te
False negative rate(lower is better)
HPCMMPC
0
1000
0
2000
0
3000
0
4000
0
5000
0
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
sample sizem
ean
rate
False positive rate(lower is better)
HPCMMPC
50 100
200
500
1000
2000
5000
1000
020
000
5000
0
0
2
4
6
8
sample size
incr
ease
fact
or
Recall(higher is better)
50 100
200
500
1000
2000
5000
1000
020
000
5000
00.00.51.01.52.02.53.03.5
sample size
incr
ease
fact
or
Precision(higher is better)
50 100
200
500
1000
2000
5000
1000
020
000
5000
0
0.0
0.2
0.4
0.6
0.8
1.0
1.2
sample size
incr
ease
fact
or
Euclidian distance(lower is better)
50 100
200
500
1000
2000
5000
1000
020
000
5000
0
0
5
10
15
20
25
sample size
incr
ease
fact
or
Number of statistical tests
Figure 1: Quality of the skeleton obtained with HPC over thatobtained with MMPC (after the CB phase). The 2 first figures present mean valuesaggregated over the 8 benchmarks. The 4 last figures present increase factors of HPC/MMPC, with the median, quartile, and most extreme values(green boxplots), along with the mean value (black line).
9
the performance values are averaged over the 8 bench-marks depicted in Table 1. The increase factor for agiven performance indicator is expressed as the ratio ofthe performance value obtained with HPC over that ob-tained with MMPC (the gold standard). Note that forsome indicators, an increase is actually not an improve-ment but is worse (e.g., false positive rate, Euclideandistance). For clarity, we mention explicitly on the sub-plots whether an increase factor> 1 should be inter-preted as an improvement or not. Regarding the qual-ity of the superstructure, the advantages of HPC againstMMPC are noticeable. As observed, HPC consistentlyincreases the recall and reduces the rate of false negativeedges. As expected this benefit comes at a little expensein terms of false positive edges. HPC also improvesthe Euclidean distance from perfect precision and re-call on all benchmarks, while increasing the number ofindependence tests and thus the running time in the CBphase (see number of statistical tests). It is worth notingthat HPC is capable of reducing by 50% the Euclideandistance with 50000 samples (lower left plot). Theseresults are very much in line with other experimentspresented in (Rodrigues de Morais & Aussem, 2010a;Villanueva & Maciel, 2012).
In Figure 2, we report the quality of the final DAG ob-tained with H2PC over that obtained with MMHC (af-ter the SS phase) as a function of the sample size. Re-garding BDeu and BIC on both training and test data,the improvements are noteworthy. The results in termsof goodness of fit to training data and new data usingH2PC clearly dominate those obtained using MMHC,whatever the sample size considered, hence its ability togeneralize better. Regarding the quality of the networkstructure itself (i.e., how close is the DAG to the truedependence structure of the data), this is pretty mucha dead heat between the 2 algorithms on small sam-ple sizes (i.e., 50 and 100), however we found H2PCto perform significantly better on larger sample sizes.The SHD increase factor decays rapidly (lower is bet-ter) as the sample size increases (lower left plot). For50 000 samples, the SHD is on average only 50% that ofMMHC. Regarding the computational burden involved,we may observe from Table 2 that H2PC has a littlecomputational overhead compared to MMHC. The run-ning time increase factor grows somewhat linearly withthe sample size. With 50000 samples, H2PC is approxi-mately 10 times slower on average than MMHC. This ismainly due to the computational expense incurred in ob-taining larger PC sets with HPC, compared to MMPC.
5. Application to multi-label learning
In this section, we address the problem of multi-label learning with H2PC. MLC is a challengingproblem in many real-world application domains,where each instance can be assigned simultaneouslyto multiple binary labels (DembczyÅski et al., 2012;Read et al., 2009; Madjarov et al., 2012; Kocev et al.,2007; Tsoumakas et al., 2010b). Formally, learningfrom multi-label examples amounts to finding a map-ping from a space of features to a space of labels. Weshall assume throughout thatX (a random vector inRd)is the feature set,Y (a random vector in{0, 1}n) is thelabel set,U = X ∪ Y and p a probability distributiondefined overU. Given a multi-label training setD, thegoal of multi-label learning is to find a function whichis able to map any unseen example to its proper set oflabels. From the Bayesian point of view, this problemamounts to modeling the conditional joint distributionp(Y | X).
5.1. Related work
This MLC problem may be tackled in various ways(Luaces et al., 2012; Alvares-Cherman et al., 2011;Read et al., 2009; Blockeel et al., 1998; Kocev et al.,2007). Each of these approaches is supposed to cap-ture - to some extent - the relationships between la-bels. The two most straightforward meta-learningmethods (Madjarov et al., 2012) are: Binary Rele-vance (BR) (Luaces et al., 2012) and Label Powerset(LP) (Tsoumakas & Vlahavas, 2007; Tsoumakas et al.,2010b). Both methods can be regarded as opposite inthe sense that BR does consider each label indepen-dently, while LP considers the whole label set at once(one multi-class problem). An important question re-mains: what shall we capture from the statistical re-lationships between labels exactly to solve the multi-label classification problem? The problem attracteda great deal of interest (DembczyÅski et al., 2012;Zhang & Zhang, 2010). It is well beyond the scopeand purpose of this paper to delve deeper into these ap-proaches, we point the reader to (DembczyÅski et al.,2012; Tsoumakas et al., 2010b) for a review. Thesecond fundamental problem that we wish to addressinvolves finding an optimal feature subset selectionof a label set, w.r.t an Information Theory criterionKoller & Sahami (1996). As in the single-label case,multi-label feature selection has been studied recentlyand has encountered some success (Gu et al., 2011;Spolaor et al., 2013).
10
50 100
200
500
1000
2000
5000
1000
020
000
5000
00.5
0.6
0.7
0.8
0.9
1.0
sample size
incr
ease
fact
or
log(BDeu post.) on train data(lower is better)
50 100
200
500
1000
2000
5000
1000
020
000
5000
0
0.5
0.6
0.7
0.8
0.9
1.0
sample size
incr
ease
fact
or
BIC on train data(lower is better)
50 100
200
500
1000
2000
5000
1000
020
000
5000
0
0.5
0.6
0.7
0.8
0.9
1.0
sample size
incr
ease
fact
or
log(BDeu post.) on test data(lower is better)
50 100
200
500
1000
2000
5000
1000
020
000
5000
00.5
0.6
0.7
0.8
0.9
1.0
sample size
incr
ease
fact
or
BIC on test data(lower is better)
50 100
200
500
1000
2000
5000
1000
020
000
5000
0
0.0
0.2
0.4
0.6
0.8
1.0
1.2
sample size
incr
ease
fact
or
Structural Hamming Distance(lower is better)
50 100
200
500
1000
2000
5000
1000
020
000
5000
0
2
4
6
8
sample size
incr
ease
fact
or
Number of scores
Figure 2: Quality of the final DAG obtained with H2PC over thatobtained with MMHC (after the SS phase). The 6 figures presentincrease factorsof HPC/ MMPC, with the median, quartile, and most extreme values (green boxplots), along with the mean value (black line).
11
Table 2: Total running time ratioR (H2PC/MMHC). White cells indicate a ratioR< 1 (in favor of H2PC), while shaded cells indicate a ratioR> 1(in favor of MMHC). The darker, the larger the ratio.
NetworkSample Size
50 100 200 500 1000 2000 5000 10000 20000 50000
child0.94 0.87 1.14 1.99 2.26 2.12 2.36 2.58 1.75 1.78±0.1 ±0.1 ±0.1 ±0.2 ±0.1 ±0.2 ±0.4 ±0.3 ±0.6 ±0.5
insurance0.96 1.09 1.56 2.93 3.06 3.48 3.69 4.10 3.76 3.75±0.1 ±0.1 ±0.1 ±0.2 ±0.3 ±0.4 ±0.3 ±0.4 ±0.6 ±0.5
mildew0.77 0.80 0.79 0.94 1.01 1.23 1.74 2.14 3.26 6.20±0.1 ±0.1 ±0.1 ±0.1 ±0.1 ±0.1 ±0.2 ±0.2 ±0.6 ±1.0
alarm0.88 1.11 1.75 2.43 2.55 2.71 2.65 2.80 2.49 2.18±0.1 ±0.1 ±0.1 ±0.1 ±0.1 ±0.1 ±0.2 ±0.2 ±0.3 ±0.6
hailfinder0.85 0.85 1.40 1.69 1.83 2.06 2.13 2.12 1.95 1.96±0.1 ±0.1 ±0.1 ±0.1 ±0.1 ±0.1 ±0.1 ±0.2 ±0.2 ±0.6
munin10.77 0.85 0.93 1.35 2.11 4.30 12.92 23.32 24.95 24.76±0.0 ±0.0 ±0.0 ±0.0 ±0.0 ±0.2 ±0.7 ±2.6 ±5.1 ±6.7
pigs0.80 0.80 4.55 4.71 5.00 5.62 7.63 11.10 14.02 11.74±0.0 ±0.0 ±0.1 ±0.1 ±0.2 ±0.2 ±0.3 ±0.6 ±1.7 ±3.2
link1.16 1.93 2.76 5.55 7.04 8.19 10.00 13.87 15.32 24.74±0.0 ±0.0 ±0.0 ±0.1 ±0.2 ±0.2 ±0.3 ±0.4 ±2.5 ±4.2
all0.89 1.04 1.86 2.70 3.11 3.71 5.39 7.75 8.44 9.64±0.1 ±0.1 ±1.2 ±1.5 ±1.9 ±2.2 ±4.0 ±7.3 ±8.4 ±9.7
5.2. Label powerset decomposition
We shall first introduce the concept of label powersetthat will play a pivotal role in the factorization of theconditional distributionp(Y | X).
Definition 1. YLP ⊆ Y is called a label powersetiffYLP ⊥⊥ Y \ YLP | X. Additionally, YLP is said minimalif it is non empty and has no label powerset as propersubset.
Let LP denote the set of all powersets defined overU, andminLP the set of all minimal label powersets.It is easily shown{Y, ∅} ⊆ LP. The key idea behindlabel powersets is the decomposition of the conditionaldistribution of the labels into a product of factors
p(Y | X) =L∏
j=1
p(YLP j | X) =L∏
j=1
p(YLP j | M LP j )
where{YLP1 , . . . ,YLPL } is a partition of label power-sets andM LP j is a Markov blanket ofYLP j in X. Fromthe above definition, we haveYLPi ⊥⊥ YLP j | X, ∀i , j.
In the framework of MLC, one can consider a mul-titude of loss functions. In this study, we focus ona non label-wise decomposable loss function calledthe subset 0/1 loss which generalizes the well-known0/1 loss from the conventional to the multi-label set-ting. The risk-minimizing prediction for subset 0/1loss is simply given by the mode of the distribution
(DembczyÅski et al., 2012). Therefore, we seek a fac-torization into a product of minimal factors in order tofacilitate the estimation of the mode of the conditionaldistribution (also called the most probable explanation(MPE)):
maxy
p(y | x) =L∏
j=1
maxyLPj
p(yLP j | x) =L∏
j=1
maxyLPj
p(yLP j | mLP j )
The next section aims to obtain theoretical results forthe characterization of the minimal label powersetsYLP j
and their Markov boundariesM LP j from a DAG, in or-der to be able to estimate the MPE more effectively.
5.3. Label powerset characterization
In this section, we show that minimal label powersetscan be depicted graphically whenp is faithful to a DAG,
Theorem 6. Suppose p is faithful to a DAGG. Then,Yi and Yj belong to the same minimal label powerset ifand only if there exists an undirected path inG betweennodes Yi and Yj in Y such that all intermediate nodes Zare either (i) Z∈ Y, or (ii) Z ∈ X and Z has two parentsin Y (i.e. a collider of the form Yp → X← Yq).
The proof is given in the Appendix. We shall nowaddress the following questions: What is the Markovboundary inX of a minimal label powerset? Answering
12
this question for general distributions is not trivial. Weestablish a useful relation between a label powerset andits Markov boundary inX whenp is faithful to a DAG,
Theorem 7. Suppose p is faithful to a DAGG. LetYLP = {Y1,Y2, . . . ,Yn} be a label powerset. Then, itsMarkov boundaryM in U is also its Markov boundaryin X, and is given inG byM =
⋃nj=1{PCYj ∪ SPYj } \ Y.
The proof is given in the Appendix.
5.4. Experimental setup
The MLC problem is decomposed into a series ofmulti-class classification problems, where each multi-class variable encodes one label powerset, with as manyclasses as the number of possible combinations of la-bels, or those present in the training data. At this point,it should be noted that the LP method is a special case ofthis framework since the whole label set is a particularlabel powerset (not necessarily minimal though). Theabove procedure can been summarized as follows:
1. Learn the BN local graphG around the label set;2. Read off G the minimal label powersets and their
respective Markov boundaries;3. Train an independent multi-class classifier on each
minimal LP, with the input space restricted to itsMarkov boundary inG;
4. Aggregate the prediction of each classifierto output the most probable explanation, i.e.arg maxy p(y | x).
To assess separately the quality of the minimal labelpowerset decomposition and the feature subset selectionwith Markov boundaries, we investigate 4 scenarios:
• BR without feature selection (denotedBR): a clas-sifier is trained on each single label with all fea-tures as input. This is the simplest approach as itdoes not exploit any label dependency. It serves asa baseline learner for comparison purposes.
• BR with feature selection (denotedBR+MB ): aclassifier is trained on each single label with theinput space restricted to its Markov boundary inG. Compared to the previous strategy, we evaluatehere the effectiveness of the feature selection task.
• Minimum label powerset method without featureselection (denotedMLP ): the minimal label pow-ersets are obtained from the DAG. All features areused as inputs.
• Minimum label powerset method with feature se-lection (denotedMLP+MB ): the minimal labelpowersets are obtained from the DAG. the inputspace is restricted to the Markov boundary of thelabels in that powerset.
We use the same base learner in each meta-learningmethod: the well-known Random Forest classifier(Breiman, 2001). RF achieves good performance instandard classification as well as in multi-label prob-lems (Kocev et al., 2007; Madjarov et al., 2012), and isable to handle both continuous and discrete data eas-ily, which is much appreciated. The standard RF imple-mentation inR (Liaw & Wiener, 2002)4 was used. Forpractical purposes, we restricted the forest size of RF to100 trees, and left the other parameters to their defaultvalues.
5.4.1. Data and performance indicatorsA total of 10 multi-label data sets are collected for ex-
periments in this section, whose characteristics are sum-marized in Table 3. These data sets come from differ-ent problem domains including text, biology, and mu-sic. They can be found on the Mulan5 repository, exceptfor imagewhich comes from Zhou6 (Maron & Ratan,1998). LetD be the multi-label data set, we use|D|, dim(D), L(D), F(D) to represent the number of ex-amples, number of features, number of possible labels,and feature type respectively.DL(D) = |{Y|∃x : (x,Y) ∈D}| counts the number of distinct label combinations ap-pearing in the data set. Continues values are binarizedduring the BN structure learning phase.
Table 3: Data sets characteristics
data set domain |D| dim(D) F(D) L(D) DL(D)
emotions music 593 72 cont. 6 27yeast biology 2417 103 cont. 14 198image images 2000 135 cont. 5 20scene images 2407 294 cont. 6 15slashdot text 3782 1079 disc. 22 156genbase biology 662 1186 disc. 27 32medical text 978 1449 disc. 45 94enron text 1702 1001 disc. 53 753bibtex text 7395 1836 disc. 159 2856corel5k images 5000 499 disc. 374 3175
The performance of a multi-label classifiercan be assessed by several evaluation measures
4http://cran.r-project.org/web/packages/randomForest5http://mulan.sourceforge.net/datasets.html6http://lamda.nju.edu.cn/data_MIMLimage.ashx
13
(Tsoumakas et al., 2010a). We focus on maximizing anon-decomposable score function: the global accuracy(also termed subset accuracy, complement of the0/1-loss), which measures the correct classificationrate of the whole label set (exact match of all labelsrequired). Note that the global accuracy implicitlytakes into account the label correlations. It is thereforea very strict evaluation measure as it requires an exactmatch of the predicted and the true set of labels. Itwas recently proved in (DembczyÅski et al., 2012)that BR is optimal for decomposable loss functions(e.g., the hamming loss), while non-decomposableloss functions (e.g. subset loss) inherently require theknowledge of the label conditional distribution. 10-foldcross-validation was performed for the evaluation ofthe MLC methods.
5.4.2. ResultsTable 4 reports the outputs of H2PC and MMHC, Ta-
ble 5 shows the global accuracy of each method on the10 data sets. Table 6 reports the running time and theaverage node degree of the labels in the DAGs obtainedwith both methods. Figures 3 up to 7 display graph-ically the local DAG structures around the labels, ob-tained with H2PC and MMHC, for illustration purposes.
Several conclusions may be drawn from these exper-iments. First, we may observe by inspection of the av-erage degree of the label nodes in Table 6 that severalDAGs are densely connected, likesceneor bibtex, whileothers are rather sparse, likegenbase, medical, corel5k.The DAGs displayed in Figures 3 up to 7 lend them-selves to interpretation. They can be used for encod-ing as well as portraying the conditional independen-cies, and the d-sep criterion can be used to read themoff the graph. Many label powersets that are reportedin Table 4 can be identified by graphical inspection ofthe DAGs. For instance, the two label powersets inyeast(Figure 3, bottom plot) are clearly noticeable inboth DAGs. Clearly, BNs have a number of advantagesover alternative methods. They lay bare useful infor-mation about the label dependencies which is crucial ifone is interested in gaining an understanding of the un-derlying domain. It is however well beyond the scopeof this paper to delve deeper into the DAG interpreta-tion. Overall, it appears that the structures recoveredby H2PC are significantly denser, compared to MMHC.On bibtexandenron, the increase of the average labeldegree is the most spectacular. Onbibtex(resp.enron),the average label degree has raised from 2.6 to 6.4 (resp.from 1.2 to 3.3). This result is in nice agreement withthe experiments in the previous section, as H2PC wasshown to consistently reduce the rate of false negative
edges with respect to MMHC (at the cost of a slightlyhigher false discovery rate).
Second, Table 4 is very instructive as it reveals thatonemotions, image, and scene, the MLP approach boilsdown to the LP scheme. This is easily seen as there isonly one minimal label powerset extracted on average.In contrast, ongenbasethe MLP approach boils downto the simple BR scheme. This is confirmed by inspect-ing Table 5: the performances of MMHC and H2PC(without feature selection) are equal. An interesting ob-servation upon looking at the distribution of the labelpowerset size shows that for the remaining data sets, theMLP mostly decomposes the label set in two parts: oneon which it performs BR and the other one on whichit performs LP. Take for instanceenron, it can be seenfrom Table 4 that there are approximately 23 label sin-gletons and a powerset of 30 labels with H2PC for a to-tal of 53 labels. The gain in performance with MLP overBR, our baseline learner, can be ascribed to the qualityof the label powerset decomposition as BR is ignoringlabel dependencies. As expected, the results using MLPclearly dominate those obtained using BR on all datasets exceptgenbase. The DAGs display the minimumlabel powersets and their relevant features.
Third, H2PC compares favorably to MMHC. Onscenefor instance, the accuracy of MLP+MB has raisedfrom 20% (with MMHC) to 56% (with H2PC). Onyeast, it has raised from 7% to 23%. Without featureselection, the difference in global accuracy is less pro-nounced but still in favor of H2PC. A Wilcoxon signedrank paired test reveals statistically significant improve-ments of H2PC over MMHC in the MLP approach with-out feature selection (p < 0.02). This trend is more pro-nounced when the feature selection is used (p < 0.001),using MLP-MB. Note however that the LP decompo-sition for H2PC and MMHC onyeastand imagearestrictly identical, hence the same accuracy values in Ta-ble 5 in the MLP column.
Fourth, regarding the utility of the feature selection,it is difficult to reach any conclusion. Whether BR orMLP is used, the use of the selected features as inputsto the classification model is not shown greatly benefi-cial in terms of global accuracy on average. The per-formance of BR and our MLP with all the features out-performs that with the selected features in 6 data setsbut the feature selection leads to actual improvementsin 3 data sets. The difference in accuracy with and with-out the feature selection was not shown to be statisti-cally significant (p > 0.20 with a Wilcoxon signed rankpaired test). Surprisingly, the feature selection did ex-ceptionally well ongenbase. On this data set, the in-crease in accuracy is the most impressive: it raised from
14
7% to 98% which is atypical. The dramatic increase inaccuracy ongenbaseis due solely to the restricted fea-ture set as input to the classification model. This is alsoobserved onmedical, to a lesser extent though. Inter-estingly, on large and densely connected networks (e.g.bibtex, slashdotandcorel5K), the feature selection per-formed very well in terms of global accuracy and signif-icantly reduced the input space which is noticeable. Onemotions, yeast, image, sceneandgenbase, the methodreduced the feature set down to nearly 1/100 its origi-nal size. The feature selection should be evaluated inview of its effectiveness at balancing the increasing er-ror and the decreasing computational burden by drasti-cally reducing the feature space. We should also keepin mind that the feature relevance cannot be defined in-dependently of the learner and the model-performancemetric (e.g., the loss function used). Admittedly, ourfeature selection based on the Markov boundaries is notnecessarily optimal for the base MLC learner used here,namely the Random Forest model.
As far as the overall running time performance is con-cerned, we see from Table 6 that for both methods, therunning time grows somewhat exponentially with thesize of the Markov boundary and the number of fea-tures, hence the considerable rise in total running timeon bibtex. H2PC takes almost 200 times longer onbib-tex (1826 variables) andenron (1001 variables) whichis quite considerable but still affordable (44 hours ofsingle-CPU time onbibtexand 13 hours onenron). Wealso observe that the size of the parent set with H2PCis 2.5 (onbibtex) and 3.6 (onenron) times larger thanthat of MMHC (which may hurt interpretability). Infact, the running time is known to increase exponen-tially with the parent set size of the true underlying net-work. This is mainly due the computational overheadof greedy search-and-score procedure with larger par-ent sets which is the most promising part to optimize interms of computational gains as we discuss in the Con-clusion.
6. Discussion & practical applications
Our prime conclusion is that H2PC is a promisingapproach to constructing BN global or local structuresaround specific nodes of interest, with potentially thou-sands of variables. Concentrating on higher recall val-ues while keeping the false positive rate as low as pos-sible pays off in terms of goodness of fit and struc-ture accuracy. Historically, the main practical diffi-culty in the application of BN structure discovery ap-proaches has been a lack of suitable computing re-sources and relevant accessible software. The number
Table 4: Distribution of the number and the size of the minimal labelpowersets output by H2PC (top) and MMHC (bottom). On each dataset: mean number of powersets, minimum/median/maximum numberof labels per powerset, and minimum/median/maximum number ofdistinct classes per powerset. The total number of labels and distinctlabels combinations is recalled for convenience.
H2PC
dataset # LPs# labels/ LP
L(D)# classes/ LP
DL(D)min/med/max min/med/max
emotions 1.0± 0.0 6 / 6 / 6 (6) 26/ 27 / 27 (27)yeast 2.0± 0.0 2 / 7 / 12 (14) 3/ 64 / 135 (198)image 1.0± 0.0 5 / 5 / 5 (5) 19/ 20 / 20 (20)scene 1.0± 0.0 6 / 6 / 6 (6) 14/ 15 / 15 (15)slashdot 9.5± 0.8 1 / 1 / 14 (22) 1/ 2 / 111 (156)genbase 26.9± 0.3 1 / 1 / 2 (27) 1/ 2 / 2 (32)medical 38.6± 1.8 1 / 1 / 6 (45) 1/ 2 / 10 (94)enron 24.5± 1.9 1 / 1 / 30 (53) 1/ 2 / 334 (753)bibtex 39.6± 4.3 1 / 1 / 112 (159) 2/ 2 / 1588 (2856)corel5k 97.7± 3.9 1 / 1 / 263 (374) 1/ 2 / 2749 (3175)
MMHC
dataset # LPs# labels/ LP
L(D)# classes/ LP
DL(D)min/med/max min/med/max
emotions 1.2± 0.4 1 / 6 / 6 (6) 2 / 26 / 27 (27)yeast 2.0± 0.0 1 / 7 / 13 (14) 2/ 89 / 186 (198)image 1.0± 0.0 5 / 5 / 5 (5) 19/ 20 / 20 (20)scene 1.0± 0.0 6 / 6 / 6 (6) 14/ 15 / 15 (15)slashdot 15.1± 0.6 1 / 1 / 9 (22) 1/ 2 / 46 (156)genbase 27.0± 0.0 1 / 1 / 1 (27) 1/ 2 / 2 (32)medical 41.7± 1.4 1 / 1 / 4 (45) 1/ 2 / 7 (94)enron 36.8± 2.7 1 / 1 / 12 (53) 1/ 2 / 119 (753)bibtex 77.2± 3.1 1 / 1 / 28 (159) 2/ 2 / 233 (2856)corel5k 165.3± 5.5 1 / 1 / 177 (374) 1/ 2 / 2294 (3175)
Table 5: Global classification accuracies using 4 learning methods(BR, MLP, BR+MB, MLP+MB) based on the DAG obtained withH2PC and MMHC. Best values between H2PC and MMHC are bold-faced.
dataset BRMLP BR+MB MLP+MB
MMHC H2PC MMHC H2PC MMHC H2PC
emotions 0.327 0.371 0.391 0.147 0.266 0.189 0.285yeast 0.169 0.273 0.271 0.062 0.155 0.073 0.233image 0.342 0.504 0.504 0.2270.252 0.308 0.360scene 0.580 0.733 0.733 0.1230.361 0.199 0.565slashdot 0.355 0.419 0.474 0.274 0.373 0.278 0.439genbase 0.069 0.069 0.069 0.9740.976 0.974 0.976medical 0.454 0.499 0.502 0.630 0.651 0.636 0.669enron 0.134 0.165 0.180 0.024 0.079 0.079 0.157bibtex 0.108 0.115 0.167 0.120 0.133 0.121 0.177corel5k 0.002 0.030 0.043 0.000 0.002 0.010 0.034
15
Table 6: DAG learning time (in seconds) and average degree ofthelabel nodes.
datasetMMHC H2PC
time label degree time label degree
emotions 1.5 2.6± 1.1 16.2 4.7± 1.8yeast 9.0 3.0± 1.6 90.9 5.0± 2.9image 3.1 4.0± 0.8 28.4 4.6± 1.0scene 9.9 3.1± 1.0 662.9 6.6± 1.8slashdot 44.8 2.7± 1.4 822.3 4.0± 5.2genbase 12.5 1.0± 0.3 12.4 1.1± 0.3medical 43.8 1.7± 1.1 334.8 2.9± 2.4enron 199.8 1.2± 1.5 47863.0 4.3± 5.2bibtex 960.8 2.6± 1.3 159469.6 6.4± 10.3corel5k 169.6 1.5± 1.4 2513.0 2.9± 2.8
of variables which can be included in exact BN anal-ysis is still limited. As a guide, this might be less thanabout 40 variables for exact structural search techniques(Perrier et al., 2008; Kojima et al., 2010). In contrast,constraint-based heuristics like the one presented in thisstudy is capable of processing many thousands of fea-tures within hours on a personal computer, while main-taining a very high structural accuracy. H2PC andMMHC could potentially handle up to 10,000 labels ina few days of single-CPU time and far less by paralleliz-ing the skeleton identification algorithm as discussed in(Tsamardinos et al., 2006; Villanueva & Maciel, 2014).
The advantages in terms of structure accuracy andits ability to scale to thousands of variables opens upmany avenues of future possible applications of H2PCin various domains as we shall discuss next. For ex-ample, BNs have especially proven to be useful ab-stractions in computational biology (Nagarajan et al.,2013; Scutari & Nagarajan, 2013; Prestat et al., 2013;Aussem et al., 2012, 2010; Pena, 2008; Pena et al.,2005). Identifying the gene network is crucial for un-derstanding the behavior of the cell which, in turn, canlead to better diagnosis and treatment of diseases. Thisis also of great importance for characterizing the func-tion of genes and the proteins they encode in determin-ing traits, psychology, or development of an organism.Genome sequencing uses high-throughput techniqueslike DNA microarrays, proteomics, metabolomics andmutation analysis to describe the function and in-teractions of thousands of genes (Zhang & Zhou.,2006). Learning BN models of gene networks fromthese huge data is still a difficult task (Badea, 2004;Bernard & Hartemink, 2005; Friedman et al., 1999a;Ott et al., 2004; Peer et al., 2001). In these studies, theauthors had decide in advance which genes were in-cluded in the learning process, in all the cases less than1000, and which genes were excluded from it (Pena,
2008). H2PC overcome the problem by focusing thesearch around a targeted gene: the key step is the iden-tification of the vicinity of a nodeX (Pena et al., 2005).
Our second objective in this study was to demonstratethe potential utility of hybrid BN structure discovery tomulti-label learning. In multi-label data where manyinter-dependencies between the labels may be present,explicitly modeling all relationships between the labelsis intuitively far more reasonable (as demonstrated inour experiments). BNs explicitly account for such inter-dependencies and the DAG allows us to identify an op-timal set of predictors for every label powerset. The ex-periments presented here support the conclusion that lo-cal structural learning in the form of local neighborhoodinduction and Markov blanket is a theoretically well-motivated approach that can serve as a powerful learn-ing framework for label dependency analysis gearedtoward multi-label learning. Multi-label scenarios arefound in many application domains, such as multimediaannotation (Snoek et al., 2006; Trohidis et al., 2008),tag recommendation, text categorization (McCallum,1999; Zhang & Zhou., 2006), protein function classifi-cation (Roth & Fischer, 2007), and antiretroviral drugcategorization (Borchani et al., 2013).
7. Conclusion & avenues for future research
We first discussed a hybrid algorithm for globalor local (around target nodes) BN structure learn-ing called Hybrid HPC (H2PC). Our extensiveexperiments showed that H2PC outperforms thestate-of-the-art MMHC by a significant margin interms of edge recall without sacrificing the numberof extra edges, which is crucial for the sound-ness of the super-structure used during the secondstage of hybrid methods, like the ones proposed in(Perrier et al., 2008; Kojima et al., 2010). The code ofH2PC is open-source and publicly available online athttps://github.com/madbix/bnlearn-clone-3.4.Second, we discussed an application of H2PC to themulti-label learning problem which is a challengingproblem in many real-world application domains. Weestablished theoretical results, under the faithfulnesscondition, in order to characterize graphically theso-called minimal label powersets that appear asirreducible factors in the joint distribution and theirrespective Markov boundaries. As far as we know, thisis the first investigation of Markov boundary principlesto the optimal variable/feature selection problem inmulti-label learning. These formal results offer a simpleguideline to characterize graphically : i) the minimallabel powerset decomposition, (i.e. into minimal
16
subsetsYLP ⊆ Y such thatYLP ⊥⊥ Y \ YLP | X), andii) the minimal subset of features, w.r.t an InformationTheory criterion, needed to predict each label powerset,thereby reducing the input space and the computationalburden of the multi-label classification. The theoreticalanalysis laid the foundation for a practical multi-labelclassification procedure. Another set of experimentswere carried out on a broad range of multi-label datasets from different problem domains to demonstrate itseffectiveness. H2PC was shown to outperform MMHCby a significant margin in terms of global accuracy.
We suggest several avenues for future research. Asfar as BN structure learning is concerned, future workwill aim at: 1) ascertaining which independence test(e.g. tests targeting specific distributions, employingparametric assumptions etc.) is most suited to the dataat hand (Tsamardinos & Borboudakis, 2010; Scutari,2011); 2) controlling the false discovery rate of theedges in the graph output by H2PC (Pena, 2008) espe-cially when dealing with more nodes than samples, e.g.learning gene networks from gene expression data. Inthis study, H2PC was run independently on each nodewithout keeping track of the dependencies found previ-ously. This lead to some loss of efficiency due to re-dundant calculations. The optimization of the H2PCcode is currently being undertaken to lower the compu-tational cost while maintaining its performance. Theseoptimizations will include the use of a cache to storethe (in)dependencies and the use of a global structure.Other interesting research avenues to explore are exten-sions and modifications to the greedy search-and-scoreprocedure which is the most promising part to optimizein terms of computational gains. Regarding the multi-label learning problem, experiments with several thou-sands of labels are currently been conducted and willbe reported in due course. We also intend to work onrelaxing the faithfulness assumption and derive a prac-tical multi-label classification algorithm based on H2PCthat is correct under milder assumptions underlying thejoint distribution (e.g. Composition, Intersection). Thisneeds further substantiation through more analysis.
Acknowledgments
The authors thank Marco Scutari for sharing hisbn-learn package inR. The research was also supportedby grants from the European ENIAC Joint Undertak-ing (INTEGRATE project) and from the French Rhone-Alpes Complex Systems Institute (IXXI).
Appendix
Lemma 8. ∀YLPi ,YLP j ∈ LP, thenYLPi ∩ YLP j ∈ LP.
Proof. To keep the notation uncluttered and for the sakeof simplicity, consider a partition{Y1,Y2,Y3,Y4} of Ysuch thatYLPi = Y1 ∪ Y2 , YLP j = Y2 ∪ Y3 . Fromthe label powerset assumption forYLPi and YLP j , wehave (Y1 ∪ Y2) ⊥⊥ (Y3 ∪ Y4) | X and (Y2 ∪ Y3) ⊥⊥(Y1 ∪ Y4) | X . From the Decomposition, we have thatY2 ⊥⊥ (Y3 ∪ Y4) | X . Using the Weak union, we obtainthatY2 ⊥⊥ Y1 | (Y3 ∪ Y4 ∪ X) . From these two facts,we can use the Contraction property to show thatY2 ⊥⊥
(Y1 ∪ Y3 ∪ Y4) | X . Therefore,Y2 = YLPi ∩ YLP j is alabel powerset by definition.
Lemma 9. Let Yi and Yj denote two distinct labels inY and define byYLPi andYLP j their respective minimallabel powerset. Then we have,
∃Z ⊆ Y \ {Yi ,Yj}, {Yi} 6⊥⊥ {Yj} | (X ∪ Z)⇒ YLPi = YLP j
Proof. Let us supposeYLPi , YLP j . By the label pow-erset definition forYLPi , we haveYLPi ⊥⊥ Y \ YLPi | X.As YLPi ∩ YLP j = ∅ owing to Lemma 8, we havethat Yj < YLPi . ∀Z ⊆ Y \ {Yi ,Yj}, Z can be decom-posed asZ i ∪ Z j such thatZ i = Z ∩ (YLPi \ {Yi}) andZ j = Z \ Z i . Using the Decomposition property, wehave that ({Yi} ∪ Z i) ⊥⊥ ({Yj} ∪ Z j) | X . Using the Weakunion property, we have that{Yi} ⊥⊥ {Yj} | (X ∪ Z) . Asthis is true∀Z ⊆ Y \ {Yi ,Yj}, then∄Z ⊆ Y \ {Yi ,Yj}, {Yi} 6
⊥⊥ {Yj} | (X ∪ Z) , which completes the proof.
Lemma 10. ConsiderYLP a minimal label powerset.Then, if p satisfies the Composition property,Z 6⊥⊥ YLP\
Z | X.
Proof. By contradiction, suppose a nonemptyZ exists,such thatZ ⊥⊥ YLP \ Z | X . From the label powersetassumption ofYLP, we have thatYLP ⊥⊥ Y \ YLP | X. From these two facts, we can use the Compositionproperty to show thatZ ⊥⊥ Y \ Z | X which contradictsthe minimal label powerset assumption ofYLP. Thisconcludes the proof.
Theorem 7. Suppose p is faithful to a DAGG. Then,Yi and Yj belong to the same minimal label powerset ifand only if there exists an undirected path inG betweennodes Yi and Yj in Y such that all intermediate nodes Zare either (i) Z∈ Y, or (ii) Z ∈ X and Z has two parentsin Y (i.e. a collider of the form Yp → X← Yq).
17
Proof. Suppose such a path exists. By conditioning onall the intermediate collidersYk in Y of the formYp →
Yk ← Yq along an undirected path inG between nodesYi andYj , we ensure that∃Z ⊆ Y \ {Yi ,Yj}, dSep(Yi ,Yj |
X ∪ Z). Due to the faithfulness, this is equivalent to{Yi} 6⊥⊥ {Yj} | (X ∪Z). From Lemma 9, we conclude thatYi andYj belong to the same minimal label powerset.To show the inverse, note that owing to Lemma 10, weknow that there exists no partition{Z i ,Z j} of Z such that({Yi} ∪ Z i) ⊥⊥ ({Yj} ∪ Z j) | X. Due to the faithfulness,there exists at least a link between ({Yi}∪Z i) and ({Yj}∪
Z j) in the DAG, hence by recursion, there exists a pathbetweenYi andYj such that all intermediate nodesZ areeither (i)Z ∈ Y, or (ii) Z ∈ X andZ has two parents inY (i.e. a collider of the formYp → X← Yq).
Theorem 8. Suppose p is faithful to a DAGG. LetYLP = {Y1,Y2, . . . ,Yn} be a label powerset. Then, itsMarkov boundaryM in U is also its Markov boundaryin X, and is given inG byM =
⋃nj=1{PCYj ∪ SPYj } \ Y.
Proof. First, we prove thatM is a Markov boundaryof YLP in U. DefineM i the Markov boundary ofYi inU, andM ′
i = M i \ Y. From Theorem 5,M i is givenin G by M i = PCYj ∪ SPYj . We may now prove thatM ′
1∪· · ·∪M ′n is a Markov boundary of{Y1,Y2, . . . ,Yn}
in U. We show first that the statement holds forn = 2and then conclude that it holds for alln by induction.Let W denoteU \ (Y1 ∪ Y2 ∪ M ′
1 ∪ M ′2), and define
Y1 = {Y1} andY2 = {Y2}. From the Markov blanketassumption forY1 we haveY1 ⊥⊥ U \ (Y1 ∪M1) | M1
. Using the Weak Union property, we obtain thatY1 ⊥⊥
W | M ′1 ∪ M ′
2 ∪ Y2 . Similarly we can deriveY2 ⊥⊥
W | M ′2 ∪M ′
1 ∪ Y1 . Combining these two statementsyieldsY1∪Y2 ⊥⊥W | M ′
1∪M ′2 due to the Intersection
property. LetM = M ′1∪M ′
2 the last expression can beformulated asY1 ∪Y2 ⊥⊥ U \ (Y1 ∪Y2 ∪M ) | M whichis the definition of a Markov blanket ofY1 ∪ Y2 in U.We shall now prove thatM is minimal. Let us supposethat it is not the case, i.e.,∃Z ⊂ M such thatY1 ∪Y2 ⊥⊥
Z∪(U\(Y1∪Y2∪M )) | M \Z . DefineZ1 = Z∩M1 andZ2 = Z∩M2, we may apply the Weak Union property togetY1 ⊥⊥ Z1 | (M \Z)∪Y2∪Z2∪(U\(Y∪M )) which canbe rewritten more compactly asY1 ⊥⊥ Z1 | (M1 \ Z1) ∪(U \ (Y1∪M1)) . From the Markov blanket assumption,we haveY1 ⊥⊥ U \ (Y1 ∪ M1) | M1 . We may nowapply the Intersection property on these two statementsto obtainY1 ⊥⊥ Z1∪(U\(Y1∪M1)) | M1\Z1 . Similarly,we can deriveY2 ⊥⊥ Z2 ∪ (U \ (Y2 ∪ M2)) | M2 \ Z2 .From the Markov boundary assumption ofM1 andM2,we have necessarilyZ1 = ∅ andZ2 = ∅, which in turn
yieldsZ = ∅. To conclude for anyn > 2, it suffices to setY1 =
⋃n−1j=1{Yj} andY2 = {Yn} to conclude by induction.
Second, we prove thatM is a Markov blanket ofYLP
in X. DefineZ = M ∩ Y. From the label powersetdefinition, we haveYLP ⊥⊥ Y \ YLP | X . Using theWeak Union property, we obtainYLP ⊥⊥ Z | U\(YLP∪Z)which can be reformulated asYLP ⊥⊥ Z | (U \ (YLP ∪
M ))∪(M \Z) . Now, the Markov blanket assumption forYLP in U yieldsYLP ⊥⊥ U \ (YLP∪M ) | M which can berewritten asYLP ⊥⊥ U \ (YLP ∪M ) | (M \ Z) ∪ Z . Fromthe Intersection property, we getYLP ⊥⊥ Z∪ (U \ (YLP∪
M )) | M \ Z . From the Markov boundary assumptionof M in U, we know that there exists no proper subsetof M which satisfies this statement, and thereforeZ =M ∩Y = ∅. From the Markov blanket assumption ofMin U, we haveYLP ⊥⊥ U \ (YLP ∪ M ) | M . Using theDecomposition property, we obtainYLP ⊥⊥ X \ M | Mwhich, together with the assumptionM ∩ Y = ∅, is thedefinition of a Markov blanket ofYLP in X.
Finally, we prove thatM is a Markov boundary ofYLP in X. Let us suppose that it is not the case, i.e.,∃Z ⊂ M such thatYLP ⊥⊥ Z ∪ (X \ M ) | M \ Z. From the label powerset assumption ofYLP, wehaveYLP ⊥⊥ Y \ YLP | X which can be rewritten asYLP ⊥⊥ Y \ YLP | (X \ M ) ∪ (M \ Z) ∪ Z . Due to theContraction property, combining these two statementsyields YLP ⊥⊥ Z ∪ (U \ (M ∪ YLP) | M \ Z . Fromthe Markov boundary assumption ofM in U, we havenecessarilyZ = ∅, which suffices to prove thatM is aMarkov boundary ofYLP in X.
18
amazed−suprised
happy−pleased
relaxing−calm
quiet−still
sad−lonely
angry−aggresive
Mean_Acc1298_Mean_Mem40_MFCC_4
Mean_Acc1298_Mean_Mem40_Flux
BH_HighPeakAmp
BHSUM3
Mean_Acc1298_Mean_Mem40_MFCC_2
Mean_Acc1298_Std_Mem40_Flux
Mean_Acc1298_Std_Mem40_Rolloff
Mean_Acc1298_Std_Mem40_MFCC_4Std_Acc1298_Std_Mem40_Flux
amazed−suprised
happy−pleased
relaxing−calm
quiet−still
sad−lonely
angry−aggresive
Std_Acc1298_Std_Mem40_MFCC_4
Mean_Acc1298_Mean_Mem40_Rolloff
Std_Acc1298_Std_Mem40_MFCC_11
Mean_Acc1298_Mean_Mem40_MFCC_0
Mean_Acc1298_Mean_Mem40_Flux
BH_HighPeakAmp
Std_Acc1298_Std_Mem40_MFCC_1
Std_Acc1298_Std_Mem40_MFCC_3
Mean_Acc1298_Std_Mem40_MFCC_4
Std_Acc1298_Mean_Mem40_MFCC_4
Std_Acc1298_Std_Mem40_MFCC_2
Std_Acc1298_Std_Mem40_MFCC_5
Mean_Acc1298_Mean_Mem40_MFCC_1
Mean_Acc1298_Mean_Mem40_Centroid
Mean_Acc1298_Std_Mem40_Rolloff
Std_Acc1298_Std_Mem40_MFCC_10
Std_Acc1298_Std_Mem40_MFCC_12
Std_Acc1298_Std_Mem40_MFCC_8
Std_Acc1298_Mean_Mem40_MFCC_12
Mean_Acc1298_Std_Mem40_MFCC_11
Mean_Acc1298_Mean_Mem40_MFCC_8
Std_Acc1298_Mean_Mem40_MFCC_11
Std_Acc1298_Mean_Mem40_MFCC_0
Std_Acc1298_Std_Mem40_MFCC_0
Mean_Acc1298_Std_Mem40_Flux
BH_LowPeakAmp
Mean_Acc1298_Std_Mem40_MFCC_2
Mean_Acc1298_Std_Mem40_MFCC_0
Std_Acc1298_Std_Mem40_Flux
BHSUM3
BH_HighPeakBPM
Std_Acc1298_Mean_Mem40_MFCC_8
BHSUM2
Std_Acc1298_Mean_Mem40_MFCC_1
Std_Acc1298_Mean_Mem40_Flux
Std_Acc1298_Std_Mem40_Centroid
(a) Emotions
Class1
Class2
Class3
Class6
Class4
Class10
Class11
Class5
Class7Class8
Class9
Class12
Class13
Class14
Att88
Att96
Att89
Att100
Att84
Att87Att93
Class1
Class2
Class3
Class6
Class4
Class10 Class11Class5
Class7
Class8
Class9
Class12Class13
Class14
Att94
Att83
Att66
Att103
Att43
Att3
Att68
Att92
Att86
Att85
Att96
Att91
Att89
Att102
Att63
Att65
Att78
Att46Att37
Att77
Att47
Att84
Att80
Att88
Att42
Att38
Att62
Att71
Att12
Att4Att5
Att2Att29
Att1
Att36
Att44
Att11
Att67
Att13
Att16
(b) Yeast
Figure 3: The local BN structures learned by MMHC (left plot)and H2PC (right plot) on a single cross-validation split, onEmotions and Yeast.
19
desert
mountains
sea
sunset
trees
X9
X21
X78
X8
X18
X6
X20
X12
X24
X77
X69
X75 desertmountains
sea
sunset
trees
X21
X7
X76
X37
X2
X20
X12
X24X129
X23
X8
X16
X4
X89
X61X77 X67
X79
X73
X28
X46
X40
X118
X38
X3
X1
X11
(a) Image
Beach
Sunset
FallFoliage
Field
Mountain
Urban
Att45
Att46
Att44
Att38
Beach
Sunset
FallFoliage
Field
MountainUrban
Att1
Att115
Att78
Att85
Att238
Att26
Att27
Att263
Att271
Att68
Att22
Att173
Att2
Att8
Att3
Att114
Att116
Att122
Att108
Att59
Att213
Att71
Att29
Att86
Att281
Att92
Att36
Att91
Att183
Att237
Att231
Att245
Att280
Att294
Att25
Att19
Att75
Att33
Att28
Att34
Att20
Att76
Att262
Att264
Att165
Att270
Att67
Att256
Att69
Att61
Att166
Att23
Att15
Att64
Att172
Att174
Att180
Att121
(b) Scene
Figure 4: The local BN structures learned by MMHC (left plot)and H2PC (right plot) on a single cross-validation split, onImage and Scene.
20
Entertainment
Interviews
Main
Developers
Apache
News
Search
Mobile
Science
IT
BSD
Idle
Games
YourRightsOnline
AskSlashdot
Apple
BookReviews
Hardware
Meta
Linux
Politics
Technology
movie
format
television
answers
founder
developers
programming
perl
foundation
ruphus13
dead
standard
mobile
iphone
phonenetbookslaptop
scientists
X4writes
game
games
gaming
riaa
work
apple
mac
macbook
book
stoolpigeon
michael
intel
nvidia
amd
pc
linux
ubuntu
source
kernel
obamaelection
presidentialvoting
political
coondoggie
official
legal
tv
shows
questions
sendspoints
readers
notes
open
left
chrome
android
search
information
update
devices
broadband
notebook
app
cell
netbook
laptops
olpc
developed
research
video
developer
industry
newyorkcountrylawyer
company
experience
os
pro
lines
books
author
demo
graphics
gpu
ii
eeebarence
windows
X8code
barack
transition
discuss
machines
party
million
Entertainment
Interviews
Main
Developers
Apache
News
Search
Mobile
Science
IT
BSD
Idle
GamesYourRightsOnline
AskSlashdot
Apple
BookReviews
Hardware
Meta
Linux
Politics
Technology
movie
format
television
copyright
trek
answers
writes
founder
director
commission
developers
programming
java
project
developerlanguage
perl
snydeq
development
javascript
microsoft
foundation
ruphus13
dead
web
search
standard
mobile
iphone
phonenetbookslaptop
early
text laptops
blackberry
notebook
wireless
narramissic
devices
phones
X3g
t.mobile
android
scientists
universe
dna
planet
water
human
users
surface
quantum
humans
astronomers
evidence
physics
discovery
brain
solar
team
study
spacecraft
telescope
earth
discovered
dk
kentuckyfc
scientific
found
moon
research
mission
scientist
mars
nasa
science
researchers
cisco
released
X4
don
readers
sends
data
dog
future
interesting
isn
baby
japanese
officialsguy
recentreports
santa
disagree
women
story
announced
woman
man
game games
gaming
series
bring
trailer
details
edge
popular
players
drm
play
littlebigplanet
coming
massively
worlds
e3
fallout
interview
entertainment
warcraft
mmos
expansion
content
activision
downloadableonline
X1up
gameplay
gamer
spokeblizzard
console
gamers
kotaku
nintendo
upcoming
xbox
runningwii
ea
mmo
gamasutra
riaa
police
names
p2p
legal
government
isps
eu
music
bill
internet
dmca
rights
law
property
court
work
server
tricks good
time
small
books
computerpay
local boss
thinking
job
recently
apple
mactakes
itunes
macbook
steve
book
stoolpigeon
michael
management
ben
intel
nvidiaamd
pc
desktop
memory hp
chips
hardware
supercomputer
cpuchip
arcticstoat
linux
ubuntu
source
kernel
active
open.source
free
obama
election
presidential
voting
political palin
telecom
votes
states
debate
congress
gov
campaign
vote
mccain
coondoggie
insiderocket
silicon
system
prototype
tech
technology
developed
studios
official
consumer
war
tv switch
act
star
things
questions
policy
army
asked
points
notes
blog talks
eldavojohn
newyorkcountrylawyer
tips
anonymous
guide
education
federalsun
python
flash
open
olpc
gave
single
infoworld
exception
companies
paul
francisco
creator
committee
windows
electronic
announcement
zdnet
mentioned
left
chromeapps
yahoosite
browser
services
X0sites
applications
information
update
lamont
broadband
app
cell
callsnetbook
cheap
X360
child
atari
lenovo
reported
networkfcc
nokia
connection
g1
canadian
create
artificial
magnetic
red
energy
cancer
service
website
image
blacktaking
european
size
nobel
professor
japan
power
month
cells
transition
international
dark
conducted
finds risk
peopleshows
suggests
robots
million
hubble
space
survey
billionsatellite
matt
american
lunar
indiaimpact
institute
launch
original
holy
landerice
centerengineers
news
university
roland
working
replace
created
releases
version
release
today
firmware
security
X1
imaginary
feel
sending
wrote
written
death
bbc
excerpt
jagslive
word
anti.globalism
begins
article
firm
lucas123
trial
president
prettypickens
country
couple
X50
financial
technica
slashdot
claimissued
men
washington
announces
september
plans
mondaysuit
wife
back
contest
video
piracysales
independent
learning
modern
industry
show
final
X3d
mirror
magazine
money
run
user
november
guitar
pointed
player
virtual
bit
battle begin
year
conference
X3
australian
reilly
evolution
sony
digital
world
king
call
set
warhammer
patent
tour
author
videosways
ceo
employees
makers
part
social
chance
live
feature
piece
discussing
sporeaction
wars
mediasentry
led
block
received
ruling
chinese
plan
agreement
warner
illegal
streaming
passed
pdfproposed
house
senate
censorship
china
access
bittorrent
submitted
speech
attempt
intellectual
supreme
appeals
ruled
case
filed
district
order
judge
appeal
company
experience
stupid problem
gmail
spam
long
computersattack
org ordered
student
makes
vista
close
jobs
offers
decided
psystar
os
youtube
store
prolines
line
include
noted
read
current
demoperformance
core
graphics
gpudriver ii
eee
market
pcsbarence
demonstrated
move
top X500
designed
revealed
hat
X6
X8
mark
code
networks
software
download
barack
america
night
tuesday
issues
machines
problems
party
state
apparently private
account
reporting
united
europe
privacy
change
register
john
challenge
pictures
ad
high
operating
mike
screen battery
built
unveiled
(a) Slashdot
PDOC00154
PDOC00343
PDOC00271
PDOC00064
PDOC00791
PDOC00380
PDOC50007
PDOC00224
PDOC00100
PDOC00670
PDOC50002
PDOC50106
PDOC00561
PDOC50017
PDOC50003
PDOC50006
PDOC50156
PDOC00662
PDOC00018
PDOC50001
PDOC00014
PDOC00750
PDOC50196
PDOC50199
PDOC00660
PDOC00653
PDOC00030
PS50072
PS50067
PS50053
PS50065
PS01031
PS50004
PS50007
PS50008
PS50049
PS50011
PS50052
PS50002
PS50106
PS50050
PS50017
PS50003
PS50006
PS50156
PS50051
PS00018
PS50001
PS00014
PS50235
PS50199
PS50102
PS00027
PDOC00154
PDOC00343
PDOC00271
PDOC00064
PDOC00791
PDOC00380
PDOC50007
PDOC00224
PDOC00100
PDOC00670
PDOC50002
PDOC50106
PDOC00561
PDOC50017
PDOC50003
PDOC50006PDOC50156
PDOC00662PDOC00018
PDOC50001
PDOC00014
PDOC00750
PDOC50196
PDOC50199
PDOC00660PDOC00653
PDOC00030
PS50072
PS50067
PS50053
PS50065
PS00318
PS00066
PS01031
PS50004
PS50008
PS50007
PS50049
PS50011
PS50052
PS50002
PS50106
PS50050
PS50017
PS50003
PS50006PS50156
PS50051PS00018
PS50001
PS00014
PS50235
PS50229
PS50199
PS50245PS50804
PS50102
(b) Genbase
Figure 5: The local BN structures learned by MMHC (left plot)and H2PC (right plot) on a single cross-validation split, onSlashdot and Genbase.
21
Class−0−593_70
Class−1−079_99
Class−2−786_09 Class−3−759_89
Class−4−753_0
Class−5−786_2
Class−6−V72_5
Class−7−511_9
Class−8−596_8
Class−9−599_0
Class−10−518_0
Class−11−593_5
Class−12−V13_09
Class−13−791_0
Class−14−789_00
Class−15−593_1
Class−16−462
Class−17−592_0
Class−18−786_59
Class−19−785_6
Class−20−V67_09
Class−21−795_5
Class−22−789_09
Class−23−786_50
Class−24−596_54
Class−25−787_03
Class−26−V42_0
Class−27−786_05
Class−28−753_21
Class−29−783_0
Class−30−277_00
Class−31−780_6
Class−32−486
Class−33−788_41 Class−34−V13_02
Class−35−493_90
Class−36−788_30
Class−37−753_3
Class−38−593_89
Class−39−758_6
Class−40−741_90
Class−41−591
Class−42−599_7
Class−43−279_12
Class−44−786_07
reflux
grade
syndrome
hemihypertrophy
cough
unilateral
drainage
present
uti
atelectasis
hydroureter
proteinuria
pain
abdominal
pharyngitis
calculi
10−year−9−month
followup
ppd
positive
chest
neurogenic
vomiting
transplant
shortness
appetite
fibrosis
yearly
pneumonia
fever
intraluminal
asthma
features
pyelectasis
moderate
pyelocaliectasis
horseshoe
duplication
duplicated
enuresis
incontinence
2;
myelomeningocele
meningomyelocele
turner
hydronephrosis
hematuria
wheezing
vesicoureteral
cystogram
ii
low 2
aldrich
lobe
9−day
alternatively
count
echogenicity
recent
normal
lung
patchy
proximally
flank
mass
conveyed
evidence
pole
growth
routine
study
antibody
renal
kidneys
accounts
aspirated
marrow
cystic
evaluation
viral focal
ago
year−old radiographic
distentionresidual
caliber
change
partial
collecting
calcium
date
infections
grossly
kidney
gross
microscopic
reactive
Class−0−593_70
Class−1−079_99
Class−2−786_09
Class−3−759_89
Class−4−753_0
Class−5−786_2
Class−6−V72_5
Class−7−511_9
Class−8−596_8
Class−9−599_0
Class−10−518_0
Class−11−593_5
Class−12−V13_09
Class−13−791_0
Class−14−789_00
Class−15−593_1
Class−16−462
Class−17−592_0
Class−18−786_59
Class−19−785_6
Class−20−V67_09
Class−21−795_5
Class−22−789_09
Class−23−786_50
Class−24−596_54
Class−25−787_03
Class−26−V42_0
Class−27−786_05Class−28−753_21
Class−29−783_0
Class−30−277_00
Class−31−780_6
Class−32−486
Class−33−788_41
Class−34−V13_02
Class−35−493_90
Class−36−788_30
Class−37−753_3
Class−38−593_89
Class−39−758_6
Class−40−741_90
Class−41−591
Class−42−599_7
Class−43−279_12
Class−44−786_07
reflux
grade
growth
syndrome
hemihypertrophy
mass
hypoventilation
cough
weeks
x−ray
unilateral
drainage
present
tract
uti
febrile
atelectasis
hydroureter
decreased
woman
proteinuria
pain
abdominal
pharyngitis
calculi
10−year−9−month
followup
ppd
positive
flank
chest
neurogenic
vomiting
transplant
shortness
appetite
fibrosis
yearly
pneumonia
fever
intraluminal
asthma
reactive
features
ago
utis
recent
pyelectasis
pyelocaliectasis
abnormal
mild
horseshoe
duplication
duplicated
kidneys
kidney
enuresis
incontinence
2;
anuresis
night
ultrasound myelomeningocele
meningomyelocele
turner
hydronephrosis
improvement
unchangeddecrease
bilateral
moderate
wiskott
hematuria
wheezing
consistent
vesicoureteral
cystogram
voiding
left
history
deflux
low
4
2
nuclear
ii
interval
wiedemann
utero
thickening
clearprior
dilatation
lobe
pleural
stable
bladder
streaky
approximately
2−3
evidence
9−day
appearance
echogenicity
thoraxwhite
urinary
infection
partial
lung
band
lobar
area
patchy
density
opacities
consolidation
subsegmental
represent
sounds
junction
conveyed
stool
underlying
pole
5
reimplantation
duplex
routine
studytiterrenal
normal
views
radiographic
radiograph
started
web
marrow
cystic
evaluation
viral
examination
male
rulefocal
days
disease
hyperinflation
findings
airway
airways
year−old
nonconventional
etiology
9;
month
degrees
years
2001
pyelonephritis
bilaterally
resolves
change
increased
residual
gross
related
appears
distention
calix
ureteral
ureters
cortical
collecting
kidney;
partially
sonographic
appearing
lungs
size
made
infections
r=9
wetting
row
day
including
vesicostomy
grossly
resolution
urogenitalprenatal
significant
parenchymal
technical
9−month−24−day
microscopic
airminimally
(a) Medical
A.A1
A.A2
A.A3
A.A4
A.A5
A.A6
A.A7
A.A8
B.B1
B.B10
B.B11
B.B12
B.B13
B.B2
B.B3
B.B4
B.B5
B.B6
B.B7
B.B8
B.B9
C.C1
C.C10
C.C11
C.C12
C.C13
C.C2
C.C3
C.C4
C.C5
C.C6
C.C7
C.C8
C.C9
D.D1
D.D10
D.D11
D.D12
D.D13D.D14
D.D15
D.D16D.D17D.D18
D.D19
D.D2
D.D3
D.D4
D.D5
D.D6
D.D7D.D8
D.D9
meeting
job
draft attached
doc
amto
article
http
media
california
confidential
comments
proposal
language
thursday
staff
securitiesfile
cc
enroncc
11
pmto
attorney
intended
legal
www
visit
hit
ca
deregulation
A.A1
A.A2
A.A3
A.A4
A.A5
A.A6
A.A7
A.A8
B.B1
B.B10
B.B11
B.B12
B.B13
B.B2
B.B3
B.B4
B.B5
B.B6
B.B7
B.B8
B.B9
C.C1
C.C10
C.C11
C.C12
C.C13
C.C2
C.C3
C.C4
C.C5
C.C6
C.C7
C.C8
C.C9
D.D1
D.D10
D.D11
D.D12
D.D13D.D14
D.D15
D.D16
D.D17
D.D18D.D19
D.D2
D.D3
D.D4
D.D5
D.D6
D.D7
D.D8
D.D9
press
kaminski
vince
00
meeting
job
draft
forwarded
news
attached
doc
weeks
subject
original
pmto
copyright
article
angeles
homes
bill
http
ferc
market
confidential
points
californiadiego
procedures
release
caps
head
solution
government
message
june
mailto
stanford
risk
london
22
tuesday
3
121
5
thursday
yesterday
development
staff
talk
discussion
concern
board
jandidn
doesnpuc
comments
language
proposal
williams
pm
stevenhou
cc
08
28 1001
02
05
09
fw
2001
data
friday
july
monday
amto
100
week
prices
good
utility
big
sale
million
securities
james
mara
find
file
cpuc
scott
trading
lineago
office republican
questions
15
number
kean
2000
3123
16
06
0403
enron
enroncc
reserved
ordered
27 world
shut
budget
losnet
held
la
spokesman
ed
temporary
valley
reached
cut
megawatts
senate
bills
chris
political
effect
al
law
bad
www
expense
siteproducers
total
iso
order
commission
alan
september
settlement
sarah
rtolinda
wood
4
issues
investigation
reasonable
commissioners
filing
filed
authority
wholesale
powerprice
interest fall
makingreal including
industrybased
markets
intended
cash
system
legal
information
attorney
talking
key
assets
america
crisis
deregulation
ca
pacific
san
residential
sdg
wayspotential
passed
lawmakerslegislators
legislature rate
shortages
policies
required
(b) Enron
Figure 6: The local BN structures learned by MMHC (left plot)and H2PC (right plot) on a single cross-validation split, onMedical and Enron.
22
TAG_2005
TAG_2006
TAG_2007
TAG_agdetection
TAG_algorithms
TAG_amperometry
TAG_analysis
TAG_and
TAG_annotation
TAG_antibody
TAG_apob
TAG_architecture
TAG_article
TAG_bettasplendens
TAG_bibteximport
TAG_book
TAG_children
TAG_classification
TAG_clustering
TAG_cognition
TAG_collaboration
TAG_collaborative
TAG_community
TAG_competition
TAG_complex
TAG_complexity
TAG_compounds
TAG_computer
TAG_computing
TAG_concept
TAG_context
TAG_cortex
TAG_critical
TAG_data
TAG_datamining
TAG_date
TAG_design
TAG_development
TAG_diffusion
TAG_diplomathesis
TAG_disability TAG_dynamics
TAG_education
TAG_elearning
TAG_electrochemistryTAG_elisa
TAG_empirical
TAG_energy
TAG_engineering
TAG_epitope
TAG_equation
TAG_evaluation
TAG_evolution
TAG_fca
TAG_folksonomy
TAG_formal
TAG_fornepomuk
TAG_games
TAG_granular
TAG_graph
TAG_hci
TAG_homogeneous
TAG_imaging
TAG_immunoassay
TAG_immunoelectrode
TAG_immunosensor
TAG_information
TAG_informationretrieval
TAG_kaldesignresearch
TAG_kinetic
TAG_knowledge
TAG_knowledgemanagement
TAG_langen
TAG_language
TAG_ldl
TAG_learning
TAG_liposome
TAG_litreview
TAG_logic
TAG_maintenance
TAG_management
TAG_mapping
TAG_marotzkiwinfried
TAG_mathematics
TAG_mathgamespatterns
TAG_methodology
TAG_metrics
TAG_mining
TAG_model
TAG_modeling
TAG_models
TAG_molecular
TAG_montecarlo
TAG_myown
TAG_narrative
TAG_nepomuk
TAG_network
TAG_networks
TAG_nlp
TAG_nonequilibrium
TAG_notag
TAG_objectoriented
TAG_of
TAG_ontologies
TAG_ontology
TAG_pattern
TAG_patterns
TAG_phase
TAG_physics
TAG_process
TAG_programming
TAG_prolearn
TAG_psycholinguistics
TAG_quantum
TAG_random
TAG_rdf
TAG_representation
TAG_requirements
TAG_research
TAG_review
TAG_science
TAG_search
TAG_semantic
TAG_semantics
TAG_semanticweb
TAG_sequence
TAG_simulation
TAG_simulations
TAG_sna
TAG_social
TAG_socialnets
TAG_software
TAG_spin
TAG_statistics
TAG_statphys23
TAG_structure
TAG_survey
TAG_system
TAG_systems
TAG_tagging TAG_technology
TAG_theory
TAG_topic1
TAG_topic10
TAG_topic11
TAG_topic2
TAG_topic3
TAG_topic4
TAG_topic6
TAG_topic7
TAG_topic8
TAG_topic9
TAG_toread
TAG_transition
TAG_visual
TAG_visualization
TAG_web
TAG_web20
TAG_wiki
2007
algorithm
algorithms
amperometric
analysis
patients
annotation
metadata
monoclonal
constants
apolipoprotein lipoprotein
apo
apob
b
architecture
architectures
reading
fish
fighting
calculations
molecular
journal
energy
initio
functional
review
children
child
classification
categories
clustering
cognitive
collaborative
collaboration
community
communities
link
entities
competitive
competition
complex
complexity
compounds
semantic
word
computer
computing
la
concept
context
aware
cortex
critical
automation
consumption
power
design
diffusion
assessment
dynamics
education
learning
electrochemical
electrode
glucose
assay
linked
empirical
engineering
equation
evaluation
evaluating
evolution
evolutionary
change
formal
games
gas
velocity
graph
interfaces
homogeneous
imaging
brain
images
immunoassay
antibody
serum
information
entropy
kinetic
constant
knowledge
management
language
languages
ldl
lipid
membrane
technology
logic
description
maintenance
reasoning
mapping
n
u
mathematics
mathematical
designing
antigen
metrics
mining
model
models
modeling
modelling
network
networks
equilibrium
extraction
object
ontologies
ontology
pattern
patterns
process
programming
programs
hypermedia
adaptive
quantum
random
rdf
representation
representations
requirements
specification
reviewed
science
scientific
search
semantics
sequence
sequences
simulation
simulations
social
latent
probabilistic
software
spin
statistics
structure
survey
systems
tagging
theory
boltzmann
proteins
biological
dna
financial
surfaces
liquid
polymer
gap
glass
relaxation
transition
visual visualization
web
first
discovery
data
machine
training
anti
antibodies
novel
efficient
techniques
analyses
disease
clin
left
primary
reaction
calculated
values
density
res
c
v
p
platform
built
siamese
molecules
protein
conference
american
psychology
physics
free
surface
ab
equations
nodes
physical
letters
young
care
vector
hierarchical
cluster
computers
supported
human
society
participants
practice
links
relationships
relations
co
determination
formation
reduce
system
applications
text
words
factors
ieee
internet
concepts
contexts
personal
exponents
application
law
coefficient
test
dynamical
students
educational
school
chem
electron
concentration
enzyme
method
theoretical
development
evaluated
evolving
genetic
biology
changes
world
wide
users
pages
services
game
phase
particles
particle
motionflow
graphs
interface
user
heterogeneous
resonance
image
retrieval
about
processing
production
maximum
measure
kinetics
rate
managing
natural
receptor
cholesterol
cell
technologies
map
onto
sci
s
interactive
non
oriented
objects
processes
program
multimedia
adaptation
mechanical
mechanics
classical
stochastic
notes
engine
numerical
onlinecultural
people
indexing
probability
magnetic
structures
its
gene
genome
markets
solid
crystal
chain
temperature
state
second
then
TAG_2005
TAG_2006
TAG_2007
TAG_agdetection
TAG_algorithms
TAG_amperometry
TAG_analysis
TAG_and
TAG_annotation
TAG_antibody TAG_apob
TAG_architecture
TAG_article
TAG_bettasplendens
TAG_bibteximport
TAG_book
TAG_children
TAG_classification
TAG_clustering
TAG_cognition
TAG_collaboration
TAG_collaborative
TAG_community
TAG_competition
TAG_complex
TAG_complexity
TAG_compounds
TAG_computer
TAG_computing
TAG_concept
TAG_context
TAG_cortex
TAG_critical
TAG_data
TAG_datamining
TAG_date
TAG_design
TAG_development
TAG_diffusion
TAG_diplomathesis
TAG_disability
TAG_dynamics
TAG_education
TAG_elearning
TAG_electrochemistry
TAG_elisa
TAG_empirical
TAG_energy
TAG_engineering
TAG_epitope
TAG_equation
TAG_evaluation
TAG_evolution
TAG_fca
TAG_folksonomy
TAG_formal
TAG_fornepomuk
TAG_games
TAG_granular
TAG_graph
TAG_hci
TAG_homogeneous
TAG_imaging
TAG_immunoassay
TAG_immunoelectrode
TAG_immunosensor
TAG_information
TAG_informationretrieval
TAG_kaldesignresearch
TAG_kinetic
TAG_knowledge
TAG_knowledgemanagement
TAG_langen
TAG_language
TAG_ldl
TAG_learning
TAG_liposome
TAG_litreview
TAG_logic
TAG_maintenance
TAG_management
TAG_mapping
TAG_marotzkiwinfried
TAG_mathematics
TAG_mathgamespatterns
TAG_methodology
TAG_metrics
TAG_mining
TAG_model
TAG_modeling
TAG_models
TAG_molecular
TAG_montecarlo
TAG_myown
TAG_narrative
TAG_nepomuk
TAG_network
TAG_networks
TAG_nlp
TAG_nonequilibrium
TAG_notag
TAG_objectoriented
TAG_of
TAG_ontologies
TAG_ontology
TAG_pattern
TAG_patterns
TAG_phase
TAG_physics
TAG_process
TAG_programming
TAG_prolearn
TAG_psycholinguistics
TAG_quantum
TAG_random
TAG_rdf
TAG_representation
TAG_requirements
TAG_research
TAG_review
TAG_science
TAG_search
TAG_semantic
TAG_semantics
TAG_semanticweb
TAG_sequence
TAG_simulation
TAG_simulations
TAG_sna
TAG_social
TAG_socialnets
TAG_software
TAG_spin
TAG_statistics
TAG_statphys23
TAG_structure
TAG_survey
TAG_system
TAG_systems
TAG_tagging
TAG_technology
TAG_theory
TAG_topic1
TAG_topic10
TAG_topic11
TAG_topic2
TAG_topic3
TAG_topic4
TAG_topic6
TAG_topic7TAG_topic8
TAG_topic9
TAG_toread
TAG_transition
TAG_visual
TAG_visualization
TAG_webTAG_web20
TAG_wiki
05
mining
2007
algorithm
algorithms
amperometric
working
competitive
glucose
analysis
patients
annotation
metadata
monoclonal
determining
mass
association
constants
apolipoprotein
lipoprotein
apo
apob
b
through
ratio
95
approach
variation
united
gene
significantly
particles
isolated
samples
containing
subjects
disease
protein
serum
clin
human
cholesterol
lipid
plasma
lipoproteins
ldl
architecture
architectures
in
reading
conference
movement
fishfighting
betta
neural
psychology
2004
behavior
display
male
calculations
molecular
energy
functional
density
could
science
carbon
employed
second
self
series
approximate
calculate
spin
acid
yield
species
coupled
gaussian
experiment
contributions
comparison
predicted
chem
set
x
treatment
relative
studied
interactions
site
level
active
structures
consistent
been
obtained
transfer
carried
frequencies
ion
geometry
compared
approximation
theoretical
found
activation
gas
sets
optimized
potentials catalytic
perturbation
complexes
study
accurate
computed
transition
effects
correlation6
computational
electron
molecule
agreement
charge
structure
experimental
exchange
calculation
electrostatic
methods
combined
barrier
reactions
atoms
method
hydrogen
mol
basis
potential
american
electronic
calculated solvent
reaction
molecules
c
bond
mechanical
dft
hybrid
qm
energies
quantum
journal
chemical
chemistry
initio
networks
physical
children
child
numbers
classification
text
categories
clustering
constraints
cognitive
collaborative
collaboration
shared
accuracy
community communities
entities
link
enzyme
competition
labeled
complex
complexity
compounds
proceedings
interpretation
target
similarity
semantic
words
language
word
computer
school
computing
la
concept
conceptual
context
aware
contexts
cortex
input
critical
data
genome
automation
consumption
power
average
technique
reduction
multi
policy
date
memory
reducing
dynamic
performance
embedded
design
creating
development
diffusion
ontology
hierarchical
assessment
identity
2006
basic
developmental
dynamics
physics
phys
exhibits
education
educational
learning
und
e
electrochemical
electrode
limits
polymer
current
direct
signal
vs
analytical
modified
range
oxidation
sensor
immobilized
immunoassay
antibody
linked
empirical
software
arise
engineering
proteins
identification
distinct
located
specific
fragments
expression
equation
evaluation
evaluating
a
evolution
requirements
evolutionary
changes
change
evolving
maintenance
web
share
tagging
formal
games
game
velocity
homogeneous
media
graph
interfaces
addition
separation
imaging
brain
medical
directions
visualization
image
multiple
images
matter
procedure
4
min
developed
assays
sensitivity
ml
antibodies
detection
assay
response
information
entropy
retrieval
mathematics
computers
teaching
constant
kinetic
antigen
knowledge
management
the
experiences
personal
languages
teachers
membrane
technology
was
access
devices
health
logic
description
reasoning
mapping
n
u
der
mathematical
references
culture
perspective
practices
environments
designing
methodology
metrics
model
models
modeling
international
modelling
proc
generation
06 network
equilibrium
extraction
documents
object
oriented
and
ontologies
to
pattern
patterns
phase
surface
process
programming
programs
hypermedia
course
standards
adaptation
adaptive
were
argued
novel
experiments
combination
principle
random
book
rdf
graphs
representation
representations
illustrated
re
specification
research
review
reviewed
synthesis
cause
enzymes
scientific
search
manner
semantics
sequence
partial
sequences
simulation
simulations
social
latent
probabilistic
source
statistics
xxiii
survey
system
systems
library
users
communication
adapted
theory
biological
dna
cells
financial
optimization
driven
formation
surfaces
temporal
growth
liquid
gap
glass
visual
views
exploration
graphical
w
0
workshop
background
first
p
02
associated
objects
vision
enabling
meaning
machine
automatically
discovery
training
situations
tools
experience
practice
tasks
activities
learned
learn
students
relevant
98
institute
statistical
letters
degrees
curves
chosen
conducted
prototype
fully
smaller
uniform
automated
similar
benefits
possibility
goal
discrete
expected
region
equal
connected
expansion
depend
considered
leads
content
strongly
short
origin
linear
introduced
corresponding
particular
shown
chains
induced
wave
bulk
forces
motion
stationary
force
local
coupling
observed
square
parameters
generalized
interacting
relaxation
also
symmetry
dependence
states
exact
particle
thermodynamic
equations
point
mean
are
scaling
field
numerical
finite
techniques
on
vector
we
proposed
problem
efficient
based
bioinformatics
produced
determination
this
developers
group
mediated
differential
an
flow
substrate
by
comparative
analyze
revealed
analyzing
analyses
is
of
that
combiningstability
improved
elements
domainsinfluence
components
importance
characteristics
highly
main
processes
four
further
significant
defined
among
including
applications
under
several
three
properties
s
its
other
different
both
between
for
daily
age
conclusion
clinical
creation
anti
directly
determine
capture
mixture
features
greater
detect
free
50
using
against
j
concentrations
bound
concentration
affinity
binding
recognition
directed
cell
previously
expressed
role
thermal
non
action
law
annual
rate
values
overall
identical
contrast
20
normal
amino
assessed
low
respectively
res
v
view
close
g
1noise
abstract
2005
lettour
matching
approaches
coefficients
coefficient
variations
national
genetic
genes
than
results
interact
fraction
charged
fluid
sample
as
subject
risk
individuals
major
beta
run
reference
those
l
factors
interaction
activity
extent
biol
profile
40
levels
total
amounts
composition
very
rich
contained
enhanced
modification
had
decreased
recognized
increased
receptor
paper
built
execution
platform
infrastructure
reveals
turn
kinds
namely
characterization
relatively
light
complete
across
lack
little
involved
various
space
out
terms
large
about
present
or
who
his
1998
least
european
instance
transactions
procedures
allows
show
control
behavioursiamese
splendens
processing
rev
behaviors
differ
24
pairs
trans
performed
base
hard
shift
weight
small
structural
atomic
mechanics
water
minimumground
state
improvement
mixed
iii
dependent
whereas
double
gradient
be
notes
progress
order
two
organized
distribution
magnetic
25
side
decreasepathway
acids
ions
cluster
functions
contribution
their
predict
prediction
predictions
fit
y 2
10
alpha
theoretically
effect
attractive
sites
lattice
residues
high
at
mechanism
with
has
have
implemented
identified
far
past
however
recently
frequency
configuration
modeled
intelligence
framework
no
quality
formed
case
qualitative
character
findings
intermediate
occurs
transitions
temperature
r
strong
parameter
good
correlated
correlations
5
7
8
18
11
12
3
restricted
computation
cost
excellent
experimentally
full
strength
boltzmann
applicable
reliable
best
reverse
kinetics
realistic
one
extendedconventional
estimate
applied
measurement
estimated
society
solution
combines
chain
indicates
intrinsic
involving
path
step
reserved
h
2003
2000
1997
o
d
systematic
mm
environment
give
body
treated
fock
classical
studies
obtain
tested
ab
nodes
emergence
links
degree
scale
topology
young
early
year
skills
years
play
care
developing
number
probabilityopportunities
document
articles
clusters
composed
parallel
controlled
selected
actual
line
positive
resulting
during
product
standard
availablefinally
variety
presented
single
within
function
simple
when
into
new
time
strategies
arguetheories
user
virtual
impact
technical
functionality
decisions
goals
building
project
designed
principles
implementation
co
supporting
online
success
members
errors
participants
relationships
relations
ph
determined
reduce
symposium
2002
joint
conditions
acm www
definition
higher
measure
measures
not
natural
defining
acquisition
written
query
xml
decision
sense
supported
reports
ieee
internet
mobile
de
des
descriptions
concepts
rapidly
situation
there
left
area
period
layer
responses
maps
primary
orientation
output
arbitrary
exponents
from
prior
database
sources
collected
databases
whole
achieve
strategy
off
i
sizes
error
distance
agent
rather
test
comprehensive
storage
static
task
precision
better
efficiency
created
challenges
create
successful
develop
volume
mobility
domain
enable
assess
reliability
choice
examined
ability
dynamicalsimulated
fluctuations
regime
2001
1999
while
settings
die
dem
auch
sich
eine
sind
f
ist
mit
zur
als
den
auf
im
zu
von
limit
some
future
art
trends
currently
technologies
30
wide
long
broad monitoring
contact
measurements
resonance
onto
advantages
sensitive
detected
evidence
identify
industrial
reuse
program
managing
projects
code
conf
testing
changing
supports
needs
discuss
section
integrated
concernssupport
each
derived
begin
evaluated
evaluate
criteria
researchers
every
biology
selection
events
world
pages
topic
http
usage
intelligent
tool
sharing navigation
resources
services
they
generated
popular
look
video
market
others
heterogeneous
multimedia
medium
interface
specificity
resolution
areas
regions
since
interactive
seenonlinear
17
used
protocoldevice
quantitative
cross
solid
100
achieved
rapid
described
format
being
types
enables
existing
organizational
provides
people
provide
production
maximum
indexing
life
historical
ideas
solving
apparent
offers
measuring
amount
maintaining
integrating
making
construction
representing
providing
know
service
organizations
business
al
general
over
regardingmight
technological
innovation
what
digital
demonstrated
indicated
investigated
did
showed
conclusions
describing
already
parts
roles
map
such
sci
m
cultural
distributions
these
unified
int
next
generating
automatic
presentation
resource
distributed
connections
global
aspect
applying
family
establish
extend
consequences
construct
entire
facilitate
meaningful
allowing
secondary
comparable
completely
center
easy
perform
generate
constructed
interesting
generally
extensive
itself
required
appear
leading
furthermore
established
additional
upon
points
groups
limited
alternative
rates
likely
requires
increasing
examine
following
mechanismsuses
provided
initial
improve
together
key
individual
become
investigate
respect
lead
value
increase
must
allow
need
therefore
made
type
understand
according
effective
application
make
without
thus
up
able
way
them
most
related
alldue
use
but
more
it
which
many
will
you
clear
right
work
shows
like
find
phenomena
diagramphases
continuous
etc
included
day
60
suggesting
subsequent
can
analyzed
affected
15
five
seven
either
differences
less
having
special
answer
issues
foundation
academic
needed
papers
focuses
topics
discusses
efforts
emerging
questions
overviewunderstanding
how
developments
literature
occur
engine
relevance
queries
now
explicit
define
length
peptide
monte
behavioral
public
kind
political
organization
sciences
class
advances
libraries
communications
discussion
bases
markets
spatial
relation
growing
crystal
materials
attention
exploring
explore
clearly
13
where
lower
t
9
towards
then
(a) Bibtex
city
mountain
sky
sun
water
clouds
tree
bay
lake
sea
beach
boats
people
branch
leaf
grass
plain
palm
horizon
shell
hills
waves
birds
land
dog
bridge
ships
buildings
fence
island
storm peaks
jet
plane
runway
basket
flight
flag
helicopter
boeing
prop
f−16
tails
smoke
formation
bear polar
snow
tundra
ice
head
black
reflection
ground
forest
fall
river
field
flowers
stream
meadow
rocks
hillside
shrubs
close−up
grizzly
cubs
drum
log
hut
sunset
display
plants
pool
coral
fan
anemone
fish
ocean
diver
sunrise
face
sand
rainbow
farms
reefs
vegetation
house
village
carvings
path
wood
dress
coast
sailboats
cattiger
bengal
fox
kit
run
shadows winter
autumn
cliff
bush
rockface
pair
den
coyote
light
arctic
shore
town
road
chapel
moon
harbor
windmills
restaurant
wall
skyline
window
clothes
shops
street
cafe
tables
nets
crafts
roofs
ruins
stone
cars
castle
courtyard
statue
stairs
costume
sponges
sign
palace
paintings
sheep
valley
balcony
post
gate
plaza
festival
temple
sculpture
museum
hotel
art
fountain
market
door
mural
garden
star
butterfly
angelfish
lion
cave
crab
grouper
pagoda
buddha
decoration
monastery
landscape
detail
writing
sails
food
room
entrance
fruit
night
perch
cow
figures
facadechairs
guard
pond
church
park
barn
arch
hats
cathedral
ceremony
crowd glass
shrine
model
pillar
carpet
monument
floor
vines cottage
poppies
lawn
tower
vegetables
bench
rose
tulip
canal
cheese
railing
dock
horses
petals
umbrella
column
waterfalls
elephant
monks pattern
interior
vendor
silhouette
architecture
blossoms
athlete
parade
ladder
sidewalk
store
steps
relief
fog
frost
frozen
rapids
crystals
spider needles
stick
mist
doorway
vineyard
pottery
pots
military
designs
mushrooms
terrace
tent
bulls
giant
tortoise
wingsalbatross
booby
nest
hawk
iguana
lizard
marine
penguin
deer
white−tailed
horns
slope
mule
fawn
antlers
elk
caribou
herd
moose
clearing
mare
foals
orchid
lily
stems
row
chrysanthemums
blooms
cactus
saguarogiraffe
zebra
tusks
hands
train
desertdunes
canyon
lighthousemast
seals
texture
dust
pepper
swimmers
pyramid
mosque
sphinx
truck
fly
trunk
baby
eagle
lynx
rodent
squirrel
goat
marsh
wolf
pack
dall
porcupine
whales
rabbit
tracks
crops
animals
moss
trail
locomotive
railroad
vehicle aerial range
insect
manwoman
rice
prayer
glacier
harvest
girl
indian
pole
dance
africanshirt
buddhist
tomboutside
shade
formula
turn
straightaway
prototype
steel
scotland
ceiling
furniture
lichen
pups
antelope
pebbles
remains leopard
jeep
calf
reptile
snake
cougar
oahu
kauai
maui
school
canoe
race
hawaii
Cluster442
Cluster380
Cluster20
Cluster317
Cluster243
Cluster157
Cluster364
Cluster423
Cluster21
Cluster133
Cluster113
Cluster93
Cluster412
Cluster383
Cluster399
Cluster408
Cluster252
Cluster268
Cluster426
Cluster170
Cluster446
Cluster241
Cluster166
Cluster478
Cluster82
Cluster137
Cluster341
Cluster145
Cluster471
Cluster96
Cluster79
Cluster230
Cluster260
Cluster461
Cluster362
Cluster409
Cluster492
Cluster42
Cluster59
Cluster205
Cluster458
Cluster81
Cluster347
Cluster129
Cluster427
Cluster348
Cluster128
Cluster101
Cluster235
Cluster351
Cluster165
Cluster390
Cluster150
Cluster39
Cluster144
Cluster392
Cluster70
Cluster493
Cluster9
Cluster305
Cluster281
Cluster193
Cluster267
Cluster126
Cluster232
Cluster159
Cluster138
Cluster173Cluster482
Cluster22
Cluster29
Cluster49
Cluster470
Cluster353
Cluster425Cluster217
Cluster326
Cluster209
Cluster110
Cluster278
Cluster322
Cluster256
Cluster185
Cluster282
Cluster431
Cluster17
Cluster117
Cluster318
Cluster114
Cluster149
Cluster244
Cluster95
Cluster56
Cluster331Cluster142
Cluster424
Cluster479
Cluster356
Cluster286
Cluster313
Cluster344
Cluster201
Cluster291
Cluster457
Cluster420
Cluster330
Cluster23
Cluster62
Cluster498
Cluster405
Cluster443
Cluster214
Cluster152
Cluster1
Cluster394
Cluster284
Cluster190
Cluster439
Cluster345
Cluster227
Cluster436
Cluster308
Cluster195
Cluster403
city
mountain
sky
sun
water
clouds
tree
bay
lake
sea
beach
boats
people
branch
leaf
grass
plain
palm
horizon
shell
hills
waves
birds
land
dog
bridge
ships
buildings
fence
island
storm
peaks
jet
plane
runway
basket
flight
flag
helicopter
boeing
prop
f−16
tails
smokeformation
bear
polar
snow
tundra
ice
head
black
reflection
ground
forest
fall
river
field
flowers
stream
meadow
rocks
hillside
shrubs
close−up
grizzly
cubs
drum
log
hut
sunset
display
plants
pool
coral
fan
anemone
fish
ocean
diver
sunrise
face
sand
rainbow
farms
reefs
vegetation
house
village
carvings
path
wood
dress
coast
sailboats
cat
tiger
bengal
fox
kit run
shadows
winter
autumn
cliff
bush
rockface
pair
den
coyote
light
arctic
shore
town
road
chapel
moon
harbor
windmills
restaurant
wall
skyline
window
clothes
shops
street
cafe
tables
nets
crafts
roofs
ruins
stone
cars
castle
courtyard
statuestairs
costume
sponges
sign
palace
paintings
sheep
valley
balcony
post
gate
plaza
festival
temple
sculpture
museum
hotel
art
fountain
market
door
mural
garden
star
butterfly
angelfish
lion
cave
crab
grouper
pagoda
buddha
decoration
monastery
landscape
detail
writing
sails
food
room
entrance
fruit
night
perch
cow
figures
facade
chairs
guard
pond
church
park
barnarchhats
cathedral
ceremony
crowd
glass
shrine
model
pillar
carpet
monument
floor
vines
cottage
poppies
lawn
tower
vegetables
bench
rose
tulip
canal
cheese
railing
dock
horses
petals
umbrella
column
waterfalls
elephant
monks
pattern
interior
vendor
silhouette
architecture
blossoms
athlete
parade
ladder
sidewalk
store
steps
relief
fog
frostfrozen
rapids
crystals
spider
needles
stick
mist
doorway
vineyard
pottery
pots
military
designs
mushrooms
terrace
tent
bulls
gianttortoise
wings
albatross
booby
nest
hawk
iguana
lizard
marine
penguin
deer
white−tailed
horns
slope
mule
fawn
antlers
elk
caribouherd
moose
clearing
marefoals
orchidlily
stems
row
chrysanthemums
blooms
cactus
saguaro
giraffe
zebra
tusks
hands
train
desert
dunes
canyon
lighthouse
mast
seals
texture
dustpepper
swimmers
pyramid
mosque
sphinx
truck
fly
trunk
baby
eagle
lynx
rodent
squirrel
goat
marsh
wolf
pack
dall
porcupine
whales
rabbit
tracks
crops
animals
moss
trail
locomotiverailroad
vehicleaerial
range
insect
man
woman
rice
prayer
glacier
harvest
girl
indian
pole
dance
african
shirt
buddhist
tomb
outside
shade
formula
turnstraightaway
prototype
steel
scotland
ceilingfurniture
lichen
pups
antelope
pebbles
remains
leopard
jeep
calf
reptile
snake
cougar
oahu
kauai
maui
school
canoe race
hawaii
Cluster429Cluster287
Cluster94
Cluster352
Cluster96
Cluster395
Cluster442
Cluster433
Cluster118
Cluster365
Cluster83
Cluster106
Cluster489
Cluster71
Cluster89
Cluster390 Cluster165
Cluster495
Cluster380
Cluster20
Cluster120
Cluster471
Cluster57
Cluster144
Cluster150
Cluster60
Cluster449
Cluster411
Cluster224
Cluster275
Cluster84
Cluster53
Cluster88
Cluster325
Cluster50
Cluster112
Cluster288
Cluster317Cluster9
Cluster490
Cluster66
Cluster214
Cluster397
Cluster100
Cluster336
Cluster85
Cluster18
Cluster74
Cluster413
Cluster297
Cluster364
Cluster11
Cluster369
Cluster243
Cluster185
Cluster273
Cluster427
Cluster279
Cluster194
Cluster385
Cluster231
Cluster396
Cluster5
Cluster157
Cluster22
Cluster423
Cluster159
Cluster121
Cluster180
Cluster498
Cluster496 Cluster466
Cluster358
Cluster432
Cluster254
Cluster366
Cluster456
Cluster62
Cluster493
Cluster21
Cluster494
Cluster282
Cluster492
Cluster228
Cluster133
Cluster258
Cluster204
Cluster274
Cluster440
Cluster444
Cluster113
Cluster460
Cluster412Cluster383
Cluster399
Cluster326
Cluster148
Cluster192
Cluster19
Cluster459
Cluster278
Cluster408
Cluster441
Cluster463
Cluster482
Cluster376
Cluster304
Cluster252
Cluster268
Cluster27
Cluster149
Cluster426
Cluster410
Cluster79
Cluster318
Cluster170
Cluster55
Cluster166
Cluster321
Cluster389
Cluster333
Cluster377
Cluster52
Cluster241
Cluster138
Cluster345
Cluster163
Cluster398
Cluster294
Cluster154
Cluster437
Cluster414
Cluster48
Cluster417
Cluster331
Cluster2
Cluster415
Cluster467
Cluster12
Cluster478
Cluster209
Cluster174
Cluster82
Cluster137
Cluster354
Cluster341
Cluster123
Cluster308
Cluster238
Cluster465
Cluster117
Cluster43
Cluster139 Cluster439
Cluster379
Cluster462
Cluster178
Cluster72
Cluster115
Cluster458
Cluster260
Cluster461
Cluster234
Cluster436
Cluster182
Cluster169
Cluster359
Cluster285Cluster362
Cluster51
Cluster38
Cluster244
Cluster263
Cluster295
Cluster59
Cluster205
Cluster301
Cluster293
Cluster420
Cluster230
Cluster281
Cluster160
Cluster242
Cluster37
Cluster108Cluster200
Cluster81
Cluster438
Cluster227
Cluster292
Cluster191
Cluster322
Cluster406
Cluster448
Cluster349
Cluster450
Cluster15
Cluster291
Cluster129
Cluster394
Cluster491
Cluster348
Cluster128
Cluster101
Cluster305
Cluster130
Cluster45
Cluster220
Cluster4
Cluster486
Cluster156
Cluster235
Cluster221
Cluster386
Cluster316
Cluster151
Cluster351
Cluster276
Cluster381
Cluster289
Cluster422
Cluster404
Cluster481
Cluster173
Cluster267
Cluster102Cluster320
Cluster393
Cluster454
Cluster127
Cluster477
Cluster499
Cluster378
Cluster430
Cluster35
Cluster136
Cluster416
Cluster98
Cluster119
Cluster7Cluster264
Cluster271
Cluster39
Cluster392
Cluster330
Cluster391
Cluster114
Cluster186
Cluster164
Cluster58
Cluster162
Cluster246
Cluster483
Cluster14
Cluster435
Cluster199
Cluster215
Cluster6
Cluster111
Cluster70
Cluster283Cluster286
Cluster167
Cluster451
Cluster239
Cluster202
Cluster25
Cluster306
Cluster211
Cluster168
Cluster217
Cluster457
Cluster146
Cluster485
Cluster219
Cluster47
Cluster155
Cluster67
Cluster193
Cluster363
Cluster40
Cluster75
Cluster455
Cluster299
Cluster405
Cluster323
Cluster372
Cluster468
Cluster475
Cluster313
Cluster104
Cluster223
Cluster311
Cluster124
Cluster213
Cluster8
Cluster122
Cluster284
Cluster190
Cluster350
Cluster181
Cluster210
Cluster41
Cluster434
Cluster183
Cluster474
Cluster126
Cluster44
Cluster232
Cluster147
Cluster63
Cluster342
Cluster46
Cluster140
Cluster309
Cluster403
Cluster470
Cluster357
Cluster152
Cluster207
Cluster340
Cluster203 Cluster65
Cluster262
Cluster233
Cluster269
Cluster387
Cluster68
Cluster34
Cluster208
Cluster23
Cluster375
Cluster29
Cluster310
Cluster49Cluster30
Cluster1
Cluster353
Cluster425
Cluster327
Cluster110
Cluster497
Cluster402
Cluster257
Cluster476
Cluster329Cluster266
Cluster197
Cluster175
Cluster76
Cluster256
Cluster179
Cluster338
Cluster222
Cluster187
Cluster370
Cluster384
Cluster277
Cluster431
Cluster17
Cluster109
Cluster265
Cluster447
Cluster407
Cluster226
Cluster56
Cluster360
Cluster153
Cluster184
Cluster445
Cluster400
Cluster382
Cluster91
Cluster247
Cluster255
Cluster142
Cluster424
Cluster479Cluster421
Cluster356
Cluster196
Cluster97
Cluster195
Cluster300Cluster335
Cluster361
Cluster206
Cluster259
Cluster480
Cluster3
Cluster347
Cluster419
Cluster107
Cluster443
Cluster87
Cluster240
Cluster24
Cluster280
Cluster61
Cluster225
Cluster343 Cluster332
Cluster270
Cluster296
Cluster488
Cluster16
Cluster116
Cluster302
Cluster26
Cluster487
Cluster171
Cluster143
Cluster249
Cluster77
Cluster464
(b) Corel5k
Figure 7: The local BN structures learned by MMHC (left plot)and H2PC (right plot) on a single cross-validation split, onBibtex and Corel5k.
23
References
Agresti, A. (2002).Categorical Data Analysis. Wiley, 2nd edition.Aliferis, C. F., Statnikov, A. R., Tsamardinos, I., Mani, S., & Kout-
soukos, X. D. (2010). Local causal and markov blanket inductionfor causal discovery and feature selection for classification part i:Algorithms and empirical evaluation.Journal of Machine Learn-ing Research, JMLR, 11, 171–234.
Alvares-Cherman, E., Metz, J., & Monard, M. (2011). Incorporatinglabel dependency into the binary relevance framework for multi-label classification.Expert Systems With Applications, ESWA, 39,1647–1655.
Armen, A. P., & Tsamardinos, I. (2011). A unified approach to esti-mation and control of the false discovery rate in bayesian networkskeleton identification. InEuropean Symposium on Artificial Neu-ral Networks, ESANN.
Aussem, A., Rodrigues de Morais, S., & Corbex, M. (2012). Analysisof nasopharyngeal carcinoma risk factors with bayesian networks.Artificial Intelligence in Medicine, 54.
Aussem, A., Tchernof, A., Rodrigues de Morais, S., & Rome, S.(2010). Analysis of lifestyle and metabolic predictors of visceralobesity with bayesian networks.BMC Bioinformatics, 11, 487.
Badea, A. (2004). Determining the direction of causal influence inlarge probabilistic networks: A constraint-based approach. In Pro-ceedings of the Sixteenth European Conference on ArtificialIntel-ligence(pp. 263–267).
Bernard, A., & Hartemink, A. (2005). Informative structurepriors:Joint learning of dynamic regulatory networks from multiple typesof data. InProceedings of the Pacific Symposium on Biocomputing(pp. 459–470).
Blockeel, H., Raedt, L. D., & Ramon, J. (1998). Top-down inductionof clustering trees. In J. W. Shavlik (Ed.),International Conferenceon Machine Learning, ICML(pp. 55–63). Morgan Kaufmann.
Borchani, H., Bielza, C., Toro, C., & Larranaga, P. (2013).Predictinghuman immunodeficiency virus inhibitors using multi-dimensionalbayesian network classifiers.Artificial Intelligence in Medicine,57, 219–229.
Breiman, L. (2001). Random forests.Machine Learning, 45, 5–32.Brown, L. E., & Tsamardinos, I. (2008). A strategy for makingpredic-
tions under manipulation.Journal of Machine Learning Research,JMLR, 3, 35–52.
Buntine, W. (1991). Theory refinement on Bayesian networks.InB. D’Ambrosio, P. Smets, & P. Bonissone (Eds.),Uncertainty inArtificial Intelligence, UAI (pp. 52–60). San Mateo, CA, USA:Morgan Kaufmann Publishers.
Cawley, G. (2008). Causal and non-causal feature selectionfor ridgeregression. JMLR: Workshop and Conference Proceedings, 3,107–128.
Cheng, J., Greiner, R., Kelly, J., Bell, D. A., & Liu, W. (2002). Learn-ing Bayesian networks from data: An information-theory basedapproach.Artificial Intelligence, 137, 43–90.
Chickering, D., Heckerman, D., & Meek, C. (2004). Large-samplelearning of Bayesian networks is NP-hard.Journal of MachineLearning Research, JMLR, 5, 1287–1330.
Chickering, D. M. (2002). Optimal structure identificationwithgreedy search.Journal of Machine Learning Research, JMLR, 3,507–554.
Cussens, J., & Bartlett, M. (2013). Advances in Bayesian NetworkLearning using Integer Programming. InUncertainty in ArtificialIntelligence(pp. 182–191).
DembczyÅski, K., Waegeman, W., Cheng, W., & Hullermeier, E.(2012). On label dependence and loss minimization in multi-labelclassification.Machine Learning, 88, 5–45.
Ellis, B., & Wong, W. H. (2008). Learning causal bayesian network
structures from experimental data.Journal of the American Statis-tical Association, 103, 778–789.
Friedman, N., Nachman, I., , & Peer, D. (1999a). Learning bayesiannetwork structure from massive datasets: The sparse candidate al-gorithm. In Proceedings of the Fifteenth Conference on Uncer-tainty in Artificial Intelligence(pp. 206–215).
Friedman, N., Nachman, I., & Pe’er, D. (1999b). Learning bayesiannetwork structure from massive datasets: the ”sparse candidate”algorithm. In K. B. Laskey, & H. Prade (Eds.),Uncertainty inArtificial Intelligence, UAI(pp. 21–30). Morgan Kaufmann Pub-lishers.
Gasse, M., Aussem, A., & Elghazel, H. (2012). Comparison of hybridalgorithms for bayesian network structure learning. InEuropeanConference on Machine Learning and Principles and PracticeofKnowledge Discovery in Databases, ECML-PKDD(pp. 58–73).Springer volume 7523 ofLecture Notes in Computer Science.
Gu, Q., Li, Z., & Han, J. (2011). Correlated multi-label feature selec-tion. In C. Macdonald, I. Ounis, & I. Ruthven (Eds.),CIKM (pp.1087–1096). ACM.
Guo, Y., & Gu, S. (2011). Multi-label classification using conditionaldependency networks. InInternational Joint Conference on Artifi-cial Intelligence, IJCAI(pp. 1300–1305). AAAI Press.
Heckerman, D., Geiger, D., & Chickering, D. (1995). Learningbayesian networks: The combination of knowledge and statisticaldata.Machine Learning, 20, 197–243.
Kocev, D., Vens, C., Struyf, J., & Dzeroski, S. (2007). Ensemblesof multi-objective decision trees. In J. N. Kok (Ed.),EuropeanConference on Machine Learning, ECML(pp. 624–631). Springervolume 4701 ofLecture Notes in Artificial Intelligence.
Koivisto, M., & Sood, K. (2004). Exact bayesian structure discov-ery in bayesian networks.Journal of Machine Learning Research,JMLR, 5, 549–573.
Kojima, K., Perrier, E., Imoto, S., & Miyano, S. (2010). Optimalsearch on clustered structural constraint for learning bayesian net-work structure.Journal of Machine Learning Research, JMLR, 11,285–310.
Koller, D., & Friedman, N. (2009).Probabilistic Graphical Models:Principles and Techniques. MIT Press.
Koller, D., & Sahami, M. (1996). Toward optimal feature selection. InInternational Conference on Machine Learning, ICML(pp. 284–292).
Liaw, A., & Wiener, M. (2002). Classification and regressionby ran-domforest.R News, 2, 18–22.
Luaces, O., Dıez, J., Barranquero, J., del Coz, J. J., & Bahamonde,A. (2012). Binary relevance efficacy for multilabel classification.Progress in AI, 1, 303–313.
Madjarov, G., Kocev, D., Gjorgjevikj, D., & Dzeroski, S. (2012).An extensive experimental comparison of methods for multi-labellearning.Pattern Recognition, 45, 3084–3104.
Maron, O., & Ratan, A. L. (1998). Multiple-Instance Learning forNatural Scene Classification. InInternational Conference on Ma-chine Learning, ICML(pp. 341–349). Citeseer volume 7.
McCallum, A. (1999). Multi-label text classification with amixturemodel trained by em. InAAAI Workshop on Text Learning.
Moore, A., & Wong, W. (2003). Optimal reinsertion: A newsearch operator for accelerated and more accurate Bayesiannet-work structure learning. In T. Fawcett, & N. Mishra (Eds.),Inter-national Conference on Machine Learning, ICML.
Nagarajan, R., Scutari, M., & Lbre, S. (2013).Bayesian Networks inR: with Applications in Systems Biology (Use R!). Springer.
Neapolitan, R. E. (2004).Learning Bayesian Networks. Upper SaddleRiver, NJ: Pearson Prentice Hall.
Ott, S., Imoto, S., & Miyano, S. (2004). Finding optimal models forsmall gene networks. InProceedings of the Pacific Symposium onBiocomputing(pp. 557–567).
24
Pena, J., Nilsson, R., Bjorkegren, J., & Tegner, J. (2007). Towardsscalable and data efficient learning of Markov boundaries.Inter-national Journal of Approximate Reasoning, 45, 211–232.
Pena, J. M., Bjorkegren, J., & Tegner, J. (2005). GrowingBayesiannetwork models of gene networks from seed genes.Bioinformat-ics, 40, 224–229.
Pearl, J. (1988).Probabilistic Reasoning in Intelligent Systems: Net-works of Plausible Inference.. San Francisco, CA, USA: MorganKaufmann.
Pena, J. M. (2008). Learning gaussian graphical models of gene net-works with false discovery rate control. InEuropean Conferenceon Evolutionary Computation, Machine Learning and Data Min-ing in Bioinformatics6 (pp. 165–176).
Pena, J. M. (2012). Finding consensus bayesian network structures.Journal of Artificial Intelligence Research, 42, 661–687.
Perrier, E., Imoto, S., & Miyano, S. (2008). Finding optimalbayesiannetwork given a super-structure.Journal of Machine Learning Re-search, JMLR, 9, 2251–2286.
Peer, D., Regev, A., Elidan, G., & Friedman, N. (2001). Inferringsubnetworks from perturbed expression profiles.Bioinformatics,17, 215–224.
Prestat, E., Rodrigues de Morais, S., Vendrell, J., Thollet, A., Gautier,C., Cohen, P., & Aussem, A. (2013). Learning the local Bayesiannetwork structure around the ZNF217 oncogene in breast tumours.Computers in Biology and Medicine, 4, 334–341.
R Core Team (2013).R: A Language and Environment for Statisti-cal Computing. R Foundation for Statistical Computing Vienna,Austria. URL:http://www.R-project.org .
Read, J., Pfahringer, B., Holmes, G., & Frank, E. (2009). Classi-fier chains for multi-label classification. InEuropean Conferenceon Machine Learning and Principles and Practice of KnowledgeDiscovery in Databases, ECML-PKDD(pp. 254–269). Springervolume 5782 ofLecture Notes in Computer Science.
Rodrigues de Morais, S., & Aussem, A. (2010a). An efficient learningalgorithm for local bayesian network structure discovery.In Euro-pean Conference on Machine Learning and Principles and Prac-tice of Knowledge Discovery in Databases, ECML-PKDD(pp.164–169).
Rodrigues de Morais, S., & Aussem, A. (2010b). A novel Markovboundary based feature subset selection algorithm.Neurocomput-ing, 73, 578–584.
Roth, V., & Fischer, B. (2007). Improved functional prediction of pro-teins by learning kernel combinations in multilabel settings. BMCBioinformatics, 8, S12+.
Schwarz, G. E. (1978). Estimating the dimension of a model.Journalof Biomedical Informatics, 6, 461–464.
Scutari, M. (2010). Learning bayesian networks with the bnlearn Rpackage.Journal of Statistical Software, 35, 1–22.
Scutari, M. (2011). Measures of Variability for Graphical Models.Ph.D. thesis School in Statistical Sciences, University ofPadova.
Scutari, M., & Brogini, A. (2012). Bayesian network structure learn-ing with permutation tests.Communications in Statistics Theoryand Methods, 41, 3233–3243.
Scutari, M., & Nagarajan, R. (2013). Identifying significant edges ingraphical models of molecular networks.Artificial Intelligence inMedicine, 57, 207–217.
Silander, T., & Myllymaki, P. (2006). A Simple Approach forFindingthe Globally Optimal Bayesian Network Structure. InUncertaintyin Artificial Intelligence, UAI(pp. 445–452).
Snoek, C., Worring, M., Gemert, J. V., Geusebroek, J., & Smeulders,A. (2006). The challenge problem for automated detection of101semantic concepts in multimedia. InACM International Confer-ence on Multimedia(pp. 421–430). ACM Press.
Spirtes, P., Glymour, C., & Scheines, R. (2000).Causation, Predic-tion, and Search. (2nd ed.). The MIT Press.
Spolaor, N., Cherman, E. A., Monard, M. C., & Lee, H. D. (2013).A comparison of multi-label feature selection methods using theproblem transformation approach.Electronic Notes in TheoreticalComputer Science, 292, 135–151.
Studeny, M., & Haws, D. (2014). Learning Bayesian network struc-ture: Towards the essential graph by integer linear programmingtools. International Journal of Approximate Reasoning, 55, 1043–1071.
Trohidis, K., Tsoumakas, G., Kalliris, G., & Vlahavas, I. (2008).Multi-label classification of music into emotions. InISMIR (pp.325–330).
Tsamardinos, I., Aliferis, C., & Statnikov, A. (2003). Algorithms forlarge scale Markov blanket discovery. InFlorida Artificial Intelli-gence Research Society Conference FLAIRS’03(pp. 376–381).
Tsamardinos, I., & Borboudakis, G. (2010). Permutation testing im-proves bayesian network learning. InEuropean Conference on Ma-chine Learning and Knowledge Discovery in Databases, ECML-PKDD (pp. 322–337).
Tsamardinos, I., Brown, L., & Aliferis, C. (2006). The Max-MinHill-Climbing Bayesian Network Structure Learning Algorithm.Machine Learning, 65, 31–78.
Tsamardinos, I., & Brown, L. E. (2008). Bounding the false discoveryrate in local Bayesian network learning. InAAAI Conference onArtificial Intelligence(pp. 1100–1105).
Tsoumakas, G., Katakis, I., & Vlahavas, I. (2010a). Mining Multi-label Data.Transformation, 135, 1–20.
Tsoumakas, G., Katakis, I., & Vlahavas, I. (2010b). Random k-labelsets for Multi-Label Classification.IEEE Transactions onKnowledge and Data Engineering, TKDE, 23, 1–12.
Tsoumakas, G., & Vlahavas, I. (2007). Random k-labelsets: An en-semble method for multilabel classification.Proceedings of the18th European Conference on Machine Learning, 4701, 406–417.
Villanueva, E., & Maciel, C. (2012). Optimized algorithm for learningbayesian network superstructures. InInternational Conference onPattern Recognition Applications and Methods, ICPRAM.
Villanueva, E., & Maciel, C. D. (2014). Efficient methods for learningbayesian network super-structures.Neurocomputing, (pp. 3–12).
Zhang, M. L., & Zhang, K. (2010). Multi-label learning by ex-ploiting label dependency. InEuropean Conference on MachineLearning and Principles and Practice of Knowledge Discovery inDatabases, ECML PKDD(p. 999). ACM Press volume 16 ofKnowledge Discovery in Databases, KDD.
Zhang, M.-L., & Zhou., Z.-H. (2006). Multilabel neural networkswith applications to functional genomics and text categorization.IEEE Transactions on Knowledge and Data Engineering, 18, 1338– 1351.
25