36
CHAPTER 36 INFERENCE OF GENE REGULATORY NETWORKS BASED ON ASSOCIATION RULES CRISTIAN ANDR ´ ES GALLO, 1 JESSICA ANDREA CARBALLIDO, 1 and IGNACIO PONZONI 1,2 1 Laboratorio de Investigaci´ on y Desarrollo en Computaci ´ on Cient´ ıfica (LIDeCC), Dept. Computer Science and Engineering, Universidad Nacional del Sur, Bah´ ıa Blanca, Argentina 2 Planta Piloto de Ingenier´ ıa Qu´ ımica (PLAPIQUI) CONICET, Bah´ ıa Blanca, Argentina 36.1 INTRODUCTION The most important and widespread mechanism used by cells to regulate molecular functions or biological processes is the coordinate transcriptional and posttranscriptional network of the interacting genes or their products. In this way and under the command of transcription factors (TFs), each gene influences the activity of the cell by generating messenger RNA (mRNA) that guides the synthesis of proteins by ribosomes in the cytoplasm. Some of these gene products generated are themselves TFs that return to the nucleus (in eukaryotes) to control the expression of one or several genes. This complicated means of controlling gene expression can be represented as a gene regulatory network (GRN). The GRNs are complex interaction maps that describe putative associations among gene products which orchestrate the living organism functions. The reverse engineering of GRNs is a paradigm with great promise for analyzing and constructing biological networks [1–3]; it is an effective way of utilizing experimental data to determine the underlying network of a given model and constitutes an open research problem in bioinformatics. Gene network modeling uses gene expression profiling data to describe the phenotypic behavior of a system under study. In order to reconstruct such a network, the procedure involves altering the gene network in some way, observing the outcome, and using com- putational methods to infer the underlying principles of the network. In this context, the data-mining methods configure suitable approaches for performing the reverse engineer- ing of these relational structures and, in particular, these reconstruction strategies can be beneficed from the application of association rule (AR) extraction techniques. Basically, an AR establishes a causal link between two or more variables, where the semantics and Biological Knowledge Discovery Handbook: Preprocessing, Mining, and Postprocessing of Biological Data, First Edition. Edited by Mourad Elloumi and Albert Y. Zomaya. © 2014 John Wiley & Sons, Inc. Published 2014 by John Wiley & Sons, Inc. 803

Biological Knowledge Discovery Handbook || Inference of Gene Regulatory Networks Based on Association Rules

Embed Size (px)

Citation preview

Page 1: Biological Knowledge Discovery Handbook || Inference of Gene Regulatory Networks Based on Association Rules

CHAPTER 36

INFERENCE OF GENEREGULATORY NETWORKS BASEDON ASSOCIATION RULESCRISTIAN ANDRES GALLO,1 JESSICA ANDREA CARBALLIDO,1

and IGNACIO PONZONI1,2

1Laboratorio de Investigacion y Desarrollo en Computacion Cientıfica (LIDeCC),Dept. Computer Science and Engineering, Universidad Nacional del Sur, Bahıa Blanca,Argentina2Planta Piloto de Ingenierıa Quımica (PLAPIQUI) CONICET, Bahıa Blanca, Argentina

36.1 INTRODUCTION

The most important and widespread mechanism used by cells to regulate molecular functionsor biological processes is the coordinate transcriptional and posttranscriptional network ofthe interacting genes or their products. In this way and under the command of transcriptionfactors (TFs), each gene influences the activity of the cell by generating messenger RNA(mRNA) that guides the synthesis of proteins by ribosomes in the cytoplasm. Some of thesegene products generated are themselves TFs that return to the nucleus (in eukaryotes) tocontrol the expression of one or several genes. This complicated means of controlling geneexpression can be represented as a gene regulatory network (GRN). The GRNs are complexinteraction maps that describe putative associations among gene products which orchestratethe living organism functions. The reverse engineering of GRNs is a paradigm with greatpromise for analyzing and constructing biological networks [1–3]; it is an effective wayof utilizing experimental data to determine the underlying network of a given model andconstitutes an open research problem in bioinformatics.

Gene network modeling uses gene expression profiling data to describe the phenotypicbehavior of a system under study. In order to reconstruct such a network, the procedureinvolves altering the gene network in some way, observing the outcome, and using com-putational methods to infer the underlying principles of the network. In this context, thedata-mining methods configure suitable approaches for performing the reverse engineer-ing of these relational structures and, in particular, these reconstruction strategies can bebeneficed from the application of association rule (AR) extraction techniques. Basically,an AR establishes a causal link between two or more variables, where the semantics and

Biological Knowledge Discovery Handbook: Preprocessing, Mining, and Postprocessing of Biological Data,First Edition. Edited by Mourad Elloumi and Albert Y. Zomaya.© 2014 John Wiley & Sons, Inc. Published 2014 by John Wiley & Sons, Inc.

803

Page 2: Biological Knowledge Discovery Handbook || Inference of Gene Regulatory Networks Based on Association Rules

804 INFERENCE OF GENE REGULATORY NETWORKS BASED ON ASSOCIATION RULES

the interpretation of the rule depend of the input data and on the mechanisms employedfor inferring the association. ARs have been extensively used for discovering interestingrelationships between variables in large data sets [4]. In bioinformatics, these methods canbe used to reveal biologically relevant associations among genes, at diverse environmentalconditions or time point observations, from different microarray samples [5–7].

This chapter focuses on gene regulation and the ways that transcriptome data can be usedto unravel the complex relationships between the genes that comprise a GRN. In particular, itdescribes the main topics that must be considered in the field of AR mining for reverse engi-neering of GRNs and presents the state-of-the art techniques currently available in the litera-ture. The organization of the chapter is as follows. In Section 36.2 the central concepts aboutAR mining for GRN reconstruction together with various other relevant issues are presentedand discussed. In Section 36.3, different data-mining approaches used for AR inference arereviewed. Finally, in Section 36.4, the conclusions and final remarks are summarized.

36.2 DATA MINING AND INFERENCE OF GRNs BASED ON ARs

A GRN is one kind of causal regulatory network. Others include protein networks andmetabolic processes [8]. The GRNs have a messily robust structure as a consequence ofevolution [9]. A GRN can be represented as a directed graph [10], in which the set of verticesN represents the genes and the set of edges E describe the regulatory relationships betweeneach pair of genes. The GRNs may also be modeled as undirected graphs [11], although thetrue underlying regulatory network is better represented as a directed graph. Each edge mayalso be decorated with additional information, such as the type of regulation (activationor inhibition) and/or the time lag in the regulation, among others. Figure 36.1 shows anexample of a GRN represented as a directed graph.

Another common way to represent a GRN is by means of a list of ARs. In this way,each interaction between genes is represented as a rule of the form gi → gj , wheregi, gj ∈ N, and gi is the regulator gene whereas gj is the target gene. Similar to thegraph representation, the ARs may contain additional information regarding the interac-tion between the genes. Even more, given a graph representing a GRN, an equivalent

+/+

+/+

+/-+/-

+/+

+/-

-/++/-

-/+

+/+

+/+g1

g4

g3

g6

g8

g7

g5

g2

FIGURE 36.1 GRN represented as a directed graph. The direction of the edge indicates the regula-tory role (regulator or target) of the genes in each interaction. A + (−) symbol on the left side of theedge label indicates up regulation (down regulation) of the regulator gene, whereas a + (−) symbolon the right side indicates activation (inhibition) of the target gene.

Page 3: Biological Knowledge Discovery Handbook || Inference of Gene Regulatory Networks Based on Association Rules

DATA MINING AND INFERENCE OF GRNs BASED ON ARs 805

list of ARs can be directly obtained through the edge set of the graph, with an AR foreach edge. If the GRN of Figure 36.1 is considered, the following ARs represent thesame GRN: +g1 → +g4, +g4 → +g1, −g4 → +g2, +g2 → +g2, +g4 → +g5, +g5 →−g3, −g3 → +g5, +g4 → +g6, +g6 → −g7, +g7 → −g4, +g7 → −g8. In the rest ofthis chapter, we will indistinctly refer to edges and ARs, since both represent the same kindof information.

A GRN can be considered as a stochastic system of discrete components. However,modeling a GRN in this way is not tractable in a computational way [12]. For that reason,the stochastic systems are not considered in this chapter, and therefore, the main focus willbe put on discrete models of GRNs. In these models, a GRN is a set of genes N and a setof functions F such that there is one function for each gene: ∀gi ∈ N, ∃fi : fi ∈ F . Eachof these functions takes all or a subset of N as parameters, and the output corresponds to adiscrete value of a given gene state set. Using this sort of model, the most important featuresof the regulatory relationships can be inferred and represented.

Data mining can either be used to infer epistasis (determine which genes interact) or tocreate explanatory models of the network. Epistasis is traditionally identified through syn-thetic lethality [13–15] and yeast two-hybrid (Y2H) experiments. Data-mining approachesare necessary in these situations because the data are often very noisy, and (as with Per1-3)phenotypic changes may be invisible unless several genes are knocked out. Inferring anexplanatory model of the network is often better, with more useful applications to biologi-cal understanding, genetic engineering, and pharmaceutical design. In this way, in order toinfer an explanatory model of a GRN with a data-mining approach, it is necessary to takeinto account several considerations. First, a few questions regarding the type of biologicaldata from which the model will be inferred should be answered: What kind of informationdoes it represent? Is it steady-state data or time-series data? Are the data inherently noisyor not? Second, since the whole objective is to infer a GRN with a data-mining approach,the discretization might play an important role in the whole process. Does the algorithmrequire discretization of data? And if this is true, how many states are necessary to repre-sent the underlying behavior of the genes? How can these estates be obtained? Additionally,there are various other relevant features that should be taken into account: the cardinalitypattern of the associations (quantity of genes that can be linked by one rule), the mannerin which the temporal behavior is modeled (time delay associations), and how to conciliaterules extracted from multiple data sources. These subjects may affect the computationalcomplexity of the inference algorithm and the biological feasibility of the inferred model.Finally, the biological and statistical validation of the ARs obtained is the more importantstep in the inference of a GRN, since it determines the biological plausibility of the inferredmodel. Figure 36.2 summarizes all of these issues, among others, which must be consideredin order to infer GRNs with data-mining approaches. In the following sections, all of thesetopics and other issues will be discussed.

36.2.1 Types of Biological Data Used for GRN Inference

There are few types of biological data available for addressing network modeling in bioin-formatics. In this regard, the most widely used data type for data-mining techniques in thereconstruction of GRNs are gene expression data. The gene expression data represent the ac-tivity of each gene gi ∈ N, measured as the concentration of mRNA, since the transcriptionactivity cannot be measured directly. Thus, given that the regulatory or phenotypic proteininteraction can consume some mRNA before the regulation of the gene takes place [16], this

Page 4: Biological Knowledge Discovery Handbook || Inference of Gene Regulatory Networks Based on Association Rules

806 INFERENCE OF GENE REGULATORY NETWORKS BASED ON ASSOCIATION RULES

Data availability Steady state

Time series

In-silico

Preprocessing Discretization

Feature selection

External biological information

Data Mining Pairwise versus

many-to-one interactions

Time delayed regulation

Multiple data source

Biological Evaluation Biological databases

Literature

Experimental analysis

FIGURE 36.2 Summary of several issues that needs to be considered in order to infer GRNs withdata-mining approaches.

may seem to be an inaccurate measure of gene activity [17]. Even more, a protein may bind apromoter region without producing any regulatory effect [18]. Additionally, most genes arenot involved in most cellular processes [19]. This implies that several of the sampled genesmay appear to randomly vary their expression levels. However, if the data set is compre-hensible and the only concern is the inference of regulatory relationships, these influencesare not important [12]. Increasing the amount of data or performing targeted inference canavoid the problem of irrelevant genes. The nongenetic unmodeled influences are analogousto the hidden intermediate variables in Bayesian networks [18]. An influence like this doesnot distort the regulatory relationship or the predictive accuracy of the inferred model [12].

36.2.1.1 Types of Expression Data There are two types of gene expression data:equilibrium (steady-state) expression levels that correspond to a static situation and time-series expression levels that are gathered during a phenotypic phase like the cell cycle[20]. The gene expression data are usually obtained employing microarrays or some similartechnology. A microarray is a pre-prepared slide divided into cells. Each cell is individuallycoated with a chemical which fluoresces when it is mixed with the mRNA generated byjust one of the genes. The brightness of each cell is used as a measurement of the level ofmRNA and therefore of the gene expression level.

Page 5: Biological Knowledge Discovery Handbook || Inference of Gene Regulatory Networks Based on Association Rules

DATA MINING AND INFERENCE OF GRNs BASED ON ARs 807

Generally, the steady-state expression data are represented by an N × M matrix A′,where the rows represent the N genes sampled and the columns represents the M differ-ent experimental conditions (or replicas or both). Different experimental conditions referto different tissues, temperatures, chemical compounds, or any other condition that mayproduce different regulatory behavior among the sampled genes. Each element aij of A′contains the expression value of gene gi in the sample or experimental condition j. On theother hand, the time series encoded in the gene expression data set are represented by meansof a gene expression data matrix, A′, where the rows and columns represent genes and timepoints, respectively. In this case, the time-series data are gathered by using temperature- (orchemical-) sensitive mutants to pause the phenotypic process while a microarray is done ona sample. Thus, the different columns represent the expression values of each sampled geneat different times under the same experimental condition during some phenotypic phase.The sampling intervals at which the genes are sampled are determined by the researcher re-garding the nature of the study and are not necessarily taken at the same equidistant interval.

Microarrays can be both biologically noisy [21] and technically noisy [19]. The first oneis the biological uncertainty in the form of intrinsic and extrinsic noise. The second one is theexperimental noise due to the complex measurement process, ranging from hybridizationconditions to microarray image processing techniques. However, the magnitude and impactof the noise are hotly debated issues that depend on the exact technology used to collectsamples. Recent research [12, 22] argues that the magnitude and impact of the noise havebeen “gravely exaggerated.”

36.2.2 Gene Expression Discretization

Data discretization, also known as binning, is a frequently used technique in computerscience and statistics applied to the biological data analysis. Discretization of real data intoa typically small number of finite values is often required by machine learning algorithms[23], Bayesian network applications [24], and any modeling algorithm using discrete-statemodels. An important advantage of using discrete states is that a significant portion of thenoise is absorbed in the process.

Nonetheless, the selection of a reasonable discretization approach is not a trivial task.In general, discretization processes imply loss of information, and different strategies yieldto distinct discrete-state models. Therefore, the biological semantics and interpretation ofthe resulting models differ, even when the subjacent real-valued data are always the same.For this reason, the choice of a discretization method should consider the intrinsic nature ofthe biological data as well as the particular features of the computational method that willmake use of these discretized models.

Several discretization techniques have been proposed in the literature. Binary discretiza-tion is the simplest way of discretizing data, used, for instance, for the construction ofBoolean network models for GRNs [25, 26]. The expression data are discretized into twoqualitative states as either present or absent. An obvious drawback of binary discretization isthat labeling the real-valued data according to a present/absent scheme generally causes theloss of a large amount of information. Discrete models and modeling techniques allowingmultiple states have been widely developed and studied [27, 28].

36.2.2.1 Discretization Problem Let A′ be an N-row × M-column gene expressionmatrix, where a′

ij represents the expression level of gene gi under condition j. The matrix A′

Page 6: Biological Knowledge Discovery Handbook || Inference of Gene Regulatory Networks Based on Association Rules

808 INFERENCE OF GENE REGULATORY NETWORKS BASED ON ASSOCIATION RULES

is defined by its set of rows, I, and its set of columns, J. Moreover, let a′IJ denote the average

value in the expression matrix A′ and a′iJ and a′

Ij denote the mean of row i and conditionj, respectively. Let HIJ denote the maximum (high) value in the expression matrix A′ andHiJ and HIj denote the maximum value of row i and column j, respectively. In the sameway, let MIJ denote the median value in the expression matrix A′ and MiJ and MIj denotethe median value of row i and column j, respectively.

A discretized matrix A′ is a mapping where each element in A′ is mapped to one ele-ment of an alphabet �, which consists of a set of different symbols representing a distinctactivation level. In the simplest case, � may contain only two symbols, one symbol usedfor regulation (or activation) and another symbol for no regulation (or inhibition). In thiscase, the expression matrix is usually transformed into a binary matrix, where 1 means reg-ulation and 0 means no regulation. Another widely used option is to consider a set of threediscretization symbols, {−1, 1, 0}, meaning DownRegulated, UpRegulated, or NoChange.Nevertheless, the values in matrix A′ may be discretized to an arbitrary number of sym-bols. After the discretization process, matrix A′ is transformed into matrix A and aij ∈ �

represents the discretized value of the expression level of gene gi under condition j. Severaldiscretization techniques have been used in expression data analysis. According to [29],these techniques can be grouped into two high-level categories:

1. Discretization using expression absolute values

2. Discretization using expression variations between time points

The approaches belonging to the first category can be used in expression data, in general,and they discretize the absolute gene expression values directly using different techniques.The second set of approaches, only applicable to time-series expression data, computesvariations between each two consecutive time points and then discretizes these variations.In the following sections the major discretization approaches will be detailed.

36.2.2.2 Discretization Using Absolute Values

Discretization Using Average and Standard Deviation A straightforward discretiza-tion method discretizes the gene expression matrix A′ using the average expression value[30], or the average combined with the standard deviation of the expression values [29, 31,32]. The limit values among bins can be computed using all the values in the expressionmatrix, that is, the overall average expression level and its standard deviation. Another pos-sibility is for the average and the standard deviation to be computed for each row or columnin the matrix.

When the goal is to discretize the matrix into a binary matrix with two symbols, one forregulation and another one for no regulation (e.g., 1 and 0), the average expression valueis usually used alone; the discretization can be computed using all the values in the matrix,by row, or by column, using one of the following expressions:

aij ={

1 if a′ij ≥ a′

IJ

0 otherwise

aij ={

1 if a′ij ≥ a′

iJ

0 otherwise

Page 7: Biological Knowledge Discovery Handbook || Inference of Gene Regulatory Networks Based on Association Rules

DATA MINING AND INFERENCE OF GRNs BASED ON ARs 809

aij ={

1 if a′ij ≥ a′

Ij

0 otherwise

Another possibility is to discretize the matrix using three symbols (for instance, −1,0, and 1) meaning down regulated, up regulated, or no change. In this case, the averageexpression value is usually combined with its standard deviation. Let α be a parameter usedto tune the desired deviation from average and σIJ , σiJ , and σIj be the standard deviationsof the overall values in the matrix, row i, and column j, respectively. Then, the discretizationcan be performed using one of the following equations [29, 31, 32]:

aij =

⎧⎪⎨⎪⎩

−1 if a′ij < a′

IJ − ασIJ

1 if a′ij > a′

IJ − ασIJ

0 otherwise

aij =

⎧⎪⎨⎪⎩

−1 if a′ij < a′

iJ − ασiJ

1 if a′ij > a′

iJ − ασiJ

0 otherwise

aij =

⎧⎪⎨⎪⎩

−1 if a′ij < a′

Ij − ασIj

1 if a′ij > a′

Ij − ασIj

0 otherwise

Discretization Based on Equal Frequency Principle The discretization process basedon the equal-frequency principle considers a given number of symbols into which theexpression values will be discretized. Then the data points are split in such a way that thereexists the same number of data points per symbol, binning the expression values accordinglyto the corresponding symbol. This process can be applied to an arbitrary number of symbols.When only two symbols are considered, the process is equivalent to the procedure performedto carry out the discretization of values by means of the median value. As before, theequal-frequency principle can be applied using all the expression values in the matrix, theexpression values by row or the expression values by column [33].

Discretization Based on Clustering Another common discretization technique for thegene expression data matrix A′ is based on clustering [34]. Generally, a clustering algorithmis applied to each row (gene expression profile) performing the discretization accordinglyto the partition returned by the algorithm. Employing clustering for discretization allowsthe consideration of multiple states in a straightforward manner, although it may introduceadditional computational cost depending on the selected clustering technique.

One of the most common clustering algorithms is the k-means clustering developedin [35]. The goal of the k-means algorithm is to minimize dissimilarity in the elementswithin each cluster while maximizing this value among elements in different clusters. Thealgorithm takes as input a set of points S to be clustered and a fixed integer k. It partitions Sinto k subsets by choosing a set of k cluster centroids. The choice of centroids determinesthe structure of the partition since each point in S is assigned to the nearest centroid. Thenfor each cluster the centroids are recomputed based on which elements are contained inthe cluster. These steps are repeated until convergence is achieved. Many applications ofthe k-means clustering, such as the MultiExperiment Viewer [36], start by organizing any

Page 8: Biological Knowledge Discovery Handbook || Inference of Gene Regulatory Networks Based on Association Rules

810 INFERENCE OF GENE REGULATORY NETWORKS BASED ON ASSOCIATION RULES

random partition into k clusters and computing their centroids. As a consequence, a differentclustering of S may be obtained every time the algorithm is run. For the special case whenonly two states (k = 2) are considered in the discretization, the optimal solution can becomputed in an optimized manner due to the total ordering of the elements in the geneexpression profile [7].

36.2.2.3 Discretization Using Expression Variations between Time PointsSeveral discretization techniques have been proposed based on the transitions in expressionstates between successive time points. These techniques usually consider either two or threestates, as stated previously. Generally, the discretization of a matrix A′ using expressionvariations between time points produces a discretized matrix A with J − 1 samples.

Transitional state discrimination (TSD) [37] is a discretization technique that uses twosymbols. After standardizing the expression data A’ to z-scores (expression profiles arescaled to zero mean and unit standard deviation), each gene expression profile is discretizedusing two state transitions:

aij ={

1 if a′ij − a′

i(j−1) ≥ 0

0 otherwise

Another discretization technique can be performed by computing variations betweensuccessive time instants as before but considering that these variations are significant when-ever they exceed a given preset threshold [38, 39]. During the discretization process, theexpression matrix is transformed into an A = N × (M − 1) matrix that reflects the chang-ing tendency of each gene expression value over time. An arbitrary number of changingtendencies may be considered leading to the discretization of the matrix in a set of sym-bols �. When three possible changing tendencies are considered [38, 39], an expressionlevel may increase from time point ti to ti+1, may decrease, or may remain unchanged.These changing tendencies are then discretized into three symbols (Increase, Decrease,NoChange, respectively). In this case, and with this set of symbols, the discretized matrixA is obtained after two steps. In the first step, the expression matrix A′ is transformed intoan A′′ = N × (M − 1) matrix of variations such that

a′′ij =

⎧⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎩

a′i(j−1) − a′

ij∣∣∣a′ij

∣∣∣ if a′ij /= 0

1 if a′ij = 0 ∧ a′

i(j−1) > 0

−1 if a′ij = 0 ∧ a′

i(j−1) < 0

0 if a′ij = 0 ∧ a′

i(j−1) = 0

Once matrix A′′ is generated, the final discretized matrix A, also with N rows and M − 1columns, is obtained in a second step by binning the values of the transformed matrixconsidering a threshold t > 0 as follows:

aij =

⎧⎪⎨⎪⎩

1 if a′′ij ≥ t

−1 if a′′ij ≤ −t

0 otherwise

Page 9: Biological Knowledge Discovery Handbook || Inference of Gene Regulatory Networks Based on Association Rules

DATA MINING AND INFERENCE OF GRNs BASED ON ARs 811

36.2.3 Pairwise versus Many-to-One Associations

The arity of the ARs inferable with data-mining methods have both biological and compu-tational implications. From the biological point of view, the GRN structure appears to beneither random nor rigidly hierarchical but scale free. This means that the probability distri-bution for the out-degree, kout, follows a power law [40, 41]. In other words, the probabilitythat a gene gi regulates k other genes is p(k) ≈ kλ, where usually λ ∈ {2, 3}. In the analysisof [40] regarding the scale-free Boolean networks, it was shown that some very disorderedsystems spontaneously “crystallize” into a high degree of order, which contributes to GRN’s“evolvability” and adaptability [41].

On the computational side, these distributions over kout and kin (the in-degree) meansthat a number of assumptions have been made in previous research in order to simplifythe problem and make it more tractable. For example, the exponential distribution overkin means that most genes are regulated by only a few other genes. Unfortunately, thisaverage is not a maximum. This means that techniques which strictly limit kin to somearbitrary constant [7, 10, 42] might not be able to infer all networks, thus compromisingtheir explanatory power.

36.2.3.1 “One-to-One” Regulatory Functions The one-to-one regulatory func-tions refer to a gene gi (target) that is only regulated by a gene gj (regulator), that is, apairwise relationship. In this case, the regulatory function fi(j) can be roughly lineal, sig-moid, or take any other form. Also, the strength of the effect of gj over gi may vary fromstrong to weak. However, this last feature can only be modeled if several discrete states areconsidered. Additionally, other nongenetic influences over gi, denoted by φi, can be con-sidered in the modeling of the regulatory function. In this situation, the regulatory functioncan be expressed as g′

i = fi(gj, φi) (see, e.g., the work of Marnellos and Mjolsness [43]).However, in most situations and for simplicity it is assumed that δf/δϕ = 0.

The regulatory relationship between gene gj and gene gi can be of either activationor inhibition. Even more, gene gj may both activate the gene gi when it is upregulatedand inhibit gi when it is downregulated and vice versa. For the case where the oppositeregulation applies always (when the regulator is overexpressed or when the regulator isunderexpressed), the underground chemical process is not clear. Moreover, in the inferredmodels of [44] this kind of regulation cannot be biologically verified. In either case, thistype of regulation is especially complex and is not evolutionary robust, which means thatthe occurrence of such a relationship is unlikely in biological terms. Other properties ofthe organisms also have an influence on the types of regulatory relationships that can beinferred. For example, inhibitory genes are more common in prokaryote than in eukaryoteorganisms [45].

36.2.3.2 “Many-to-One” Regulatory Functions A gene gi may be regulated byseveral other genes, in which case the regulatory function is usually more complex. Inparticular, the gene regulation for a eukaryotic organism can be enormously complex [46],in which the regulatory function can be a piecewise threshold function [16, 47, 48]. Thecomplexity arises because of the complex indirect, multilevel and multistage biologicalprocess underlying gene regulation. The regulatory process is detailed in works such as [49–51]. Finally, some of the logically possible regulatory relationships appear to be unlikely.For example, it appears that the exclusive-or relationships are biologically and statisticallyimprobable [52].

Page 10: Biological Knowledge Discovery Handbook || Inference of Gene Regulatory Networks Based on Association Rules

812 INFERENCE OF GENE REGULATORY NETWORKS BASED ON ASSOCIATION RULES

36.2.4 Inference of GRNs from Multiple Data Sources

Most early researches on automatic learning of transcriptional regulatory networks em-ploy only gene expression data. Recent simulation studies suggest that regulatory networkslearned from gene expression data alone can be considerably obscured by the recoveryof spurious interactions when the number of observations is small [53]. Integrating find-ings from multiple data sources (e.g., DNA sequences, gene and protein expression profiles,protein–protein interactions, protein structural information, and protein–DNA binding data)can overcome this drawback [54]. However, there are several problems in integrating di-verse genomics data into network models [55]. First, the inference from multiple differentdata sources can lead to missing interactions that are only present in certain experimentalconditions. Also, the genomics data are heterogeneous in their sensitivity and specificityfor relationships between genes. For example, experimental methods such as mass spec-trometry preferentially observe abundant proteins, while comparative genomics methodsapply only to evolutionarily conserved genes. Increasing the sensitivity of detection usuallycarries out a cost of increasing false-positive identifications. Thus, the systematic bias foreach method should be understood and considered during data integration. Additionally,genomics data sets vary widely in their utility for reconstructing gene networks. Thus, ro-bust benchmarking methods that can evaluate each data set and allow comparison of theirrelative merits are required. Finally, data sets are often correlated, complicating integration,since it can be difficult to measure the correlation because of both data incompleteness(a common problem) and sampling biases.

Two major related approaches have been developed in joint learning transcriptional regu-lation from multiple data sources. In one approach, various types of data are used to identifysets of genes that interact together in the cell or are coregulated in modules [17, 56]. In theother one, various types of data are used to supplement gene expression data in learningregulatory networks [57, 58]. As regards these last works, Bernard and Hartemink presenteda method for jointly learning dynamic models of transcriptional regulatory networks fromgene expression data and transcription factor binding data, based on dynamic Bayesiannetwork inference algorithms [58]. Results obtained from analyzing yeast cell cycle datademonstrate that the recovery of dynamic regulatory networks from multiple types of databy this joint learning algorithm is more accurate than that from each data type alone. Imotoet al. proposed a statistical method for estimating a gene network based on Bayesian net-works from microarray gene expression data together with biological knowledge, includingprotein–protein interactions, protein–DNA interactions, transcriptional factor binding infor-mation, and existing literature [57]. An advantage of the method is that the balance betweenmicroarray information and biological knowledge is optimized automatically by the pro-posed criterion. Monte Carlo simulations showed the effectiveness of the proposed methodin extracting more information from microarray data and estimating the gene network moreaccurately. Yeang et al. [59] developed a framework for inferring transcriptional regula-tion. The models they developed, called physical network models, are annotated molecularinteraction graphs. The attributes in the model correspond to verifiable properties of theunderlying biological system such as the existence of protein–protein and protein–DNAinteractions, the directionality of signal transduction in protein–protein interactions, andsigns of the immediate effects of these interactions. Possible configurations of these vari-ables are constrained by the available data sources. The application of this algorithm ondata sets related to the pheromone response pathway in yeast demonstrated that the derivedmodel was consistent with previous knowledge of the pathway.

Page 11: Biological Knowledge Discovery Handbook || Inference of Gene Regulatory Networks Based on Association Rules

DATA MINING AND INFERENCE OF GRNs BASED ON ARs 813

36.2.5 Temporal Delayed Associations from Time-Series Data

Another important aspect to be considered when dealing with the reconstruction of GRNsis constituted by the manner in which the temporal patterns of a GRN are captured. As wasmentioned in [10, 60], time-delayed gene regulation is a common phenomenon. Thereby,multiple time-delayed gene regulations can be considered the norm, while single time-delayed associations can be considered the exception [54]. This occurs since, within regu-lation procedures, various events occur at different steps. Usually, the step of transcription(from DNA to mRNA) is fast while the time for translation varies from protein to protein[61]. Besides, the protein–DNA regulation is an accumulation process and the thresholddiffers for different regulation gene pairs [61]. Suppose gi regulates gj , and the change ofexpression level of gi may affect the expression level of gj after a certain time interval. Forexample, based on the gene expression microarray data set for yeast of Spellman et al. [20],the gene MCM1 regulates the gene CLN3, and each time the expression level of MCM1changes, the corresponding expression level of CLN3 changes about 30 min later. Also,the time delay intervals may vary for different gene regulatory pairs. For example, humanTNF -α and iNOS genes are regulated by AP-1 and NF-κB1. Their delays in expressionafter the activation of AP-1 and NF-κB1 are 3 and 6 h, respectively [62]. It is further knownthat there should be an upper limit for the time delay in a gene network since the lengthof a cell cycle is limited. The regulation of genes can form feedback loops (for exam-ple, g1 → g2 → · · · → g1), which exist in many metabolism pathways and are critical inmaintaining the stability of a gene network [63].

Variability in the timing of biological processes further complicates the inference ofgene association from time-series data. The rate at which similar underlying processes suchas the cell cycle unfold can be expected to differ across organisms, genetic variants, andenvironmental conditions. For instance, Spellman et al. [20] analyze time-series data forthe yeast cell cycle in which different methods were used to synchronize the cells. It is clearthat the cycle lengths across the different experiments vary considerably and that the seriesbegin and end at different phases of the cell cycle. This complicates the interpretation ofthe results if several time-series data sets are used in the inference of the interaction amonggenes [7]. Thus, a method is necessary to align such series so as to make them compa-rable, such as representing the time-series gene expression profiles as continuous curves,which allows the standardization of the rate at which each sample is considered in eachdata sets.

36.2.6 Biological and Statistical Validation of Inferred GRNs

Once a GRN is obtained by a data-mining approach, it is crucial to validate it in orderto determine the correctness and/or the biological viability of the inferred ARs. The typeof validation required depends on the goal of the study that was carried out. First of all,it is necessary to separate those analyses in which the objective is the assessment of thedata-mining algorithm from those in which the goal corresponds to the identification ofsome promising hypothesis of new biological knowledge. In the first case, the analysis alsodepends on the data type employed in the inference. In silico data allow the use of well-known data-mining metrics, such as precision, sensitivity, and specificity, because the realinteractions among genes are known before hand. Thus, it is possible to compare severalalgorithms and determine which one of them best reconstructs a GRN regarding given insilico data. Although the results cannot be conclusive, since they depend on the approach

Page 12: Biological Knowledge Discovery Handbook || Inference of Gene Regulatory Networks Based on Association Rules

814 INFERENCE OF GENE REGULATORY NETWORKS BASED ON ASSOCIATION RULES

used in the generation of the in silico data, they may provide insights regarding the behaviorof each method.

When real data are employed to assess an algorithm, the real interactions among genesthat are present in the data are not known beforehand. In general, only specific curatedknowledge regarding the real interactions among genes is available. Even more, thoseknown interactions may not be present in the real data set due to the specific environmentalcondition employed in the experiment. Thus, the previous mentioned metrics, in general,are not applicable. However, there are other means to assess the algorithm when real data areused. First, there is the rule-by-rule analysis of the biological relevance of the relationshipsobtained by the method. This is done by means of a search through the literature, lookinginto known biological interactions for the genes under consideration. This approach is soundwhen a single method is evaluated; however, it has drawbacks that complicate its applicationin most scenarios. First, it is only applicable when a small set of rules is evaluated, since thewhole process is performed manually. Another disadvantage is that it cannot be used forcomparing several methods, because the quality of a rule is biased by the expert that evaluatesit, and therefore it is impossible to establish a fair order of merit for the algorithms underconsideration. Another approach consists in employing online databases of gene interactionsto assess the inferred ARs. As an example, in the case of the yeast organism, there are twowell known databases: Kyoto Encyclopedia of Genes and Genomes (KEGG) [64] and GeneOntology (GO) annotation [65]. The KEGG cell cycle regulation path is a collection ofmanually drawn pathway maps representing the regulation knowledge on the molecularinteraction, and the pathway contains interaction information which is relevant to the cellcycle of yeast. Thus, if an extracted AR is matched with KEGG regulation information,then the rule can be considered as correctly extracted. In the same way, the GO annotationis another source of potential associations for yeast genes. One can consider the gene pairsrepresenting all gene pairs sharing any GO biological process terms between specific levelsof a GO annotation and use it as an another benchmarking set. Nonetheless, as stated before,it should be clear that important known interactions will not be found by any data-drivenapproach if the data sets do not have correlations among the genes involved in such relations.

Independently of the data type employed in the analysis, another common techniqueto assess the performance of an AR mining algorithm is cross-validation [66–68]. Cross-validation is a technique widely used in data mining for assessing how the results of astatistical analysis will generalize to an independent data set. It is mainly used in the GRNreconstruction to estimate how accurately the predictive model will perform in practice.One round of cross-validation involves partitioning a data set into complementary subsets,performing the analysis on one subset (called training set), and validating the analysison the other subset (called validation set or testing set). To reduce variability, multiplerounds of cross-validation are performed using different partitions, and the validation resultsare averaged over the rounds. There are some common types of cross-validation, such asthe K-fold cross-validation and the repeated random subsampling validation. The mayordisadvantage of this type of validation in gene AR mining is, in general, the reduced amountof samples available in the data sets. If an inference is performed on a data sets with fewsamples, the effective amount of samples used in the inference process is even smaller dueto the partition into the training and test sets, thus affecting negatively the predictability ofthe resulting model. Another disadvantage is when time-series data are employed to infertime-delayed rules. The partition into the training and test set cannot be performed becauseboth sets will refer to completely different periods of time, thus becoming incomparablefrom a biological point of view.

Page 13: Biological Knowledge Discovery Handbook || Inference of Gene Regulatory Networks Based on Association Rules

TECHNIQUES OF INFERENCE OF GRNs BASED ON AR 815

Finally, if the whole goal of the analysis is the inference of new biological knowledge, it isnecessary to clarify that the rules inferred by any data-mining approach will always representconfident regulatory associations among genes. That is, the extracting-rules approach canbe useful for the identification of some promising hypothesis regarding the nature of theexperiments analyzed. However, the corroboration by biological experiments will alwaysbe mandatory in order to obtain curated new knowledge.

36.2.7 Advantages and Limitations of Inference of GRNs Based on ARs

Inferring GRNs with AR mining approaches presents several advantages. First of all, theinferred models are highly abstract and hence require less amount of data than continuousmodels [such as the general ordinary differential equations (ODEs)]. This favors its abilityto perform inferences, since almost all gene expression data are suitable for extraction ofARs. Additionally, the simplicity of the inferred model allows the inference of larger modelswith a higher speed of analysis and also facilitates the interpretability of the results.

However, there are also several disadvantages in the inference of ARs. The most im-portant one is that the ARs can display only qualitative dynamic behavior. This could beovercome to some extent if several states for each gene (more than two) were considered.Nonetheless, this also complicates the inference since it requires more data to deduce theinteractions, and it also demands more computational resources due to the increased searchspace. Additionally, as they are highly abstract, the level of detail that can be modeled isvery limited. This issue generally affects its faithfulness as regards biological reality, and italso limits its ability to model dynamics.

36.3 TECHNIQUES OF INFERENCE OF GRNs BASED ON AR

36.3.1 Frequent-Itemset-Based Methods

Frequent-itemset-based methods were originally developed to find interesting associa-tions or correlation relationships among data in a large database, such as those of busi-ness transaction records. The discovery of interesting ARs derived from the so-called fre-quent itemset is valuable in many business decision-making processes, such as catalogdesign, cross-marketing, and loss-leader analysis [69]. Following the definitions in [70],let = {g1, . . . , gN} be a set of distinct literals, called items. A set X ⊆ with |X| = k iscalled a k-itemset or simply an itemset. Let D be a set of transactions where each transactionT is an itemset. There is a unique identifier associated with each transaction, its transac-tion identification (TID). A transaction T contains or supports an itemset X if X ⊆ T . Asstated previously, an AR is an expression X → Y , where in this case X ⊆ , Y ⊆ , andX ∩ Y = ∅. The itemset X has a support x in the transaction database D if x% of transationsT in D contain X:

supp(X) = |{T |X ⊆ T, T ∈ D}||D|

The rule X → Y has a support s in the transaction database D if s% of transactions T in Dcontain X ∪ Y , that is,

supp(X → Y ) = |{T |{X ∪ Y} ⊆ T, T ∈ D}||D|

Page 14: Biological Knowledge Discovery Handbook || Inference of Gene Regulatory Networks Based on Association Rules

816 INFERENCE OF GENE REGULATORY NETWORKS BASED ON ASSOCIATION RULES

The rule X → Y has a confidence c in the transaction database D if c% of the transactionsT in D that contain X also contain Y, that is,

conf(X → Y ) = supp(X ∪ Y )

supp(X)

Note the different meanings for the support and confidence measures. While the support of anitemset or a rule indicates the statistical significance of the itemset or the rule, the confidenceis a measure of the rule’s strength. Generally, only the rules with support and confidencevalues above certain thresholds (minsupport and minconf, respectively) are considered.

In this formulation, the problem of rule mining can be decomposed into two steps:frequent-itemset identification and rule generation. The first step entails the identificationof all frequent itemsets F = {X|supp(X) ≥ minsupp}. Once the set of all frequent itemsets,as well as their supports, is known, the second step involves the derivation of the desiredARs from F. This procedure is very simple: For each X ∈ F , the confidence of all possiblerules X − Y → Y is checked, where Y ⊂ X and Y /= Ø, and those rules which fall belowminconf are excluded.

The main challenge of mining ARs in this way lies in the first step, the identificationof frequent itemsets. It is intuitively obvious that a linear increase in || will result in anexponential growth of the number of itemsets to be considered. Fortunately, the itemsetsupport has the downward-closure property: All subsets of a frequent itemset must also befrequent [70]. As a result, there is a border in the lattice structure separating the frequentand infrequent itemsets [71], with the frequent itemsets located above the border and theinfrequent itemsets located below. The basic principle is to employ this border to prune thesearch space efficiently.

Most of the proposed itemset mining methods are a variant of the APRIORI algorithm[72]. The APRIORI algorithm adopts a breath-first-search approach to the itemset latticeand uses k-itemsets to explore (k + 1)-itemsets. The algorithm scans the database in thefirst round to count the occurrences of each item. It then finds the set of frequent 1-itemset(denote as L1) with respect to a given threshold minsupp. A subsequent round of thealgorithm (e.g., round k) consists of two phases. First, the frequent (k − 1)-itemsets Lk−1found in the (k − 1)th round are used to generate the candidate itemsets Ck. Second, thedatabase is scanned and the k-itemsets in Ck are checked. If a k-itemset X in Ck is notfrequent, it is removed from Ck. The remaining k-itemsets in Ck constitute Lk and will beused for the (k + 1)th round. These two phases iterate until the set of frequent k-itemsetsLk is empty.

APRIORI-based methods show good performance with sparse data sets such as market-basket data, whre the frequent patterns are very short. However, with dense data sets suchas microarrays, where there are many long frequent patterns, these methods scale poorlyand are sometimes impractical. This drawback is due to the high computational cost ofthe evaluation of candidate and test sets used by APRIORI-based approaches. Thus, newmethods like FP-GROWTH [73], which simplify the problem of finding long patterns byconcatenating small ones, have emerged as a promising strategy. In fact, several methodshave been devised on the FP-GROWTH basis [4, 74]. The main idea relies on a compacttree structure called FP-tree, which is searched through recursively in order to enumerate allfrequent patterns. The pattern growth is achieved by concatenating the suffix pattern withthe frequent pattern generated from a conditional FP-tree (e.g., the patterns with lengthequal to 1 will be used for generating those with length equal to 2, and so on). Even

Page 15: Biological Knowledge Discovery Handbook || Inference of Gene Regulatory Networks Based on Association Rules

TECHNIQUES OF INFERENCE OF GRNs BASED ON AR 817

tree-based methods such as FP-GROWTH may find some difficulties when dealing withhigh-dimensional data sets. A frequent pattern of size k (number of items) implies thepresence of 2k − 2 additional frequent patterns as well, each of which is explicitly checkedout by such methods. Thus, FPM algorithms that employ sophisticated heuristics for mininglong frequent itemsets constitute practical solutions for GAA.

There are currently two alternatives for mining long patterns. The first one is to mine onlymaximal frequent itemsets, as in MAXMINER [75] and GENMAX [76], which are typicallyorders of magnitude lower than all frequent patterns. Maximal itemsets are those longestfrequent patterns found under a certain support threshold. Despite the fact that maximalpatterns help understand the long itemsets in dense domains, they lead to the loss of infor-mation; since subset counting is not available, maximal sets are not suitable for generatingrules. The second alternative is mining only frequent closed sets as in CLOSE [77], CLOSE+[78], and CHARM [79]. Closed sets are lossless in the sense that they can be used to uniquelydetermine the set of all frequent patterns and their exact frequencies. A closed itemset isa frequent pattern that fits a support threshold and does not have any other superfrequentpattern set with similar support value covering it. Furthermore, closed-based algorithms canhandle pattern redundancy, which is quite common in the application of association miningon high-dimensional databases [4, 74]. However, even by using such a strategy, the highdimensionality of microarrays still poses great challenges for these methods.

It is important to consider that all aforementioned methods employ exponential com-bination of all the columns (i.e., genes) in the gene expression matrix. Such search spacesize increases proportionally with the number of genes. Therefore, FPM methods that donot use candidate set generation are usually more efficient. The type of patterns found alsoplays an important role in the strength or weakness of a FPM method. Thus, closed-itemsetstrategies are more reliable for the gene AR mining. From such a general discussion, onecould expect that CLOSE+ is the most suitable column enumeration approach for gene ARmining. Indeed, the method was not applied to any kind of gene expression data, although itwas successfully evaluated against its counterpart by using other high dense data sets [80].

36.3.1.1 Time-Delayed ARs with Frequent-Itemset Mining There are currentlytwo alternatives for mining time-delayed ARs with frequent-itemset methods. The first oneis to mine the rules by means of the application of the APRIORI algorithm (or any otheritemset mining algorithm) on matrices of time-delayed gene expression (TdE) profiles [81],similar to those used in [54]. The TdE captures the regulation among genes in W units oftime. It merely consists of an N × (W + 1)M matrix, in which each row is a time windowand the columns contain the W corresponding values for each gene. For example, as Baraliset al. [81] reported, if the discrete matrix in Table 36.1 is considered, the time-delayed

TABLE 36.1 Discrete Matrix

Gene t1 t2 t3 t4 · · · tM

g1 0 1 0 0 · · · 1g2 1 1 0 0 · · · 0g3 0 0 1 1 · · · 0g4 1 0 1 0 · · · 1...

......

......

. . ....

gN 0 0 1 0 · · · 0

Page 16: Biological Knowledge Discovery Handbook || Inference of Gene Regulatory Networks Based on Association Rules

818 INFERENCE OF GENE REGULATORY NETWORKS BASED ON ASSOCIATION RULES

TABLE 36.2 Time-Delayed Matrix

Time g1 + 0 g1 + 1 g1 + 2 g2 + 0 · · · gN + 0 gN + 1 gN + 2

W1 0 1 0 1 · · · 0 0 1W2 1 0 0 1 · · · 0 1 0...

......

......

. . ....

......

WM 1 NA NA 0 · · · 0 NA NA

matrix in Table 36.2 is obtained in the case of regulations among two time instances (i.e.,W = 2).

The other alternative is to extend the concepts of itemset mining in order to take intoaccount the time-lagged rules. In this sense, Nam et al. [82] developed a temporal associ-ation rule mining (TARM) method, based on the APRIORI algorithm, extending the basicconcepts as follows: A temporal item is an item which has a time stamp. A temporal itemsetI is a nonempty set of temporal items. Given a temporal itemset I, a set T of transactionson I, and a positive integer minsupport, I is a temporal frequent itemset with respect to Tand minsupport if supp(I) ≥ minsupport. A temporal AR (X(�) → Y ) is a pair of disjointtemporal itemsets where the time stamp of each temporal item in X is ahead of those of alltemporal items in Y and where � is the interval of two different time stamps.

Figure 36.3 shows an example, reported in [82], of the temporal itemset mining pro-cess. Suppose a three-state discretization process as shown in the Figure 36.3a. In or-der to find temporally associated genes, it is first assumed that all related genes mayhave various sizes of transcriptional time delays. Therefore, the method searches for as-sociated genes in all possible sets of different time point experiments where the timeinterval varies from 0 to W (Figure 36.3b). For example, the temporal transaction sett0 + t2 = [+g1L, −g2L, +g1R, +g2R, −g3R] consists of up- or down-regulated genes attime stamps t0 and t2, with the size of transcriptional time delay � = 2. Note that g1 isup regulated in both cases t0 and t2, but it is considered as two different genes: g1L (g1 onthe left-hand side) and g1R (g1 on the right-hand side). Following, Figure 36.3c indicatesthe extracted temporal frequent itemsets with support threshold 50%. Finally, two tempo-ral ARs are discovered with confidence threshold 50% as shown in Figure 36.3d. In thisway, TARM can find various sizes of transcriptional time delays between associated genes,activation and inhibition relationships, and sets of coregulators for the target genes.

36.3.2 Classification and Regression Tree-Based Approaches

A decision tree is a decision support tool that uses a treelike graph or model of decisions andtheir possible consequences, including chance event outcomes, resource costs, and utility.Decision trees are commonly used in operations research, specifically in decision analysis,to help identify a strategy most likely to reach a goal. In gene AR mining, a decision treeis a rooted tree in which nonleaf nodes are labeled with explaining genes, the arcs fromnonleaf nodes are labeled with possible characteristics of explaining genes, and the leavesof the tree are labeled with the states of the predicted gene. There are two kinds of decisiontrees: classification trees and regression trees [83]. The first are those whose outcomesare the classes to which the data belong, whereas the second are those whose outcomescan be considered as real numbers. An example of a decision tree for classification of theyeast gene CLN2 is shown in Figure 36.4. Each path from the root node to a leaf node in

Page 17: Biological Knowledge Discovery Handbook || Inference of Gene Regulatory Networks Based on Association Rules

TECHNIQUES OF INFERENCE OF GRNs BASED ON AR 819

t0 t1 t2 t3 t4 t5 t6

g1 1 0 1 0 1 0 0g2 −1 −1 1 −1 1 0 1g3 0 0 −1 −1 0 −1 0

(a) Binned time-series data, with three genes and six time points

t0 + t2 = {+g1L, −g2L, +g1R, +g2R, −g3R}t1 + t3 = {−g2L, −g2R, −g3R}t2 + t4 = {+g1L, +g2L, −g3L, +g1R, +g2R}t3 + t5 = {−g2L, −g3L, −g3R}t4 + t6 = {+g1L, +g2L, +g2R}

(b) Temporal transaction sets, transcriptional time delay � = 2

{+g1L}, {−g2L}, {+g2R}, {−g3R}{+g1L, +g2R}, {−g2L, −g3R}

(c) Temporal frequent itemsets, support = 50%

+g12 → +g2

−g22 → −g3

(d) Temporal ARs, confidence = 50%

FIGURE 36.3 Temporal AR mining process with transcriptional time delay � = 2, support ≥ 50%,confidence ≥ 50%.

the tree presents a rule that defines a state of the predicted gene via expression levelsof explaining genes. It follows that every decision tree is equivalent to a list of decisionrules. This method of representation allows the decomposition of decision trees from acomplex structure to simple and compactly presented ARs, which can be independentlycompared to the existing knowledge. Thereby, the decision tree of Figure 36.4 can be

≤0.8

≤0.5 ≤0.7

>0.8

>0.5

>0.7

≤1.2 >1.2 SWI5

CDC28 CLB1

CLN1

CLN2 is

down regulated

CLN2 is

up regulated

CLN2 is

up regulated

CLN2 is

down regulated

CLN2 is

up regulated

FIGURE 36.4 Possible classification tree for gene CLN2 of Saccharomyces cerevisiae. CLN2 is thetarget gene; SWI5, CLB1, CDC28, and CLN1 are the regulatory genes. Expression thresholds of therespective explaining genes mark all the arcs.

Page 18: Biological Knowledge Discovery Handbook || Inference of Gene Regulatory Networks Based on Association Rules

820 INFERENCE OF GENE REGULATORY NETWORKS BASED ON ASSOCIATION RULES

represented by means of the following list of ARs: −SWI5ˆ+CDC28ˆ+CLN1 → −CLN2;−SWI5ˆ+CDC28ˆ−CLN1→ +CLN2; −SWI5ˆ−CDC28 → +CLN2; +SWI5ˆ−CLB1 →+CLN2; +SWI5ˆ+CLB1 → −CLN2, where the symbol ˆ stands for the logic AND.

In the context of AR mining by means of decision trees, the functions determining statesof target genes from data are called classifiers, while algorithms building such classifiers onthe basis of data with known states are called inducers, induction, or ensemble algorithms.Each expression profile with a known state of a predicted gene is called an example or aninstance. The set of examples used for classifier creation is the training set. If a subset ofthe examples is separated from the training set and is used for estimation of classificationaccuracy, it is called a test set. Thereby, the inference of a decision tree for AR mining can beconsidered as a standard classification problem in the following way. Let Y = {y1, . . . , yN}be the set of all sample expression profiles and let Y/i = {y1/i, . . . , yN/i} be the set of partialsample expression profiles for a given gene gi from the matrix A′. Let us define a classifier,C, as a function that maps a vector y to a discrete value s. Sometimes, in the context ofclassification, the vector y is called a feature vector, while s is a label. The subset of y vectorswith correct labels assigned to them is called a data set, D, for a particular classificationproblem. An induction algorithm I maps a data set D into a classifier C. Thus, in order tosolve the problem described above, it is necessary to define the data sets and then chooseappropriate induction algorithms. More specifically, let the goal be to predict the state ofgene gi from a matrix A′. An induction algorithm I maps the data set Di = (Y/i, si) into theclassifier Ci (the index i for Di and Ci is used to emphasize that they correspond to the genegi). For the given data set Di, it is necessary to create a classifier that correctly predicts thestate of gene gi, that is, I(Di,yj/i) = Ci(yj/i) = sij . Thus, for this problem, the predictedgene gi and the explaining genes belong to the same sample j.

As a part of the classification problem it is necessary to find which genes are relevant tothe prediction of a particular gene. This is known as the feature subset selection problem.Two kinds of methods for feature subset selection have been generally presented in theliterature: filter and wrapper methods [84, 85]. In the filter approach, the feature set isfiltered to find the “most promising” subset by evaluating an objective function beforerunning the induction algorithm. The weak point of this approach is that the properties of aparticular induction algorithm are ignored. In the wrapper approach, the selection algorithmuses the induction algorithm itself to evaluate the objective function. The wrapper approachof Kohavi was reported as performing better than the filter approach for many real andartificial data sets [85]. The idea of the wrapper algorithm is to tune parameters of aninduction algorithm considering it as a black box in order to optimize some objectivefunction (e.g., the accuracy of a classifier). The set of attributes relevant to the classificationmay be considered as parameters of an induction algorithm. Selecting the parameters thatmaximize the objective function gives a list of “good” features. For the details of a selectionalgorithm see [85]. The classification rules inferred in this way assume that only a limitednumber of gene regulators are sufficient for accurate predictions.

Soinov et al. [86] were the first authors that approached the task of inferring the ARs bymeans of decision trees. They used classification algorithms for continuous data, in which thediscretization forms part of the algorithm. This allowed them to find abundance thresholdsof regulatory genes, which are specific to different gene interactions in the network, andsufficient for the switching of the target gene from one state to the other. In this way, everygene has its own unique discretization threshold for input signals. They used two typesof induction algorithms. The first one exploits the wrapper approach for feature subsetselection [85]. It is called C4.5, by Quinlan [87], with wrappers by Kohavi [85]. The second

Page 19: Biological Knowledge Discovery Handbook || Inference of Gene Regulatory Networks Based on Association Rules

TECHNIQUES OF INFERENCE OF GRNs BASED ON AR 821

one is C4.5 itself. C4.5 is an algorithm that constructs the classification model inductively,generalizing information from examples of correct classification. It has proved to be analgorithm of good performance for a large variety of data sets.

A more recently published example is [88], which employs regression trees to solvethe problem. The basic idea of this method is to decompose the prediction of a regulatorynetwork between p genes into p different regression problems. In each one of the regressionproblems, the expression pattern of one of the target genes is predicted from the expressionpatterns of all the other genes. The difference between this method and Soinov’s approachlies in the use of regression trees instead of classification trees. In this way, they comparetwo tree-based ensemble methods founded on randomization, namely random forests [89]and extra trees [90]. In a random Forests ensemble, each tree is built on a bootstrap samplefrom the original learning sample and, at each test node, K attributes are selected at randomamong all candidate attributes before determining the best split. In the extra-trees method,on the other hand, each tree is built from the original learning sample and at each test nodethe best split is determined among K random splits, and each one is determined by a randomselection of an input (without replacement) and a threshold.

36.3.2.1 Time-Delayed ARs with Decision Trees The aforementioned frameworkfor the inference of ARs by means of decision trees does not consider possible delayed in-teractions. In [86] an extended definition for the problem of single time-delayed interactionswas introduced. This formulation is merely the same as before, except that the data set isnow Di = (Y ′

/i,s′i), where Y ′

/i = {y1/i, . . . ,yN−1/i} and s′i = (si2 , . . . , siN ). The classifierCi is said to classify gene gi for sample j correctly if Ci(yj/i) = si(j+1). Note that, in thecase of this problem, the regulator genes belong to the sample preceding the sample of thetarget gene gi. This formulation can be generalized to any time delay in the regulatory effectbetween the regulator genes and the target gene.

Another approach was proposed by [54]. They introduce a method that allows the ex-pression of a target gene at time t + 1 to be interacted with other genes at time frames{t, t − 1, . . . , t − (W − 1)}. For each target gene, its time-delayed gene expression profileis constructed. Then, a decision tree is used to discover the time-delayed regulations thatmodulate the activities of the target gene (see Figure 36.5 for an example).

≤0.5 >0.5

>0.7

≤0.7

CLB1

(t–1)

HCT1(t)

CDC20 (t+1)

up regulated CDC20(t+1)

down regulated

CDC20(t+1)

up regulated

FIGURE 36.5 Possible classification tree for gene CDC20 of S. cerevisiae: CDC20 is the targetgene; CLB1 and HTC1 are the regulatory genes. Expression thresholds of the respective explaininggenes mark all the arcs.

Page 20: Biological Knowledge Discovery Handbook || Inference of Gene Regulatory Networks Based on Association Rules

822 INFERENCE OF GENE REGULATORY NETWORKS BASED ON ASSOCIATION RULES

TABLE 36.3 Di = (TdE, Ci) Matrix for Target Gene gi

g1 gN

Gene \ t + 1 t + (W − 1) · · · t − 1 t · · · t + (W − 1) · · · t − 1 t Ci

W + 1 d11 · · · d1(W−1) d1W · · · dN1 · · · dN(W−1) dNW Ci(W+1)

W + 2 d12 · · · d1W d1(t+1) · · · dN2 · · · dNW dN(t+1) Ci(W+2)

...... · · ·

...... · · ·

... · · ·...

......

W + (M − W) d1(M−W) · · · d1(M−2) d1(M−1) · · · dN(M−W) · · · dN(M−2) dN(M−1) CiM

Note: The genes g1, . . . , gn are the putative regulatory genes to be assessed. The dkl values are the temporaltranscriptions of these genes, and Ci denotes the phenotype (state) vector for the target gene gi at the temporalpoint (t + 1, . . . , M).

A time-delayed gene expression profile (TdE) is an (M − W) × (N × W) matrix, whereeach W-column block in the N × W columns represents the activities of each of the N(regulating) genes at time points t, t − 1, . . . , t − (W − 1) and each row is therefore an(N × W)-dimension vector. As the value of t changes from W to M − 1 (the time windowmoves from the first time point to the M − W time point), it produces M − W such vectorsor samples. Next, it is necessary to set up the corresponding phenotype (label) for eachsample, which was determined by the states of the target gene gi. Finally, the completeddata for the time-delayed gene expression profiles for the target gene is denoted by Di =(TdE,Ci), where Ci is a column vector of states for gene gi. The Di = (TdE,Ci) matrix forthe target gene gi is given in Table 36.3.

36.3.3 Bayesian Networks

A Bayesian network is a representation of a joint probability distribution as a directedacyclic graph (DAG) [91, 92]. The vertices of a DAG correspond to random variables[V1, . . . , VN ] and the edges correspond to parent–child dependencies among variables. Therandom variables may be either discrete or continuous valued. In the context of GRNs,Vi represents the expression level of gene gi, and the edges of the DAG represent therelations among genes. Thereby, a Bayesian network can be represented by a list of ARs thatcorrespond to parent–child dependencies among variables. The joint probability distributioncan thus be written in the simple product form

P[V1, . . . , VN ] =N∏

i−1

P[Vi |Pa(Vi) ]

Bayesian networks have a number of features which make them attractive candidates formodeling gene expression data: They are suitable to handle noisy or missing data, to handlehidden variables such as protein levels which may have an effect on mRNA steady-statelevels, to describe locally interacting processes, and to make causal inferences from thederived models. Friedman et al. [92] proposed modeling a gene network as a Bayesiannetwork: Each gene is a vertex and each regulatory relationship is an edge in the Bayesiannetwork. As learning a sparse network is technically difficult, Friedman proposed a two-step algorithm, the sparse candidate algorithm, to learn the structure and parameters: Foreach gene, (1) some candidate parents who are likely to be the parents of the target geneare selected; (2) the Bayesian score for every possible subset of the candidate parent set is

Page 21: Biological Knowledge Discovery Handbook || Inference of Gene Regulatory Networks Based on Association Rules

TECHNIQUES OF INFERENCE OF GRNs BASED ON AR 823

computed and the best combination is searched for. In the first step, a general method usingpairwise correlation, such as mutual information (MI), is applied to find the genes with highdependence with the target genes. However, some dependences cannot be measured by MI.Thus, some weak parents are generated. Weak parents are parents to a target gene but donot have a high dependence with it. Kullback–Leibler (KL) divergence is used in the work,which can be improved iteratively using the learned network as the prior knowledge in theiterated learning process, to find better dependence between gene pairs. The second stepcan be done by some heuristic method such as hill climbing [92]. Friedman showed thatthe results obtained by the sparse candidate learning algorithm are biologically meaningfulby examining them with a set of statistic measurements: robust test, order relation, Markovrelation, and so on. Since then, many works based on the Bayesian network frame havebeen proposed, and biologically relevant results have been obtained. Hartemink et al. [93]extended Friedman’s work by adding these annotations to edges: +, −, or +/−, whichrepresent positive, negative, or unknown regulation. Beal et al. [94] proposed including theunmeasured genes as the hidden factors to learn a gene network. They proposed imple-menting the step by state-space models (SSMs). Lee and Lee [95] proposed a modularizedlearning approach based on the assumption that most genes are likely to be related to othergenes in the same biological modules rather than the genes in different modules. They pro-posed finding overlapping modules in the genes and learning the subnetworks in moduleswith a Bayesian network. Zhou et al. [96] proposed constructing the probabilistic GRNsthat emphasize network topology using a reversible jump Markov chain technique. Rogersand Girolami [97] proposed to infer the regulatory networks by the Bayesian regressionapproach, which works with continuous variables directly.

Bayesian networks have also the disadvantage of excluding dynamical aspects of generegulation since they need to be acyclic graphs. To some extent, this can be overcome throughgeneralizations like dynamical Bayesian networks (DBNs), which allow feedback relationsbetween genes in a network. A DBN is a Bayesian network which has been temporally“unrolled.” Typically, the variables are viewed as entities whose value changes over time.However, if the variables are considered as constant (as in the hidden Markov model), it ispossible to represent, for example, gi at t and gi at t + 1 using two different variables, say gt

i

and gt+1i . Assuming that the conditional dependencies cannot point backward or “sideways”

in time, this means that the graph must be acyclic, even if gi autoregulates. Also, assumingthat the conditional dependencies are constant over time and that the prior joint distributionis the same as the temporal joint distribution [98], the network only needs to be unrolledfor one time step. An example is provided in Figure 36.6.

Murphy and Mian [99] and Gransson and Koski [100] used the DBN to model genenetworks. In this model, a gene at a time point is regulated by its parent in the previous timepoint. Thus, the acyclic limitation of the Bayesian network is overcome in the DBN. Murphyet al. gave a thorough report in [99] on the application of the DBN in learning gene networks.Imoto et al. [91] and Kim et al. [101] further extended Bayesian networks and DBNs by in-tegrating nonparametric regression into the models, so that the methods can use continuousgene expression values instead of the discrete values in the general Bayesian network ap-proaches. Their method is capable of capturing the nonlinear relationships among genes. Yuet al. [102] presented an influence score to measure the magnitudes of regulatory strength ofthe edges. This score is useful for eliminating the false positives as well as for distinguishingthe positive or negative regulation of edges. With more and more works using Bayesian net-works as the framework to tackle the gene network reconstruction problem [17, 103–105],the Bayesian network is becoming a widely used approach in learning gene networks.

Page 22: Biological Knowledge Discovery Handbook || Inference of Gene Regulatory Networks Based on Association Rules

824 INFERENCE OF GENE REGULATORY NETWORKS BASED ON ASSOCIATION RULES

g1

g2 g3

g1t g3

t+1

g3t g1

t+1

g1t g2

t+1

(a) (b)

FIGURE 36.6 (a) Cyclic Bayesian network impossible to factorize and (b) equivalent acyclic DBN.The prior network [98] is not shown.

An understanding of causal relationships in a network is crucial in the task of determiningthe impact of interventions at the genetic level and of performing counterfactual reasoningthat leads to finding “causes.” In general, dependence relations in Bayesian networks donot give unique causal inferences. There are multiple graphs that yield the same joint distri-bution. Measurements of gene expression, in the absence of interventions, are insufficientto uniquely determine the underlying causal mechanisms. Recently, a few studies providedmethods for uniquely inferring causal mechanisms for certain cases of Bayesian networksbased on perturbation data [106–108]. However, most researches on reverse engineeringof GRNs by either Boolean or differential equation-based models do not take the “causal”aspect of gene connections into consideration. In wet laboratories, learning causal relation-ships between genes can be done by knocking out all possible subsets of genes of a givenset and studying the impact on the other genes in the set. This is not often feasible whenthe number of genes in the set is more than a handful. An alternative approach is to usetime-series gene expression data. Unfortunately, such data can only be obtained for cellsof particular organisms such as yeast. For human tissues, high-throughput gene expres-sion data are generally only available for the steady state. Therefore, how to infer causalrelationships between genes from steady-state data is an open question for researchers ofthis field.

Modern variations of Bayesian networks have also added new capabilities to them,particularly fuzzy Bayesian networks. These range from specialized techniques designedto reduce the complexity of hybrid Bayesian network (HBN) belief propagation with fuzzyapproximations to more general formalizations [53, 109–111]. General formalizations allowvariables in Bayesian networks to take fuzzy states, with all of the advantages in robustness,comprehensibility, and dimensionality reduction this provides [98, 112].

36.3.3.1 Time-Delayed GRNs with Bayesian Networks In Tiefei [113], a time-delayed Bayesian network was proposed to model a GRN, which can capture various time-delayed relationships as well as discover directed loops spanning at least one time slice.The time-delayed Bayesian network is defined as follows: Let W be the maximum timedelay allowed for each regulation. A time-delayed Bayesian network can be described byT =<G, θ, δ>, in which G =<V, E> is a directed graph, where V = {V1, V2, . . . , Vn} isthe set of variables of G and E is the set of directed edges of G. Each variable Vi represents

Page 23: Biological Knowledge Discovery Handbook || Inference of Gene Regulatory Networks Based on Association Rules

TECHNIQUES OF INFERENCE OF GRNs BASED ON AR 825

a gene, and each edge (Vi, Vj) represents the regulation process from Vi to Vj . For everyedge (Vi, Vj) ∈ E, δ(Vi; Vj) represents the unique time delay for the edge (Vi, Vj). Note thatδ(Vi, Vj) is an integer and 0 ≤ δ(Vi, Vj) ≤ W . Assume θ is the parameter set of G that storesthe conditional probability distribution P(Vi|Pa(Vi)) for every Vi ∈ V , where Pa(Vi) is theparent set of Vi in G. A directed cycle is allowed if at least one of its edges has time delay≥1. Figure 36.7a shows an example of a directed cycle with four genes in a time-delayedBayesian network.

In order to model the time-delayed Bayesian networks, a relationship between them andthe traditional Bayesian network can be established [113]. Given a maximum time delayW, a variable at a time slice can only be affected by variables in the current time sliceand the previous W time slices. For each variable Vi, let Vi,0, Vi,1, . . . , Vi,W−1, Vi,W beits states in the previous W time slices and the current time slice. Learning whether theedge (Vj, Vi) has a time delay � is equivalent to learning whether (Vj,W−�, Vi,W ) is anedge. The formal transformation is described as follows: Given a time-delayed networkT =<G, θ, δ>, where G =<V, E>, with the maximum time delay W, T can be representedusing a traditional network U =<H, θ′> such that H =<V ′, E′>, where V ′ is the vertexset and E′ is the edge set. Assume V ′ = {Vi,t|Vi ∈ V, t = 0, 1, . . . , W}. Thus, each vertexVi ∈ V is transformed into W + 1 vertices {Vi,0, . . . , Vi,W }. Consider a variable Vi ∈ V ,with Pa(Vi) = {Vi1 , . . . , Vis} being the parent set of Vi in G. In H, the variable Vi,W hasparents Vi1,(W−�1), Vi2,(W−�2), . . . , Vis,(W−�s), where �j is the time delay δ(Vi, Vij ) asso-ciated with the edge between Vi and Vij . In the parameter set θ′, the conditional probabilitydistribution P(Vi,W |Vi1,(W−�1), Vi2,(W−�2), . . . , Vis,(W−�s)) of U is the same as the condi-tional probability distribution P(Vi|Vi1 , . . . , Vis ) of N.

Figure 36.7 shows an example reported in [113] of the transformation. It can be easilyverified that the transformed network U is a directed acyclic graph and that the networkU contains all the parameters of T. Once the network U is learned, the parameters of thenetwork T can be easily recovered. Additionally, if the time delay W = 0, the time-delayednetwork is indeed a traditional Bayesian network, and if W = 1, the time-delayed networkis a dynamic Bayesian network. A work related to this approach is the k-DBN model, wherek-DBN was proposed in [114] for finding hidden variables in a network. Though k-DBN was

1

2 1

2

V1

V2

V3

V4

V1,0 V1,1 V1,2

V2,2V2, 1V2,0

V3,2V3, 1V3,0

V4,0 V4, 1 V4,2

(a) (b)

FIGURE 36.7 Example of network transformation. (a) The time-delayed network contains 4 vari-ables and 4 edges. The integer on each edge indicates the time delay, and the maximum time delayk is assumed to be 2. This network has one cycle: V1 → V2 → V3 → V4 → V1. (b) The transformednetwork contains 12 variables and 4 edges. Each variable Vi is transformed into 3 variables: Vi,0, Vi,1

and Vi,2. The edge (Vi, Vj), with time delay � is transformed into edge (Vi,W−�, Vj,W ). After thetransformation, no cycle exists.

Page 24: Biological Knowledge Discovery Handbook || Inference of Gene Regulatory Networks Based on Association Rules

826 INFERENCE OF GENE REGULATORY NETWORKS BASED ON ASSOCIATION RULES

g1 g2 g3

f1 f2 f3

FIGURE 36.8 A Boolean network. For clarity, each f ∈ F has been put into a node. Normally thefunctions are implicit in the edges among N; f1 = −g1 ∧ g2 ∧ g3, f2 = −g1 ∨ g2 ∧ g3, f3 = g1 ∨ g3.

not used for learning causal relationships as is the case in a gene network, it can be extendedto learn the structure of a gene network, allowing more than one edge with different timedelays from gene gi to gene gj .

36.3.4 Boolean Networks

Boolean network models, originally introduced by Kauffman [115, 116], can provide usefulinsights in network dynamics at the coarse level. In a Boolean network, an entity can attaintwo alternative levels: active (1) or inactive (0). For example, a gene can be described asexpressed or not expressed at any time. The level of each entity is updated according tothe levels of several entities via a specific Boolean function. The 0–1 vector that describesthe levels of all entities is called a system state or global state. It is assumed to changesynchronously such that at every time step the level of each entity is determined accordingto the levels of its regulators at the previous time step and to the regulation function. Figure36.8 is an example of a Boolean network. In many cases, the regulatory relationshipsbetween network components have not been established and therefore need to be derivedfrom experimental data. For any entity under a Boolean network model, both its regulatorsand a regulatory function that is consistent with a set of gene expression profiles can befound efficiently provided that the number of regulators of each entity does not exceed aset limit [117].

Boolean networks do not correctly model the dynamics of a transcription factor thatdown regulates its own expression due to the model’s limited level of detail [118]. Anotherproblem is that it is computationally expensive to analyze the dynamics of large networks,as the number of global states is exponential in the number of entities. However, whenthe number of entities is small and only qualitative knowledge is available, Boolean net-works can provide important insights, such as the existence and nature of steady statesor network robustness. Furthermore, for modeling large-scale genetic regulatory systems,Boolean networks may represent the only practical alternative [112].

Microarray data exhibit uncertainty on several levels, as stated before. First, there is bio-logical uncertainty in the form of intrinsic and extrinsic noise. Second, there is experimentalnoise due to the complex measurement process, ranging from hybridization conditions tomicroarray image processing techniques. Third, there may be interacting latent variables,such as proteins, various environmental conditions, or other genes that are not measured,which are further sources of variability in the measurements. To address the uncertainty,Shmulevich et al. [119] introduced probabilistic Boolean networks (PBNs) by associatingseveral predictors with each target gene [119]. If target gene gi has l(i) associated predictor

Page 25: Biological Knowledge Discovery Handbook || Inference of Gene Regulatory Networks Based on Association Rules

TECHNIQUES OF INFERENCE OF GRNs BASED ON AR 827

g1 g2 g3 gN

gi

1

if 2

if ( )

il if

...

...

1

ic 2

ic( )

il ic

FIGURE 36.9 Basic building block of PBN.

functions, f(i)1 , f

(i)2 , . . . , f

(i)(i) , then at each point in time t one of these functions is selected

to form the transition rule for gi at time t + 1. Clearly, if l(i) = 1 for all i = 1, 2, . . . , N,the PBN simply reduces to a standard Boolean network. The basic building block of a PBNis shown in Figure 36.9. The wiring diagram for the entire PBN consists of N such buildingblocks. Conceptually, the probabilistic predictor of each target gene can be thought of asa random switch, where at each time point in the network the function f

(i)k is chosen with

probability c(i)k to predict gene gi. One way to assign these probabilities is to employ the

coefficient of determination (CoD) [119], normalized in such a way that∑l(i)

k=1 c(i)k = 1.

That is, c(i)k = θ

(i)k /

∑l(i)j=1 θ

(i)k , where θ

(i)k is the CoD for the target gene gi relative to the

genes used as inputs to predictor f(i)k .

Within the context of PBNs, Hashimoto et al. [120] have developed a method to grow anetwork starting from a smaller number of genes of interest, or seed genes. The proposedalgorithm is flexible and permits various designer choices regarding how to proceed, suchas the measure of connection strength between genes, search protocol, and stopping condi-tions. As an example, the CoD [119] can be assigned as the strength-measuring function.Identifying the seed genes of interest is a critical step in this algorithm. The seed genes areusually selected with the aid of prior biological knowledge.

While good at abstracting uncertainty in biological system, the PBN model fails indescribing the context-specific determinism of regulatory systems. Context can be definedas a certain condition under which a limited number of genes are tightly regulated by eachother for a specific cellular mechanism or a specific task. This specific task can be a differentdevelopmental stage, or tissue-specific function, resulting in a specific cell type. The changeof this context will result in the change in the set of genes that are highly interactive andprobably their connectivity and relationships. Different biological contexts can also correlatewith different diseases or might be a reason why certain patients respond to a therapy whileothers do not. Li et al. [121] developed a context-sensitive Boolean network (cBN) model todescribe the behavior of cellular systems. A cBN can be considered as a constrained PBN,where the constraint is the way to assign the probability for the model. The rule inference

Page 26: Biological Knowledge Discovery Handbook || Inference of Gene Regulatory Networks Based on Association Rules

828 INFERENCE OF GENE REGULATORY NETWORKS BASED ON ASSOCIATION RULES

g1

g0

g3

g4

g2

g7

g9

g8

g6

g5

FIGURE 36.10 Example of cBN with two contexts.

is based on the assumption that the inferred rules and the observations are consistent withina (given) context. Figure 36.10 shows an example of cBNs that contain two contexts and15 genes.

36.3.4.1 Time-Delayed GRNs with Boolean Networks In order to deal with timedelay in gene regulation, Silvescu and Honavar [10] proposed an algorithm that uses time-series data to find temporal Boolean networks (TBoN). TBoNs were developed to modelregulatory delays, which may come about due to missing intermediary genes and spatial orbiochemical delays between transcription and regulation, as stated before. An example ofa temporal Boolean network is presented in Figure 36.11.

A TBoN is very similar to a regular Boolean network except that the functions f ∈ F

can refer to past gene expression levels. Rather than depending just on Nt to inferNt+1, parameters to fi can be annotated with an integer temporal delay. For example,g3 = f3(g1, g2, g3) = g0

1 ∨ g−23 means g3 is expressed at t + 1 if g1 at t is expressed or

g3 was expressed at time t − 2. The TBoN can also be reformulated and inferred as adecision tree.

[2]

g1 g2 g3

f1 f2 f3

FIGURE 36.11 Temporal Boolean network. Presentation and functions are as in Figure 36.8, butdelays are shown in brackets between genes and functions. The default delay if no annotation ispresent is assumed to be zero.

Page 27: Biological Knowledge Discovery Handbook || Inference of Gene Regulatory Networks Based on Association Rules

TECHNIQUES OF INFERENCE OF GRNs BASED ON AR 829

36.3.5 Other Techniques

Over the past few years, other approaches were proposed that do not correspond to theprevious classification. However, these methods can also be used to infer gene ARs frommicroarray data. These techniques will be briefly described in the following sections.

36.3.5.1 Clustering One of the main problems that hinder research on gene networkreconstruction is the dimension problem; that is, there are many genes with a few replicates.A useful approach is to group genes with similar expression patterns into clusters and theninfer the regulatory relationship among the clusters [113]. Researchers believe genes withsimilar expression patterns have similar functions or are involved in the same biologicalevents [122]. Currently, several clustering methods are used for this purpose. Differentclustering methods can generate very different results. Each combination of distance mea-surement and clustering algorithm tends to emphasize a different type of regularity in thedata. There is no single criterion for choosing the best clustering method. How to choosethe method depends on the particular emphasis.

Given clusters, there are also several methods to find the interactions among them. Chenet al. [123] reduced 3131 yeast genes into 308 clusters by average linkage clustering. Then,they used simulated annealing to optimize a qualitative network based on the timing ofpeaks in the data. Wahde and Hertz [122] clustered 65 genes from rat spinal cord andhippocampus data sets [124] into four “waves” using the Fitch’s hierarchical clusteringalgorithm [125]. Then, by a genetic algorithm, they built a four-node continuous-time re-current neural network. Someren et al. [126] reduced 2467 yeast genes into t − 1 clustersand represented each cluster by a “prototype” gene calculated from the cluster. A linearmodel of the prototype genes is then generated by linear regression. Toh and Horimoto[11] proposed averaging the gene expression values of each cluster and then discoveredthe regulatory relationships by graphical Gaussian modeling (GGM). Guthke et al. [127]proposed grouping genes into clusters, and then they found the representative genes forthe clusters. The connections among the representative genes are modeled by differentialequations.

36.3.5.2 Pairwise Methods Pairwise methods seek to discover the relationshipsamong genes by pairwise comparisons solely. They do not consider interactions wherethe expression of one gene is achieved by the combined effects of other multiple genes.Arkin et al. [128] proposed correlation metric construction (CMC), which computes themagnitude of gene pairs by cross-correlation. A distance matrix is constructed for eachgene pair by comparing their similarities to other genes. Then a diagram is constructedto summarize the strength of interaction and predict mechanistic connections between thegenes. Chen et al. [123] proposed activation/inhibition networks to find regulation based onwhether peaks in one signal precede peaks in another signal, grouping the genes with similarexpression profiles. Then a prototype is generated for each group of genes by averaging theexpression values of genes in the group. Each prototype represents a group of genes withsimilar expression patterns and is represented as a series of peaks. The correlations betweenprototype pairs are calculated to determine the type of regulatory relationships (activation,inhibition, or unmatched) and measure the strength of the regulatory relationship betweenany two prototypes. Finally, the regulation matrix is generated by the scores. Ponzoni et al.[129] proposed a machine learning algorithm called GRNCOP based on combinatorial op-timization that does not assume arbitrary or uniform gene expression value discretizations.

Page 28: Biological Knowledge Discovery Handbook || Inference of Gene Regulatory Networks Based on Association Rules

830 INFERENCE OF GENE REGULATORY NETWORKS BASED ON ASSOCIATION RULES

The thresholds are calculated dynamically by applying the same continuous-valued attributediscretization techniques as those used for classification algorithms based on decision trees.Then, each possible pair of genes is evaluates and an AR with a particular accuracy basedon an objective function is obtained. Finally, only the rules that achieved an accuracy valueover a preselected threshold are reported. Gallo et al. [7] proposed an extension of thisapproach allowing the inference of ARs with multiple time delays and from multiple datasources.

36.3.5.3 Support Vector Machine Methods Support vector machine (SVM) meth-ods have attracted a high interest within the bioinformatics community during the last yearsdue to their good prediction performance for various tasks. They rely on principles fromstatistical learning theory [130]. The idea is to construct an optimal hyperplane between twoclasses +1 and −1 such that the margin, that is, the distance of the hyperplane to the pointclosest to it, is maximized. To allow for nonlinear classification, so-called kernel functionsare employed which can be thought of as special similarity measures. They implicitly mapthe original data into some high-dimensional feature space in which the optimal hyperplanecan be found. As an example, suppose that linear kernels k(x,x′) = 〈x,x′〉 as well as polyno-mial kernels of degree 2 k(x,x′) = 〈x,x′〉2 are considered, where x and x′ are the expressionlevels of all genes except for gene gi in sample j. The polynomial kernel implicitly computesall pairwise products between expression levels of two genes. In this way, not only linearbut also nonlinear dependencies between gene expressions can be captured. In addition toa kernel function, a soft margin parameter C has to be fixed. Guyon et al. [131] proposedan algorithm called RFE capable of determining, for each gene gi, which genes are suitedbest to predict its state. This algorithm successively eliminates the gene that influences thesize of the margin least. The termination of this procedure can be determined by a 10-foldcross-validation.

36.4 CONCLUDING REMARKS

The development of computational methods for GRN modeling is a hot research topic. Inthis chapter, a specific family of algorithms for extracting ARs among genes is reviewed.The reverse engineering of GRNs from ARs has an important methodological advantage: Itallows a model-free reconstruction of the networks. In other words, these techniques do notrequire, in general, any constraint or previous knowledge about the relationship networkstructure, nor do they make assumptions related to the physicochemical principles thatgovern the gene interactions. These methods only need the gene expression information asdata source for the inference process.

All of the techniques here reviewed come from diverse data-mining approaches, but mostof them share common aspects, like some preprocessing steps. In particular, the discretiza-tion of the gene expression data constitutes a central point for these methods, with importantsemantic implications. As it was described, there are several algorithms for addressing thisproblem which range from simplistic and straightforward arbitrary discretizations to elab-orated adaptive methods. Moreover, additional complex aspects, which emerge as part ofthis preprocessing step when transitional states are modeled from time-series data, werealso presented.

Regarding the inference methodologies, a wide variety of techniques were illustratedhere, such as frequent-itemset-based methods, classification and regression trees, Bayesian

Page 29: Biological Knowledge Discovery Handbook || Inference of Gene Regulatory Networks Based on Association Rules

REFERENCES 831

networks, Boolean networks, SVMs, clustering approaches, and some pairwise algorithms.For most of these data-mining approaches, several algorithms were reviewed, emphasizingthe advantages and limitations of these contributions.

Another relevant point is the inference of temporal associations among genes. This pointwas addressed in a transversal way along the chapter, illustrating how different data-miningmethods include these types of time delay rules. Additional topics such as the cardinalitymodel for the rules, statistical and biological validation of the network, or how to extractassociations from multiple data sources are also analyzed in detail.

Finally, an important question probably remains in the reader’s mind: How realistic canthese discretized regulatory models be? At this point, as a corollary of this chapter, it isimportant to remark that in any network modeling methodology there is awareness andacceptance that a model describes only some properties of the real biological system andignores many others. In other words, a model emphasizes particular aspects of the reality,leaving out details that are not always relevant for the purpose of the study. In this context,the AR inference algorithms constitute a valuable tool for understanding and discoveringpotential hidden relationships among genes, and the discretization of the gene expressionvalues in states allows the modeler to be focused on the relevant information patterns.Therefore, the discretized view of the data can help to capture the gene behavior in an easierinterpretable way. Nevertheless, a realistic reconstruction of the complex regulatory ma-chinery that occurs in the cell will need to attack the problem from different perspectives andwith complementary computational approaches. Moreover, the final biological validationof each novel association will always be required in order to obtain a feasible and confidentnetwork.

ACKNOWLEDGMENTS

This work was supported by Research Project PIP 112-2009-0100322 founded by the CON-ICET (National Research Council of Argentina) and by Research Project PGI 24/ZN15founded by the Universidad Nacional del Sur (Bahía Blanca, Argentina).

REFERENCES

1. H. Kitano. Perspectives on systems biology. New Generat. Comput., 18:199–216, 2000.

2. K. H. Cho, S. M. Choo, S. H. Jung, J. R. Kim, H. S. Choi, and J. Kim. Reverse engineering ofgene regulatory networks. IET Syst. Biol., 1:149–63, 2007.

3. M. E. Csete and J. C. Doyle. Reverse engineering of biological complexity. Science, 295:1664–1669, 2002.

4. A. Ceglar and J. F. Roddick. Association mining. ACM Comput. Surv., 38:2, 2006.

5. C. Creighton and S. Hanash. Mining gene expression databases for association rules. Bioinfor-matics, 19:79–86, 2003.

6. P. Carmona-Saez, M. Chagoyen, A. Rodriguez, O. Trelles, J. M. Carazo, and A. Pascual-Montano. Integrated analysis of gene expression by association rules discovery. BMC Bioinfor-matics, 7:54, 2006.

7. C. A. Gallo, J. A. Carballido, and I. Ponzoni. Discovering Time-lagged rules from microarraydata using gene profile classifiers. BMC Bioinformatics, 12(123):1–21, 2011.

Page 30: Biological Knowledge Discovery Handbook || Inference of Gene Regulatory Networks Based on Association Rules

832 INFERENCE OF GENE REGULATORY NETWORKS BASED ON ASSOCIATION RULES

8. D. C. McShan, M. Updadhayaya, and I. Shah. Symbolic inference of xenobiotic metabolism. InR. B. Altman, A. K. Dunker, L. Hunter, T. A. Jung, and T. E. Klein (Eds.), Pacific SymposiumBiocomputing. World Scientific, Hawaii, 2004, pp. 545–556.

9. K. Sterelny and P. E. Griffiths. Sex and Death: An Introduction to Philosophy of Biology. Scienceand Its Conceptual Foundations Series. University of Chicago Press, Chicago, IL, 1999.

10. A. Silvescu and V. Honavar. Temporal Boolean network models of genetic networks and theirinference from gene expression time series. Complex Sys., 13:54–70, 2001.

11. H. Toh and K. Horimoto. Inference of A genetic network by a combined approach of clusteranalysis and graphical Gaussian modeling. Bioinformatics, 18(2):287–297, 2002.

12. C. Fogelberg and V. Palade. Machine Learning and Genetic Regulatory Networks: A Review anda Roadmap. In A. Abraham‚ A. E. Hassanien‚ A. Vasilakos‚ W. Pedrycz‚ F. Herrera‚ P. Siarry‚ A.de Carvalho, and A. P. Engelbrecht (Eds.)‚ Foundations of Computational Intelligence. Springer,Heidelberg, 2009, pp. 3–34.

13. G. Giaever et al. Functional profiling of the Saccharomyces cerevisiae genome. Nature,418(6896):387–391, 2002.

14. P. Y. Lum, C. D. Armour, S. B. Stepaniants, G. Cavet, M. K. Wolf, J. S. Butler, J. C. Hinshaw,P. Garnier, G. D. Prestwich, A. Leonardson, P. Garrett-Engele, C. M. Rush, M. Bard, G. Schim-mack, J. W. Phillips, C. J. Roberts, and D. D. Shoemaker. Discovering modes of action fortherapeutic compounds using a genome-wide screen of yeast heterozygotes. Cell, 116(1):121–137, 2004.

15. A. H. Tong, M. Evangelista, A. B. Parsons, H. Xu, G. D. Bader, N. Page, M. Robinson, S.Raghibizadeh, C. W. Hogue, H. Bussey, B. Andrews, M. Tyers, and C. Boone. Systematicgenetic analysis with ordered arrays of yeast deletion mutants. Science, 294(5550):2364–2368,2001.

16. E. Segal, N. Friedman, N. Kaminski, A. Regev, and D. Koller. From signatures to models:Understanding cancer using microarrays. Nat. Genetics, 37:S38–S45, 2005.

17. E. Segal, M. Shapira, A. Regev, D. Pe’er, D. Botstein, D. Koller, and N. Friedman. Modulenetworks: Identifying regulatory modules and their condition-specific regulators from geneexpression data. Nat. Genet., 34(2):166–176, 2003.

18. A. J. Hartemink, D. K. Gifford, T. S. Jaakkola, and R. A. Young. Combining location andexpression data for principled discovery of genetic regulatory network models. In R. B. Altman,A. K. Dunker, and L. Hunter (Eds.), Pacific Symposium Biocomputing. World Scientific, Hawaii,2002, pp. 437–449.

19. D. Jiang, C. Tang, and A. Zhang. Cluster analysis for gene expression data: A survey. IEEETrans. Knowledge Data Eng., 16(11):1370–1386, 2004.

20. P. T. Spellman, G. Sherlock, M. Q. Zhang, V. R. Iyer, K. Anders, M. B. Eisen, P. O. Brown,D. Botstein, and B. Futcher. Comprehensive identification of cell cycle-regulated genes of theyeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell, 9(12):3273–3297,1998.

21. M. Nykter, T. Aho, M. Ahdesmaki, P. Ruusuvuori, A. Lehmussola, and O. Yli-Harja. Simulationof microarray data with realistic characteristics. Bioinformatics, 7:349, 2006.

22. L. Klebanov and A. Yakovlev. How high is the level of technical noise in microarray data? Biol.Direct., 2:9, 2007.

23. J. Dougherty, R. Kohavi, and M. Sahami. Supervised and unsupervised discrimination of con-tinuous Features. In A. Prieditis and S. Russell (Eds.), Machine Learning: Proceedings of the12th International Conference, Morgan Kauffman, San Francisco, CA, 1995.

24. N. Friedman and M. Goldszmidt. Discretization of continuous attributes while learning Bayesiannetworks. In L. Saitta (Ed.), Proceedings of the 13th International Conference on MachineLearning. Morgan Kauffman, San Francisco, CA, 1996, pp. 157–165.

Page 31: Biological Knowledge Discovery Handbook || Inference of Gene Regulatory Networks Based on Association Rules

REFERENCES 833

25. S. A. Kauffman. Metabolic stability and epigenesist in randomly constructed genetic nets. J.Theor. Biol., 22:437–467, 1969.

26. R. Albert and H. G. Othmer. The topology of the regulatory interactions predics the expressionpattern of the segment polarity genes in Drosophila melanogaster. J. Theor. Biol., 223:1–18,2003.

27. D. Thieffry, R. Thomas. Qualitative analysis of gene networks. In R. B. Altman, A. K. Dunker, L.Hunter, and T. E. Klein (Eds.), Pacific Symposium Biocomputing. World Scientific, Singapore,1997, pp. 77–88.

28. R. Laubenbacher and B. Stigler. A computational algebra approach to the reverse engineeringof gene regulatory networks. J. Theor. Biol., 229:523–537, 2004.

29. S. C. Madeira and A. L. Oliveira. An evaluation of discretization methods for non-supervisedanalysis of time-series gene expression data. Technical Report. INESC-ID, University of BeiraInterior, Portugal, 2005.

30. X. Li, S. Rao, W. Jiang, C. Li, Y. Xiao, Z. Guo, Q. Zhang, L. Wang, L. Du, J. Li, L. Li, T. Zhang,and Q. K. Wang. Discovery of time-delayed gene regulatory networks based on temporal geneexpression profiling. BMC Bioinformatics, 7:26, 2006.

31. M. Koyuturk, W. Szpankowski, and A. Grama. Biclustering gene-feature matrices for statisti-cally significant dense patterns. In Proceedings of the 8th Annual International Conference onResearch in Computational Molecular Biology, 2004, pp. 480–484.

32. G. Park and W. Szpankowski. Analysis of biclusters with applications to gene expression data.In C. Martinez (Ed.), 2005 International Conference on Analysis of Algorithms. DMTCS Pro-ceedings, Nancy, 2005, pp. 267–274.

33. S. Lonardi, W. Szpankowski, and Q. Yang. Finding biclusters by random projections. In S. C.Sahinalp, S. Muthukrishnan, and U. Dogrusoz (Eds.), Combinatorial Pattern Matching. SpringerBerlin, 2004, pp. 102–116.

34. A. Jain and R. Dubes. Algorithms for Clustering Data. Prentice Hall, Upper Saddle River, NJ,1988, pp. 58–89.

35. J. MacQueen. Some methods for classification and analysis of multivariate observations. InProceedings of the 5th Berkeley Symposium of Mathematical Statistics and Probability, Vol. 1,University of California Press, Berkeley, CA, 1967, pp. 281–297.

36. A. Saeed, V. Sharov, J. White, J. Li, W. Liang, N. Bhagabati, J. Braisted, M. Klapa, T. Currier,M. Thiagarajan, A. Sturn, M. Snuffin, A. Rezantsev, D. Popov, A. Ryltsov, E. Kostukovich, I.Borisovsky, Z. Liu, A. Vinsavich, V. Trush, and J. Quackenbush. TM4: A free, open-sourcesystem for microarray data management and analysis. BioTechniques, 34(2):374–378, 2003.

37. C. Moller-Levet, S. Cho, and O. Wolkenhauer. Microarray data clustering based on temporalvariation: Fcv and tsd preclustering. Appl. Bioinformatics, 2(1):35–45, 2003.

38. L. Ji and K. Tan. Identifying time-lagged gene clusters using gene expression data. Bioinfor-matics, 21(4):509–516, 2005.

39. L. Ji and K. Tan. Mining gene expression data for positive and negative co-regulated geneclusters. Bioinformatics, 20(16):2711–2718, 2004.

40. S. A. Kauffman. Antichaos and adaptation. Sci. Am., 265(2):78–84, 1991.

41. A. L. Barabasi and Z. N. Oltvai. Network biology. Understanding the cell’s functional organi-sation. Nat. Rev. Genet., 5(2):101–113, 2004.

42. J. Tegner, M. K. Yeung, J. Hasty, and J. J. Collins. Reverse engineering gene networks: Integratinggenetic perturbations with dynamical modeling. Proc. Nat. Acad. Sci. USA, 100(10):5944–5949,2003.

43. G. Marnellos and E. Mjolsness. A gene network approach to modeling early neurogenesis indrosophila. In R. B. Altman, A. K. Dunker, L. Hunter, and T. E. Klein (Eds.), Pacific SymposiumBiocomputing. World Scientific, Hawaii, 1998, pp. 30–41.

Page 32: Biological Knowledge Discovery Handbook || Inference of Gene Regulatory Networks Based on Association Rules

834 INFERENCE OF GENE REGULATORY NETWORKS BASED ON ASSOCIATION RULES

44. T. J. Perkins1, J. Jaeger, and J. Reinitz. Reverse engineering the gap gene network of drosophilamelanogaster. PLoS Comp. Bio., 2(5):e51, 2006.

45. M. J. Herrg ard, M. W. Covert, and B. Palsson. Reconciling gene expression data with knowngenome-scale regulatory network structures. Genome Res., 13(11):2423–2434, 2003.

46. P. C. FitzGerald, D. Sturgill, A. Shyakhtenko, B. Oliver, and C. Vinson. Comparative genomicsof drosophila and human core promoters. Genome Biol., 7:R53, 2006.

47. Q. Cui, B. Liu, T. Jiang, and S. Ma. Characterizing the dynamic connectivity between genesby variable parameter regression and Kalman filtering based on temporal gene expression data.Bioinformatics, 21(8):1538–1541, 2005.

48. P. Spirtes, C. Glymour, R. Scheines, S. Kauffman, V. Airmale, and F. Wimberly. ConstructingBayesian network models for gene expression networks from microarray data. Proceedings of theAtlantic Symposium on Computational Biology, Genome Information System and Technology,2000. Retrieved from http://repository.cmu.edu/philosophy/290/.

49. H. de Jong. Modeling and simulation of genetic regulatory systems: A literature review. J. Comp.Biol., 9(1):67–103, 2002.

50. M. E. Driscoll and T. S. Gardner. Identification and control of gene networks in living organismsvia supervised and unsupervised learning. J. Process Control, 16(3):303–311, 2006.

51. J. Vohradsky. Neural network model of gene expression. FASEB J., 15:846–854, 2001.

52. S. Liang, S. Fuhrman, and R. Somogyi. REVEAL: A general reverse engineering algorithm forinference of genetic network architectures. In R. B. Altman, A. K. Dunker, L. Hunter, and T. E.Klein (Eds.), Pacific Symposium Biocomputing. World Scientific, Hawai, 1998, pp. 18–29.

53. D. Husmeier. Sensitivity and specificity of inferring genetic regulatory interactions from mi-croarray experiments with dynamic Bayesian networks. Bioinformatics, 19(17):2271–2282,2003.

54. J. Li, X. Li, H. Su, H. Chen, and D. W. Galbraith. A framework of integrating gene relations fromheterogeneous data sources: An experiment on Arabidopsis thaliana. Bioinformatics, 22:2037–2043, 2006.

55. I. Lee, Z. Li, and E. M. Marcotte. An improved, bias-reduced probabilistic functional genenetwork of baker’s yeast, Saccharomyces cerevisiae. PLoS ONE, 2(Suppl. 10):e988, 2007.

56. Z. Bar-Joseph, G. K. Gerber, T. I. Lee, N. J. Rinaldi, J. Y. Yoo, F. Robert, D. B. Gordon, E.Fraenkel, T. S. Jaakkola, R. A. Young, and D. K. Gifford. Computational discovery of genemodules and regulatory networks. Nat. Biotechnol., 21:1337–1342, 2003.

57. S. Imoto, T. Higuchi, T. Goto, K. Tashiro, S. Kuhara, and S. Miyano. Combining microarraysand biological knowledge for estimating gene networks via Bayesian networks. J. Bioinform.Comput. Biol., 2:77–98, 2004.

58. A. Bernard and A. J. Hartemink. Informative structure priors: Joint learning of dynamic regu-latory networks from multiple types of data. Pac. Symp. Biocomput., 10:459–70, 2005.

59. C. H. Yeang, T. Ideker, and T. Jaakkola. Physical network models. J. Comput. Biol., 11:243–262,2004.

60. C. H. Yeang and T. Jaakkola. Time series analysis of gene expression and location Data. InThird IEEE Symposium on BioInformatics and BioEngineering (BIBE’03). IEEE, Bethesda,MD, 2003, pp. 305–312.

61. B. Lewin. Genes. 7th ed. Oxford University Press, 1999.

62. A. K. Lee, S. H. Sung, Y. C. Kim, and S. G. Kim. Inhibition of lipopolysaccharide-induciblenitric oxide synthase TNF-α and COX-2 expression by Sauchinone effects on NF-κB1 phos-phorylation, C/EBP and AP-1 activation. Br. J. Pharmacol., 139:11–20, 2003.

63. O. Cinquin and J. Demongeot. Positive and negative feedback: Striking a balance betweennecessary antagonists. J. Theor. Biol., 216:229–241, 2002.

Page 33: Biological Knowledge Discovery Handbook || Inference of Gene Regulatory Networks Based on Association Rules

REFERENCES 835

64. M. Kanehisa, S. Goto, M. Hattori, K. F. Aoki-Kinoshita, M. Itoh, S. Kawashima, T. Katayama,M. Araki, and M. Hirakawa. From genomics to chemical genomics: New developments inKEGG. Nucleic Acids Res., D354–357, 2006.

65. S. S. Dwight, M. A. Harris, K. Dolinski, C. A. Ball, G. Binkley, K. R. Christie, D. G. Fisk, L.Issel Tarver, M. Schroeder, G. Sherlock, A. Sethuraman, S. Weng, D. Botstein, and J. M. Cherry.Saccharomyces Genome Database (SGD) provides secondary gene annotation using the GeneOntology (GO). Nucleic Acids Res., 30:69–72, 2002.

66. S. Geisser. Predictive Inference. Chapman and Hall, New York, 1993.

67. R. Kohavi. Wrappers for performance enhancement and oblivious decision graphs. Ph.D. Thesis.Stanford University, Computer Science Department, 1995.

68. P. A. Devijver and J. Kittler. Pattern Recognition: A Statistical Approach, Prentice-Hall, London,1982.

69. J. Han and M. Kamber. Data Mining: Concept and Techniques. The Morgan Kaufmann Seriesin Data Management Systems. Morgan Kaufmann, San Francisco, CA, 2000.

70. A. Zhang. Advanced Analysis of Gene Expression Microarray. World Scientific, Singapore,2006.

71. J. Hipp, U. Gntzer, and G. Nakhaeizadeh. Algorithms for association rule mining a generalsurvey and comparison. ACM SIGKDD Explor., 2(Issue 1):58–64, 2000.

72. R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in largedatabases. In Proceedings of the 1993 ACM-SIGMOD International Conference on Managementof Data, ACM Press, Washington, DC, 1993, pp. 207–216.

73. J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proceedingsof the 2000 ACM SIGMOD International Conference on Management of Data. ACM Press,Dallas, TX, 2000, pp. 1–12.

74. J. Han, H. Cheng, D. Xin, and X. Yan. Frequent pattern mining: Current status and futuredirections. Data Min. Knowl. Discov., 15:55–86, 2007.

75. R. J. Bayardo. Efficiently mining long patterns from databases. In Proceedings of the 1998 ACMSIGMOD International Conference on Management of Data. ACM Press, Seattle, WA, 1998,pp. 88–93.

76. K. Gouda and M. J. Zaki. GenMax: An efficient algorithm for mining maximal frequent itemsets.Data Min. Knowl. Discov., 11:223–242, 2005.

77. N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Efficient mining of association rules usingclosed itemset lattices. Inf. Syst., 24:25–46, 1999.

78. J. Wang, J. Han,and J. Pei. CLOSET+: Searching for the best strategies for mining fre-quent closed itemsets. In Proceedings of the The Ninth ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining. ACM Press, Washington, DC, 2003,pp. 236–245.

79. M. J. Zaki and C. J. Hsiao. CHARM: An efficient algorithm for closed itemset mining. InProceedings of the SIAM International Conference on Data Mining. SIAM Press, Arlington,VA, 2002, pp. 457–73.

80. R. Alves, D. S. Rodriguez-Baena, and J. S. Aguilar-Ruiz. Gene association analysis: A survey offrequent pattern mining from gene expression data. Brief Bioinformatics, 11(2):210–224, 2009.

81. E. Baralis, G. Bruno, and E. Ficarra. Temporal association rules for gene regulatory networks.In Intelligent Systems. IEEE, Varna, 2008, pp. 2–7.

82. H. Nam, K. Lee, and D. Lee. Identification of temporal association rules from time seriesmicroarray data sets. BMC Bioinformatics, 10:(Suppl. 3):S6, 2009.

83. L. Breiman, J. H. Friedman, R. A. Olsen, and C. J. Stone. Classification and Regression Trees.Chapman & Hall, London, 1984.

Page 34: Biological Knowledge Discovery Handbook || Inference of Gene Regulatory Networks Based on Association Rules

836 INFERENCE OF GENE REGULATORY NETWORKS BASED ON ASSOCIATION RULES

84. I. Witten and E. Frank. Data Mining—Practical Machine Learning Tools and Techniques withJAVA Implementations. Morgan Kaufmann, San Francisco, CA, 1999.

85. R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection.Proc. 14th Int. Joint Conf. Artif. Intell., 2(12):1137–1143, 1995.

86. L. A. Soinov, M. A. Krestyaninova, and A. Brazma1. Towards reconstruction of gene networksfrom expression data by supervised learning. Genome Biol, 4(1):R6, 2003.

87. J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco, CA,1992.

88. V. A. Huynh-Thu, A. Irrthum, L. Wehenkel, and P. Geurts. Inferring regulatory networks fromexpression data using tree-based methods. PLoS ONE, 5(9):e12776, 2010.

89. L. Breiman. Random forests. Machine Learning, 45:5–32, 2001.

90. P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Machine Learning 36:3–42,2006.

91. S. Imoto, T. Goto, and S. Miyano. Estimation of genetic networks and functional structuresbetween genes by using Bayesian networks and nonparametric regression. In R. B. Altman, A.K. Dunker, and L. Hunter (Eds.), Pacific Symposium Biocomputing. World Scientific, Hawaii,2002, pp. 175–186.

92. N. Friedman, M. Linial, I. Nachman, and D. Pe’er. Using Bayesian networks to analyze expres-sion data. J. Comput. Biol., 7:601–620, 2000.

93. A. J. Hartemink, D. K. Gifford, T. S. Jaakkola, and R. A. Young. Using graphical modelsand genomic expression data to statistically validate models of genetic regulatory networks. InR. B. Altman, K. Lauderdale, A. K. Dunker, L. Hunter, and T. E. Klein (Eds.), Pacific SymposiumBiocomputing. World Scientific, Hawaii, 2001, pp. 422–433.

94. M. J. Beal, F. Falciani, Z. Ghahramani, C. Rangel, and D. L. Wild. A bayesian approach toreconstructing genetic regulatory networks with hidden factors. Bioinformatics, 21(3):349–356,2005.

95. P. H. Lee and D. Lee. Modularized learning of genetic interaction networks from biologicalannotations and mRNA expression data. Bioinformatics, 21(11):2739–2747, 2005.

96. X. B. Zhou, X. D. Wang, R. Pal, I. Ivanov, M. Bittner, and E. R. Dougherty. A Bayesianconnectivity-based approach to constructing probabilistic gene regulatory networks. Bioinfor-matics, 20(17):2918–2927, 2004.

97. S. Rogers and M. Girolami. A Bayesian regression approach to the inference of regulatorynetworks from gene expression data. Bioinformatics, 21(14):3131–3137, 2005.

98. N. Friedman. The Bayesian structure EM algorithm. In G. F. Cooper and S. Moral (Eds), Un-certainty in Artificial Intelligence. Morgan Kaufmann Publishers, San Francisco, CA, 1998, pp.129–138.

99. K. Murphy and S. Mian. Modelling gene expression data using dynamic Bayesian networks.Technical Report. Division of Computer Science, University of California, Berkerley, CA, 1999.

100. L. Gransson and T. Koski. Using a dynamic Bayesian network to learn genetic interactions.Technical Report. Graduate School of Biomedical Research, Linkoping University, 2002.

101. S. Kim, S. Imoto, and S. Miyano. Dynamic Bayesian network and nonparametric regressionfor nonlinear modeling of gene networks from time series gene expression data. Biosystems,75:57–65, 2004.

102. J. Yu, V. A. Smith, P. P. Wang, A. J. Haremink, and E. D. Jarvis. Advances to Bayesian networkinference for generating causal networks from observational biological data. Bioinformatics,20(18):3594–3603, 2004.

103. N. Friedman. Inferring cellular networks using probabilistic graphical models. Science,303(6):799–805, 2004.

Page 35: Biological Knowledge Discovery Handbook || Inference of Gene Regulatory Networks Based on Association Rules

REFERENCES 837

104. J. Hasty, D. McMillen, F. Isaacs, and J. J. Collins. Computational studies of gene regulatorynetworks: In numero molecular biology. Nat. Rev. Genet., 2(4):268–279, 2001.

105. D. Pe’er, A. Regev, G. Elidan, and N. Friedman. Inferring subnetworks from perturbed expressionprofiles. Bioinformatics, 17(Suppl. 1):S215–224, 2001.

106. N. Guelzim, S. Bottani, P. Bourgine, and F. Kepes. Topological and causal structure of the yeasttranscriptional regulatory network. Nat. Genet., 31:60–63, 2002.

107. C. Yoo and G. F. Cooper. Discovery of gene-regulation pathways using local causal search. InI. S. Kohane (Ed.), Proc. AMIA Symp. Hanley & Belfus, San Antonio, TX, 2002, pp. 914–918.

108. B. Xing and M. J. van der Laan. A causal inference approach for constructing transcriptionalregulatory networks. Bioinformatics, 21:4007–4013, 2005.

109. T. Akutsu, S. Miyano, and S. Kuhara. Algorithms for inferring qualitative models of biologicalnetworks. In R. B. Altman, A. K. Dunker, and L. Hunter (Eds.), Pacific Symposium Biocomput-ing. World Scientific, Hawaii, 2000, pp. 293–304.

110. I. Simon, J. Barnett, N. Hannett, C. T. Harbison, N. J. Rinaldi, T. L. Volkert, J. J. Wyrick, J.Zeitlinger, D. K. Gifford, T. S. Jaakkola, and R. A. Young. Serial regulation of transcriptionalregulators in the yeast cell cycle. Cell, 106(6):697–708, 2001.

111. K. Sivakumar, R. Chen, and H. Kargupta. Learning Bayesian network structure from distributeddata. In SIAM International Data Mining Conference. Society for Industrial and Applied Math-ematics, San Francisco, CA, 2003, pp. 284–288.

112. P. Smolen, D. A. Baxter, and J. H. Byrne. Modeling transcriptional control in gene networks—methods, recent results and future directions. Bull. Math. Biol., 62:247–292, 2000.

113. L. Tiefei. Learning gene network using Bayesian network framework. Ph.D. Thesis. NationalUniversity of Singapore, 2005.

114. X. Boyen, N. Friedman, and D. Koller. Discovering the hidden structure of complex dynamicsystems. In K. Laskey and H. Prade (Eds.), Uncertainty in Artificial Intelligence. Morgan Kauf-mann Publishers, San Francisco, CA, 1999, pp. 91–100.

115. S. A. Kauffman. Requirements for evolvability in complex systems: Orderly dynamics andfrozen components. Phys. D, 42:135–152, 1990.

116. S. A. Kauffman. The Origins of Order, Self-Organization and Selection in Evolution. OxfordUniversity Press, New York, NY, 1993.

117. H. Lähdesmaki, I. Shmulevich, and O. Yli-Harja. On learning gene regulatory networks underthe Boolean network model. Machine Learning, 52:147–167, 2003.

118. S. Kauffman, C. Peterson, B. Samuelsson, and C. Troein. Random Boolean network mod-els and the yeast transcriptional network. Proc. Natl Acad. Sci. USA, 100:14796–14799,2003.

119. I. Shmulevich, E. R. Dougherty, S. Kim, and W. Zhang. Probabilistic Boolean networks: Arule-based uncertainty model for gene regulatory networks. Bioinformatics, 18:261–274, 2002.

120. R. Hashimoto, S. Kim, I. Shmulevich, W. Zhang, M. L. Bittner, and E. R. Dougherty. Growinggenetic regulatory networks from seed genes. Bioinformatics, 20:1241–1247, 2004.

121. H. Li, J. Xuan, Y. Wang, and M. Zhan. Inferring regulatory networks. Front. Biosci., 13:263–275,2008.

122. M. Wahde and J. Hertz. Coarse-grained reverse engineering of genetic regulatory networks.Biosystems, 55:129–136, 2000.

123. T. Chen, V. Filkov, and S. S. Skiena. Identifying gene regulatory networks from exprimentaldata. In Research in Computational Molecular Biology. ACM, Lyon, 1999, pp. 94–103.

124. X. Wen, S. Fuhrman, G. S. Michaels, D. B. Carr, S. Smith, J. L. Barker, and R. Somogyi. Large-scale temporal gene expression mapping of central nervous system development. Proc. Natl.Acad. Sci., 95(1):334–339, 1998.

Page 36: Biological Knowledge Discovery Handbook || Inference of Gene Regulatory Networks Based on Association Rules

838 INFERENCE OF GENE REGULATORY NETWORKS BASED ON ASSOCIATION RULES

125. W. M. Fitch and E. Margoliash. Construction of phylogenetic trees. Science, 155:279–284, 1967.

126. E. P. V. Someren, L. F. A.Wessels, and M. J. T. Reinders. Linear modeling of genetic networksfrom experimental data. In R. Altman, T. L. Bailey, P. Bourne, M. Gribskov, T. Leagauer, I.N. Shidyalov, L. F. Ten Eyck, and H. Weissig (Eds.), International Conference on IntelligentSystems for Molecular Biology. AAAI Press, La Jolla, CA, 2000, pp. 355–366.

127. R. Guthke, U. Moller, M. Hoffmann, F. Thies, and S. Topfer. Dynamic network reconstructionfrom gene expression data applied to immune response during bacterial infectioin. Bioinformat-ics, 21(8):1626–1634, 2005.

128. A. Arkin, P. Shen, and J. Ross. A test case of correlation metric construction of a reactionpathway from measurements. Science, 277:1275–1279, 1997.

129. I. Ponzoni, F. Azuaje, J. Augusto, and D. Glass. Inferring adaptive regulation thresholds and asso-ciation rules from gene expression data through combinatorial optimization learning. IEEE/ACMTrans. Comp. Biol. Bioinformatics, 4(Suppl. 4):624–634, 2007.

130. B. Scholkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.

131. I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classification usingsupport vector machines. Machine Learning, 46:389–422, 2002.