Improved Biclustering Of Microarray Data

Embed Size (px)

Citation preview

  • 8/10/2019 Improved Biclustering Of Microarray Data

    1/10

    Improved Biclustering Of Microarray Data

    SADIQ HUSSAIN* and G.C. HAZARIKA**

    *System Administrator, Examination Branch, Dibrugarh University,Dibrugarh-786004 (India)

    [email protected], 09435246192 (M)**Director, Centre for Computer Studies, Dibrugarh University,

    Dibrugarh-786004 (India)[email protected]

    ABSTRACT

    Biclustering algorithms simultaneously cluster both rowsand columns. These types of algorithms are applied to geneexpression data analysis to find a subset of genes that exhibitsimilar expression pattern under a subset of conditions. Cheng

    and Church introduced the mean squared residue measure tocapture the coherence of a subset of genes over a subset ofconditions. They provided a set of heuristic algorithms basedprimarily on node deletion to find one bicluster or a set of biclustersafter masking discovered biclusters with random values.

    The mean squared residue is a popular measure of biclusterquality. One drawback however is that it is biased toward flatbiclusters with low row variance. In this paper, we introduce animproved bicluster score that removes this bias and promotes thediscovery the most significant biclusters in the dataset. We employthis score within a new biclustering approach based on the bottom-up search strategy. We believe that the bottom-up search approach

    better models the underlying functional modules of the geneexpression dataset.

    J . Comp. & Math. Sci. Vol. 1(2), 111-120 (2010).

    1. INTRODUCTION

    Advances in gene expressionmicroarray technologies over the lastdecade or so have made it possible tomeasure the expression levels ofthousands of genes over many experi-

    mental conditions (e.g., different patients,tissue types and growth environments).

    The data produced in these experimentsare usually arranged in a data matrixof genes (rows) and conditions (columns).Results from multiple microarrayexperiments may be combined and the

    J ournal of Computer and Mathematical Sciences Vol. 1 Issue 2, 31 March, 2010 Pages (103-273)

  • 8/10/2019 Improved Biclustering Of Microarray Data

    2/10

    data matrix may easily exceed thousandsof genes and hundreds of conditionin size.

    As datasets increase in size,however, it becomes increasingly unlikelythat genes will retain correlation acrossthe full set of conditions making clus-tering problematic. The gene expressioncontext further exacerbates thisproblem as it is not uncommon for theexpression of related genes to be highlysimilar under one set of conditions and

    yet independent under another set4.

    Given these issues it is perhaps moreprudent to cluster genes over a signi-ficant subset of experimental conditions.

    This two-way clustering technique has

    been termed biclusteringand was firstintroduced to gene expression analysis

    by Cheng and Church5. They developed

    a two-way correlation metric called themean squared residue scoreto measurethe bicluster quality of selected rowsand columns from the gene expressiondata matrix. They employed this metricwithin a top-down greedy node deletionalgorithm aimed at discovering all thesignificant biclusters within a geneexpression data matrix. Following thisseminal work, other metrics and biclus-

    tering frameworks were developed6,7,8.

    However, approaches based on Chengand Church's mean squared residuescore remain most prevalent in the

    literature9,10,11,12.

    One notable drawback, however,of the mean squared residue score isthat it is also affected by variance,favouring correlations with low variances.

    Furthermore, because variance changesby the square of the change in scale, thescore tends to discover correlations overlower scales. These effects culminatein a bias toward 'flat' biclusters contai-ning genes with relatively unfluctuating

    expression levels within the lowerscales (fold changes) of the geneexpression dataset. This issue has been

    articulated previously in13. We illustrate

    the mean squared residue (H-Score)bias. In this paper, we introduce animproved bicluster scoring metricwhich compensates for this bias andenables the discovery of biclustersthroughout expression data, includingthose potentially more interestingcorrelations over the higher scales (fold

    changes).

    2. Bottom-Up Biclustering ByLocality Expansion

    In general, as with conventionalclustering, biclustering techniques fallinto top-down and bottom-up appro-aches. Top-down approaches begin byfinding an initial approximation of thesolution over the full set of objects andfeatures, in this case all the rows andcolumns of the gene expression datamatrix. There is a possibility that thismethod may however discover anartificially large bicluster solutionwhich may be a combination of partsof two or more local solutions.

    Top-down methods tend toproduce a more uniform set of clustersin terms of size and shape. This repre-sentation may not accurately model

    112 Sadiq Hussainet al., J .Comp.&Math.Sci. Vol.1(2), 111-120 (2010).

    J ournal of Computer and Mathematical Sciences Vol. 1 Issue 2, 31 March, 2010 Pages (103-273)

  • 8/10/2019 Improved Biclustering Of Microarray Data

    3/10

    the diverse set of functional modulesthat may be present in expression data.Also because they deal initially withthe full set of dimensions, top-downapproaches may not scale well as thedataset increase in size.

    Our framework is built aroundthe bottom-up search approach. It isfounded on the premise that searchingfor interesting structure in higherdimensions can be first reduced to asearch for structure in lower dimensions.In general, this search method is moreadept at discovering several localsolutions rather than global solutions.In this way, we hope to discover a morenatural set of biclusters that capture

    the diversity and organization of fun-ctional groups in the cell. This approachalso has the advantage that it iscomputationally more efficient to searchfor solutions over a smaller set ofdimensions.

    Our approach, can be dividedinto two phases, firstly we perform astochastic search for the set of highlycorrelated sub-biclusters. These arethen used as a set of seeds for the nextphase, that of a deterministic expansioninto higher dimensions by adding the bestfitting rows and columns. These algori-thms will be discussed in this sectionbut first, we will introduce our improvedbicluster scoring metric that is essentialin locating these significant biclusterseeds.

    A. A n Imp roved Bic lus te r Sco r ingMetric :

    A already mentioned, the meansquared residue (H-Score) is affectedby the scale of the variance withinbiclusters. To address this problem, ourbicluster scoring metric, the Hv-Score,takes into account each entry's distance

    from its row mean. The sum of thesquares of these distances are computedand used as the denominator in a newbicluster scoring measure given inequation 1. The numerator is simply thesum of the squared residues in thematrix. Minimizing this function for theselected sub-matrix solution minimizesthe sum of the squared residues whilemaximizing the sum of the distancesfrom the row means. The Hv-Score isdefined as :

    Hv (I,J )=

    JjIi

    iJij aa,

    2

    JjI,i

    2ij

    )(

    )R(a

    (1)

    where R(aij)is the residue score of each

    entry, aij, and aiJ is the row mean for

    each entry. We evaluate this new metricand use it to locate highly correlatedbicluster seeds across the scales of thedataset in the seed searchstep of our

    approach.

    B. Seed Search :

    The goal of the initial seedsearch algorithm is to locate a diverseset of highly correlated bicluster seedsthroughout the gene expression datamatrix. If these seeds are significantin size, it is more probable that theyrepresent part of a larger bicluster of

    Sadiq Hussainet al., J .Comp.&Math.Sci. Vol.1(2), 111-120 (2010). 113

    J ournal of Computer and Mathematical Sciences Vol. 1 Issue 2, 31 March, 2010 Pages (103-273)

  • 8/10/2019 Improved Biclustering Of Microarray Data

    4/10

    rows and columns. For example, a seedthat consists of 10 genes activelycorrelating over 10 conditions is likelyto be a part of a larger set of relatedgenes and would be a good base onwhich to expand.

    Our seed search is based on thewell known stochastic search techniqueof simulated annealing. It has beenshown that simulated annealing searchcan discover improved bicluster solutions

    over the best-first greedy search method11.

    Unlike greedy search techniques,simulated annealing allows the acceptanceof reversals in fitness (worse solutions)and backtracking which gives it thecapability to bypass locally good solutions

    until it converges on a solution nearthe global optimum. The acceptanceof reversals is probabilistic as definedby Boltzman's equation :

    P( )( E e TE

    (2)

    From the equation, it can be

    seen that this probability depends on

    two variables, the size of the reversal

    Eand Temperature of the system, T.Generally, T is first given a high value

    to initially allow a large percentage of

    reversals to be accepted, this helps the

    search explore the entire search space.

    Gradually, thisT is lowered and the

    potential search space shrinks until it

    converges on the global optimum. The

    number of solutions evaluated (attempts)

    and the number of solutions accepted

    (successes) at each temperature are

    also input parameters. These determinehow extensive a search is carried outbefore theT is lowered.

    If we reduce both the initialtemperature and the depth of search at

    each temperature, we can confine thesearch to a more local area of the data.Sequential searches will be able touncover several local optima ratherthan one solution close to the globaloptimum. With regard to gene expressiondata, these regional optima may corres-pond to the diverse set of biclusterseeds.

    C. Seed Expansion :

    Upon locating a good set ofseeds, the deterministic phase of seedexpansion involves adding the bestfitting rows and columns to each seed.Prior to seed expansion, the correlationof the rows and columns in the remainderof the dataset is measured relative tothe seed. The following formulae basedon the residue score, are used to scorethese rows and columns.

    All rows (genes) not in Seed (I, J)are scored as follows :

    H(I)= JjIiJ ,||

    1(aij- aI j- aiJ+ aIJ)2 (3)

    where i I. All columns (conditions)not in the Seed (I, J) are similarly scored:

    H(J )= JjIiI ,||

    1(aij- aI j- aiJ+ aIJ)

    2 (4)

    114 Sadiq Hussainet al., J .Comp.&Math.Sci. Vol.1(2), 111-120 (2010).

    J ournal of Computer and Mathematical Sciences Vol. 1 Issue 2, 31 March, 2010 Pages (103-273)

  • 8/10/2019 Improved Biclustering Of Microarray Data

    5/10

    where j J .

    An important attribute of thisexpansion phase is that it adds corre-lated rows and columns regardless ofscales. Rows and columns are standar-

    dized by subtracting the mean anddividing by the standard deviation.

    These scores are then sorted and thebest fitting column or row is added ineach iteration. As our main objectivein this study is to capture gene (row)correlations, we only add rows duringseed expansion. We used Fuzzy Logic

    and Z-test to speed up the process.

    Algorithm I : Seed SearchVariable definitions:x : current solution,

    t0 : initial temperature,

    t : current temperature,rate: temperature fall rate,a: attempts, s: successes,

    acount : attempt count, scount : success count,

    H : H fitness function, M : data matrix,row: seed row size,col : seed column size,

    Seed Search (t0, rate, row, col, s,a,M)

    fuzzificationx randomSolution (row, col)

    t t0while ( x not converging)

    while (acount< a AND scount < s)

    do the Z-test to find the significance

    xnew Generate New Solution (M,x)

    if H(xnew) < H(x) then

    x xnew

    else if exp (-T

    E) > random (0,1)

    then x xnewt Cool(t,rate)defuzzificationreturn x

    Sadiq Hussainet al., J .Comp.&Math.Sci. Vol.1(2), 111-120 (2010). 115

    J ournal of Computer and Mathematical Sciences Vol. 1 Issue 2, 31 March, 2010 Pages (103-273)

  • 8/10/2019 Improved Biclustering Of Microarray Data

    6/10

    D. The Yeast Cell Cycle Dataset :

    In our evaluation, we used theyeast cell cycle dataset from Cho

    et.al.,18. This dataset has been used in

    previous biclustering5,12and clustering

    studies19. Cheng and Church used a

    subset of 2,884 of the most variablegenes from Cho's dataset. Unlikeexpression data from the larger humangenome, yeast data gives a morecomplete representation in which allthe ORFs and functional modules are

    Algorith II : The Seed Expansion Phase

    Variable definitions:r : row, c: column,R: rows in seed,C: columns in seed,

    rVarT : row variance threshold,cVarT: column variance threshold,rT : seed row threshold,cT: seed column threshold.

    SeedExpansion ( Seed(R,C), rVarT, cVarT,rT,cT)fuzzificationScore all r RScore all c CSort r scoresSort c scores

    Select best scoring r or c if ( variance r /c > r/cVarT) if (r/c number < r/cT) add r/c to Seed (R,C) re-score Seed(R,C)

    defuzzification return expanded Seed(R,C)

    covered.

    Cho's dataset contains 6,178genes, 17 conditions and 6,453missing values. Rows containing manymissing values were removed giving 6,145rows with 5,846 missing values.Missing values were replaced by randomvalues generated between the first andthird quartiles of the data range. Thisreduces the impact of these imputedvalues on the structure of the dataset.We annotated all genes in our dataset

    116 Sadiq Hussainet al., J .Comp.&Math.Sci. Vol.1(2), 111-120 (2010).

    J ournal of Computer and Mathematical Sciences Vol. 1 Issue 2, 31 March, 2010 Pages (103-273)

  • 8/10/2019 Improved Biclustering Of Microarray Data

    7/10

    Table 1.Comparison between the H-Score and the Hv-Score. 10 biclusters

    seeds discovered by Seed Search with each metric. The percentageof the Dominant Functional Category (D.F.C.) and MIPS code and

    name are also listed

    Seed D.F.C. MIPS Category D.F.C. MIPS Category

    1 40% 99: Unclass. ORFs 70% 12: Protein Synth.2 50% 14: Protein Fate 60% 10: Cell Cycle

    3 30% 01: Metabolism 40% 99: Unclass. ORFs

    4 40% 14: Protein Fate 90% 11:Transcription

    5 60% 14: Protein Fate 60% 12: Protein Synth.

    6 40% 32: Cell Rescue 50% 99: Unclass. ORFs.

    7 40% 99: Unclass. ORFs. 50% 14: Protein Fate

    8 20% 12: Protein Synth. 40% 99: Unclass. ORFs.

    9 50% 14: Protein Fate 50% 12: Protein Synth.

    10 40% 14: Protein Fate 70% 10: Cell Cycle

    Figure I. best scoring seed found with Hv-Score ( genes:216, 217, 526, 616, 1022, 1184, 1476, 1623,

    1795, 2086, 2278, 2375, 2538)

    Sadiq Hussainet al., J .Comp.&Math.Sci. Vol.1(2), 111-120 (2010). 117

    J ournal of Computer and Mathematical Sciences Vol. 1 Issue 2, 31 March, 2010 Pages (103-273)

  • 8/10/2019 Improved Biclustering Of Microarray Data

    8/10

    using MIPS (Munich Information centre

    for Protein Sequences) online functional

    catalogue20. Over 1,500 genes in the

    dataset were labelled as category 99

    (unclassified). These annotations were

    used to evaluate the correspondence

    of the biclusters to gene functionalmodules.

    3. Evaluation

    In the first part of our evaluation,

    we evaluate our improved metric, the

    H-Score, against Cheng and Church's

    original H-Score. We then carry out

    comparative evaluations with recent

    clustering and biclustering approaches

    and end by using our model to putativelyannotate several unclassified yeast

    genes.

    4. Discussion & Future Work

    In this paper, we have demons-

    trated improvements to the popular

    mean squared residue scoring function

    by removing the bias it shows toward

    so called 'flat' biclusters. Furthermore,

    our H-Score was shown to be a moreeffective objective functional from

    both qualitative graphical illustrations

    and quantitative functional analysis of

    biclusters.

    In future work, it is our hope to

    compound these and further putative

    classifications by cross validation

    across multiple expression datasets.

    Cho's yeast cell cycle dataset is a well

    studied dataset and a good benchmark

    but may lack the condition number to

    fully test a subspace clustering method

    such as biclustering. Using larger

    datasets, we hope to further validate our

    biclustering method and also increase

    the support for gene relationships andputative classifications by discovering

    biclusters over a larger number of

    conditions.

    REFERENCES

    1. B. Berrer, W. Dubitzky, and S.

    Draghici, A Practical Approach t o

    Microarray Data Analysis. Kluwer

    Academic Publishers, ch. 1, pp. 15-

    19 (2003).2. S.L. Pomeroy, P. Tamayo, M. Gaase-

    nbeek, L.M. Sturla, M. Angelo, M.E.

    McLaughlin, J .Y. Kim, L.C. Goum-

    nerova, B. Black, C. Lau, J .C. Allen,

    D. Zagzag, J .M.O., and T. Curran,

    C. Wetmore, J .A. Biegel, T. Poggio,

    S. Mukherjee, R. Rifkin, A. Califano,

    D. Stolovitzky, D. Louis, J . Mesirow,

    E. Lander, and T. Golub, "Prediction

    of central nervous system embryonal

    tumour outcome based on geneexpression", Nature, vol. 24, No.

    415, pp. 436-42 (2002).

    3. I. S. Dhillon, S. Mallela and D. S.

    Modha, "Information theoretic co-

    clustering", in Proceedings of t he

    ninth ACM SIGKDD International

    Conference on Know ledge Discovery

    and Data Mining (2003).

    4. A. Ben-Dor, B. Chor, R. Karp and Z.

    Yakini, "Discovering local structure

    118 Sadiq Hussainet al., J .Comp.&Math.Sci. Vol.1(2), 111-120 (2010).

    J ournal of Computer and Mathematical Sciences Vol. 1 Issue 2, 31 March, 2010 Pages (103-273)

  • 8/10/2019 Improved Biclustering Of Microarray Data

    9/10

    in gene expression data : the order-

    preserving submatrix problem",

    Journal of Computat ional Biology,

    vol. 10, No. 3-4, pp. 373-84 (2003).

    5. Y. Cheng and G.M. Church, "Bicl-

    ustering of expression data", in

    Proceedings of the Eight Interna-tional Conference on Intelligent

    Systems for Molecular Biology

    (ISMB),pp. 93-103 (2000).

    6. A. Tanay, R. Sharan, and R. Shamir,

    "Discovering statistically significant

    biclusters in gene expression data",

    Bioinformatics,vol. 18 ,pp. 36-44,

    (2002).

    7. L. Lazzeroni, and A. Owen, "Plaid

    models for gene expression data",

    Stat istica Sinica,vol.12 ,pp. 61-86,(2002).

    8. Y. Kluger, R. Basri, J .T. Chang, and

    M. Gerstein, "Spectral biclustering

    of microarray data : Coclustering

    genes and conditions", Genome

    Research, vol. 13 , pp. 703 - 716,

    (2003).

    9. J . Yang, H. Wang, W. Wang, and

    P. Yu, "Enhanced biclustering on

    expression data", in IEEE Third

    Symposium on Bioinformatics and

    Bioengineering (2003).

    10. H. Cho, I.S. Dhillon, Y. Guan and S.

    Sra, "Mimimum sum squared residue

    co-clustering of gene expression

    data", SIAM international conference

    on datamining (2004).

    11. K. Bryan, P. Cunningham, and N.

    Bolshakova, "Biclustering of expre-

    ssion data using simulated annealing",

    in Proceedings of the eight eenth

    IEEE Symposium on Comput er

    Based Medical Systems (2005).

    12. S. Bleuler, A. Prelic, and E. Zitzler,

    "An EA framework for biclustering

    of gene expression data", inCongress

    on Evolut ionary Computat ion (CEC-2004), Piscataway, NJ : IEEE, pp.

    166-173 (2004).

    13. J . Aguilar-Ruiz, "Shifting and scaling

    patterns from gene expression data",

    Bioinformatics,vol.21 ,No. 20 . ,pp.

    3849-3845 (2005).

    14. B. Mirkin,Mathematical Classification

    and Clustering,Dordrecht, Kluwer,

    (1996).

    15. D.S. J ohnsen, "The np-completeness

    column : an ongoing guide", J ournalof A lgorithms,vol. 8 ,pp. 438-448

    (1987).

    16. I.S. Dhillon, E.M. Marcotte, and U.

    Roshan, "Diametrical clustering for

    identifying anti-correlated gene

    clusters", Bioinformatics, vol. 19 ,

    No. 13 ,pp. 1612-1619 (2003).

    17. S. C. Madeira, and A. L. Oliveira,

    "Biclustering algorithms for biological

    data analysis : A survey", IEEE/ACM

    Transact ions on Comput ational

    Biology and Bioinformat ics, vol. 1,

    No. 1 ,pp. 24-45 (2004).

    18. R. Cho, M. Campbell, E. Winzeler,

    L. Steinmetz, A. Conway, L. Wodicka,

    T. Wolfsberg, A. Gabrielian, D.

    Landsman, and D. Lockhart, "A

    genome-wide transcriptional analysis

    of the mitotic cell cycle", Molecular

    Cell,vol. 2 , pp. 65-73, J uly (1998).

    Sadiq Hussainet al., J .Comp.&Math.Sci. Vol.1(2), 111-120 (2010). 119

    J ournal of Computer and Mathematical Sciences Vol. 1 Issue 2, 31 March, 2010 Pages (103-273)

  • 8/10/2019 Improved Biclustering Of Microarray Data

    10/10

    19. R. Balasubramaniyan, E. Hullermeier,

    N. Weskamp, and J . Kamper, "Clus-

    tering of gene expression data using

    a local shape-based similarity

    measure", Bioinformatics, vol. 21 ,

    No. 7 ,pp. 1069-1077 (2005).

    20. H.W. Mewes, D. Frishman, U. Gul-dener, G. Mannhaupt, K.F.X. Mayer,

    M. Mokrejs, B. Morgenstern, M.

    Munsterkotter, S. Rudd, and B. Weil,

    "Mips : a database for genomes

    and protein sequences", Nucleic

    Acids Research,vol. 30, No. 1 ,pp.

    31-34 (2002).

    21. R. Shamir, A. Maron-Katz, A. Tanay,

    C. Linhart, I. Steinfeld, R. Sharan,

    Y. Shiloh, and R. Elkon, "Expander-

    an integrative program suite formicroarray data analysis", BMC

    Bioinformatics, vol. 21, No.6 ,p.

    232 (2005).

    120 Sadiq Hussainet al., J .Comp.&Math.Sci. Vol.1(2), 111-120 (2010).

    J ournal of Computer and Mathematical Sciences Vol. 1 Issue 2, 31 March, 2010 Pages (103-273)