Improved Biclustering Of Microarray Data

8/10/2019 Improved Biclustering Of Microarray Data

1/10

Improved Biclustering Of Microarray Data

SADIQ HUSSAIN* and G.C. HAZARIKA**

*System Administrator, Examination Branch, Dibrugarh University,Dibrugarh-786004 (India)

[email protected], 09435246192 (M)**Director, Centre for Computer Studies, Dibrugarh University,

Dibrugarh-786004 (India)[email protected]

ABSTRACT

Biclustering algorithms simultaneously cluster both rowsand columns. These types of algorithms are applied to geneexpression data analysis to find a subset of genes that exhibitsimilar expression pattern under a subset of conditions. Cheng

and Church introduced the mean squared residue measure tocapture the coherence of a subset of genes over a subset ofconditions. They provided a set of heuristic algorithms basedprimarily on node deletion to find one bicluster or a set of biclustersafter masking discovered biclusters with random values.

The mean squared residue is a popular measure of biclusterquality. One drawback however is that it is biased toward flatbiclusters with low row variance. In this paper, we introduce animproved bicluster score that removes this bias and promotes thediscovery the most significant biclusters in the dataset. We employthis score within a new biclustering approach based on the bottom-up search strategy. We believe that the bottom-up search approach

better models the underlying functional modules of the geneexpression dataset.

J . Comp. & Math. Sci. Vol. 1(2), 111-120 (2010).

1. INTRODUCTION

Advances in gene expressionmicroarray technologies over the lastdecade or so have made it possible tomeasure the expression levels ofthousands of genes over many experi-

mental conditions (e.g., different patients,tissue types and growth environments).

The data produced in these experimentsare usually arranged in a data matrixof genes (rows) and conditions (columns).Results from multiple microarrayexperiments may be combined and the

J ournal of Computer and Mathematical Sciences Vol. 1 Issue 2, 31 March, 2010 Pages (103-273)


2/10

data matrix may easily exceed thousandsof genes and hundreds of conditionin size.

As datasets increase in size,however, it becomes increasingly unlikelythat genes will retain correlation acrossthe full set of conditions making clus-tering problematic. The gene expressioncontext further exacerbates thisproblem as it is not uncommon for theexpression of related genes to be highlysimilar under one set of conditions and

yet independent under another set4.

Given these issues it is perhaps moreprudent to cluster genes over a signi-ficant subset of experimental conditions.

This two-way clustering technique has

been termed biclusteringand was firstintroduced to gene expression analysis

by Cheng and Church5. They developed

a two-way correlation metric called themean squared residue scoreto measurethe bicluster quality of selected rowsand columns from the gene expressiondata matrix. They employed this metricwithin a top-down greedy node deletionalgorithm aimed at discovering all thesignificant biclusters within a geneexpression data matrix. Following thisseminal work, other metrics and biclus-

tering frameworks were developed6,7,8.

However, approaches based on Chengand Church's mean squared residuescore remain most prevalent in the

literature9,10,11,12.

One notable drawback, however,of the mean squared residue score isthat it is also affected by variance,favouring correlations with low variances.

Furthermore, because variance changesby the square of the change in scale, thescore tends to discover correlations overlower scales. These effects culminatein a bias toward 'flat' biclusters contai-ning genes with relatively unfluctuating

expression levels within the lowerscales (fold changes) of the geneexpression dataset. This issue has been

articulated previously in13. We illustrate

the mean squared residue (H-Score)bias. In this paper, we introduce animproved bicluster scoring metricwhich compensates for this bias andenables the discovery of biclustersthroughout expression data, includingthose potentially more interestingcorrelations over the higher scales (fold

changes).

2. Bottom-Up Biclustering ByLocality Expansion

In general, as with conventionalclustering, biclustering techniques fallinto top-down and bottom-up appro-aches. Top-down approaches begin byfinding an initial approximation of thesolution over the full set of objects andfeatures, in this case all the rows andcolumns of the gene expression datamatrix. There is a possibility that thismethod may however discover anartificially large bicluster solutionwhich may be a combination of partsof two or more local solutions.

Top-down methods tend toproduce a more uniform set of clustersin terms of size and shape. This repre-sentation may not accurately model

112 Sadiq Hussainet al., J .Comp.&Math.Sci. Vol.1(2), 111-120 (2010).



3/10

the diverse set of functional modulesthat may be present in expression data.Also because they deal initially withthe full set of dimensions, top-downapproaches may not scale well as thedataset increase in size.

Our framework is built aroundthe bottom-up search approach. It isfounded on the premise that searchingfor interesting structure in higherdimensions can be first reduced to asearch for structure in lower dimensions.In general, this search method is moreadept at discovering several localsolutions rather than global solutions.In this way, we hope to discover a morenatural set of biclusters that capture

the diversity and organization of fun-ctional groups in the cell. This approachalso has the advantage that it iscomputationally more efficient to searchfor solutions over a smaller set ofdimensions.

Our approach, can be dividedinto two phases, firstly we perform astochastic search for the set of highlycorrelated sub-biclusters. These arethen used as a set of seeds for the nextphase, that of a deterministic expansioninto higher dimensions by adding the bestfitting rows and columns. These algori-thms will be discussed in this sectionbut first, we will introduce our improvedbicluster scoring metric that is essentialin locating these significant biclusterseeds.

A. A n Imp roved Bic lus te r Sco r ingMetric :

A already mentioned, the meansquared residue (H-Score) is affectedby the scale of the variance withinbiclusters. To address this problem, ourbicluster scoring metric, the Hv-Score,takes into account each entry's distance

from its row mean. The sum of thesquares of these distances are computedand used as the denominator in a newbicluster scoring measure given inequation 1. The numerator is simply thesum of the squared residues in thematrix. Minimizing this function for theselected sub-matrix solution minimizesthe sum of the squared residues whilemaximizing the sum of the distancesfrom the row means. The Hv-Score isdefined as :

Hv (I,J )=

JjIi

iJij aa,

2

JjI,i

2ij

)(

)R(a

(1)

where R(aij)is the residue score of each

entry, aij, and aiJ is the row mean for

each entry. We evaluate this new metricand use it to locate highly correlatedbicluster seeds across the scales of thedataset in the seed searchstep of our

approach.

B. Seed Search :

The goal of the initial seedsearch algorithm is to locate a diverseset of highly correlated bicluster seedsthroughout the gene expression datamatrix. If these seeds are significantin size, it is more probable that theyrepresent part of a larger bicluster of

Sadiq Hussainet al., J .Comp.&Math.Sci. Vol.1(2), 111-120 (2010). 113



4/10

rows and columns. For example, a seedthat consists of 10 genes activelycorrelating over 10 conditions is likelyto be a part of a larger set of relatedgenes and would be a good base onwhich to expand.

Our seed search is based on thewell known stochastic search techniqueof simulated annealing. It has beenshown that simulated annealing searchcan discover improved bicluster solutions

over the best-first greedy search method11.

Unlike greedy search techniques,simulated annealing allows the acceptanceof reversals in fitness (worse solutions)and backtracking which gives it thecapability to bypass locally good solutions

until it converges on a solution nearthe global optimum. The acceptanceof reversals is probabilistic as definedby Boltzman's equation :

P( )( E e TE

(2)

From the equation, it can be

seen that this probability depends on

two variables, the size of the reversal

Eand Temperature of the system, T.Generally, T is first given a high value

to initially allow a large percentage of

reversals to be accepted, this helps the

search explore the entire search space.

Gradually, thisT is lowered and the

potential search space shrinks until it

converges on the global optimum. The

number of solutions evaluated (attempts)

and the number of solutions accepted

(successes) at each temperature are

also input parameters. These determinehow extensive a search is carried outbefore theT is lowered.

If we reduce both the initialtemperature and the depth of search at

each temperature, we can confine thesearch to a more local area of the data.Sequential searches will be able touncover several local optima ratherthan one solution close to the globaloptimum. With regard to gene expressiondata, these regional optima may corres-pond to the diverse set of biclusterseeds.

C. Seed Expansion :

Upon locating a good set ofseeds, the deterministic phase of seedexpansion involves adding the bestfitting rows and columns to each seed.Prior to seed expansion, the correlationof the rows and columns in the remainderof the dataset is measured relative tothe seed. The following formulae basedon the residue score, are used to scorethese rows and columns.

All rows (genes) not in Seed (I, J)are scored as follows :

H(I)= JjIiJ ,||

1(aij- aI j- aiJ+ aIJ)2 (3)

where i I. All columns (conditions)not in the Seed (I, J) are similarly scored:

H(J )= JjIiI ,||

1(aij- aI j- aiJ+ aIJ)

2 (4)




5/10

where j J .

An important attribute of thisexpansion phase is that it adds corre-lated rows and columns regardless ofscales. Rows and columns are standar-

dized by subtracting the mean anddividing by the standard deviation.

These scores are then sorted and thebest fitting column or row is added ineach iteration. As our main objectivein this study is to capture gene (row)correlations, we only add rows duringseed expansion. We used Fuzzy Logic

and Z-test to speed up the process.

Algorithm I : Seed SearchVariable definitions:x : current solution,

t0 : initial temperature,

t : current temperature,rate: temperature fall rate,a: attempts, s: successes,

acount : attempt count, scount : success count,

H : H fitness function, M : data matrix,row: seed row size,col : seed column size,

Seed Search (t0, rate, row, col, s,a,M)

fuzzificationx randomSolution (row, col)

t t0while ( x not converging)

while (acount< a AND scount < s)

do the Z-test to find the significance

xnew Generate New Solution (M,x)

if H(xnew) < H(x) then

x xnew

else if exp (-T

E) > random (0,1)

then x xnewt Cool(t,rate)defuzzificationreturn x




6/10

D. The Yeast Cell Cycle Dataset :

In our evaluation, we used theyeast cell cycle dataset from Cho

et.al.,18. This dataset has been used in

previous biclustering5,12and clustering

studies19. Cheng and Church used a

subset of 2,884 of the most variablegenes from Cho's dataset. Unlikeexpression data from the larger humangenome, yeast data gives a morecomplete representation in which allthe ORFs and functional modules are

Algorith II : The Seed Expansion Phase

Variable definitions:r : row, c: column,R: rows in seed,C: columns in seed,

rVarT : row variance threshold,cVarT: column variance threshold,rT : seed row threshold,cT: seed column threshold.

SeedExpansion ( Seed(R,C), rVarT, cVarT,rT,cT)fuzzificationScore all r RScore all c CSort r scoresSort c scores

Select best scoring r or c if ( variance r /c > r/cVarT) if (r/c number < r/cT) add r/c to Seed (R,C) re-score Seed(R,C)

defuzzification return expanded Seed(R,C)

covered.

Cho's dataset contains 6,178genes, 17 conditions and 6,453missing values. Rows containing manymissing values were removed giving 6,145rows with 5,846 missing values.Missing values were replaced by randomvalues generated between the first andthird quartiles of the data range. Thisreduces the impact of these imputedvalues on the structure of the dataset.We annotated all genes in our dataset




7/10

Table 1.Comparison between the H-Score and the Hv-Score. 10 biclusters

seeds discovered by Seed Search with each metric. The percentageof the Dominant Functional Category (D.F.C.) and MIPS code and

name are also listed

Seed D.F.C. MIPS Category D.F.C. MIPS Category

1 40% 99: Unclass. ORFs 70% 12: Protein Synth.2 50% 14: Protein Fate 60% 10: Cell Cycle

3 30% 01: Metabolism 40% 99: Unclass. ORFs

4 40% 14: Protein Fate 90% 11:Transcription

5 60% 14: Protein Fate 60% 12: Protein Synth.

6 40% 32: Cell Rescue 50% 99: Unclass. ORFs.

7 40% 99: Unclass. ORFs. 50% 14: Protein Fate

8 20% 12: Protein Synth. 40% 99: Unclass. ORFs.

9 50% 14: Protein Fate 50% 12: Protein Synth.

10 40% 14: Protein Fate 70% 10: Cell Cycle

Figure I. best scoring seed found with Hv-Score ( genes:216, 217, 526, 616, 1022, 1184, 1476, 1623,

1795, 2086, 2278, 2375, 2538)




8/10

using MIPS (Munich Information centre

for Protein Sequences) online functional

catalogue20. Over 1,500 genes in the

dataset were labelled as category 99

(unclassified). These annotations were

used to evaluate the correspondence

of the biclusters to gene functionalmodules.

3. Evaluation

In the first part of our evaluation,

we evaluate our improved metric, the

H-Score, against Cheng and Church's

original H-Score. We then carry out

comparative evaluations with recent

clustering and biclustering approaches

and end by using our model to putativelyannotate several unclassified yeast

genes.

4. Discussion & Future Work

In this paper, we have demons-

trated improvements to the popular

mean squared residue scoring function

by removing the bias it shows toward

so called 'flat' biclusters. Furthermore,

our H-Score was shown to be a moreeffective objective functional from

both qualitative graphical illustrations

and quantitative functional analysis of

biclusters.

In future work, it is our hope to

compound these and further putative

classifications by cross validation

across multiple expression datasets.

Cho's yeast cell cycle dataset is a well

studied dataset and a good benchmark

but may lack the condition number to

fully test a subspace clustering method

such as biclustering. Using larger

datasets, we hope to further validate our

biclustering method and also increase

the support for gene relationships andputative classifications by discovering

biclusters over a larger number of

conditions.

REFERENCES

1. B. Berrer, W. Dubitzky, and S.

Draghici, A Practical Approach t o

Microarray Data Analysis. Kluwer

Academic Publishers, ch. 1, pp. 15-

19 (2003).2. S.L. Pomeroy, P. Tamayo, M. Gaase-

nbeek, L.M. Sturla, M. Angelo, M.E.

McLaughlin, J .Y. Kim, L.C. Goum-

nerova, B. Black, C. Lau, J .C. Allen,

D. Zagzag, J .M.O., and T. Curran,

C. Wetmore, J .A. Biegel, T. Poggio,

S. Mukherjee, R. Rifkin, A. Califano,

D. Stolovitzky, D. Louis, J . Mesirow,

E. Lander, and T. Golub, "Prediction

of central nervous system embryonal

tumour outcome based on geneexpression", Nature, vol. 24, No.

415, pp. 436-42 (2002).

3. I. S. Dhillon, S. Mallela and D. S.

Modha, "Information theoretic co-

clustering", in Proceedings of t he

ninth ACM SIGKDD International

Conference on Know ledge Discovery

and Data Mining (2003).

4. A. Ben-Dor, B. Chor, R. Karp and Z.

Yakini, "Discovering local structure




9/10

in gene expression data : the order-

preserving submatrix problem",

Journal of Computat ional Biology,

vol. 10, No. 3-4, pp. 373-84 (2003).

5. Y. Cheng and G.M. Church, "Bicl-

ustering of expression data", in

Proceedings of the Eight Interna-tional Conference on Intelligent

Systems for Molecular Biology

(ISMB),pp. 93-103 (2000).

6. A. Tanay, R. Sharan, and R. Shamir,

"Discovering statistically significant

biclusters in gene expression data",

Bioinformatics,vol. 18 ,pp. 36-44,

(2002).

7. L. Lazzeroni, and A. Owen, "Plaid

models for gene expression data",

Stat istica Sinica,vol.12 ,pp. 61-86,(2002).

8. Y. Kluger, R. Basri, J .T. Chang, and

M. Gerstein, "Spectral biclustering

of microarray data : Coclustering

genes and conditions", Genome

Research, vol. 13 , pp. 703 - 716,

(2003).

9. J . Yang, H. Wang, W. Wang, and

P. Yu, "Enhanced biclustering on

expression data", in IEEE Third

Symposium on Bioinformatics and

Bioengineering (2003).

10. H. Cho, I.S. Dhillon, Y. Guan and S.

Sra, "Mimimum sum squared residue

co-clustering of gene expression

data", SIAM international conference

on datamining (2004).

11. K. Bryan, P. Cunningham, and N.

Bolshakova, "Biclustering of expre-

ssion data using simulated annealing",

in Proceedings of the eight eenth

IEEE Symposium on Comput er

Based Medical Systems (2005).

12. S. Bleuler, A. Prelic, and E. Zitzler,

"An EA framework for biclustering

of gene expression data", inCongress

on Evolut ionary Computat ion (CEC-2004), Piscataway, NJ : IEEE, pp.

166-173 (2004).

13. J . Aguilar-Ruiz, "Shifting and scaling

patterns from gene expression data",

Bioinformatics,vol.21 ,No. 20 . ,pp.

3849-3845 (2005).

14. B. Mirkin,Mathematical Classification

and Clustering,Dordrecht, Kluwer,

(1996).

15. D.S. J ohnsen, "The np-completeness

column : an ongoing guide", J ournalof A lgorithms,vol. 8 ,pp. 438-448

(1987).

16. I.S. Dhillon, E.M. Marcotte, and U.

Roshan, "Diametrical clustering for

identifying anti-correlated gene

clusters", Bioinformatics, vol. 19 ,

No. 13 ,pp. 1612-1619 (2003).

17. S. C. Madeira, and A. L. Oliveira,

"Biclustering algorithms for biological

data analysis : A survey", IEEE/ACM

Transact ions on Comput ational

Biology and Bioinformat ics, vol. 1,

No. 1 ,pp. 24-45 (2004).

18. R. Cho, M. Campbell, E. Winzeler,

L. Steinmetz, A. Conway, L. Wodicka,

T. Wolfsberg, A. Gabrielian, D.

Landsman, and D. Lockhart, "A

genome-wide transcriptional analysis

of the mitotic cell cycle", Molecular

Cell,vol. 2 , pp. 65-73, J uly (1998).




10/10

19. R. Balasubramaniyan, E. Hullermeier,

N. Weskamp, and J . Kamper, "Clus-

tering of gene expression data using

a local shape-based similarity

measure", Bioinformatics, vol. 21 ,

No. 7 ,pp. 1069-1077 (2005).

20. H.W. Mewes, D. Frishman, U. Gul-dener, G. Mannhaupt, K.F.X. Mayer,

M. Mokrejs, B. Morgenstern, M.

Munsterkotter, S. Rudd, and B. Weil,

"Mips : a database for genomes

and protein sequences", Nucleic

Acids Research,vol. 30, No. 1 ,pp.

31-34 (2002).

21. R. Shamir, A. Maron-Katz, A. Tanay,

C. Linhart, I. Steinfeld, R. Sharan,

Y. Shiloh, and R. Elkon, "Expander-

an integrative program suite formicroarray data analysis", BMC

Bioinformatics, vol. 21, No.6 ,p.

232 (2005).



Documents

Improved Biclustering Of Microarray Data