Upload
compmathsjournalws
View
217
Download
0
Embed Size (px)
Citation preview
8/10/2019 Improved Biclustering Of Microarray Data
1/10
Improved Biclustering Of Microarray Data
SADIQ HUSSAIN* and G.C. HAZARIKA**
*System Administrator, Examination Branch, Dibrugarh University,Dibrugarh-786004 (India)
[email protected], 09435246192 (M)**Director, Centre for Computer Studies, Dibrugarh University,
Dibrugarh-786004 (India)[email protected]
ABSTRACT
Biclustering algorithms simultaneously cluster both rowsand columns. These types of algorithms are applied to geneexpression data analysis to find a subset of genes that exhibitsimilar expression pattern under a subset of conditions. Cheng
and Church introduced the mean squared residue measure tocapture the coherence of a subset of genes over a subset ofconditions. They provided a set of heuristic algorithms basedprimarily on node deletion to find one bicluster or a set of biclustersafter masking discovered biclusters with random values.
The mean squared residue is a popular measure of biclusterquality. One drawback however is that it is biased toward flatbiclusters with low row variance. In this paper, we introduce animproved bicluster score that removes this bias and promotes thediscovery the most significant biclusters in the dataset. We employthis score within a new biclustering approach based on the bottom-up search strategy. We believe that the bottom-up search approach
better models the underlying functional modules of the geneexpression dataset.
J . Comp. & Math. Sci. Vol. 1(2), 111-120 (2010).
1. INTRODUCTION
Advances in gene expressionmicroarray technologies over the lastdecade or so have made it possible tomeasure the expression levels ofthousands of genes over many experi-
mental conditions (e.g., different patients,tissue types and growth environments).
The data produced in these experimentsare usually arranged in a data matrixof genes (rows) and conditions (columns).Results from multiple microarrayexperiments may be combined and the
J ournal of Computer and Mathematical Sciences Vol. 1 Issue 2, 31 March, 2010 Pages (103-273)
8/10/2019 Improved Biclustering Of Microarray Data
2/10
data matrix may easily exceed thousandsof genes and hundreds of conditionin size.
As datasets increase in size,however, it becomes increasingly unlikelythat genes will retain correlation acrossthe full set of conditions making clus-tering problematic. The gene expressioncontext further exacerbates thisproblem as it is not uncommon for theexpression of related genes to be highlysimilar under one set of conditions and
yet independent under another set4.
Given these issues it is perhaps moreprudent to cluster genes over a signi-ficant subset of experimental conditions.
This two-way clustering technique has
been termed biclusteringand was firstintroduced to gene expression analysis
by Cheng and Church5. They developed
a two-way correlation metric called themean squared residue scoreto measurethe bicluster quality of selected rowsand columns from the gene expressiondata matrix. They employed this metricwithin a top-down greedy node deletionalgorithm aimed at discovering all thesignificant biclusters within a geneexpression data matrix. Following thisseminal work, other metrics and biclus-
tering frameworks were developed6,7,8.
However, approaches based on Chengand Church's mean squared residuescore remain most prevalent in the
literature9,10,11,12.
One notable drawback, however,of the mean squared residue score isthat it is also affected by variance,favouring correlations with low variances.
Furthermore, because variance changesby the square of the change in scale, thescore tends to discover correlations overlower scales. These effects culminatein a bias toward 'flat' biclusters contai-ning genes with relatively unfluctuating
expression levels within the lowerscales (fold changes) of the geneexpression dataset. This issue has been
articulated previously in13. We illustrate
the mean squared residue (H-Score)bias. In this paper, we introduce animproved bicluster scoring metricwhich compensates for this bias andenables the discovery of biclustersthroughout expression data, includingthose potentially more interestingcorrelations over the higher scales (fold
changes).
2. Bottom-Up Biclustering ByLocality Expansion
In general, as with conventionalclustering, biclustering techniques fallinto top-down and bottom-up appro-aches. Top-down approaches begin byfinding an initial approximation of thesolution over the full set of objects andfeatures, in this case all the rows andcolumns of the gene expression datamatrix. There is a possibility that thismethod may however discover anartificially large bicluster solutionwhich may be a combination of partsof two or more local solutions.
Top-down methods tend toproduce a more uniform set of clustersin terms of size and shape. This repre-sentation may not accurately model
112 Sadiq Hussainet al., J .Comp.&Math.Sci. Vol.1(2), 111-120 (2010).
J ournal of Computer and Mathematical Sciences Vol. 1 Issue 2, 31 March, 2010 Pages (103-273)
8/10/2019 Improved Biclustering Of Microarray Data
3/10
the diverse set of functional modulesthat may be present in expression data.Also because they deal initially withthe full set of dimensions, top-downapproaches may not scale well as thedataset increase in size.
Our framework is built aroundthe bottom-up search approach. It isfounded on the premise that searchingfor interesting structure in higherdimensions can be first reduced to asearch for structure in lower dimensions.In general, this search method is moreadept at discovering several localsolutions rather than global solutions.In this way, we hope to discover a morenatural set of biclusters that capture
the diversity and organization of fun-ctional groups in the cell. This approachalso has the advantage that it iscomputationally more efficient to searchfor solutions over a smaller set ofdimensions.
Our approach, can be dividedinto two phases, firstly we perform astochastic search for the set of highlycorrelated sub-biclusters. These arethen used as a set of seeds for the nextphase, that of a deterministic expansioninto higher dimensions by adding the bestfitting rows and columns. These algori-thms will be discussed in this sectionbut first, we will introduce our improvedbicluster scoring metric that is essentialin locating these significant biclusterseeds.
A. A n Imp roved Bic lus te r Sco r ingMetric :
A already mentioned, the meansquared residue (H-Score) is affectedby the scale of the variance withinbiclusters. To address this problem, ourbicluster scoring metric, the Hv-Score,takes into account each entry's distance
from its row mean. The sum of thesquares of these distances are computedand used as the denominator in a newbicluster scoring measure given inequation 1. The numerator is simply thesum of the squared residues in thematrix. Minimizing this function for theselected sub-matrix solution minimizesthe sum of the squared residues whilemaximizing the sum of the distancesfrom the row means. The Hv-Score isdefined as :
Hv (I,J )=
JjIi
iJij aa,
2
JjI,i
2ij
)(
)R(a
(1)
where R(aij)is the residue score of each
entry, aij, and aiJ is the row mean for
each entry. We evaluate this new metricand use it to locate highly correlatedbicluster seeds across the scales of thedataset in the seed searchstep of our
approach.
B. Seed Search :
The goal of the initial seedsearch algorithm is to locate a diverseset of highly correlated bicluster seedsthroughout the gene expression datamatrix. If these seeds are significantin size, it is more probable that theyrepresent part of a larger bicluster of
Sadiq Hussainet al., J .Comp.&Math.Sci. Vol.1(2), 111-120 (2010). 113
J ournal of Computer and Mathematical Sciences Vol. 1 Issue 2, 31 March, 2010 Pages (103-273)
8/10/2019 Improved Biclustering Of Microarray Data
4/10
rows and columns. For example, a seedthat consists of 10 genes activelycorrelating over 10 conditions is likelyto be a part of a larger set of relatedgenes and would be a good base onwhich to expand.
Our seed search is based on thewell known stochastic search techniqueof simulated annealing. It has beenshown that simulated annealing searchcan discover improved bicluster solutions
over the best-first greedy search method11.
Unlike greedy search techniques,simulated annealing allows the acceptanceof reversals in fitness (worse solutions)and backtracking which gives it thecapability to bypass locally good solutions
until it converges on a solution nearthe global optimum. The acceptanceof reversals is probabilistic as definedby Boltzman's equation :
P( )( E e TE
(2)
From the equation, it can be
seen that this probability depends on
two variables, the size of the reversal
Eand Temperature of the system, T.Generally, T is first given a high value
to initially allow a large percentage of
reversals to be accepted, this helps the
search explore the entire search space.
Gradually, thisT is lowered and the
potential search space shrinks until it
converges on the global optimum. The
number of solutions evaluated (attempts)
and the number of solutions accepted
(successes) at each temperature are
also input parameters. These determinehow extensive a search is carried outbefore theT is lowered.
If we reduce both the initialtemperature and the depth of search at
each temperature, we can confine thesearch to a more local area of the data.Sequential searches will be able touncover several local optima ratherthan one solution close to the globaloptimum. With regard to gene expressiondata, these regional optima may corres-pond to the diverse set of biclusterseeds.
C. Seed Expansion :
Upon locating a good set ofseeds, the deterministic phase of seedexpansion involves adding the bestfitting rows and columns to each seed.Prior to seed expansion, the correlationof the rows and columns in the remainderof the dataset is measured relative tothe seed. The following formulae basedon the residue score, are used to scorethese rows and columns.
All rows (genes) not in Seed (I, J)are scored as follows :
H(I)= JjIiJ ,||
1(aij- aI j- aiJ+ aIJ)2 (3)
where i I. All columns (conditions)not in the Seed (I, J) are similarly scored:
H(J )= JjIiI ,||
1(aij- aI j- aiJ+ aIJ)
2 (4)
114 Sadiq Hussainet al., J .Comp.&Math.Sci. Vol.1(2), 111-120 (2010).
J ournal of Computer and Mathematical Sciences Vol. 1 Issue 2, 31 March, 2010 Pages (103-273)
8/10/2019 Improved Biclustering Of Microarray Data
5/10
where j J .
An important attribute of thisexpansion phase is that it adds corre-lated rows and columns regardless ofscales. Rows and columns are standar-
dized by subtracting the mean anddividing by the standard deviation.
These scores are then sorted and thebest fitting column or row is added ineach iteration. As our main objectivein this study is to capture gene (row)correlations, we only add rows duringseed expansion. We used Fuzzy Logic
and Z-test to speed up the process.
Algorithm I : Seed SearchVariable definitions:x : current solution,
t0 : initial temperature,
t : current temperature,rate: temperature fall rate,a: attempts, s: successes,
acount : attempt count, scount : success count,
H : H fitness function, M : data matrix,row: seed row size,col : seed column size,
Seed Search (t0, rate, row, col, s,a,M)
fuzzificationx randomSolution (row, col)
t t0while ( x not converging)
while (acount< a AND scount < s)
do the Z-test to find the significance
xnew Generate New Solution (M,x)
if H(xnew) < H(x) then
x xnew
else if exp (-T
E) > random (0,1)
then x xnewt Cool(t,rate)defuzzificationreturn x
Sadiq Hussainet al., J .Comp.&Math.Sci. Vol.1(2), 111-120 (2010). 115
J ournal of Computer and Mathematical Sciences Vol. 1 Issue 2, 31 March, 2010 Pages (103-273)
8/10/2019 Improved Biclustering Of Microarray Data
6/10
D. The Yeast Cell Cycle Dataset :
In our evaluation, we used theyeast cell cycle dataset from Cho
et.al.,18. This dataset has been used in
previous biclustering5,12and clustering
studies19. Cheng and Church used a
subset of 2,884 of the most variablegenes from Cho's dataset. Unlikeexpression data from the larger humangenome, yeast data gives a morecomplete representation in which allthe ORFs and functional modules are
Algorith II : The Seed Expansion Phase
Variable definitions:r : row, c: column,R: rows in seed,C: columns in seed,
rVarT : row variance threshold,cVarT: column variance threshold,rT : seed row threshold,cT: seed column threshold.
SeedExpansion ( Seed(R,C), rVarT, cVarT,rT,cT)fuzzificationScore all r RScore all c CSort r scoresSort c scores
Select best scoring r or c if ( variance r /c > r/cVarT) if (r/c number < r/cT) add r/c to Seed (R,C) re-score Seed(R,C)
defuzzification return expanded Seed(R,C)
covered.
Cho's dataset contains 6,178genes, 17 conditions and 6,453missing values. Rows containing manymissing values were removed giving 6,145rows with 5,846 missing values.Missing values were replaced by randomvalues generated between the first andthird quartiles of the data range. Thisreduces the impact of these imputedvalues on the structure of the dataset.We annotated all genes in our dataset
116 Sadiq Hussainet al., J .Comp.&Math.Sci. Vol.1(2), 111-120 (2010).
J ournal of Computer and Mathematical Sciences Vol. 1 Issue 2, 31 March, 2010 Pages (103-273)
8/10/2019 Improved Biclustering Of Microarray Data
7/10
Table 1.Comparison between the H-Score and the Hv-Score. 10 biclusters
seeds discovered by Seed Search with each metric. The percentageof the Dominant Functional Category (D.F.C.) and MIPS code and
name are also listed
Seed D.F.C. MIPS Category D.F.C. MIPS Category
1 40% 99: Unclass. ORFs 70% 12: Protein Synth.2 50% 14: Protein Fate 60% 10: Cell Cycle
3 30% 01: Metabolism 40% 99: Unclass. ORFs
4 40% 14: Protein Fate 90% 11:Transcription
5 60% 14: Protein Fate 60% 12: Protein Synth.
6 40% 32: Cell Rescue 50% 99: Unclass. ORFs.
7 40% 99: Unclass. ORFs. 50% 14: Protein Fate
8 20% 12: Protein Synth. 40% 99: Unclass. ORFs.
9 50% 14: Protein Fate 50% 12: Protein Synth.
10 40% 14: Protein Fate 70% 10: Cell Cycle
Figure I. best scoring seed found with Hv-Score ( genes:216, 217, 526, 616, 1022, 1184, 1476, 1623,
1795, 2086, 2278, 2375, 2538)
Sadiq Hussainet al., J .Comp.&Math.Sci. Vol.1(2), 111-120 (2010). 117
J ournal of Computer and Mathematical Sciences Vol. 1 Issue 2, 31 March, 2010 Pages (103-273)
8/10/2019 Improved Biclustering Of Microarray Data
8/10
using MIPS (Munich Information centre
for Protein Sequences) online functional
catalogue20. Over 1,500 genes in the
dataset were labelled as category 99
(unclassified). These annotations were
used to evaluate the correspondence
of the biclusters to gene functionalmodules.
3. Evaluation
In the first part of our evaluation,
we evaluate our improved metric, the
H-Score, against Cheng and Church's
original H-Score. We then carry out
comparative evaluations with recent
clustering and biclustering approaches
and end by using our model to putativelyannotate several unclassified yeast
genes.
4. Discussion & Future Work
In this paper, we have demons-
trated improvements to the popular
mean squared residue scoring function
by removing the bias it shows toward
so called 'flat' biclusters. Furthermore,
our H-Score was shown to be a moreeffective objective functional from
both qualitative graphical illustrations
and quantitative functional analysis of
biclusters.
In future work, it is our hope to
compound these and further putative
classifications by cross validation
across multiple expression datasets.
Cho's yeast cell cycle dataset is a well
studied dataset and a good benchmark
but may lack the condition number to
fully test a subspace clustering method
such as biclustering. Using larger
datasets, we hope to further validate our
biclustering method and also increase
the support for gene relationships andputative classifications by discovering
biclusters over a larger number of
conditions.
REFERENCES
1. B. Berrer, W. Dubitzky, and S.
Draghici, A Practical Approach t o
Microarray Data Analysis. Kluwer
Academic Publishers, ch. 1, pp. 15-
19 (2003).2. S.L. Pomeroy, P. Tamayo, M. Gaase-
nbeek, L.M. Sturla, M. Angelo, M.E.
McLaughlin, J .Y. Kim, L.C. Goum-
nerova, B. Black, C. Lau, J .C. Allen,
D. Zagzag, J .M.O., and T. Curran,
C. Wetmore, J .A. Biegel, T. Poggio,
S. Mukherjee, R. Rifkin, A. Califano,
D. Stolovitzky, D. Louis, J . Mesirow,
E. Lander, and T. Golub, "Prediction
of central nervous system embryonal
tumour outcome based on geneexpression", Nature, vol. 24, No.
415, pp. 436-42 (2002).
3. I. S. Dhillon, S. Mallela and D. S.
Modha, "Information theoretic co-
clustering", in Proceedings of t he
ninth ACM SIGKDD International
Conference on Know ledge Discovery
and Data Mining (2003).
4. A. Ben-Dor, B. Chor, R. Karp and Z.
Yakini, "Discovering local structure
118 Sadiq Hussainet al., J .Comp.&Math.Sci. Vol.1(2), 111-120 (2010).
J ournal of Computer and Mathematical Sciences Vol. 1 Issue 2, 31 March, 2010 Pages (103-273)
8/10/2019 Improved Biclustering Of Microarray Data
9/10
in gene expression data : the order-
preserving submatrix problem",
Journal of Computat ional Biology,
vol. 10, No. 3-4, pp. 373-84 (2003).
5. Y. Cheng and G.M. Church, "Bicl-
ustering of expression data", in
Proceedings of the Eight Interna-tional Conference on Intelligent
Systems for Molecular Biology
(ISMB),pp. 93-103 (2000).
6. A. Tanay, R. Sharan, and R. Shamir,
"Discovering statistically significant
biclusters in gene expression data",
Bioinformatics,vol. 18 ,pp. 36-44,
(2002).
7. L. Lazzeroni, and A. Owen, "Plaid
models for gene expression data",
Stat istica Sinica,vol.12 ,pp. 61-86,(2002).
8. Y. Kluger, R. Basri, J .T. Chang, and
M. Gerstein, "Spectral biclustering
of microarray data : Coclustering
genes and conditions", Genome
Research, vol. 13 , pp. 703 - 716,
(2003).
9. J . Yang, H. Wang, W. Wang, and
P. Yu, "Enhanced biclustering on
expression data", in IEEE Third
Symposium on Bioinformatics and
Bioengineering (2003).
10. H. Cho, I.S. Dhillon, Y. Guan and S.
Sra, "Mimimum sum squared residue
co-clustering of gene expression
data", SIAM international conference
on datamining (2004).
11. K. Bryan, P. Cunningham, and N.
Bolshakova, "Biclustering of expre-
ssion data using simulated annealing",
in Proceedings of the eight eenth
IEEE Symposium on Comput er
Based Medical Systems (2005).
12. S. Bleuler, A. Prelic, and E. Zitzler,
"An EA framework for biclustering
of gene expression data", inCongress
on Evolut ionary Computat ion (CEC-2004), Piscataway, NJ : IEEE, pp.
166-173 (2004).
13. J . Aguilar-Ruiz, "Shifting and scaling
patterns from gene expression data",
Bioinformatics,vol.21 ,No. 20 . ,pp.
3849-3845 (2005).
14. B. Mirkin,Mathematical Classification
and Clustering,Dordrecht, Kluwer,
(1996).
15. D.S. J ohnsen, "The np-completeness
column : an ongoing guide", J ournalof A lgorithms,vol. 8 ,pp. 438-448
(1987).
16. I.S. Dhillon, E.M. Marcotte, and U.
Roshan, "Diametrical clustering for
identifying anti-correlated gene
clusters", Bioinformatics, vol. 19 ,
No. 13 ,pp. 1612-1619 (2003).
17. S. C. Madeira, and A. L. Oliveira,
"Biclustering algorithms for biological
data analysis : A survey", IEEE/ACM
Transact ions on Comput ational
Biology and Bioinformat ics, vol. 1,
No. 1 ,pp. 24-45 (2004).
18. R. Cho, M. Campbell, E. Winzeler,
L. Steinmetz, A. Conway, L. Wodicka,
T. Wolfsberg, A. Gabrielian, D.
Landsman, and D. Lockhart, "A
genome-wide transcriptional analysis
of the mitotic cell cycle", Molecular
Cell,vol. 2 , pp. 65-73, J uly (1998).
Sadiq Hussainet al., J .Comp.&Math.Sci. Vol.1(2), 111-120 (2010). 119
J ournal of Computer and Mathematical Sciences Vol. 1 Issue 2, 31 March, 2010 Pages (103-273)
8/10/2019 Improved Biclustering Of Microarray Data
10/10
19. R. Balasubramaniyan, E. Hullermeier,
N. Weskamp, and J . Kamper, "Clus-
tering of gene expression data using
a local shape-based similarity
measure", Bioinformatics, vol. 21 ,
No. 7 ,pp. 1069-1077 (2005).
20. H.W. Mewes, D. Frishman, U. Gul-dener, G. Mannhaupt, K.F.X. Mayer,
M. Mokrejs, B. Morgenstern, M.
Munsterkotter, S. Rudd, and B. Weil,
"Mips : a database for genomes
and protein sequences", Nucleic
Acids Research,vol. 30, No. 1 ,pp.
31-34 (2002).
21. R. Shamir, A. Maron-Katz, A. Tanay,
C. Linhart, I. Steinfeld, R. Sharan,
Y. Shiloh, and R. Elkon, "Expander-
an integrative program suite formicroarray data analysis", BMC
Bioinformatics, vol. 21, No.6 ,p.
232 (2005).
120 Sadiq Hussainet al., J .Comp.&Math.Sci. Vol.1(2), 111-120 (2010).
J ournal of Computer and Mathematical Sciences Vol. 1 Issue 2, 31 March, 2010 Pages (103-273)