Upload
clayton-morin
View
43
Download
8
Tags:
Embed Size (px)
DESCRIPTION
Clustering Short Gene Expression Profiles. Ling Wang Marco Ramoni Paolo Sebastiani. The Problem: Input. Gene expression profiles for J genes from microarray experiments. [1]. The Problem: Output. A clustering of the genes that groups functionally related genes in the same cluster. - PowerPoint PPT Presentation
Citation preview
Clustering Short Gene Expression Profiles
Ling Wang Marco Ramoni
Paolo Sebastiani
Abdullah Mueen 1
The Problem: InputGene Symbol 0h 0.5h 3h 6h 12h
1 ZFX -0.027 0.158 0.169 0.193 -0.165
2 ZNF133 0.183 -0.068 -0.134 -0.252 0.177
3 USP2 -0.67 -0.709 -0.347 -0.779 -0.403
4 DSCR1L1 -0.923 -0.51 -0.718 -0.512 -0.668
5 WNT5A -0.471 -0.264 -0.269 -0.154 -0.254
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
24180 0.983 1.55 2.541 2.187 0.147
Abdullah Mueen 2
Gene expression profiles for J genes from microarray experiments
[1]
The Problem: Output
• A clustering of the genes that groups functionally related genes in the same cluster.
Abdullah Mueen 3
Previous Works
• Hierarchical Clustering (Eisen et al., 1998)• K-means and self organizing maps (Tamayo et
al, 1999)• Standard measures : Euclidian Distance,
Correlation coefficient.• Problem
– Ignores the sequential nature of the profiles.– Different pairs of time series can have same
measure.Abdullah Mueen 4
[3]
Previous Works
• Continuous representation of the profile using – Autoregressive Models. – Hidden Markov Models.
• Advantages:– Count the temporal information– Good for long profiles ( 10 points or more )– Easily go with Bayesian Clustering.
Abdullah Mueen 5
[3]
Autoregressive Model: Definition• Each time point is correlated with p previous time
points.
• Combining the models of all the time points for a gene
• Xj is the regression matrix of size (n-p)x(p+1) and βj is the coefficient matrix.
Abdullah Mueen 6
[2]
Autoregressive Model: Problems
• Problems– AR model is for stationary time series. Interval
between time points are ignored.– For short gene expression profiles (5 time points)
the regression order can not be large.– For a large number of genes with short expression
profiles, there may be random patterns. AR model overfit these random patterns.
Abdullah Mueen 7
The Algorithm
The algorithm has three components1. A model describing the dynamics of gene
expression temporal profiles.2. A probabilistic metric to score different
clustering models based on the posterior probability of each clustering model.
3. A heuristic to make the search for the best clustering model feasible.
Abdullah Mueen 8
Polynomial Model: Definition
• Each time point is approximated by a polynomial of degree p .
• The combined model for a gene is
Abdullah Mueen 9
Polynomial Model: Assumptions
• The uncorrelated errors are normally distributed with mean 0 and variance 1/τj
where• The coefficients are normally distributed
• β0, α1 and α2 are hyper-parameters of the prior distributions of the parameters.
Abdullah Mueen 10
Hyper-parameters• Around 25-50% of the total number of genes/probes
in the microarrays are disregarded because of their low confidence level.
• To avoid overfitting random patterns, hyper parameters are estimated from random data.
• If σ2a is the sample variance of the disregarded genes
then the hyper-parameters are related through
Abdullah Mueen 11
Scoring Method
• The scoring function is calculated using marginal likelihood of each gene which is
• For the current model marginal likelihood of a gene is
Abdullah Mueen 12
Marginal Likelihood
• With the polynomial model, assumed prior distribution and hyper parameters, the marginal likelihood function is computed.
Abdullah Mueen 13
Scoring the Model
• The weighted average of the marginal likelihood of each gene is the scoring function for a clustering model.
• The weights for each cluster varies with the size of the cluster.
Abdullah Mueen 14
Agglomerative Clustering
• The clustering phase starts with singleton clusters.
• It computes and• Iteratively merges time series into clusters
until the scoring function does not increase.• While merging it takes average of the cluster
representatives.
Abdullah Mueen 15
Heuristic Search
• Computing the scoring function for all the model is expensive and a heuristic is adopted.
• Instead of computing all the possible merge pairs, it tries to find a merge pair that increases the scoring function. The search for such a merge pair is done in the descending order of their Euclidian Distance, Dynamic Time Warping, etc.
Abdullah Mueen 16
Evaluation: Simulation
Simulated Dataset
Number of true patterns
Number of profiles this Algorithm found
Number of significant profiles STEM found
Noise 0 0 17
4 patterns with noise
4 4 4
6 patterns with noise
6 6 11
Abdullah Mueen 17
Evaluation: Real Data• The gene expression profiles from [1] are
used. Clusters are tested using Gene Ontology enrichment test with EASE (Hosack et al. 2003).
Abdullah Mueen 18
Method Number of Clusters % significance
This Algorithm 11 63%
STEM 10 40%
Conclusion
• Short gene expression profiles are modeled using polynomials.
• A clustering model is evaluated using the marginal likelihood of the genes with respect to the polynomial model.
• An agglomerative clustering is done with a heuristic search strategy.
• Output clusters are gene ontology enriched.
Abdullah Mueen 19
References
1. Guillemin K., Salma N.R., Tompkins L.S., and Falkow S. Cag pathogenicity island-specific responses of gastric epithelial cells to Helicobacter pylori infection. PNAS. 99: 15136-15141, 2002.
2. M. Ramoni, P. Sebastiani, and I. S. Kohane. Cluster analysis of gene expression dynamics. Proc. Natl. Acad. Sci. USA, 99(14):9121–6, 2002
3. J. Ernst, G. J. Nau, and Z. Bar-Joseph. Clustering short time series gene expression data. Bioinformatics, 21 Suppl. 1:i159-i168, 2005
Abdullah Mueen 20