Clustering Short Gene Expression Profiles

Clustering Short Gene Expression Profiles

Ling Wang Marco Ramoni

Paolo Sebastiani

Abdullah Mueen 1

The Problem: InputGene Symbol 0h 0.5h 3h 6h 12h

1 ZFX -0.027 0.158 0.169 0.193 -0.165

2 ZNF133 0.183 -0.068 -0.134 -0.252 0.177

3 USP2 -0.67 -0.709 -0.347 -0.779 -0.403

4 DSCR1L1 -0.923 -0.51 -0.718 -0.512 -0.668

5 WNT5A -0.471 -0.264 -0.269 -0.154 -0.254

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

24180 0.983 1.55 2.541 2.187 0.147

Abdullah Mueen 2

Gene expression profiles for J genes from microarray experiments

[1]

The Problem: Output

• A clustering of the genes that groups functionally related genes in the same cluster.

Abdullah Mueen 3

Previous Works

• Hierarchical Clustering (Eisen et al., 1998)• K-means and self organizing maps (Tamayo et

al, 1999)• Standard measures : Euclidian Distance,

Correlation coefficient.• Problem

– Ignores the sequential nature of the profiles.– Different pairs of time series can have same

measure.Abdullah Mueen 4

[3]

Previous Works

• Continuous representation of the profile using – Autoregressive Models. – Hidden Markov Models.

• Advantages:– Count the temporal information– Good for long profiles ( 10 points or more )– Easily go with Bayesian Clustering.

Abdullah Mueen 5

[3]

Autoregressive Model: Definition• Each time point is correlated with p previous time

points.

• Combining the models of all the time points for a gene

• Xj is the regression matrix of size (n-p)x(p+1) and βj is the coefficient matrix.

Abdullah Mueen 6

[2]

Autoregressive Model: Problems

• Problems– AR model is for stationary time series. Interval

between time points are ignored.– For short gene expression profiles (5 time points)

the regression order can not be large.– For a large number of genes with short expression

profiles, there may be random patterns. AR model overfit these random patterns.

Abdullah Mueen 7

The Algorithm

The algorithm has three components1. A model describing the dynamics of gene

expression temporal profiles.2. A probabilistic metric to score different

clustering models based on the posterior probability of each clustering model.

3. A heuristic to make the search for the best clustering model feasible.

Abdullah Mueen 8

Polynomial Model: Definition

• Each time point is approximated by a polynomial of degree p .

• The combined model for a gene is

Abdullah Mueen 9

Polynomial Model: Assumptions

• The uncorrelated errors are normally distributed with mean 0 and variance 1/τj

where• The coefficients are normally distributed

• β0, α1 and α2 are hyper-parameters of the prior distributions of the parameters.

Abdullah Mueen 10

Hyper-parameters• Around 25-50% of the total number of genes/probes

in the microarrays are disregarded because of their low confidence level.

• To avoid overfitting random patterns, hyper parameters are estimated from random data.

• If σ2a is the sample variance of the disregarded genes

then the hyper-parameters are related through

Abdullah Mueen 11

Scoring Method

• The scoring function is calculated using marginal likelihood of each gene which is

• For the current model marginal likelihood of a gene is

Abdullah Mueen 12

Marginal Likelihood

• With the polynomial model, assumed prior distribution and hyper parameters, the marginal likelihood function is computed.

Abdullah Mueen 13

Scoring the Model

• The weighted average of the marginal likelihood of each gene is the scoring function for a clustering model.

• The weights for each cluster varies with the size of the cluster.

Abdullah Mueen 14

Agglomerative Clustering

• The clustering phase starts with singleton clusters.

• It computes and• Iteratively merges time series into clusters

until the scoring function does not increase.• While merging it takes average of the cluster

representatives.

Abdullah Mueen 15

Heuristic Search

• Computing the scoring function for all the model is expensive and a heuristic is adopted.

• Instead of computing all the possible merge pairs, it tries to find a merge pair that increases the scoring function. The search for such a merge pair is done in the descending order of their Euclidian Distance, Dynamic Time Warping, etc.

Abdullah Mueen 16

Evaluation: Simulation

Simulated Dataset

Number of true patterns

Number of profiles this Algorithm found

Number of significant profiles STEM found

Noise 0 0 17

4 patterns with noise

4 4 4

6 patterns with noise

6 6 11

Abdullah Mueen 17

Evaluation: Real Data• The gene expression profiles from [1] are

used. Clusters are tested using Gene Ontology enrichment test with EASE (Hosack et al. 2003).

Abdullah Mueen 18

Method Number of Clusters % significance

This Algorithm 11 63%

STEM 10 40%

Conclusion

• Short gene expression profiles are modeled using polynomials.

• A clustering model is evaluated using the marginal likelihood of the genes with respect to the polynomial model.

• An agglomerative clustering is done with a heuristic search strategy.

• Output clusters are gene ontology enriched.

Abdullah Mueen 19

References

1. Guillemin K., Salma N.R., Tompkins L.S., and Falkow S. Cag pathogenicity island-specific responses of gastric epithelial cells to Helicobacter pylori infection. PNAS. 99: 15136-15141, 2002.

2. M. Ramoni, P. Sebastiani, and I. S. Kohane. Cluster analysis of gene expression dynamics. Proc. Natl. Acad. Sci. USA, 99(14):9121–6, 2002

3. J. Ernst, G. J. Nau, and Z. Bar-Joseph. Clustering short time series gene expression data. Bioinformatics, 21 Suppl. 1:i159-i168, 2005

Abdullah Mueen 20

Documents

Clustering Short Gene Expression Profiles