Mining Phenotype Structures

University at Buffalo The State University of New York

Mining Phenotype Structures

Chun Tang and Aidong ZhangBioinformatics Journal, 20(6):829-838, 2004


Microarray Data Analysis Analysis from two angles

sample as object, gene as attribute

gene as object, sample/condition as attribute


Supervised Analysis Select training samples (hold out…) Sort genes (t-test, ranking…) Select informative genes (top 50 ~ 200) Cluster based on informative genes

1 1 … 1 0 0 … 01 1 … 1 0 0 … 0

0 0 … 0 1 1 … 1

0 0 … 0 1 1 … 1

g1

g2

.

.

.

.

.

.

.

g4131

g4132

1 1 … 1 0 0 … 01 1 … 1 0 0 … 0

0 0 … 0 1 1 … 1

0 0 … 0 1 1 … 1

g1

g2

.

.

.

g4131

g4132

Class 1 Class 2


Unsupervised Analysis

We will focus on unsupervised sample partition which assume no phenotype information being assigned to any sample.

Since the initial biological identification of sample classes has been slow, typically evolving through years of hypothesis-driven research, automatically discovering sample pattern presents a significant contribution in microarray data analysis.

Many mature statistic methods can not be applied

without the phenotypes of samples being known in

advance.


4 5 6 7 8 9 10

gene1

gene6

gene7

gene2

gene4

gene5

gene3

gene1

gene6

gene7

gene2

gene4

gene5

gene3

Unsupervised Analysis

Informative Genes

Non- informative

Genes

samples

An informative gene is a gene which manifests samples' phenotype distinction.

Phenotype structure: sample partition + informative genes.

Automatic Phenotype Structure MiningAutomatic Phenotype Structure Mining

1 2 3


gene1

gene2

gene3

Informative genes

1 2 3 4 5 6 7

Phenotype distinctionMining

Mining

Gene expression matrix Result

Automatic Phenotype Structure Mining

Given a n m data matrix M and the number of samples' phenotypes K.

The goal is to find K mutually exclusive groups of the samples matching their empirical phenotypes, and to find the set of informative genes which manifests this phenotype distinction.


Requirements

The expression levels of each informative gene should be similar over the samples within each phenotype

The expression levels of each informative gene should display a clear dissimilarity between each pair of phenotypes


Challenges (1)

The volume of genes is very large while the number of samples is very limited, no distinct class structures of samples can be properly detected by the existing techniques.


gene1

gene2

gene3

gene4

gene5

gene6

gene7

gene8

gene9

gene10

gene11

gene12

gene13

gene14

gene15

The limited informative

genes are buried in large

amount of noise.

gene5

gene9

gene12

Challenges (2)


Challenges (3)

Gene PROTEASOME IOTA X59417Gene C-myb U22376

Gene Fumarylacetoacetate M55150 Gene LTC4 synthase U50136

The values within data matrices are all real numbers

None of the informative genes follows ideal “high-`low” pattern.


Related Work New tools using traditional methods :

The similarity measures used in these methods are based on the full gene space.

PCs do not necessarily have strong correlation with informative genes.

TreeView

CLUTO

CIT

CNIO

GeneSpring

J-Express

CLUSFAVOR

• SOM

• K-means

• Hierarchical clustering

• Graph based clustering

• PCA


Related Work (Cont’d)

Clustering with feature selection: (CLIFF, two-way ordering, SamCluster)

1. Filtering the invariant genes• Rank variance• PCA• CV

2. Partition the samples• Ncut, Min-Max Cut• Hierarchical Clustering

3. Pruning genes based on the partition• Markov blanket filter• T-test


Related Work (Cont’d) Subspace clustering :

• Bi-clustering• δ-clustering



Subspace clustering only measure trend similarity. But in our model, we require each gene show consistent signals on the samples of the same phenotype.



Subspace clustering algorithms only detect local correlated features and objects without considering dissimilarity between different clusters. We want to get the genes which can differentiate all phenotypes.


Our Contributions

We transferred the phenotype structure

mining problem into an optimization problem.

A series of statistic-based metrics are defined

as objective functions.

A heuristic searching method and a mutual

reinforcing adjustment approach are

proposed to find phenotype structures.


Model - Measurements

gene1

gene2

gene3

samples

Intra-consistency

Inter-divergency

Phenotype

Quality

Intra-consistency

S1 S2

G’


Intra-consistency

Measure-ment

Data(A) Data(B)

residue 0.1975 0.4506

MSR 0.0494 0.4012

Ours 339.0667 5.3000

NOT consistent

consistent


Intra-pattern-consistency (Cont’d)

Variance of a single gene on the samples within one phenotype:

Intra-pattern-consistency: average row variance

'

2',, )(

1'

1)',(

SjSiji ww

SSiVar

.)(1''

1

)',('

1)','(

' '

2',,

'

Gg SsSiji

Gg

i j

i

wwSG

SiVarG

SGCon

In a subset of genes (candidate informative genes), does every gene have good consistency on a set of samples?

Average of variance of the subset of genes – the smaller the intra-phenotype consistency, the better.


Inter-pattern-divergence

Both “inter-pattern-consistency” and ``intra-pattern-divergence” on the same gene are reflected.

Average block distance:

')),,'( '

,,

21

21

G

ww

SSGDiv GgSiSi

i

How a subset of genes (candidate informative genes) can discriminate two phenotypes of samples?

Sum of the average difference between the phenotypes – the larger the inter-phenotype divergence, the better.


Pattern Quality

The purpose of pattern discovery is to identify the empirical patterns where the intra-pattern-consistency inside each phenotype is high and the inter-pattern-divergence between each pair of phenotypes is large.

);,1(,, ),'(

),'(),'(

1

jiKjiSSji

ji

ji SSGDiv

SGConSGCon

The higher the value, the better the quality.


Measurements

Inter-divergence:

')),,'( '

,,

21

21

G

ww

SSGDiv GgSiSi

i

Phenotype Quality

);,1(,, ),'(

),'(),'(

1

jiKjiSSji

ji

ji SSGDiv

SGConSGCon

Intra-consistency

.)(1''

1)','(

' '

2',,

Gg Ss

Sijii j

wwSG

SGCon


Phenotype Quality

Data(A) Data(B) Data(C)

Con 4.25 3.44 4.52

Div 41.60 25.20 46.16

14.2687 9.6074 15.3526

Highest phenotype quality


Model - Formalized Problem

Input

m samples and n genes

the corresponding gene expression matrix M

the number of phenotypes K

Output

A K-partition of samples (phenotypes) and a subset of genes (informative space) that the phenotype quality is maximized.


Strategy

Maintain a candidate phenotype structure and iteratively adjust the candidate structure toward the optimal solution.

Basic elements:

A candidate structure: A partition of samples {S1,S2,…Sk}

A subset of genes G’G

The corresponding phenotype quality An adjustment:

For a gene G’, insert into G’

For a gene G’, remove from G’

For a sample in a group S’, move to other group

The quality gain measures the change of phenotype quality of before and after the adjustment.

ig

ig

is


Heuristic Searching

intermediatecandidatestructuregene/sample

Iterative Adjusting

adjustmentΩ > 0

Y

N)

)(exp(

iTp

adjusting

pick upan object

candidate structuregeneration


Heuristic Searching Starts with a random K-partition of samples and a subset of genes as

the candidate of the informative space.

Iteratively adjust the partition and the gene set toward a better solution. (Random order of genes and samples.) for each gene, try possible insert/remove for each sample, try best movement.

Insert a gene Remove a gene Move a sample


Heuristic Search

For each possible adjustment, compute For each gene, try possible insert/removeFor each sample, try the best movement

> 0 conduct the adjustment < 0 conduct the adjustment with probability

T(i) is a decreasing simulated annealing function

and i is the iteration number. T(0)=1, T(i)=1/(i+1) in our implementation

))(

exp(iT

p


Mutual Reinforcing Adjustment - Motivation

Drawbacks of the heuristic searching method: blind initialization , equal chance of samples and genes, noisy samples.

The phenotype quality value of subset of informative genes and partially phenotype should also be high.

Mining phenotypes and informative genes directly from high-dimensional noisy data is difficult, we start from small groups whose data distribution and patterns are much easier to be detected.

Mining of phenotypes and informative genes should

mutually reinforced.


Mutual Reinforcing Adjustment - Motivation

A B C


Mutual Reinforcing Adjustment - Major Steps

Partition the Matrix: divide the original matrix into a series of exclusive sub-matrices based on partitioning both the samples and genes.

Reference Partition Detection: post a partial or approximate phenotype structure called a reference partition of samples.

compute reference degree for each sample groups;

select k groups of samples;

do partition adjustment.

Gene Adjustment: adjust the candidate informative genes.

compute for reference partition on G

perform possible adjustment of each genes

Refinement Phase


Method Detail - Iteration Phasein

form

ativ

e ge

nes

G’

all samples

partitioning the matrix

all samples

info

rmat

ive

gene

s G

’

reference partition detection

reference partition

gene adjustment

info

rmat

ive

gene

s G

’

info

rmat

ive

gene

s G

’’

reference partition

info

rmat

ive

gene

s G

’’ all samples

to next iteration


Partitioning the Matrix

Partition the samples and genes into multiple groupsUse CAST

A threshold t decide the size of each groupBased on the Pearson’s correlation Coefficient

Outliers will be filtered out from any groupSamples or genes in the same group share similar

patterns

k

i i

k

i i

k

i iiYX

yyxx

yyxx

1

2

1

2

1,

))((


Reference Partition Detection

Select the groups of samples as potential phenotypes

Pick the first group with the highest reference degree

Select the other groups by considering the inter-phenotype divergence w.r.t. selected groups

' ),(

1log)(

GG jijj

iSGCon

SSref

'

1

0

),(

),,(log)(

GG pxi

ptpxx

ti

pxpxi SGCon

SSGDivSSRan


Check the Missing Samples

Probabilistically insert the remaining samples not in the selected groups into the most probably matching group

In iterations, use the gene candidate sets to improve the reference partition


Gene Adjustment

Gene adjustment: Test the possible adjustments that lead to improvement

Insert a gene Remove a gene


The partition corresponding to the best state may not

cover all the samples.

Add every sample not covered by the reference

partition into its matching group the phenotypes of

the samples.

Then, a gene adjustment phase is conducted. We

execute all adjustments with a positive quality gain

informative space.

Time complexity O(n*m2*I)

Method-Refinement Phase


gene1

gene6

gene7

gene2

gene4

gene3

samples 1 4 82 3 5 6 7 9 10

gene8

gene9

Output: p phenotype structures where the tth structure is a Kt-partition of samples (phenotypes) and a subset of genes (informative space) which

manifest the sample partition. The overall phenotype quality is maximized.

Empirical Phenotype Structure

Hidden Phenotype Structure

Mining Multiple Phenotype Structures


Maintain p candidate phenotype structures and iteratively adjust them toward the optimal solution.

Basic elements of each candidate structure:

A candidate structure

A Kt partition of samples

A subset of genes G’G

The corresponding phenotype quality t

An adjustment

For a gene gi Gt, insert into Gt

For a gene gi Gt, move from Gt’ (tt’) or remove from all structures

For a sample si in group S’, move to other group

The quality gain measures the change of pattern quality of the states after the adjustment.

Extended Algorithm Strategy


The Extended Algorithm (Cont’d) Gene

insert moveremove

Sample

move

candidate structure 1

candidate structure 2


Mining Multiple Phenotype Structures (Cont’d)

Partially informative genes


Formalized Problem

Input

•m samples and n genes

•the corresponding gene expression matrix M

•the number of phenotype structures p

•the set of numbers {K1, K2, …, Kp}

Output

p phenotype structures where the tth structure is a Kt-partition of samples (phenotypes) and a subset of genes (informative space) which manifest the sample partition. The overall phenotype quality is maximized.


The AlgorithmCandidate Structure Generation

cluster genes into p’ group (p’>p) (CAST)

generate sample partitions one by one on clusters of genes, select best quality genes.

Iterative Adjustment

for each gene, try possible insert/move/remove

for each sample,

- examine all possible adjustment

- select best movement.


The Algorithm (Cont’d) Gene (p possible adjustments)

insert moveremove

Sample (Kt-1 possible

adjustments for each

partition)


The Algorithm (Cont’d)

Data Standardization

the original gene intensity values relative values

,,'

,i

iji

ji

www

1

)(;

1

2,1 ,

m

ww

m

ww

m

j iji

i

m

j ji

i where

Random order of genes and samples

Conduct negative action with a probability

Simulated annealing technique

))(

exp(iT

p

.1

1)(;1)0(

iiTT


Experiments

Data Sets:Multiple-sclerosis data

MS-IFN : 4132 * 28 (14 MS vs. 14 IFN)MS-CON : 4132 * 30 (15 MS vs. 15 Control)

Leukemia data7129 * 38 (27 ALL vs. 11 AML)7129 * 34 (20 ALL vs. 14 AML)

Colon cancer data2000 * 62 (22 normal vs. 40 tumor colon tissue)

Hereditary breast cancer data3226 * 22 ( 7 BRCA1, 8 BRCA2, 7 Sporadics)


Rand Index

Rand Index -A measurement of “agreement” between the ground-truth (P) and the results (Q) : “a” : the number of pairs of objects that are in

the same class in P and in the same class in Q; “b” : the number of pairs of objects that are in

the same class in P but not in the same class in Q;

“c” : the number of pairs of objects that are in the same class in Q but not in the same class in P;

“d” : the number of pairs of objects that are in different classes in P and in different class in Q.

dcba

daRI

P Q

s1 s2

s1 s2

s1 s2

s1 s2

s1 s2 s1 s2

s1 s2 s1 s2


Phenotype Structure Detection

Data Set MS-IFN MS-CON Leukemia-G1 Leukemia-G2 Colon Breast

Data Size 4132*28 4132*30 7129*38 7129*34 2000*62 3226*22

J-Express 0.4815 0.4851 0.5092 0.4965 0.4939 0.4112

CLUTO 0.4815 0.4828 0.5775 0.4866 0.4966 0.6364

CIT 0.4841 0.4851 0.6586 0.4920 0.4966 0.5844

CNIO 0.4815 0.4920 0.6017 0.4920 0.4939 0.4112

CLUSFAVOR 0.5238 0.5402 0.5092 0.4920 0.4939 0.5844

-cluster 0.4894 0.4851 0.5007 0.4538 0.4796 0.4719

Heuristic 0.8052 0.6230 0.9761 0.7086 0.6293 0.8638

Mutual 0.8387 0.6513 0.9778 0.7558 0.6827 0.8749


ExperimentsNumber of iterations Running time

Data Size mean standard deviation

mean standard deviation

4132*28 158 27.2 180 35.1

4132*30 168 29.5 195 37.8

7129*38 171 16.1 436 51.9

7129*34 198 35.9 458 101.2

2000*62 133 17.8 479 98.5

3226*22 157 22.2 167 35.6

The mean value and standard deviation of the numbers of iterations and response time (in second) with respect to the matrix size.


Experimental Results (5)Phenotype Structure Detection (Cont’d)

The mutual reinforcing approach as applied to the MS-IFN group.

(A) shows the distribution of the original 28 samples. Each point represents a sample with 4132 genes mapped to two-dimensional space.

(B) shows the distribution in the middle of the adjustment.

(C) shows the distribution of the same 28 samples after the iterations. 76 genes was selected as informative space.


Experimental Results (5)Informative Gene Selection


Phenotype Structures


Experimental Results (5)Informative Gene Selection (Cont’d)


Experimental Results (5)Scalability Evaluation


Conclusion from the Experiments

The work is motivated by the needs of emerging microarray data analysis.

The strategy is designed for data which have the following properties:The number of samples is limited but the gene

dimension is very large.Large volumes of irrelevant and redundant genes

prevent accurate grouping of samples;Analyzing over one dimension object can enhance

detecting meaningful patterns of another dimension.

Documents

Mining Phenotype Structures