55
University at Buffalo The State University of New York Mining Phenotype Structures Chun Tang and Aidong Zhang Bioinformatics Journal, 20(6):829-838, 2004

Mining Phenotype Structures

  • Upload
    limei

  • View
    44

  • Download
    0

Embed Size (px)

DESCRIPTION

Mining Phenotype Structures. Chun Tang and Aidong Zhang Bioinformatics Journal, 20(6):829-838, 2004. Microarray Data Analysis. Analysis from two angles sample as object, gene as attribute gene as object, sample/condition as attribute. Supervised Analysis. - PowerPoint PPT Presentation

Citation preview

Page 1: Mining Phenotype Structures

University at Buffalo The State University of New York

Mining Phenotype Structures

Chun Tang and Aidong ZhangBioinformatics Journal, 20(6):829-838, 2004

Page 2: Mining Phenotype Structures

University at Buffalo The State University of New York

Microarray Data Analysis Analysis from two angles

sample as object, gene as attribute

gene as object, sample/condition as attribute

Page 3: Mining Phenotype Structures

University at Buffalo The State University of New York

Supervised Analysis Select training samples (hold out…) Sort genes (t-test, ranking…) Select informative genes (top 50 ~ 200) Cluster based on informative genes

1 1 … 1 0 0 … 01 1 … 1 0 0 … 0

0 0 … 0 1 1 … 1

0 0 … 0 1 1 … 1

g1

g2

.

.

.

.

.

.

.

g4131

g4132

1 1 … 1 0 0 … 01 1 … 1 0 0 … 0

0 0 … 0 1 1 … 1

0 0 … 0 1 1 … 1

g1

g2

.

.

.

g4131

g4132

Class 1 Class 2

Page 4: Mining Phenotype Structures

University at Buffalo The State University of New York

Unsupervised Analysis

We will focus on unsupervised sample partition which assume no phenotype information being assigned to any sample.

Since the initial biological identification of sample classes has been slow, typically evolving through years of hypothesis-driven research, automatically discovering sample pattern presents a significant contribution in microarray data analysis.

Many mature statistic methods can not be applied

without the phenotypes of samples being known in

advance.

Page 5: Mining Phenotype Structures

University at Buffalo The State University of New York

4 5 6 7 8 9 10

gene1

gene6

gene7

gene2

gene4

gene5

gene3

gene1

gene6

gene7

gene2

gene4

gene5

gene3

Unsupervised Analysis

Informative Genes

Non- informative

Genes

samples

An informative gene is a gene which manifests samples' phenotype distinction.

Phenotype structure: sample partition + informative genes.

Automatic Phenotype Structure MiningAutomatic Phenotype Structure Mining

1 2 3

Page 6: Mining Phenotype Structures

University at Buffalo The State University of New York

gene1

gene2

gene3

Informative genes

1 2 3 4 5 6 7

Phenotype distinctionMining

Mining

Gene expression matrix Result

Automatic Phenotype Structure Mining

Given a n m data matrix M and the number of samples' phenotypes K.

The goal is to find K mutually exclusive groups of the samples matching their empirical phenotypes, and to find the set of informative genes which manifests this phenotype distinction.

Page 7: Mining Phenotype Structures

University at Buffalo The State University of New York

Requirements

The expression levels of each informative gene should be similar over the samples within each phenotype

The expression levels of each informative gene should display a clear dissimilarity between each pair of phenotypes

Page 8: Mining Phenotype Structures

University at Buffalo The State University of New York

Challenges (1)

The volume of genes is very large while the number of samples is very limited, no distinct class structures of samples can be properly detected by the existing techniques.

Page 9: Mining Phenotype Structures

University at Buffalo The State University of New York

gene1

gene2

gene3

gene4

gene5

gene6

gene7

gene8

gene9

gene10

gene11

gene12

gene13

gene14

gene15

The limited informative

genes are buried in large

amount of noise.

gene5

gene9

gene12

Challenges (2)

Page 10: Mining Phenotype Structures

University at Buffalo The State University of New York

Challenges (3)

Gene PROTEASOME IOTA X59417Gene C-myb U22376

Gene Fumarylacetoacetate M55150 Gene LTC4 synthase U50136

The values within data matrices are all real numbers

None of the informative genes follows ideal “high-`low” pattern.

Page 11: Mining Phenotype Structures

University at Buffalo The State University of New York

Related Work New tools using traditional methods :

The similarity measures used in these methods are based on the full gene space.

PCs do not necessarily have strong correlation with informative genes.

TreeView

CLUTO

CIT

CNIO

GeneSpring

J-Express

CLUSFAVOR

• SOM

• K-means

• Hierarchical clustering

• Graph based clustering

• PCA

Page 12: Mining Phenotype Structures

University at Buffalo The State University of New York

Related Work (Cont’d)

Clustering with feature selection: (CLIFF, two-way ordering, SamCluster)

1. Filtering the invariant genes• Rank variance• PCA• CV

2. Partition the samples• Ncut, Min-Max Cut• Hierarchical Clustering

3. Pruning genes based on the partition• Markov blanket filter• T-test

Page 13: Mining Phenotype Structures

University at Buffalo The State University of New York

Related Work (Cont’d) Subspace clustering :

• Bi-clustering• δ-clustering

Page 14: Mining Phenotype Structures

University at Buffalo The State University of New York

Related Work (Cont’d)

Subspace clustering only measure trend similarity. But in our model, we require each gene show consistent signals on the samples of the same phenotype.

Page 15: Mining Phenotype Structures

University at Buffalo The State University of New York

Related Work (Cont’d)

Subspace clustering algorithms only detect local correlated features and objects without considering dissimilarity between different clusters. We want to get the genes which can differentiate all phenotypes.

Page 16: Mining Phenotype Structures

University at Buffalo The State University of New York

Our Contributions

We transferred the phenotype structure

mining problem into an optimization problem.

A series of statistic-based metrics are defined

as objective functions.

A heuristic searching method and a mutual

reinforcing adjustment approach are

proposed to find phenotype structures.

Page 17: Mining Phenotype Structures

University at Buffalo The State University of New York

Model - Measurements

gene1

gene2

gene3

samples

Intra-consistency

Inter-divergency

Phenotype

Quality

Intra-consistency

S1 S2

G’

Page 18: Mining Phenotype Structures

University at Buffalo The State University of New York

Intra-consistency

Measure-ment

Data(A) Data(B)

residue 0.1975 0.4506

MSR 0.0494 0.4012

Ours 339.0667 5.3000

NOT consistent

consistent

Page 19: Mining Phenotype Structures

University at Buffalo The State University of New York

Intra-pattern-consistency (Cont’d)

Variance of a single gene on the samples within one phenotype:

Intra-pattern-consistency: average row variance

'

2',, )(

1'

1)',(

SjSiji ww

SSiVar

.)(1''

1

)',('

1)','(

' '

2',,

'

Gg SsSiji

Gg

i j

i

wwSG

SiVarG

SGCon

In a subset of genes (candidate informative genes), does every gene have good consistency on a set of samples?

Average of variance of the subset of genes – the smaller the intra-phenotype consistency, the better.

Page 20: Mining Phenotype Structures

University at Buffalo The State University of New York

Inter-pattern-divergence

Both “inter-pattern-consistency” and ``intra-pattern-divergence” on the same gene are reflected.

Average block distance:

')),,'( '

,,

21

21

G

ww

SSGDiv GgSiSi

i

How a subset of genes (candidate informative genes) can discriminate two phenotypes of samples?

Sum of the average difference between the phenotypes – the larger the inter-phenotype divergence, the better.

Page 21: Mining Phenotype Structures

University at Buffalo The State University of New York

Pattern Quality

The purpose of pattern discovery is to identify the empirical patterns where the intra-pattern-consistency inside each phenotype is high and the inter-pattern-divergence between each pair of phenotypes is large.

);,1(,, ),'(

),'(),'(

1

jiKjiSSji

ji

ji SSGDiv

SGConSGCon

The higher the value, the better the quality.

Page 22: Mining Phenotype Structures

University at Buffalo The State University of New York

Measurements

Inter-divergence:

')),,'( '

,,

21

21

G

ww

SSGDiv GgSiSi

i

Phenotype Quality

);,1(,, ),'(

),'(),'(

1

jiKjiSSji

ji

ji SSGDiv

SGConSGCon

Intra-consistency

.)(1''

1)','(

' '

2',,

Gg Ss

Sijii j

wwSG

SGCon

Page 23: Mining Phenotype Structures

University at Buffalo The State University of New York

Phenotype Quality

Data(A) Data(B) Data(C)

Con 4.25 3.44 4.52

Div 41.60 25.20 46.16

14.2687 9.6074 15.3526

Highest phenotype quality

Page 24: Mining Phenotype Structures

University at Buffalo The State University of New York

Model - Formalized Problem

Input

m samples and n genes

the corresponding gene expression matrix M

the number of phenotypes K

Output

A K-partition of samples (phenotypes) and a subset of genes (informative space) that the phenotype quality is maximized.

Page 25: Mining Phenotype Structures

University at Buffalo The State University of New York

Strategy

Maintain a candidate phenotype structure and iteratively adjust the candidate structure toward the optimal solution.

Basic elements:

A candidate structure: A partition of samples {S1,S2,…Sk}

A subset of genes G’G

The corresponding phenotype quality An adjustment:

For a gene G’, insert into G’

For a gene G’, remove from G’

For a sample in a group S’, move to other group

The quality gain measures the change of phenotype quality of before and after the adjustment.

ig

ig

is

Page 26: Mining Phenotype Structures

University at Buffalo The State University of New York

Heuristic Searching

intermediatecandidatestructuregene/sample

Iterative Adjusting

adjustmentΩ > 0

Y

N)

)(exp(

iTp

adjusting

pick upan object

candidate structuregeneration

Page 27: Mining Phenotype Structures

University at Buffalo The State University of New York

Heuristic Searching Starts with a random K-partition of samples and a subset of genes as

the candidate of the informative space.

Iteratively adjust the partition and the gene set toward a better solution. (Random order of genes and samples.) for each gene, try possible insert/remove for each sample, try best movement.

Insert a gene Remove a gene Move a sample

Page 28: Mining Phenotype Structures

University at Buffalo The State University of New York

Heuristic Search

For each possible adjustment, compute For each gene, try possible insert/removeFor each sample, try the best movement

> 0 conduct the adjustment < 0 conduct the adjustment with probability

T(i) is a decreasing simulated annealing function

and i is the iteration number. T(0)=1, T(i)=1/(i+1) in our implementation

))(

exp(iT

p

Page 29: Mining Phenotype Structures

University at Buffalo The State University of New York

Mutual Reinforcing Adjustment - Motivation

Drawbacks of the heuristic searching method: blind initialization , equal chance of samples and genes, noisy samples.

The phenotype quality value of subset of informative genes and partially phenotype should also be high.

Mining phenotypes and informative genes directly from high-dimensional noisy data is difficult, we start from small groups whose data distribution and patterns are much easier to be detected.

Mining of phenotypes and informative genes should

mutually reinforced.

Page 30: Mining Phenotype Structures

University at Buffalo The State University of New York

Mutual Reinforcing Adjustment - Motivation

A B C

Page 31: Mining Phenotype Structures

University at Buffalo The State University of New York

Mutual Reinforcing Adjustment - Major Steps

Partition the Matrix: divide the original matrix into a series of exclusive sub-matrices based on partitioning both the samples and genes.

Reference Partition Detection: post a partial or approximate phenotype structure called a reference partition of samples.

compute reference degree for each sample groups;

select k groups of samples;

do partition adjustment.

Gene Adjustment: adjust the candidate informative genes.

compute for reference partition on G

perform possible adjustment of each genes

Refinement Phase

Page 32: Mining Phenotype Structures

University at Buffalo The State University of New York

Method Detail - Iteration Phasein

form

ativ

e ge

nes

G’

all samples

partitioning the matrix

all samples

info

rmat

ive

gene

s G

reference partition detection

reference partition

gene adjustment

info

rmat

ive

gene

s G

info

rmat

ive

gene

s G

’’

reference partition

info

rmat

ive

gene

s G

’’ all samples

to next iteration

Page 33: Mining Phenotype Structures

University at Buffalo The State University of New York

Partitioning the Matrix

Partition the samples and genes into multiple groupsUse CAST

A threshold t decide the size of each groupBased on the Pearson’s correlation Coefficient

Outliers will be filtered out from any groupSamples or genes in the same group share similar

patterns

k

i i

k

i i

k

i iiYX

yyxx

yyxx

1

2

1

2

1,

))((

Page 34: Mining Phenotype Structures

University at Buffalo The State University of New York

Reference Partition Detection

Select the groups of samples as potential phenotypes

Pick the first group with the highest reference degree

Select the other groups by considering the inter-phenotype divergence w.r.t. selected groups

' ),(

1log)(

GG jijj

iSGCon

SSref

'

1

0

),(

),,(log)(

GG pxi

ptpxx

ti

pxpxi SGCon

SSGDivSSRan

Page 35: Mining Phenotype Structures

University at Buffalo The State University of New York

Check the Missing Samples

Probabilistically insert the remaining samples not in the selected groups into the most probably matching group

In iterations, use the gene candidate sets to improve the reference partition

Page 36: Mining Phenotype Structures

University at Buffalo The State University of New York

Gene Adjustment

Gene adjustment: Test the possible adjustments that lead to improvement

Insert a gene Remove a gene

Page 37: Mining Phenotype Structures

University at Buffalo The State University of New York

The partition corresponding to the best state may not

cover all the samples.

Add every sample not covered by the reference

partition into its matching group the phenotypes of

the samples.

Then, a gene adjustment phase is conducted. We

execute all adjustments with a positive quality gain

informative space.

Time complexity O(n*m2*I)

Method-Refinement Phase

Page 38: Mining Phenotype Structures

University at Buffalo The State University of New York

gene1

gene6

gene7

gene2

gene4

gene3

samples 1 4 82 3 5 6 7 9 10

gene8

gene9

Output: p phenotype structures where the tth structure is a Kt-partition of samples (phenotypes) and a subset of genes (informative space) which

manifest the sample partition. The overall phenotype quality is maximized.

Empirical Phenotype Structure

Hidden Phenotype Structure

Mining Multiple Phenotype Structures

Page 39: Mining Phenotype Structures

University at Buffalo The State University of New York

Maintain p candidate phenotype structures and iteratively adjust them toward the optimal solution.

Basic elements of each candidate structure:

A candidate structure

A Kt partition of samples

A subset of genes G’G

The corresponding phenotype quality t

An adjustment

For a gene gi Gt, insert into Gt

For a gene gi Gt, move from Gt’ (tt’) or remove from all structures

For a sample si in group S’, move to other group

The quality gain measures the change of pattern quality of the states after the adjustment.

Extended Algorithm Strategy

Page 40: Mining Phenotype Structures

University at Buffalo The State University of New York

The Extended Algorithm (Cont’d) Gene

insert moveremove

Sample

move

candidate structure 1

candidate structure 2

Page 41: Mining Phenotype Structures

University at Buffalo The State University of New York

Mining Multiple Phenotype Structures (Cont’d)

Partially informative genes

Page 42: Mining Phenotype Structures

University at Buffalo The State University of New York

Formalized Problem

Input

•m samples and n genes

•the corresponding gene expression matrix M

•the number of phenotype structures p

•the set of numbers {K1, K2, …, Kp}

Output

p phenotype structures where the tth structure is a Kt-partition of samples (phenotypes) and a subset of genes (informative space) which manifest the sample partition. The overall phenotype quality is maximized.

Page 43: Mining Phenotype Structures

University at Buffalo The State University of New York

The AlgorithmCandidate Structure Generation

cluster genes into p’ group (p’>p) (CAST)

generate sample partitions one by one on clusters of genes, select best quality genes.

Iterative Adjustment

for each gene, try possible insert/move/remove

for each sample,

- examine all possible adjustment

- select best movement.

Page 44: Mining Phenotype Structures

University at Buffalo The State University of New York

The Algorithm (Cont’d) Gene (p possible adjustments)

insert moveremove

Sample (Kt-1 possible

adjustments for each

partition)

Page 45: Mining Phenotype Structures

University at Buffalo The State University of New York

The Algorithm (Cont’d)

Data Standardization

the original gene intensity values relative values

,,'

,i

iji

ji

www

1

)(;

1

2,1 ,

m

ww

m

ww

m

j iji

i

m

j ji

i where

Random order of genes and samples

Conduct negative action with a probability

Simulated annealing technique

))(

exp(iT

p

.1

1)(;1)0(

iiTT

Page 46: Mining Phenotype Structures

University at Buffalo The State University of New York

Experiments

Data Sets:Multiple-sclerosis data

MS-IFN : 4132 * 28 (14 MS vs. 14 IFN)MS-CON : 4132 * 30 (15 MS vs. 15 Control)

Leukemia data7129 * 38 (27 ALL vs. 11 AML)7129 * 34 (20 ALL vs. 14 AML)

Colon cancer data2000 * 62 (22 normal vs. 40 tumor colon tissue)

Hereditary breast cancer data3226 * 22 ( 7 BRCA1, 8 BRCA2, 7 Sporadics)

Page 47: Mining Phenotype Structures

University at Buffalo The State University of New York

Rand Index

Rand Index -A measurement of “agreement” between the ground-truth (P) and the results (Q) : “a” : the number of pairs of objects that are in

the same class in P and in the same class in Q; “b” : the number of pairs of objects that are in

the same class in P but not in the same class in Q;

“c” : the number of pairs of objects that are in the same class in Q but not in the same class in P;

“d” : the number of pairs of objects that are in different classes in P and in different class in Q.

dcba

daRI

P Q

s1 s2

s1 s2

s1 s2

s1 s2

s1 s2 s1 s2

s1 s2 s1 s2

Page 48: Mining Phenotype Structures

University at Buffalo The State University of New York

Phenotype Structure Detection

Data Set MS-IFN MS-CON Leukemia-G1 Leukemia-G2 Colon Breast

Data Size 4132*28 4132*30 7129*38 7129*34 2000*62 3226*22

J-Express 0.4815 0.4851 0.5092 0.4965 0.4939 0.4112

CLUTO 0.4815 0.4828 0.5775 0.4866 0.4966 0.6364

CIT 0.4841 0.4851 0.6586 0.4920 0.4966 0.5844

CNIO 0.4815 0.4920 0.6017 0.4920 0.4939 0.4112

CLUSFAVOR 0.5238 0.5402 0.5092 0.4920 0.4939 0.5844

-cluster 0.4894 0.4851 0.5007 0.4538 0.4796 0.4719

Heuristic 0.8052 0.6230 0.9761 0.7086 0.6293 0.8638

Mutual 0.8387 0.6513 0.9778 0.7558 0.6827 0.8749

Page 49: Mining Phenotype Structures

University at Buffalo The State University of New York

ExperimentsNumber of iterations Running time

Data Size mean standard deviation

mean standard deviation

4132*28 158 27.2 180 35.1

4132*30 168 29.5 195 37.8

7129*38 171 16.1 436 51.9

7129*34 198 35.9 458 101.2

2000*62 133 17.8 479 98.5

3226*22 157 22.2 167 35.6

The mean value and standard deviation of the numbers of iterations and response time (in second) with respect to the matrix size.

Page 50: Mining Phenotype Structures

University at Buffalo The State University of New York

Experimental Results (5)Phenotype Structure Detection (Cont’d)

The mutual reinforcing approach as applied to the MS-IFN group.

(A) shows the distribution of the original 28 samples. Each point represents a sample with 4132 genes mapped to two-dimensional space.

(B) shows the distribution in the middle of the adjustment.

(C) shows the distribution of the same 28 samples after the iterations. 76 genes was selected as informative space.

Page 51: Mining Phenotype Structures

University at Buffalo The State University of New York

Experimental Results (5)Informative Gene Selection

Page 52: Mining Phenotype Structures

University at Buffalo The State University of New York

Phenotype Structures

Page 53: Mining Phenotype Structures

University at Buffalo The State University of New York

Experimental Results (5)Informative Gene Selection (Cont’d)

Page 54: Mining Phenotype Structures

University at Buffalo The State University of New York

Experimental Results (5)Scalability Evaluation

Page 55: Mining Phenotype Structures

University at Buffalo The State University of New York

Conclusion from the Experiments

The work is motivated by the needs of emerging microarray data analysis.

The strategy is designed for data which have the following properties:The number of samples is limited but the gene

dimension is very large.Large volumes of irrelevant and redundant genes

prevent accurate grouping of samples;Analyzing over one dimension object can enhance

detecting meaningful patterns of another dimension.