32
University at Buffalo The State University of New York Pattern-based Clustering How to cluster the five objects? Hard to define a global similarity measure

Pattern-based Clustering

  • Upload
    star

  • View
    36

  • Download
    2

Embed Size (px)

DESCRIPTION

Pattern-based Clustering. How to cluster the five objects? Hard to define a global similarity measure. What Is Pattern-based Clustering?. A cluster: a set of objects following the same pattern in a subset of dimensions (Wang et al, 2002). Challenges. - PowerPoint PPT Presentation

Citation preview

Page 1: Pattern-based Clustering

University at Buffalo The State University of New York

Pattern-based Clustering

How to cluster the five objects?Hard to define a global similarity measure

Page 2: Pattern-based Clustering

University at Buffalo The State University of New York

What Is Pattern-based Clustering?

A cluster: a set of objects following the same pattern in a subset of dimensions (Wang et al, 2002)

Page 3: Pattern-based Clustering

University at Buffalo The State University of New York

ChallengesMost clustering approaches do not address the temporal

variations in time series gene expression data, which is an important feature and affect the performance.

Previous approaches try to find coherent patterns and clusters w.r.t. the entire set of attributes

Patterns may be embedded in sub attribute spaces Only a subset of genes participate in any cellular processes of interest Any cellular process may take place only in a subset of experiment

conditions.

a) raw data b) shifting patterns c) scaling patterns

Page 4: Pattern-based Clustering

University at Buffalo The State University of New York

Gene-Sample-Time (GST) Microarray Data

2D time-series data

3D gene-sample-time data

• The GST microarray data consist of three dimensionsthree dimensions

• The samples often exhibit various various phenotypesphenotypes, e.g., cancer vs. control

A collection of samples

Page 5: Pattern-based Clustering

University at Buffalo The State University of New York

Challenges of Mining GST Data

Challenges 2D data 3D data

Mining Process Partition genes

Partition genes and samples

simultaneously

Cluster model Two types of variables

Three types of variables

Most clustering algorithms were designed for 2D data, and cannot be directly extended for 3D data.

Page 6: Pattern-based Clustering

University at Buffalo The State University of New York

Coherent Gene Cluster

• The group of samples (sj1, sj2, sj3 ) may exhibit the same phenotype• The group of genes (gi1,gi2,gi3) may be strongly correlated to the phenotype shared by (sj1, sj2, sj3 )

A coherent gene clusterA 3D GST data set The 2D representation

Page 7: Pattern-based Clustering

University at Buffalo The State University of New York

Results from a Real Data Set The Multiple Sclerosis (MS) data consist of

4324 genes 13 MS patients 10 time points before and after IFN- treatment

25 coherent gene clusters were reported

Sample A Sample B Sample C Sample D

Sample E Sample F Sample G Sample H

An example of coherent gene clusters (107 genes, 8 samples)

Page 8: Pattern-based Clustering

University at Buffalo The State University of New York

Other Types of Coherent Clusters

Page 9: Pattern-based Clustering

University at Buffalo The State University of New York

Problem DefinitionGiven a GST microarray data matrix M, a maximal

coherent gene cluster C=(GS) is a combination of a subset of genes G and a subset of samples S such that: Coherent : the subset of genes G are coherent across the

subset of samples S;Significant : |G|≥ming, |S|≥mins, where ming and mins are

user-specified parameters;Maximal : any insertion of gG or sS will make C not

coherent.The problem of mining coherent gene clusters is to

find the complete set of maximal coherent gene clusters in M.

Page 10: Pattern-based Clustering

University at Buffalo The State University of New York

Coherence Measure Various coherence measures exist. Measure selection is application dependent. A general coherence model

Given a coherence measure sim(•) and a user-specified threshold , A gene ga is coherent on samples si and sj, if sim(pai,paj)≥ .Coherent gene matrix (G1,S1): if every gene gi G1 is coherent

across samples in S1.Trivial coherent gene matrix: ({gi}, {sj}), (G, {sj})

We choose the Person’s correlation coefficient. Other coherence measures are also applicable.

Page 11: Pattern-based Clustering

University at Buffalo The State University of New York

Related Work Clustering algorithms on Gene-Sample or

Gene-Time microarray data The cluster model is completely different

Subspace clustering Find subsets of objects coherent with subsets of

attributes Frequent pattern mining

Find subsets of items frequently appearing in transaction databases

Page 12: Pattern-based Clustering

University at Buffalo The State University of New York

Algorithm Outline

Phase 1 (Pre-processing) : For each gene g, find the complete set of maximal coherent sample sets of gene g.

Phase 2: Compute the complete set of maximal coherent gene clusters based on pre-processing results.

Page 13: Pattern-based Clustering

University at Buffalo The State University of New York

Coherent Sample Sets

Given a gene g, a maximal coherent sample set of g is a subset of samples Si such that: coherent : g is coherent across Si; significant : |Si| mins; maximal : there exists no superset S’Si such

that g is also coherent with S’. (g Si ) is a building block for coherent

gene clusters including g.

Page 14: Pattern-based Clustering

University at Buffalo The State University of New York

Preprocessing Phase

s1 s2 s3 s4 s5 s6

s1 1 1 0 1 0 0

s2 1 1 0 0 0 0

s3 0 0 1 1 1 1

s4 1 0 1 1 1 1

s5 0 0 1 1 1 1

s6 0 0 1 1 1 1

Suppose mins = 3

The coherence matrix of gene g

The coherence graph of gene g

s1

s2

s3

s5

s4s6

s4s3

s5 s6

{s3,s4,s5,s6} is a coherent sample set of

gene g

Page 15: Pattern-based Clustering

University at Buffalo The State University of New York

Sample-gene Search

Set enumeration tree Enumerate all subsets of samples

systematically. Each node on the tree corresponds to a subset

of samples. For each node S

Find the maximal set of genes Gs which is coherent with S

Page 16: Pattern-based Clustering

University at Buffalo The State University of New York

Set Enumeration Tree

The set enumeration tree for {a,b,c,d}

{}

{a} {c}{b} {d}

{a,b} {a,c} {a,d} {b,c} {b,d} {c,d}

{a,b,d}{a,b,c} {a,c,d} {b,c,d}

{a,b,c,d}

Page 17: Pattern-based Clustering

University at Buffalo The State University of New York

Find the Maximal Coherent Subset of Genes

After the pre-processing phase:

Given a subset of samples S, how to find the maximal coherent set of genes GS? Expensive approach: scan the table once For each S, Gs can be derived by a single scan of the maximal

coherent samples of all genes. If S Sj, g Gs.

Efficient approach: use the inverted list.

g1 {s1, s2, s3, s4, s5}

g2 {s1,s2,s4}, {s1,s5}

g3 {s1,s2,s3,s4,s5}

g4 {s1,s2,s3},{s5,s6}

g5 {s1,s5,s6}

Page 18: Pattern-based Clustering

University at Buffalo The State University of New York

The Inverted ListGene Maximal Coherent sample sets

g1 {s1, s2, s3, s4, s5}

g2 {s1, s2, s4}, {s1, s5}

g3 {s1, s2, s3, s4, s5}

g4 {s1, s2, s3}, {s5, s6}

g5 {s1, s5, s6}

Sample The inverted list

s1 {g1.b1, g2.b1, g2.b2, g3.b1, g4.b1, g5.b1}

s2 {g1.b1, g2.b1, g3.b1, g4.b1}

s3 {g1.b1, g3.b1, g4.b1}

s4 {g1.b1, g2.b1, g3.b1}

s5 {g1.b1, g2.b2, g3.b1, g4.b2, g5.b1}

s6 {g4.b2, g5.b1}

The table of maximal coherent sample sets for genes

The table of inverted lists for samples

g2.b1

g2.b2

Page 19: Pattern-based Clustering

University at Buffalo The State University of New York

Intersection Instead of Scanning

Given a subset of samples S={si1,…,sik}, intersect the inverted lists of si1,…,sik. For example, given S={s1,s2,s3},

Ls1^Ls2^Ls3={g1.b1,g3.b1,g4.b1}, so Gs={g1,g3,g4}. Suppose the parent of S is S’={si1,…,sik-1}, then

LS=LS’ Lsik.

Page 20: Pattern-based Clustering

University at Buffalo The State University of New York

Anti-monotonic Property

Given a combination (GS),if G is not coherent on S, then for any superset S’S, G cannot be

coherent on S’. For any descendant S’ of S on the tree

let GS be the maximal coherent gene set of S, let GS’ be the maximal coherent gene sets of S’, since S’S, we have GS’ GS.

Page 21: Pattern-based Clustering

University at Buffalo The State University of New York

Pruning Irrelevant Samples

Given a subset of samples S={si1,…,sik}, a sample sjtails, if j > ik

there exists at least ming genes g such that g is coherent with S{sj}

Samples sltails(irrelevant samples) cannot be used to extend S.

Page 22: Pattern-based Clustering

University at Buffalo The State University of New York

Pruning Unpromising Nodes

Given a subset of samples S={si1,…,sik}, if |S|+|tails|< mins, then prune the subtree of S. let the maximal coherent subset of genes of S be Gs,

if there exists (G’S’) such that (Stails) S’ GsG’,

the prune the subtree of S

Page 23: Pattern-based Clustering

University at Buffalo The State University of New York

Determination of Maximal Coherent Gene Clusters

The depth-first search strategy: For any superset S’ of S, S’ is

visited before S; or a child of S.

To determine whether a coherent gene cluster (GsS) is maximal, check (GsS) after visiting all its children, report (GsS) if it is not subsumed.

Page 24: Pattern-based Clustering

University at Buffalo The State University of New York

{ }

{s1}{s2,s3,s4,s5}

{s2}{s3,s4}

{s3}{}

{s4}{}

{s1,s2}{s3,s4}

{g1.b1, g2.b1, g3.b1, g4.b1}

{s1,s3}{}

{g1.b1, g3.b1, g4.b1}

{s1,s4}{}

{g1.b1, g2.b1, g3.b1}

{s2,s3}{}

{g1.b1, g3.b1, g4.b1}

{s2,s4}{}

{g1.b1, g2.b1, g3.b1}

{s1,s2,s3}{}

{g1.b1,g3.b1,g4.b1}

{s1,s2,s4}{}

{g1.b1,g2.b1,g3.b1}

Sample The inverted list

s1 {g1.b1, g2.b1, g2.b2, g3.b1, g4.b1, g5.b1}

s2 {g1.b1, g2.b1, g3.b1, g4.b1}

s3 {g1.b1, g3.b1, g4.b1}

s4 {g1.b1, g2.b1, g3.b1}

s5 {g1.b1, g2.b2, g3.b1, g4.b2, g5.b1}

s6 {g4.b2, g5.b1}

Page 25: Pattern-based Clustering

University at Buffalo The State University of New York

Mining Coherent Gene Clusters

Systematic enumeration of genes and samples Sample-Gene Search Gene-Sample Search

Pruning rules Determination of whether a coherent gene

cluster (GS) is maximal

Page 26: Pattern-based Clustering

University at Buffalo The State University of New York

Gene-sample Search

Sample-Gene Search Gene-Sample SearchSubjects to enumerate samples genesNumber of subjects to enumerate

101~102 103~104

Coherent objects Single set of maxmial coherent genes

Single or multiple sets of maxmial

coherent sample Efficiency on GST data High Low

Page 27: Pattern-based Clustering

University at Buffalo The State University of New York

Experiment Data Sets Real-world gene expression data

4324 genes 13 multiple sclerosis (MS) patients before and at 1,2,4,8,24,48,120 and 168 hours after IFN-

treatment Synthetic data

Given the number of genes NG, samples NS and coherent gene clusters NC

Simulate the pre-processing results Embed NC maximal coherent gene clusters (GS)

Page 28: Pattern-based Clustering

University at Buffalo The State University of New York

A Coherent Gene Cluster from Real Data

Page 29: Pattern-based Clustering

University at Buffalo The State University of New York

Effect of Parameters

Number of clusters vs. ming (mins=3,=0.8)

Number of clusters vs. mins (ming=10, =0.8)

Number of clusters vs. (ming=10,mins=3)

Page 30: Pattern-based Clustering

University at Buffalo The State University of New York

Scalability

Scalability of phase 1 Scalability w.r.t. number of genes (number of samples: 30)

Scalability w.r.t. number of samples (number of genes: 3,000)

Page 31: Pattern-based Clustering

University at Buffalo The State University of New York

Conclusion

We define the new problem of mining coherent gene clusters from the novel gene-sample-time microarray data.

We propose two approaches: the sample-gene search and the gene-sample search.

We conduct an extensive empirical evaluation on both real and synthetic data sets.

Page 32: Pattern-based Clustering

University at Buffalo The State University of New York

Future Work

New problems from the gene-sample-time microarray data: Coherent sample clusters (GS)

for each sS, any pair of genes gi, gjG has coherent patterns.

Coherent gene-sample clusters (GS), both a coherent gene cluster and a coherent

sample cluster.