Download pdf - Statistical Genomics and Bioinformatics Workshop: … · Statistical Genomics and Bioinformatics Workshop 8/16/2013 2 Clustering Basics • Clustering is the process of grouping a

Statistical Genomics and Bioinformatics Workshop8/16/2013

1

Statistical Genomics and Bioinformatics Workshop:

Genetic Association and RNA-Seq Studies

Genomic Clustering and Signature Development

Brooke L. Fridley, PhDUniversity of Kansas Medical Center

1

Genomic Clustering

2


2

Clustering Basics• Clustering is the process of grouping a set of

physical or abstract objects into classes of similar objects.– It is also called unsupervised learning.

• Exploratory tool: use these methods for visualization, hypothesis generation, selection of genes for further consideration– We should not use these methods inferentially.

• Hierarchical clustering specifically: we are provided with a picture from which we can make many/any conclusions.

3

Why cluster genes?

• Identify groups of possibly co-regulated genes

• Identify typical temporal or spatial gene expression patterns

• Arrange a set of genes in a linear order that is at least not totally meaningless (we hope).

• Aids in the interpretation

4


3

Why cluster samples?

• Quality control: Detect experimental artifacts/bad hybridizations

• Check whether samples are grouped according to known categories (though this might be better addressed using a supervised approach: statistical tests,

classification)

• Identify new “classes” of biological samples– tumor subtypes

– Disease heterogeneity

5

Human breast tumors cluster into 6 distinct molecular subtypes of breast cancer with differences

in patient survival.

Hallett, et al. (2012) A gene signature for predicting outcome in patients with basal-like breast cancer. Scientific Reports

6


4

Clustering vs Classification

• Clustering is ‘unsupervised’:– We don’t use any information about what class the samples

belong to (e.g. disease status, cancer type) to determine cluster structure

– Hierarchical, K-Means, PCA, SOM, model-based clustering– Clustering finds groups in the data

• Classification methods are ‘supervised’:– identifying to which of a set of groups/categories a new

observation belongs, on the basis of a training set of data containing observations whose category membership is known.

– Discriminant analysis, PAM (Shrunken centroids), Random Forests/CART, K-Nearest Neighbor,

– Classification methods finds ‘classifiers’7

Cluster analysis

Generally, cluster analysis is based on two components:

1. Distance measure: Quantification of dissimilarity / similarity of objects.

2. Cluster algorithm: A procedure to group objects. Aim: small within-cluster distances, large between-cluster distances.

8


5

Distance and Similarity• Every clustering method is based solely on the measure

of distance or similarity.

– The clustering is only as good as the distance matrix

• Generally, not enough thought and time is spent on choosing and estimating the distance/similarity matrix.

– Applying correlation to highly skewed data will provide misleading results.

– Applying Euclidean distance to data measured on categorical scale will be invalid.

9

Hierarchical Clustering• The most over used statistical

method in gene expression analysis

• Tends to be pretty unstable.– Many different ways to perform

hierarchical clustering– Sensitive to small changes in

the data• Provided with clusters of every

size– where to “cut” the dendrogram is

user-determined

Huse, Kwon, et al (2010) Parallel Evolution in Pseudomonas aeruginosa over 39,000 generations in vivo. mBio. 1(4). 10


6

Distances between clusters used for hierarchical clustering

• Distance between two clusters is based on the pairwise distances between members of the clusters.

• Complete linkage: largest distance

• Average linkage: average distance

• Single linkage: smallest distance

• Complete linkage gives preference to compact / spherical clusters.

• Single linkage can produce long stretched clusters.11

Hierarchical Clustering Example: Visualizing DNA Methylation Data

Christensen et al. PLoS Genetics, 2009

CpG loci

Tissue sam

ples

Unmethylated

Methylated

Unsupervised hierarchical clustering based on Manhattan distance and average linkage. 12


7

K-Means

• Intuitive and very easy to implement

• Pre-specification of the number of clusters K

- K is typically unknown for most practical purposes

- Misspecification may lead to poor results

• Choice of a distance measure may be difficult to justify

• Clusters are expected to be of similar size

• May not work well for irregular clusters

- Clusters based on the first-moment (i.e., mean) only

13

K-means clustering illustration

Gene B

Gen

e A

14


8

Gene B

Gen

e A

m1(1)

m2(1)

• Need to first pre-specify the number of clusters • For simplicity, assume k = 2

• Step 1: Initialize the means for the two clusters

15

Gene B

Gen

e A

m1(1)

m2(1)

• Compute the distance of each of the points to each of the two means

• Assign points to the cluster with the closest mean

16


9

Gene B

Gen

e A

m1(1)

m2(1)



17

Gene B

Ge

ne A

m1(2)

m2(2)

• Re-compute the means based on the observations within that cluster (i.e., m(2)

1 and m(2)2)



18


10

• Re-compute the means based on the observations within that cluster (i.e., m(2)

1 and m(2)2)



Gene B

Gen

e A

m1(2)

m2(2)

19

Gene B

Gen

e A

m1(5)

m2(5)

• After 5 iterations, we converge on our final solution!• Consists of class labels for each of the n

observations

20


11

Cautions about Clustering

• Clustering can be a useful exploratory tool

• Cluster results are very sensitive to noise in the data

• Need to assess cluster structure and stability of results

• Different clustering approaches can give quite different results

– Methods

– Number Clusters

– Distance measures

• For hierarchical clustering, interpretation is almost always subjective

• Doesn’t tell us anything about what features should be used for clustering

21

Genomic Classification

22


12

?Bad prognosis

recurrence < 5yrsGood Prognosisrecurrence > 5yrs

ReferenceL van’t Veer et al (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, Jan.

ObjectsArray

Feature vectorsGene

expression

Predefine classesClinical

outcome

new array

Learning set

Classificationrule

Good PrognosisMatesis > 5

23

Decision tree classifiers (CART)

Gene 1Mi1 < ‐0.67

Gene 2Mi2 > 0.18

0

2

1

yes

yes

no

no

0.18

Transparent rules and easy to interpret and implement

G1 0.1 -0.2 0.3 G2 0.3 0.4 0.4G3 … ……Class 0 1 0

24


13

Ensemble classifiers

Training Set

X1, X2, … X100

Classifier 1Resample 1




Examples:BaggingBoosting

Random Forest

Aggregateclassifier

25

Feature Selection

• Lead to better classification performance by removing variables that are noise with respect to the outcome

• May provide useful biological insights

• Can eventually lead to the diagnostic tests

26


14

Classifier Performance Assessment

• Any classification rule needs to be evaluated for its performance on the future samples.

• One needs to estimate future performance based on what is available

• Assessing performance of the classifier based on

– Cross-validation• V-fold CV

• Leave-one-out cross validation (LOOCV)

– Training vs Testing set

– Testing on independent dataset

27

Diagram of performance assessment

Training set

Performance assessment

TrainingSet

Independenttest set

(CV) Learningset

(CV) Test set

Classifier

Classifier

Classifier

Resubstitution estimation

Test set estimation

Cross Validation

28


15

Pharmacogenomic (PGx) Classifiers

Benefits:• Enables patients to be

treated with drugs that actually work for them

• Avoids false negative trials for heterogeneous populations

• Avoids erroneous generalizations of conclusions from positive trials

Develop a PGx classifier to determine patients likely to benefit

from a new TRT

Establish reproducibility of the PGx classifier

Use the PGx classifier to design and analyze a new clinical trial to

evaluate effectiveness of TRT in the overall population or pre‐defined

subsets determined by the classifier.

29

Coming Full Circle:Integrating Many Methods & Data Types

30


16

Molecular Phenotype Based GWAS

• Molecular Subtype GWAS: – For risk with existing controls– For clinical outcome – For quantitative trait

All CasesAll Cases

Molecular Case Subtype A

Molecular Case Subtype A

GWASGWAS

Molecular Case Subtype B

Molecular Case Subtype B

GWASGWAS

Novel Subtype Specific Loci

Novel Subtype Specific Loci

Based on Clustering

31

Example: Integrative AnalysisMultiple Methods and Multiple Data Types

Association analysis to determine epigenetic features associated with clinical outcome (TTR)

Association analysis to determine epigenetic features associated with clinical outcome (TTR)

Model-based clustering (semi-supervised) to determine clinically relevant methylation-based subtypes

Model-based clustering (semi-supervised) to determine clinically relevant methylation-based subtypes

Nearest shrunken centroid (PAM) (supervised) analysis to determine genes in which mRNA levels differ between

methylation subgroups

Nearest shrunken centroid (PAM) (supervised) analysis to determine genes in which mRNA levels differ between

methylation subgroups

Pathway Analysis of resulting differential expressed genes to determine enriched pathways.

Pathway Analysis of resulting differential expressed genes to determine enriched pathways.

32


17

Training setN=168

Recurrences = 110

Testing setN=169

Recurrences = 116

337 HGS Ovarian

Cancer Tumors

Application to Ovarian Cancer• Restricted to High Grade

Serous (HGS) histology

• Pre-chemo tumor sample

• 450K IlluminaMethylation Array

• Similar stage and recurrence status between testing and training data sets

Wang, Cicek, … ,Fridley, Goode (2013). Tumor Hypo-Methylation at 6p21.3 Associates with Longer Time to Recurrence of High-Grade Serous Epithelial Ovarian Cancer. Under review. 33

Epigenetics

Step 1

Step 2

Step 3

Step 4

34


18

Training set(168)

Testing set(169)

337 high‐grade serous tumors

Random split

1. Loci ranking2. Cross‐validation3. Clustering and

signature generation

SS‐RPMM

rL: worse outcomerR: better outcome

Train Testoptimal number of CpG loci = 60

Analysis workflow of semi‐supervised clustering used in this study. 35

169 samples / 104 eventsL: worse outcomeR: better outcome

All the HGS samples in testing set

Kaplan Meier plot of association between groups (R or L) and recurrence time, for all the samples in testing set (n=169).

p-value=5.3E-4, HR=0.5

36


19

HGS testing samples with platinum and taxane treatment

130 samples/ 90 eventsL: worse outcomeR: better outcome

Kaplan Meier plot of association between groups (R or L) and recurrence time, for the samples in testing set and received platinum and taxane treatment (n=130).

p-value=1.20E-5, HR=0.39

37

Nearest Shrunken Centroid Method(PAM Analysis)

• Goal: In a sample with K different classes and p variables, what variables contribute to the separation of these classes?

• PAM: Shrinks each class centroid towards the overall centroid. The shrinkage factor is determined by CV.

• The shrinkage de-noises large effects while setting small ones to zero (i.e., selection of key genes)

38


20

39

Ovarian Cancer Study

• 104 of the HGS cases have Agilent gene expression data

rL (poor outcome): 48

rR (better outcome): 56

40


21

Gene expression heatmap of PAM selected genes

921 probes with higher expression in L group

1413 probes with higher expression in R group

L: worse outcomeR: better outcome

Expression heatmaps of signature genes selected by PAM analysis using shrinkage factor 1.5, which was selected based on minimum cross‐validation error.

41

What are the genes that distinguish between the molecular subtypes?

• 712 genes are over expressed in patients with poor outcome

– moderately enriched in signaling pathways, such as Wnt/beta-catenin Signaling (p-value=8.71E-5).

• 958 genes are over expressed in patients with better outcome

– extremely enriched in immune related pathways, such as Antigen Presentation Pathway (p-value=1.6E-32), Crosstalk between Dendritic Cells and Natural Killer Cells (p-value=2E-24), and Communication between Innate and Adaptive Immune Cells (p-value=5E-24).

– Might explain why this group is associated with better outcome (blessed by protection of boosted immune mechanism)

42


22

Questions?

Thank You for Attending this Workshop.

43