Statistical Genomics and Bioinformatics Workshop8/16/2013
1
Statistical Genomics and Bioinformatics Workshop:
Genetic Association and RNA-Seq Studies
Genomic Clustering and Signature Development
Brooke L. Fridley, PhDUniversity of Kansas Medical Center
1
Genomic Clustering
2
Statistical Genomics and Bioinformatics Workshop8/16/2013
2
Clustering Basics• Clustering is the process of grouping a set of
physical or abstract objects into classes of similar objects.– It is also called unsupervised learning.
• Exploratory tool: use these methods for visualization, hypothesis generation, selection of genes for further consideration– We should not use these methods inferentially.
• Hierarchical clustering specifically: we are provided with a picture from which we can make many/any conclusions.
3
Why cluster genes?
• Identify groups of possibly co-regulated genes
• Identify typical temporal or spatial gene expression patterns
• Arrange a set of genes in a linear order that is at least not totally meaningless (we hope).
• Aids in the interpretation
4
Statistical Genomics and Bioinformatics Workshop8/16/2013
3
Why cluster samples?
• Quality control: Detect experimental artifacts/bad hybridizations
• Check whether samples are grouped according to known categories (though this might be better addressed using a supervised approach: statistical tests,
classification)
• Identify new “classes” of biological samples– tumor subtypes
– Disease heterogeneity
5
Human breast tumors cluster into 6 distinct molecular subtypes of breast cancer with differences
in patient survival.
Hallett, et al. (2012) A gene signature for predicting outcome in patients with basal-like breast cancer. Scientific Reports
6
Statistical Genomics and Bioinformatics Workshop8/16/2013
4
Clustering vs Classification
• Clustering is ‘unsupervised’:– We don’t use any information about what class the samples
belong to (e.g. disease status, cancer type) to determine cluster structure
– Hierarchical, K-Means, PCA, SOM, model-based clustering– Clustering finds groups in the data
• Classification methods are ‘supervised’:– identifying to which of a set of groups/categories a new
observation belongs, on the basis of a training set of data containing observations whose category membership is known.
– Discriminant analysis, PAM (Shrunken centroids), Random Forests/CART, K-Nearest Neighbor,
– Classification methods finds ‘classifiers’7
Cluster analysis
Generally, cluster analysis is based on two components:
1. Distance measure: Quantification of dissimilarity / similarity of objects.
2. Cluster algorithm: A procedure to group objects. Aim: small within-cluster distances, large between-cluster distances.
8
Statistical Genomics and Bioinformatics Workshop8/16/2013
5
Distance and Similarity• Every clustering method is based solely on the measure
of distance or similarity.
– The clustering is only as good as the distance matrix
• Generally, not enough thought and time is spent on choosing and estimating the distance/similarity matrix.
– Applying correlation to highly skewed data will provide misleading results.
– Applying Euclidean distance to data measured on categorical scale will be invalid.
9
Hierarchical Clustering• The most over used statistical
method in gene expression analysis
• Tends to be pretty unstable.– Many different ways to perform
hierarchical clustering– Sensitive to small changes in
the data• Provided with clusters of every
size– where to “cut” the dendrogram is
user-determined
Huse, Kwon, et al (2010) Parallel Evolution in Pseudomonas aeruginosa over 39,000 generations in vivo. mBio. 1(4). 10
Statistical Genomics and Bioinformatics Workshop8/16/2013
6
Distances between clusters used for hierarchical clustering
• Distance between two clusters is based on the pairwise distances between members of the clusters.
• Complete linkage: largest distance
• Average linkage: average distance
• Single linkage: smallest distance
• Complete linkage gives preference to compact / spherical clusters.
• Single linkage can produce long stretched clusters.11
Hierarchical Clustering Example: Visualizing DNA Methylation Data
Christensen et al. PLoS Genetics, 2009
CpG loci
Tissue sam
ples
Unmethylated
Methylated
Unsupervised hierarchical clustering based on Manhattan distance and average linkage. 12
Statistical Genomics and Bioinformatics Workshop8/16/2013
7
K-Means
• Intuitive and very easy to implement
• Pre-specification of the number of clusters K
- K is typically unknown for most practical purposes
- Misspecification may lead to poor results
• Choice of a distance measure may be difficult to justify
• Clusters are expected to be of similar size
• May not work well for irregular clusters
- Clusters based on the first-moment (i.e., mean) only
13
K-means clustering illustration
Gene B
Gen
e A
14
Statistical Genomics and Bioinformatics Workshop8/16/2013
8
Gene B
Gen
e A
m1(1)
m2(1)
• Need to first pre-specify the number of clusters • For simplicity, assume k = 2
• Step 1: Initialize the means for the two clusters
15
Gene B
Gen
e A
m1(1)
m2(1)
• Compute the distance of each of the points to each of the two means
• Assign points to the cluster with the closest mean
16
Statistical Genomics and Bioinformatics Workshop8/16/2013
9
Gene B
Gen
e A
m1(1)
m2(1)
• Compute the distance of each of the points to each of the two means
• Assign points to the cluster with the closest mean
17
Gene B
Ge
ne A
m1(2)
m2(2)
• Re-compute the means based on the observations within that cluster (i.e., m(2)
1 and m(2)2)
• Compute the distance of each of the points to each of the two means
• Assign points to the cluster with the closest mean
18
Statistical Genomics and Bioinformatics Workshop8/16/2013
10
• Re-compute the means based on the observations within that cluster (i.e., m(2)
1 and m(2)2)
• Compute the distance of each of the points to each of the two means
• Assign points to the cluster with the closest mean
Gene B
Gen
e A
m1(2)
m2(2)
19
Gene B
Gen
e A
m1(5)
m2(5)
• After 5 iterations, we converge on our final solution!• Consists of class labels for each of the n
observations
20
Statistical Genomics and Bioinformatics Workshop8/16/2013
11
Cautions about Clustering
• Clustering can be a useful exploratory tool
• Cluster results are very sensitive to noise in the data
• Need to assess cluster structure and stability of results
• Different clustering approaches can give quite different results
– Methods
– Number Clusters
– Distance measures
• For hierarchical clustering, interpretation is almost always subjective
• Doesn’t tell us anything about what features should be used for clustering
21
Genomic Classification
22
Statistical Genomics and Bioinformatics Workshop8/16/2013
12
?Bad prognosis
recurrence < 5yrsGood Prognosisrecurrence > 5yrs
ReferenceL van’t Veer et al (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, Jan.
ObjectsArray
Feature vectorsGene
expression
Predefine classesClinical
outcome
new array
Learning set
Classificationrule
Good PrognosisMatesis > 5
23
Decision tree classifiers (CART)
Gene 1Mi1 < ‐0.67
Gene 2Mi2 > 0.18
0
2
1
yes
yes
no
no
0.18
Transparent rules and easy to interpret and implement
G1 0.1 -0.2 0.3 G2 0.3 0.4 0.4G3 … ……Class 0 1 0
24
Statistical Genomics and Bioinformatics Workshop8/16/2013
13
Ensemble classifiers
Training Set
X1, X2, … X100
Classifier 1Resample 1
Classifier 2Resample 2
Classifier 499Resample 499
Classifier 500Resample 500
Examples:BaggingBoosting
Random Forest
Aggregateclassifier
25
Feature Selection
• Lead to better classification performance by removing variables that are noise with respect to the outcome
• May provide useful biological insights
• Can eventually lead to the diagnostic tests
26
Statistical Genomics and Bioinformatics Workshop8/16/2013
14
Classifier Performance Assessment
• Any classification rule needs to be evaluated for its performance on the future samples.
• One needs to estimate future performance based on what is available
• Assessing performance of the classifier based on
– Cross-validation• V-fold CV
• Leave-one-out cross validation (LOOCV)
– Training vs Testing set
– Testing on independent dataset
27
Diagram of performance assessment
Training set
Performance assessment
TrainingSet
Independenttest set
(CV) Learningset
(CV) Test set
Classifier
Classifier
Classifier
Resubstitution estimation
Test set estimation
Cross Validation
28
Statistical Genomics and Bioinformatics Workshop8/16/2013
15
Pharmacogenomic (PGx) Classifiers
Benefits:• Enables patients to be
treated with drugs that actually work for them
• Avoids false negative trials for heterogeneous populations
• Avoids erroneous generalizations of conclusions from positive trials
Develop a PGx classifier to determine patients likely to benefit
from a new TRT
Establish reproducibility of the PGx classifier
Use the PGx classifier to design and analyze a new clinical trial to
evaluate effectiveness of TRT in the overall population or pre‐defined
subsets determined by the classifier.
29
Coming Full Circle:Integrating Many Methods & Data Types
30
Statistical Genomics and Bioinformatics Workshop8/16/2013
16
Molecular Phenotype Based GWAS
• Molecular Subtype GWAS: – For risk with existing controls– For clinical outcome – For quantitative trait
All CasesAll Cases
Molecular Case Subtype A
Molecular Case Subtype A
GWASGWAS
Molecular Case Subtype B
Molecular Case Subtype B
GWASGWAS
Novel Subtype Specific Loci
Novel Subtype Specific Loci
Based on Clustering
31
Example: Integrative AnalysisMultiple Methods and Multiple Data Types
Association analysis to determine epigenetic features associated with clinical outcome (TTR)
Association analysis to determine epigenetic features associated with clinical outcome (TTR)
Model-based clustering (semi-supervised) to determine clinically relevant methylation-based subtypes
Model-based clustering (semi-supervised) to determine clinically relevant methylation-based subtypes
Nearest shrunken centroid (PAM) (supervised) analysis to determine genes in which mRNA levels differ between
methylation subgroups
Nearest shrunken centroid (PAM) (supervised) analysis to determine genes in which mRNA levels differ between
methylation subgroups
Pathway Analysis of resulting differential expressed genes to determine enriched pathways.
Pathway Analysis of resulting differential expressed genes to determine enriched pathways.
32
Statistical Genomics and Bioinformatics Workshop8/16/2013
17
Training setN=168
Recurrences = 110
Testing setN=169
Recurrences = 116
337 HGS Ovarian
Cancer Tumors
Application to Ovarian Cancer• Restricted to High Grade
Serous (HGS) histology
• Pre-chemo tumor sample
• 450K IlluminaMethylation Array
• Similar stage and recurrence status between testing and training data sets
Wang, Cicek, … ,Fridley, Goode (2013). Tumor Hypo-Methylation at 6p21.3 Associates with Longer Time to Recurrence of High-Grade Serous Epithelial Ovarian Cancer. Under review. 33
Epigenetics
Step 1
Step 2
Step 3
Step 4
34
Statistical Genomics and Bioinformatics Workshop8/16/2013
18
Training set(168)
Testing set(169)
337 high‐grade serous tumors
Random split
1. Loci ranking2. Cross‐validation3. Clustering and
signature generation
SS‐RPMM
rL: worse outcomerR: better outcome
Train Testoptimal number of CpG loci = 60
Analysis workflow of semi‐supervised clustering used in this study. 35
169 samples / 104 eventsL: worse outcomeR: better outcome
All the HGS samples in testing set
Kaplan Meier plot of association between groups (R or L) and recurrence time, for all the samples in testing set (n=169).
p-value=5.3E-4, HR=0.5
36
Statistical Genomics and Bioinformatics Workshop8/16/2013
19
HGS testing samples with platinum and taxane treatment
130 samples/ 90 eventsL: worse outcomeR: better outcome
Kaplan Meier plot of association between groups (R or L) and recurrence time, for the samples in testing set and received platinum and taxane treatment (n=130).
p-value=1.20E-5, HR=0.39
37
Nearest Shrunken Centroid Method(PAM Analysis)
• Goal: In a sample with K different classes and p variables, what variables contribute to the separation of these classes?
• PAM: Shrinks each class centroid towards the overall centroid. The shrinkage factor is determined by CV.
• The shrinkage de-noises large effects while setting small ones to zero (i.e., selection of key genes)
38
Statistical Genomics and Bioinformatics Workshop8/16/2013
20
39
Ovarian Cancer Study
• 104 of the HGS cases have Agilent gene expression data
rL (poor outcome): 48
rR (better outcome): 56
40
Statistical Genomics and Bioinformatics Workshop8/16/2013
21
Gene expression heatmap of PAM selected genes
921 probes with higher expression in L group
1413 probes with higher expression in R group
L: worse outcomeR: better outcome
Expression heatmaps of signature genes selected by PAM analysis using shrinkage factor 1.5, which was selected based on minimum cross‐validation error.
41
What are the genes that distinguish between the molecular subtypes?
• 712 genes are over expressed in patients with poor outcome
– moderately enriched in signaling pathways, such as Wnt/beta-catenin Signaling (p-value=8.71E-5).
• 958 genes are over expressed in patients with better outcome
– extremely enriched in immune related pathways, such as Antigen Presentation Pathway (p-value=1.6E-32), Crosstalk between Dendritic Cells and Natural Killer Cells (p-value=2E-24), and Communication between Innate and Adaptive Immune Cells (p-value=5E-24).
– Might explain why this group is associated with better outcome (blessed by protection of boosted immune mechanism)
42
Statistical Genomics and Bioinformatics Workshop8/16/2013
22
Questions?
Thank You for Attending this Workshop.
43