45
More Microarray Analysis: Unsupervised Approaches Matt Hibbs Troyanskaya Lab

More Microarray Analysis: Unsupervised Approaches

  • Upload
    brad

  • View
    42

  • Download
    0

Embed Size (px)

DESCRIPTION

More Microarray Analysis: Unsupervised Approaches. Matt Hibbs Troyanskaya Lab. Outline. Gene Expression vs. DNA applications A little more normalization (missing values) Unsupervised Analysis Basic Clustering Statistical Enrichment PCA/SVD Advanced Clustering Search-based Approaches. - PowerPoint PPT Presentation

Citation preview

Page 1: More Microarray Analysis: Unsupervised Approaches

More Microarray Analysis:Unsupervised Approaches

Matt Hibbs

Troyanskaya Lab

Page 2: More Microarray Analysis: Unsupervised Approaches

Outline

• Gene Expression vs. DNA applications

• A little more normalization (missing values)

• Unsupervised Analysis– Basic Clustering– Statistical Enrichment– PCA/SVD– Advanced Clustering– Search-based Approaches

Page 3: More Microarray Analysis: Unsupervised Approaches

Expression / DNA

• Some similar concepts to analysis, but often very different goals

• Expression – clustering, guilt by association, functional enrichment

• DNA – signal processing, spatial relationships, motif finding

• Visualized differently (Heat maps vs. karyoscope)

Page 4: More Microarray Analysis: Unsupervised Approaches

The missing value problem

• Microarrays can have systematic or random missing values

• Some algorithms can’t deal with missing values (PCA/SVD in particular)

• Instead of hoping missing values won’t bias the analysis, better to estimate them accurately

Page 5: More Microarray Analysis: Unsupervised Approaches

Spatial Defects

Page 6: More Microarray Analysis: Unsupervised Approaches

KNN Impute

• Idea: use genes with similar expression profiles to estimate missing values

2 | 4 | 5 | 7 | 3 | 2

2 | | 5 | 7 | 3 | 1

8 | 9 | 2 | 1 | 4 | 9

Gene X

Gene A

Gene B

3 | 5 | 6 | 7 | 3 | 2 Gene C

2 | 4 | 5 | 7 | 3 | 2

2 |4.3| 5 | 7 | 3 | 1

8 | 9 | 2 | 1 | 4 | 9

Gene X

Gene A

Gene B

3 | 5 | 6 | 7 | 3 | 2 Gene C

Page 7: More Microarray Analysis: Unsupervised Approaches

Complete data set Data set with missing values estimated by KNNimpute algorithm

Data set with 30% entries missing and filled with zeros (zero values appear black)

 

Imputation affects downstream analysis

Page 8: More Microarray Analysis: Unsupervised Approaches

Unsupervised Analysis

• Supervised techniques great if you have starting information (e.g. labels)– But, we often we don’t know enough beforehand

to apply these methods

• Unsupervised techniques are exploratory– Let the data organize itself, then try to find

biological meaning– Approaches to understand whole data– Visualization often helpful

Page 9: More Microarray Analysis: Unsupervised Approaches

Clustering

• Let the data organize itself

• Reordering of genes (or conditions) in the dataset so that similar patterns are next to each other (or in separate groups)

• Identify subsets of genes (or experiments) that are related by some measure

Page 10: More Microarray Analysis: Unsupervised Approaches

Quick Example

Ge

nes

Conditions

Page 11: More Microarray Analysis: Unsupervised Approaches

Why cluster?

• “Guilt by association” – if unknown gene X is similar in expression to known genes A and B, maybe they are involved in the same/related pathway

• Visualization: datasets are too large to be able to get information out without reorganizing the data

Page 12: More Microarray Analysis: Unsupervised Approaches

Clustering Techniques

• Algorithm (Method)– Hierarchical– K-means– Self Organizing Maps– QT-Clustering– NNN– .– .– .

• Distance Metric– Euclidean (L2)

– Pearson Correlation– Spearman Correlation

– Manhattan (L1)

– Kendall’s – .– .– .

Page 13: More Microarray Analysis: Unsupervised Approaches

Distance Metrics

• Choice of distance measure is important for most clustering techniques

• Pair-wise metrics – compare vectors of numbers– e.g. genes x & y, ea. with n measurements

Euclidean Distance

Pearson Correlation

Spearman Correlation

Page 14: More Microarray Analysis: Unsupervised Approaches

Distance Metrics

Euclidean Distance

Pearson Correlation

Spearman Correlation

Page 15: More Microarray Analysis: Unsupervised Approaches

Hierarchical clustering

• Imposes (pair-wise) hierarchical structure on all of the data

• Often good for visualization

• Basic Method (agglomerative):1. Calculate all pair-wise distances

2. Join the closest pair

3. Calculate pair’s distance to all others

4. Repeat from 2 until all joined

Page 16: More Microarray Analysis: Unsupervised Approaches

Hierarchical clustering

Page 17: More Microarray Analysis: Unsupervised Approaches

Hierarchical clustering

Page 18: More Microarray Analysis: Unsupervised Approaches

Hierarchical clustering

Page 19: More Microarray Analysis: Unsupervised Approaches

Hierarchical clustering

Page 20: More Microarray Analysis: Unsupervised Approaches

Hierarchical clustering

Page 21: More Microarray Analysis: Unsupervised Approaches

Hierarchical clustering

Page 22: More Microarray Analysis: Unsupervised Approaches

HC – Interior Distances

• Three typical variants to calculate interior distances within the tree– Average linkage: mean/median over all possible

pair-wise values

– Single linkage: minimum pair-wise distance

– Complete linkage: maximum pair-wise distance

Page 23: More Microarray Analysis: Unsupervised Approaches

Hierarchical clustering: problems

• Hard to define distinct clusters• Genes assigned to clusters on the basis of all

experiments• Optimizing node ordering hard (finding the optimal

solution is NP-hard)• Can be driven by one strong cluster – a problem

for gene expression b/c data in row space is often highly correlated

Page 24: More Microarray Analysis: Unsupervised Approaches

HC: Real Example

• Demo in JavaTreeView & HIDRA– Spellman et al., 1998: yeast alpha-factor sync

cell cycle timecourse

Page 25: More Microarray Analysis: Unsupervised Approaches

HC: Another Example

• Expression of tumors hierarchically clustered

• Expression groups by clinical class

Garber et al.

Page 26: More Microarray Analysis: Unsupervised Approaches

K-means Clustering

• Groups genes into a pre-defined number of independent clusters

• Basic algorithm:1. Define k = number of clusters

2. Randomly initialize each cluster with a seed (often with a random gene)

3. Assign each gene to the cluster with the most similar seed

4. Recalculate all cluster seeds as means (or medians) of genes assigned to the cluster

5. Repeat 3 & 4 until convergence (e.g. No genes move, means don’t change much, etc.)

Page 27: More Microarray Analysis: Unsupervised Approaches

K-means example

Page 28: More Microarray Analysis: Unsupervised Approaches

K-means example

Page 29: More Microarray Analysis: Unsupervised Approaches

K-means example

Page 30: More Microarray Analysis: Unsupervised Approaches

K-means: problems

• Have to set k ahead of time– Ways to choose “optimal” k: minimize within-

cluster variation compared to random data or held out data

• Each gene only belongs to exactly 1 cluster

• One cluster has no influence on the others (one dimensional clustering)

• Genes assigned to clusters on the basis of all experiments

Page 31: More Microarray Analysis: Unsupervised Approaches

K-means: Real Example

• Demo in TIGR MeV– Spellman et al. alpha-factor cell cycle

Page 32: More Microarray Analysis: Unsupervised Approaches

Clustering “Tweaks”

• Fuzzy clustering – allows genes to be “partially” in different clusters

• Dependent clusters – consider between-cluster distances as well as within-cluster

• Bi-clustering – look for patterns across subsets of conditions– Very hard problem (NP-complete)– Practical solutions use heuristics/simplifications

that may affect biological interpretation

Page 33: More Microarray Analysis: Unsupervised Approaches

Cluster Evaluation

• Mathematical consistency– Compare coherency of clusters to background

• Look for functional consistency in clusters– Requires a gold standard, often based on GO,

MIPS, etc.

• Evaluate likelihood of enrichment in clusters– Hypergeometric distribution, etc.– Several tools available

Page 34: More Microarray Analysis: Unsupervised Approaches

Gene Ontology

• Organization of curated biological knowledge– 3 branches: biological process, molecular function, cellular component

Page 35: More Microarray Analysis: Unsupervised Approaches

Hypergeometric Distribution

• Probability of observing x or more genes in a cluster of n genes with a common annotation

– N = total number of genes in genome– M = number of genes with annotation– n = number of genes in cluster– x = number of genes in cluster with annotation

• Multiple hypothesis correction required if testing multiple functions (Bonferroni, FDR, etc.)

• Additional genes in clusters with strong enrichment may be related

Page 36: More Microarray Analysis: Unsupervised Approaches

GO term Enrichment Tools

• SGD’s & Princeton’s GoTermFinder– http://go.princeton.edu

• GOLEM (http://function.princeton.edu/GOLEM)

• HIDRA

Sealfon et al., 2006

Page 37: More Microarray Analysis: Unsupervised Approaches

More Unsupervised Methods

• Search-based approaches– Starting with a query gene/condition, find most

related group

• Singular Value Decomposition (SVD) & Principal Component Analysis (PCA)– Decomposition of data matrix into “patterns”

“weights” and “contributions”– Real names are “principal components”

“singular values” and “left/right eigenvectors”– Used to remove noise, reduce dimensionality,

identify common/dominant signals

Page 38: More Microarray Analysis: Unsupervised Approaches

• SVD is the method, PCA is performing SVD on centered data

• Projects data into another orthonormal basis• New basis ordered by variance explained

X U

Vt

=

SVD (& PCA)

OriginalData matrix

“Eigen-conditions”

Singular values

“Eigen-genes”

Page 39: More Microarray Analysis: Unsupervised Approaches

SVD

SVD

Page 40: More Microarray Analysis: Unsupervised Approaches

SVD: Real Example

• Demo in TIGR MeV– Spellman et al., 1998 cell cycle time courses

• alpha-factor sync• cdc15 sync

Page 41: More Microarray Analysis: Unsupervised Approaches

DNA arrays / Sequence-based Analysis

• Methods so far focused on expression data

• Other uses of microarrays often sequence based: CGH, ChIP-chip, SNP scanner– Data has important, inherent order– Most analysis methods developed from signal

processing techniques (e.g. sound)– View data in chromosomal order (karyoscope)

• Tools: JavaTreeView, IGB, Chippy

Page 42: More Microarray Analysis: Unsupervised Approaches

CGH Example

• Demo in JavaTreeView

Page 43: More Microarray Analysis: Unsupervised Approaches

(data from Hughes et al. (2000))

Aneuploidy affects expression too

rpl20arpl20a, Chromosome XV

Page 44: More Microarray Analysis: Unsupervised Approaches

Software Tools

• JavaTreeView – viz, karyoscope

• HIDRA – viz, mult. datasets, search

• Cluster (Eisen lab) – clustering

• TIGR MeV – clustering, viz

• IGB – Affy’s CGH browser

• ChIPpy – ChIP-chip analysis

Page 45: More Microarray Analysis: Unsupervised Approaches

Summary

• Unsupervised Analysis– Let the data organize itself, find patterns– Clustering: Distance Metric + Algorithm– SVD/PCA – auto find dominant patterns

• Impute missing values (KNN)

• CGH – Karyoscope view

• Questions?