Mev Advanced

Analysis of Multiple Experiments

TIGR Multiple Experiment Viewer (MeV)

Advanced Course Coverage

• Introduction-fundamental concepts, expression vectors and

distance metrics-fundamental statistical concepts encountered in

mev analysis modules

• Algorithm Coverage-Lecture / Hands on Exercises (refer to algorithm handout for order…)

IAS-1

MD

Lucidea

Others

IAS-2

MD3

Microarray Printers

Microarray Data Flow

Axon-1

Others

Axon-2

ScanArray

Microarray Scanners

PCR Score

Spotfinder(Image Analysis)

MIDAS(Normalization)

MeV(Data Analysis)

Miner(.tav File Creator)

MABCOS(Barcode System)

Exp Designer

MADAM(Data Manager)Probe Source

Data Entry Pages

Probe

Study

Slide

ScanHybridization

Expression Analysis

Scheduler(Machine Scheduling)

SliTrack(Machine Control)

Query Window

GenePix Converter

Study

Probe

Slidetype Slide

Experiment

Reports

MAGE-ML Database

MUSAGE

Database

Others…

Database

MAD

.tiff Image File

Raw .tav File

Normalized .tav File

Interpretation…

THE INSTITUTE FOR GENOMIC THE INSTITUTE FOR GENOMIC RESEARCHRESEARCH

TIGRTIGRTIGRTIGR

Raw .tav File

Expression Data

The Expression Matrix is a representation of data from multiplemicroarray experiments.

Each element is a log ratio(usually log 2 (Cy5 / Cy3) )

Red indicates a positive log ratio, i.e, Cy5 > Cy3

Green indicates anegative log ratio , i.e.,Cy5 < Cy3

Black indicates a logratio of zero, i. e., Cy5 and Cy3 are very close in value

Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gray indicates missing data

Expression Vectors-Gene Expression Vectors

encapsulate the expression of a gene over a set of experimental conditions or sample types.

-0.8 0.8 1.5 1.8 0.5 -1.3 -0.4 1.5

-2

0

2

1 2 3 4 5 6 7 8Log2(cy5/cy3)

Expression Vectors As Points in‘Expression Space’

Experiment 1

Experiment 2

Experiment 3

Similar Expression

-0.8

-0.60.9 1.2

-0.3

1.3

-0.7Exp 1 Exp 2 Exp 3

G1

G2

G3

G4

G5

-0.4-0.4

-0.8-0.8

-0.7

1.3 0.9 -0.6

Distance and Similarity

-the ability to calculate a distance (or similarity, it’s inverse) between two expression vectors is fundamental to clustering algorithms

-distance between vectors is the basis upon which decisions are made when grouping similar patterns of expression

-selection of a distance metric defines the concept of distance

Distance: a measure of similarity between genes.


Gene A

Gene B

x1A x2A x3A x4A x5A x6A

x1B x2B x3B x4B x5B x6B

Some distances: (MeV provides 11 metrics)

1. Euclidean: i = 1 (xiA - xiB)26

2. Manhattan: i = 1 |xiA – xiB|6

3. Pearson correlation

p0

p1

Distance is Defined by a Metric

-2

0

2

log2

(cy5

/cy3

)

Euclidean Pearson(r*-1)Distance Metric:

4.2

1.4

-1.00

-0.90D

D

Statistical Concepts

Probability distributions

The probability of an event is the likelihood of its occurring. It is sometimes computed as a relative frequency (rf), where

the number of “favorable” outcomes for an eventrf = ----------------------------------------------------------------

the total number of possible outcomes for that event.

The probability of an event can sometimes be inferred from atheoretical probability distribution, such as a normal distribution.

Normal distribution

X = μ (mean of the distribution)

σ = std. deviationof the distribution

Population 1 Population 2

Mean 1 Mean 2

Sample mean “s”

Less than a 5% chance that the sample with mean s came from population 1, i.e., s is significantly different from “mean 1” at the p < 0.05 significance level. But we cannot reject the hypothesis that the sample came from population 2.

Many biological variables, such as height and weight, can reasonably be assumed to approximate the normal distribution.

But expression measurements? Probably not.

Fortunately, many statistical tests are considered to be fairly robust to violations of the normality assumption, and other assumptions used in these tests.

Randomization / resampling based tests can be used to get around the violation of the normality assumption.

Even when parametric statistical tests (the ones that make use of normal and other distributions) are valid, randomization testsare still useful.

1. Compute the value of interest (i.e., the test-statistic s) from your data set.

Outline of a randomization test - 1

Original data set

s

2. Make “fake” data sets from your original data, by taking a random sub-sample of the data, or by re-arranging the data in a random fashion.

3. Re-compute s from the “fake” data set.

“fake” s

“fake” s

“fake” s. . .

Randomized data sets

4. Repeat steps 2 and 3 many times (often several hundred to several thousand times). Keep a record of the “fake” s values from step 3.

5. Draw inferences about the significance of your original s value by comparing it with the distribution of the randomized (“fake”) s values.

Range of randomized s values

Original s value: could be significantas it exceeds most of the randomizeds values


Rationale

Ideally, we want to know the “behavior” of the larger population from which the sample is drawn, in order to make statistical inferences. Here, we don’t know that the larger population “behaves” like a normal distribution, or some other idealized distribution. All we have to work with are the data in hand.

Our “fake” data sets are our best guess about this behavior (i.e., if we had been pulling data at random from an infinitely large population, we might expect to get a distribution similar to what we get by pulling random sub-samples, or by reshuffling the order of the data in our sample)


The problem of multiple testing(adapted from presentation by Anja von Heydebreck, Max–Planck–Institute for Molecular Genetics,Dept. Computational Molecular Biology, Berlin, Germanyhttp://www.bioconductor.org/workshops/Heidelberg02/mult.pdf)

• Let’s imagine there are 10,000 genes on a chip, AND

• None of them is differentially expressed.

• Suppose we use a statistical test for differential expression, where we consider a gene to be differentially expressed if it meets the criterion at a p-value of p < 0.05.

http://www.bioconductor.org/workshops/Heidelberg02/mult.pdf

The problem of multiple testing – 2

• Let’s say that applying this test to gene “G1” yields a p-value of p = 0.01

• Remember that a p-value of 0.01 means that there is a 1% chance that the gene is not differentially expressed, i.e.,

• Even though we conclude that the gene is differentially expressed (because p < 0.05), there is a 1% chance that our conclusion is wrong.

• We might be willing to live with such a low probabilityof being wrong

BUT .....

The problem of multiple testing – 3

• We are testing 10,000 genes, not just one!!!

• Even though none of the genes is differentially expressed, about 5% of the genes (i.e., 500 genes) will be erroneously concluded to be differentially expressed, because we have decided to “live with” a p-value of 0.05

• If only one gene were being studied, a 5% margin of error might not be a big deal, but 500 false conclusions in one study? That doesn’t sound too good.

The problem of multiple testing - 4

• There are “tricks” we can use to reduce the severity of this problem.

• They all involve “slashing” the p-value for each test (i.e., gene), so that while the critical p-value for the entiredata set might still equal 0.05, each gene will be evaluated at a lower p-value.

• We’ll go into some of these techniques later.

• Don’t get too hung up on p-values.

• Ultimately, what matters is biological relevance. P-values should help you evaluate the strength of the evidence, rather than being used as an absolute yardstick of significance. Statistical significance is not necessarily the same as biological significance.

• i.e., you don’t want to belong to “that group of people whose aim in life is to be wrong 5% of the time”!!! *

*Kempthorne, O., and T.E. Deoerfler 1969 The behaviour of some significance tests under experimental randomization. Biometrika 56:231-248, as cited in Manly, B.J.F. 1997. Randomization, bootstrap and Monte Carlo methods in biology: pg. 1. Chapman and Hall / CRC

Pearson correlation coefficient – r

• Indicates the degree to which a linear relationship can be approximated between two variables.

• Can range from (–1.0) to (+1.0).

• Positive r between two variables X and Y: as X increases, so does Y on the whole.

X

Y

•Negative r: as X increases, Y generally decreases.

• The higher the magnitude of r (in the positive or negative direction), the more linear the relationship.

Y

X

Pearson correlation - 2• Sometimes, a p-value is associated with the correlation coefficient r.

• This p-value is computed from a theoretical distribution of the correlation coefficient, similar to the normal distribution.

Population correlation coefficient = 0

Sample correlation coefficient r

p < 0.05 range, i.e., reject the null hypothesis that the variables are not correlated, since the sample correlation coefficient is in the rejection range of the correlation coefficient distribution that has a mean = 0

• This is the p-value for the null hypothesis that the X and Y data for our sample come from a population in which their correlation is zero, i.e., the null hypothesis is that there is no linear relationship between X and Y. • If p is sufficiently small (often p < 0.05), we can reject the null hypothesis, i.e., weconclude that there is indeed a linear relationship between X and Y.

Pearson correlation - 3

The square of the Pearson correlation, r2, also known as the coefficient of determination, is a measure of the “strength” of the linear relationship between X and Y.

It is the proportion of the total variation in X and Y that is explained by a linear relationship.

Algorithms…

Hierarchical Clustering (HCL)

HCL is an agglomerative clustering method which joins similar genes into groups. The iterative process continues with the joining of resulting groups based on their similarity until all groups are connected in a hierarchical tree.

(HCL-1)

Hierarchical Clustering

g8g1 g2 g3 g4 g5 g6 g7

g7g1 g8 g2 g3 g4 g5 g6

g7g1 g8 g4 g2 g3 g5 g6

g1 is most like g8

g4 is most like {g1, g8}

(HCL-2)

g7g1 g8 g4 g2 g3 g5 g6

g6g1 g8 g4 g2 g3 g5 g7

g6g1 g8 g4 g5 g7 g2 g3


g5 is most like g7

{g5,g7} is most like {g1, g4, g8}

(HCL-3)

g6g1 g8 g4 g5 g7 g2 g3

Hierarchical Tree

(HCL-4)


During construction of the hierarchy, decisions must be made to determine which clusters should be joined. The distance or similarity between clusters must be calculated. The rules that govern this calculation are linkage methods.

(HCL-5)

Agglomerative Linkage Methods

Linkage methods are rules or metrics that return a value that can be used to determine which elements (clusters) should be linked.

Three linkage methods that are commonly used are:

• Single Linkage• Average Linkage• Complete Linkage

(HCL-6)

Cluster-to-cluster distance is defined as the minimum distance between members of one cluster and members of the another cluster. Single linkage tends to create ‘elongated’ clusters with individual genes chained onto clusters.

DAB = min ( d(ui, vj) )

where u A and v Bfor all i = 1 to NA and j = 1 to NB

Single Linkage

(HCL-7)

DAB

Cluster-to-cluster distance is defined as the average distance between all members of one cluster and all members of another cluster. Average linkage has a slight tendency to produce clusters of similar variance.

DAB = 1/(NANB) ( d(ui, vj) )


Average Linkage

(HCL-8)

DAB

Cluster-to-cluster distance is defined as the maximum distance between members of one cluster and members of the another cluster. Complete linkage tends to create clusters of similar size and variability.

DAB = max ( d(ui, vj) )


Complete Linkage

(HCL-9)

DAB

Comparison of Linkage Methods

Single Ave. Complete(HCL-10)

Bootstrapping (ST)

Bootstrapping – resampling with replacement

Original expression matrix:


Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Various bootstrapped matrices (by experiments):

Exp 2 Exp 3 Exp 4

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Exp 2 Exp 4 Exp 4 Exp 1 Exp 3 Exp 5 Exp 6

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Exp 1 Exp 5

Jackknifing (ST)Jackknifing – resampling without replacement

Original expression matrix:


Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Various jackknifed matrices (by experiments):

Exp 1 Exp 3 Exp 4 Exp 5 Exp 6

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Exp 1 Exp 2 Exp 3 Exp 4 Exp 6

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Analysis of Bootstrapped and Jackknifed Support Trees

• Bootstrapped or jackknifed expression matrices are created many times by randomly resampling the original expression matrix, using either the bootstrap or jackknife procedure.

• Each time, hierarchical trees are created from the resampled matrices.

• The trees are compared to the tree obtained from the original data set.

• The more frequently a given cluster from the original tree is found in the resampled trees, the stronger the support for the cluster.

• As each resampled matrix lacks some of the original data, high support for a cluster means that the clustering is not biased by a small subset of the data.

1. Specify number of clusters, e.g., 5.

2. Randomly assign genes to clusters.

G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13

K-Means / K-Medians Clustering (KMC)– 1

K-Means Clustering – 2

3. Calculate mean / median expression profile of each cluster.

4. Shuffle genes among clusters such that each gene is now in the cluster whose mean / median expression profile (calculated in step 3) is the closest to that gene’s expression profile.

G1 G2G3 G4 G5G6

G7

G8 G9G10

G11

G12

G13

5. Repeat steps 3 and 4 until genes cannot be shuffled around any more, OR a user-specified number of iterations has been reached.

K-Means / K-Medians is most useful when the user has an a-priori hypothesis about the number of clusters the genes should group into.

Principal Components (PCAG and PCAE) – 1

1. PCA simplifies the “views” of the data.

2. Suppose we have measurements for each gene on multiple experiments.

3. Suppose some of the experiments are correlated.

4. PCA will ignore the redundant experiments, and will take aweighted average of some of the experiments, thus possibly making the trends in the data more interpretable.

5. The components can be thought of as axes in n-dimensional space, where n is the number of components. Each axis represents adifferent trend in the data.

PCAG and PCAE - 2

“Cloud” of data points (e.g., genes) in 3-dimensional space

x

y

z

Data points resolved along 3 principalcomponent axes.In this example,

x-axis could mean a continuum from over-to under-expression (“blue” and “green”genes over-expressed, yellow genes under-expressed)

y-axis could mean that “gray” genes are over-expressed in first five expts and under expressed in The remaining expts, while “brown” genes are under-expressed in the first five expts, and over-expressed in the remaining expts.

z-axis might represent different cyclic patterns, e.g., “red” genes might be over-expressed in odd-numbered expts and under-expressed in even-numbered ones, whereas the opposite is true for “purple” genes.

Interpretation of components is somewhat subjective.

Cluster Affinity Search Technique (CAST)

-uses an iterative approach to segregate elements with ‘high affinity’ into a cluster

-the process iterates through two phases

-addition of high affinity elements to the cluster being created

-removal or clean-up of low affinity elements from the cluster being created

Clustering Affinity Search Technique (CAST)-1Affinity = a measure of similarity between a gene, and all the genes in a cluster. Threshold affinity = user-specified criterion for retaining a gene in a cluster, defined as%age of maximum affinity at that point

1. Create a new empty cluster C1.

3. Move the two most similar genes into the new cluster.

Empty cluster C1

G2G4

G9

G8

G12

G6

G1

G7

G13

G11

G14

G3

G5 G15

G10

Unassigned genes

4. Update the affinities of all the genes (new affinity of a gene = its previous affinity + its similarity to the gene(s) newly added to the cluster C1)

2. Set initial affinity of all genes to zero

5. While there exists an unassigned gene whose affinity to the cluster C1 exceeds theuser-specified threshold affinity, pick the unassigned gene whose affinity is the highest,and add it to cluster C1. Update the affinities of all the genes accordingly.

ADD GENES:

CAST – 2

6. When there are no more unassigned high-affinity genes, check to see if cluster C1 contains any elements whose affinity is lower than the current threshold. If so, removethe lowest-affinity gene from C1. Update the affinities of all genes by subtracting from each gene’s affinity, its similarity to the removed gene.

7. Repeat step 6 while C1 contains a low-affinity gene.

8. Repeat steps 5-7 as long as changes occur to the cluster C1.

REMOVE GENES:

9. Form a new cluster with the genes that were not assigned to cluster C1, repeating steps1-8.

10. Keep forming new clusters following steps 1-9, until all genes have been assigned to a cluster

Current cluster C1

G2G4

G9

G8

G12G6

G1 G7

G13

G11

G14

G3

G5

G15G10

Unassigned genes

QT-Clust (from Heyer et. al. 1999) (HJC) -1

1. Compute a jackknifed distance between all pairs of genes(Jackknifed distance: The data from one experiment are excluded from both genes, and the distance is calculated. Each experiment is thus excluded in turn, and the maximum distancebetween the two genes (over all exclusions) is the jackknifed distance. This is a conservativeestimate of distance that accounts for bias that might be introduced by single outlier experiments.)

2. Choose a gene as the seed for a new cluster. Add the gene which increases cluster diameter the least. Continue adding genes until additional genes will exceed the specified cluster diameter limit.

G4G6

G5

G8

G7

G9

G10G2G3

G11

G1“Seed” gene

Currently unassigned genes

Current cluster

G11

G12

3. Repeat step 2 for every gene, so that each gene has the chance to be the seed of a new cluster. All clusters are provisional at this point.

QT-Clust – 2

4. Choose the largest cluster obtained from steps 2 and 3. In case of a tie, pick one of the largest clusters at random.

5. All genes that are not in the cluster selected above are treated as currently unassigned. Repeat steps 2-4 on these unassigned genes.

6. Stop when the last cluster thus formed has fewer genes than a user-specified number.All genes that are not in a cluster at this point are treated as unassigned.

G1“Seed” gene

G11

G12G7

G8 G2

“Seed” gene

G11

G10

G3

G4

G1

G5

G9

G7

G8G3

“Seed” gene

G9G4

Pick this cluster

Self Organizing Tree Algorithm

• Dopazo, J. , J.M Carazo, Phylogenetic reconstruction using and unsupervised growing neural network that adopts the topology of a phylogenetic tree. J. Mol. Evol. 44:226-233, 1997.

• Herrero, J., A. Valencia, and J. Dopazo. A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics, 17(2):126-136, 2001.

SOTA - 1

SOTA Characteristics• Divisive clustering, allowing high level hierarchical

structure to be revealed without having to completely partition the data set down to single gene vectors

• Data set is reduced to clusters arranged in a binary tree topology

• The number of resulting clusters is not fixed before clustering

• Neural network approach which has advantages similar to SOMs such as handling large data sets that have large amounts of ‘noise’

SOTA - 2

SOTA Topology

Parent Node

Winning Cell

Sister Cell

p

ws

migration factor (s < p < w)

SOTA - 3

Centroid Vector

Members

Adaptation Overview-each gene vector associated with the parent is compared to the centroid vector of its offspring cells.

-the most similar cell’s centroid and its neighboring cells are adapted using the appropriate migration weights.

SOTA - 4

-following the presentation of all genes to the system a measure of system diversity is used to determine if training has found an optimal position for the offspring.

-if the system diversity improves (decreases) then another training epoch is started otherwise training ends and a new cycle starts with a cell division.

SOTA - 5

The most ‘diverse’ cell is selected for division at the start of the next training cycle.

SOTA - 6

Growth Termination

Expansion stops when the most diverse cell’s diversity falls below a threshold.

SOTA - 7

0

0.05

0.1

0.15

0.2

0 100 200 300 400 500

Adaptation Epoch Number

Tre

e D

ive

rsit

y

Each training cycle ends when the overall tree diversity ‘stabilizes’.This triggers a cell division andpossibly a new training cycle.

SOTA - 8

Self-organizing maps (SOMs) – 1

1. Specify the number of nodes (clusters) desired, and also specify a 2-D geometry for the nodes, e.g., rectangular or hexagonal

N = NodesG = GenesG1 G6

G3

G5

G4G2

G11

G7G8

G10

G9

G12 G13

G14G15

G19G17

G22

G18

G20

G16

G21G23

G25G24

G26 G27

G29G28

N1 N2

N3 N4

N5 N6

SOMs – 22. Choose a random gene, e.g., G9

3. Move the nodes in the direction of G9. The node closest to G9 (N2) is movedthe most, and the other nodes are moved by smaller varying amounts. The further away the node is from N2, the less it is moved.

G1 G6

G3

G5G4

G2

G11

G7G8

G10G9

G12 G13G14

G15

G19G17

G22

G18G20

G16

G21G23

G25G24

G26 G27

G29G28

N1 N2

N3 N4

N5 N6

SOM Neighborhood Options

G11

G7G8

G10G9

N1 N2

N3 N4

N5 N6

G11

G7G8

G10G9

N1 N2

N3 N4

N5 N6

Bubble Neighborhood

Gaussian

Neighborhoodradius

All move, alpha is scaled.Some move, alpha is constant.

SOMs – 3

4. Steps 2 and 3 (i.e., choosing a random gene and moving the nodes towards it) arerepeated many (usually several thousand) times. However, with each iteration, the amountthat the nodes are allowed to move is decreased.

5. Finally, each node will “nestle” among a cluster of genes, and a gene will be considered to be in the cluster if its distance to the node in that cluster is less than itsdistance to any other node

G1 G6

G3

G5G4

G2

G11

G7G8

G10G9

G12 G13G14

G15

G19G17

G22

G18G20

G16

G21G23

G25G24

G26 G27

G29G28

N1 N2

N3

N4

N5N6

Template Matching-template matching allows one to find expression vectors which match a provided template

-a template can be derived from

- a gene known to be central to the area of study

- a sample or set of samples of a particular type

- a cluster with a mean pattern of interest

- a pattern constructed to reveal trends based on knowledge of the experimental design

PTM-2

-Sometimes it is useful to identify elements that have complementary patterns by selecting to use the absolute value of r.

Because of the random initialization of K-Means / K-Means, clustering results may vary somewhat between successive runs onthe same dataset. KMS helps us validate the clustering results obtained from K-Means / K-Medians.

• Run K-Means / K-Medians multiple times.

• The KMS module generates clusters in which the member genes frequently group together in the same clusters (“consensus clusters”) across multiple runs of K-Means / K-Medians.

3. The consensus clusters consist of genes that clustered together in at least x% of the K-Means / Medians runs, where x is the threshold percentage input by the user.

K-Means / K-Medians Support (KMS)

Compute first principle component of expression matrix

Shave off % (default 10%) of genes with lowest values of dot product with 1st principal component

Orthogonalize expression matrix with respect to the average gene in the cluster and repeat shaving procedure

Repeat until only one gene remains

Results in a series of nested clusters

Choose cluster of appropriate size as determined by gap statistic calculation

Gene Shaving

Gap statistic calculation (choosing cluster size)

Quality measure for clusters:

Create random permutations of the expression matrix and calculate R2 for each

Large R2 implies a tight cluster of coherent genes

within variance between variance

R2 =

Compare R2 of each cluster to that of the entire expression matrix

Choose the cluster whose R2 is furthest from the average R2 of the permuted expression matrices.

between variance of mean gene across experiments

within variance of each gene about the cluster average

Gene Shaving

The final cluster contains a set of genes that are

greatly affected by the experimental

conditions in a similar way.

Relevance Networks

Set of genes whose expression profiles are predictive of one another.

Genes with low entropy (least variable across experiments)are excluded from analysis.

H = -p(x)log2(p(x))x=1

10

Can be used to identify negative correlations between genes

Relevance Networks

Correlation coefficients outside the boundaries defined by the minimum and maximum thresholds are eliminated.

A

D

E B

C

.28

.75

.15.37

.40

.02

.51

.11

.63

.92A

D

E B

C

Tmin = 0.50The expression pattern of each gene compared to that of every other gene.

The ability of each gene to predict the expression of each other gene is assigned a correlation coefficient

Tmax = 0.90

The remaining relationships between genes define the subnets

1. Assign experiments to two groups, e.g., in the expression matrix below, assign Experiments 1, 2 and 5 to group A, and experiments 3, 4 and 6 to group B.


Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

2. Question: Is mean expression level of a gene in group A significantly different from mean expression level in group B?

Exp 1 Exp 2 Exp 3 Exp 4Exp 5 Exp 6

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Group A Group B

T-Tests (TTEST) – Between subjects (or unpaired) - 1

3. Calculate t-statistic for each gene

4. Calculate probability value of the t-statistic for each gene either from:

A. Theoretical t-distribution

OR

B. Permutation tests.

TTEST – Between subjects - 2

Permutation tests

i) For each gene, compute t-statistic

ii) Randomly shuffle the values of the gene between groups A and B,such that the reshuffled groups A and B respectively have the same number of elements as the original groups A and B.


Gene 1

Group A Group B

Exp 1Exp 4 Exp 5Exp 2Exp 3 Exp 6

Gene 1

Group A Group B

Original grouping

Randomized grouping

TTEST - Between subjects - 3

Permutation tests - continued

iii) Compute t-statistic for the randomized gene

iv) Repeat steps i-iii n times (where n is specified by the user).

v) Let x = the number of times the absolute value of the original t-statistic exceeds the absolute values of the randomized t-statistic over n randomizations.

vi) Then, the p-value associated with the gene = 1 – (x/n)


5. Determine whether a gene’s expression levels are significantly different between the two groups by one of three methods:

A) Just alpha: If the calculated p-value for a gene is less than or equal to the user-input alpha (critical p-value), the gene isconsidered significant.

ORUse Bonferroni corrections to reduce the probability of erroneously classifying non-significant genes as significant.

B) Standard Bonferroni correction: The user-input alpha is divided by the total number of genes to give a critical p-value that is usedas above.


5C) Adjusted Bonferroni:

i) The t-values for all the genes are ranked in descending order.

ii) For the gene with the highest t-value, the critical p-value becomes (alpha / N), where N is the total number of genes; for the gene with the second-highest t-value, the critical p-value will be (alpha/ N-1), and so on.

TTEST - Between subjects – 6

TTEST – 1-class (or One-sample t-test) - 1

1. Used to test if the the mean expression of a gene over all experiments is different from a hypothesized mean.


Gene 1

Gene 2

Gene 3

Vector 1

Vector 2

Vector 3

2. Question: Is the mean of the values of a given gene vector significantly different from a hypothesized mean?

TTEST- 1 Class - 2

3. Often, the hypothesized mean in gene expression studies is zero, meaning that we are looking for genes whose mean log2 ratio across all experiments issignificantly different from zero, i.e.,

4. Using 1-sample t-tests, we can select genes which, on average, show differential expression across all experiments (since genes with no differentialexpression should have a mean log2 ratio of zero across all expts).

5. Calculate t-value, where

Observed mean of gene vector – Hypothesized mean of gene vectort = ------------------------------------------------------------------------------

Standard error of the mean of the gene vector

TTEST – 1 class - 3

6. Calculate p-value from a theoretical t-distribution, OR

7. By permutation: 7a. Randomly pick some elements of the gene vector, and change their values,such that the new value of the changed element is

[original value – 2 x (original value - hypothesized mean)]

(i.e., “flip” the element’s deviation around the hypothesized mean)

Thus, if the original gene values are:

and the hypothesized mean is zero, then

the randomized gene values could be:

0.5 -1.3 2.4 1.2 -0.2 0.8

-0.5 -1.3 2.4 -1.2 0.2 -0.8

These elements were randomly chosen and flipped around zero,the hypothesized mean

TTEST – 1 class - 4

7b. Calculate t-value from the randomized gene

7c. Repeat 7a and 7b as many times as desired. If all permutations are chosen, then every possible combination of elements in the gene vector is chosen forflipping.

7d. The p-value = 1 – (the proportion of times that the original absolute t-value exceeds the randomized absolute t-value over all the permutations conducted).

8. If a gene’s p-value is less than or equal to the user-specified critical p-value,the gene’s mean expression over all experiments is significantly different fromthe hypothesized mean.

9. Bonferroni and adjusted Bonferroni corrections may be applied just as in the two-sample t-test.

One Way Analysis of Variance (ANOVA)

1. Assign experiments to > 2 groups

2. Question: Is mean expression level of a gene the same across all groups?

Ex 2Ex 1 Ex 3 Ex 4 Ex 5 Ex 6 Ex 7 Ex 8 Ex 9

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Ex 3 Ex 6 Ex 8Ex 4 Ex 5 Ex 9Ex 2Ex 1 Ex 7

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Group 1 Group 2 Group 3

ANOVA - 2

3. Calculate an F-ratio for each gene, where

Mean square (groups) F = --------------------------, which is a measure of

Mean square (error)

Between groups variability---------------------------------Within groups variability

The larger the value of F, the greater the difference among the group meansrelative to the sampling error variability (which is the within groups variability).

i.e., the larger the value of F, the more likely it is that the differences among thegroup means reflect “real” differences among the means of the populations they are drawn from, rather than being due to random sampling error.

ANOVA - 3

4. The p-value associated with an F-value is the probability that an F-value that large would be obtained if there were no differences among group means (i.e., given the null hypothesis).

Therefore, the smaller the p-value, the less likely it is that the null hypothesis is valid, i.e., the differences among group means are more likely to reflect real population differences as p-values decrease in magnitude.

ANOVA - 4

5. P-values can be obtained for the F-values from a theoretical F-distribution, assuming that the populations from which the data are obtained

• are normally distributed, and

• have homogeneous variances.

The test is considered robust to violations of these assumptions, provided sample sizes are relatively large and similar across groups.

ANOVA – 5

6. P-values can be obtained from permutation tests (just like in t-tests), if one does not want to rely on the assumptions needed for using the F-distribution.

P-values can also be corrected for multiple comparisons (using Bonferroni or other procedures).

These features will soon be implemented in MeV.

Two-factor ANOVA (TFA)

- Can be used to find genes whose expression is significantly different over two factors (e.g., sex and strain), as well as to look for genes with a significant interaction for these two factors.

Male

Female

Strain A Strain B Strain C

TFA - 2

Strain

Female

Male

Gen

e ex

pres

sion Female

1 2 3Strain

Male

Gen

e ex

pres

sion

1 2 3

No interaction Interaction

TFA - 3

• Ideally, design should be balanced, i.e., equal numbers of samplesin each factor A – factor B combination.

• If unbalanced, the analysis can still be conducted, but F-tests willbe somewhat biased. May need to use smaller p-values.

• can have balanced designs with no replication (see below). In this case, interaction cannot be tested..

Male

Female

Strain A Strain B Strain C

Significance analysis of microarrays (SAM)

• SAM can be used to pick out significant genes based on differential expression between sets of samples.

Currently implemented for the following designs:- two-class unpaired- two-class paired- multi-class- censored survival- one-class

SAM -2• SAM gives estimates of the False Discovery Rate (FDR),

which is the proportion of genes likely to have been wrongly identified by chance as being significant.

• It is a very interactive algorithm – allows users to dynamically change thresholds for significance (through the tuning parameter delta) after looking at the distribution of the test statistic.

• The ability to dynamically alter the input parameters based on immediate visual feedback, even before completing the analysis, should make the data-mining process more sensitive.

Two-class unpaired: to pick out genes whose mean expression level is significantly different between two groups of samples (analogous to between subjects t-test).

Two-class paired: samples are split into two groups, and there is a 1-to-1 correspondence between an sample in group A and one in group B (analogous to paired t-test).

SAM designs

SAM designs - 2

Multi-class: picks up genes whose mean expression is different across > 2 groups of samples (analogous to one-way ANOVA)

Censored survival: picks up genes whose expression levels are correlated with duration of survival.

One-class: picks up genes whose mean expression across experiments is different from a user-specified mean.

1. Assign experiments to two groups, e.g., in the expression matrix below, assign Experiments 1, 2 and 5 to group A, and experiments 3, 4 and 6 to group B.


Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

2. Question: Is mean expression level of a gene in group A significantly different from mean expression level in group B?


Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Group A Group B

SAM Two-Class Unpaired

Permutation tests

i) For each gene, compute d-value (analogous to t-statistic). This isthe observed d-value for that gene.

ii) Rank the genes in ascending order of their d-values.

iii) Randomly shuffle the values of the genes between groups A and B,such that the reshuffled groups A and B respectively have the same number of elements as the original groups A and B. Compute the d-value for each randomized gene


Gene 1

Group A Group B

Exp 1Exp 4 Exp 5Exp 2Exp 3 Exp 6

Gene 1

Group A Group B

Original grouping

Randomized grouping

SAM Two-Class Unpaired– 2

SAM Two-Class Unpaired - 3

iv) Rank the permuted d-values of the genes in ascending order

v) Repeat steps iii) and iv) many times, so that each gene has many randomized d-values corresponding to its rank from the observed(unpermuted) d-value. Take the average of the randomized d-values for each gene. This is the expected d-value of that gene.

vi) Plot the observed d-values vs. the expected d-values

SAM Two-Class Unpaired– 4 Significant positive genes (i.e., mean expression of group B >mean expression of group A) in red

Significant negative genes (i.e., mean expression of group A > mean expression of group B) in green

“Observed d = expected d” line

Tuning parameter“delta” limits, can be dynamically changed by using the slider bar or entering a value in the text field.

The more a gene deviates from the “observed = expected” line, the more likely it is to be significant. Any gene beyond the first gene in the +ve or –ve direction on the x-axis (including the first gene), whose observed exceeds the expected by at least delta, is considered significant.

SAM Two-Class Unpaired – 5

For each permutation of the data, compute the number of positive and negative significant genes for a given delta as explained in the previous slide. The median number of significant genes from these permutations is the median False Discovery Rate.

The rationale behind this is, any genes designated as significant from the randomized data are being picked up purely by chance (i.e., “falsely” discovered). Therefore, the median number picked up over many randomizations is a good estimate of false discovery rate.

SAM Two-Class Paired•Samples fall into two groups

•Each member of group A is associated with a member of group B in a 1-to-1 relationship

A B

A-B pair

•e.g., groups A and B could respectively represent “before” and “after” a drug treatment, and each A-B pair of samples could come from the same patient before and after the treatment.

•or, groups A and B could represent two strains for which samples were collected at the several time points over a time course study. A sample collected from each of strain A and B at the same time point could form an AB pair.

SAM Two-Class Paired - 2

• The rest of the analysis is similar to two-class unpaired SAM. Positive significant genes are those for which Mean(Group B) is significantly larger

than Mean (Group A), and reverse is true for negative significant genes

SAM Multi-Class• Extension of SAM two -class unpaired to more than 2 groups • Experiments belong to one of at least three groups

• Analogous to one-way between subjects ANOVAEx 2Ex 1 Ex 3 Ex 4 Ex 5 Ex 6 Ex 7 Ex 8 Ex 9

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Ex 3 Ex 6 Ex 8Ex 4 Ex 5 Ex 9Ex 2Ex 1 Ex 7

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Group 1 Group 2 Group 3

SAM Multi-Class - 2

• This analysis yields only positive significant genes

• These are genes whose means are significantly different acrosssome combination of the groups of experiments.

SAM Censored Survival

• Each experiment (sample) is associated with an observation time, and a state at the time of observation.

• The state is either “dead” or “censored”

• “Censored” means that the subject survived beyond the timepoint at which the sample was taken.

• A positive score means that a higher expression level for thatgene implies shorter survival (i.e., higher risk), whereas a negative score means that higher expression implies longersurvival.

SAM One-Class

• used to pick up genes whose mean expression across experiments is different from a user-specified mean.

• analogous to one-class t-test

• positive genes are those whose means are greater than the specifiedmean, while negative genes have means smaller than the specified mean

Support Vector Machines (SVM)

• supervised learning technique

• uses supplied information such as presumptive biological relationships between a set of elements, and the expression profiles of elements to produce a binary classification of elements.

Supervised Learning

-begins with the definition of a class which specifies in advance which elements should cluster together.

-ie. genes for enzymes in a common pathway or part of a regulatory system, or samples may be a tissue type or from a particular strain.

-this information is used to train the SVM to discriminate members from non-members

SVM Process Overview

SVMTraining

SVM

Classification

Initial Classification

Data

Weights

Data

ElementsIn

Classification

ElementsOut of

Classification

SVM Classification

• SVM attempts to find an optimal separating hyperplane between members of the two initial classifications.

Separating hyperplane

Separation Problem

-an optimal hyperplane partitions the initial classification correctly and maximizes distance from the plane to elements on either ‘side’, positive and negative examples.

-when the training examples (initial classification) consists of very diverse expression patterns finding an optimal hyperplane can be impossible…

SVM Kernel Construction The expression data can be transformed to a higher

dimensional space (feature space) by applying a kernel function. This transformation can have the effect of allowing a separating hyperplane to be found.

Practical SVM Issues

• Results depend heavily on the input parameters.

• Using a high degree kernel function risks artificial separation of the data.

• An iterative approach to increasing the kernel power is advisable.

SVM Results• Two classes are produced

– Positive Class: contains elements with expression patterns similar to those in the positive examples in the training set.

– Negative Class: contains all other members of the input set.

• Each of these classes has elements that fall in two groups

– Those initially in the class (true positives and true negatives)

– Those recruited into the class (false positives and false negatives)

K-Nearest Neighbor Classification – KNNC - 1

• supervised classification scheme

• user specifies the number of expected classes

• a training set of vectors is provided as input

• user specifies classes of training vectors

• training set should contain example of each class

KNNC – 2 – pre-classification filters

• Prior to classification, variance filtering can optionally be applied to all vectors (training set + vectors to be trained). This will filter out genes with low variance across experiments. Note that this might filter out some genes in the training set as well.

• Correlation filtering can also be applied on the vectors to be classified. This would filter out those vectors in the set to be classified, that are not significantly correlated with any gene in thetraining set.

• Significance for correlation filtering is determined by apermutation test.

KNNC – 3 - correlation filtering randomization test1. The Pearson correlation coefficient r is computed between a givenvector to be classified, and each member of the training set

2. The maximum such r is called the rmax for that vector.

3. The vector is randomized a user-specified number of times, andeach time, an rmax is calculated using the randomized vector(call it rmax*), just as in steps 1 and 2.

4. The proportion of times rmax* exceeds rmax over all randomizationsis the p-value for that vector.

5. If the p-value for a vector < the user-specified p-value, that vector is retained for further analysis.

6. Steps 1-6 are repeated for every vector in the set to be classified.

KNNC – 4 - Classification parameters

• Let v be a vector that needs to be classified, and T = {t1, t2, …, t10} be the set of training vectors.

• The user specifies the classes of each element of T. Say, thereare 4 classes.

• The user also specifies the number of neighbors k. Say, k = 5.

t1

t2

t3

t4

t5

t6

t7

t8

t9

t10v

Class 1

Class 2

Class 3

Class 4

T

KNNC –5 - Classification

t1

t2

t3

t4

t5

t6

t7

t8

t9

t10v

Class 1

Class 2

Class 3

Class 4

T

• Suppose v’s 5 nearest neighbors in set T (by Euclidean distance) are t1, t4, t8, t2, and t5.

• Since class 1 is most frequently represented in v’s nearest neigbors, v is assignedto class 1.

• If there is a tie in frequency of classes represented among nearest neighbors, thevector remains unassigned.

EASE(Expression Analysis Systematic Explorer)

EASE analysis identifies prevalent biological themes within gene clusters.

The significance of each identified theme is determined by its prevalence in the cluster and in the gene population of genes from which the cluster was created.

Consider a population of genes representing a diverse set of biological roles or themes shown below as different colors.

Diverse Biological Roles

Many algorithms can be applied to expression data to partition genes based on expression profiles over multiple conditions.

Many of these techniques work solely on expression data and disregard biological information.

-What are the some of the predominant biological themes represented in the cluster and how should significance be assigned to a discovered biological theme?

Consider a particular cluster…

Example:

Population Size: 40 genesCluster size: 12 genes

10 genes, shown in green, have a common biological theme and 8 occur within the cluster.

The frequency of the theme in the population is 10/40 = 25%

The frequency of the theme within the cluster is 8/12 = 67%

40

12

10

8

* 80% of the genes related to the theme in the populationended up within the relatively small cluster.

AND

Consider the Outcome

Contingency Matrix

A 2x2 contingency matrix is typically used to capture the relationships between cluster membership and membership to a biological theme.

out

in

Theme

outin

Cluster

2

4 26

8

ContingencyMatrix

Assigning Significance to the Findings

The Fisher’s Exact Test permits us to determine if there arenon-random associations between the two variables, expressionbased cluster membership and membership to a particular biological theme.

8 2

4 26

in out

in

out

Cluster

Theme p .0002

( 2x2 contingency matrix )

Hypergeometric Distribution

a b

c d

a+c

a+b

b+d

c+d

!!!!!

)!()!()!()!(

)!()!(!

!!)!(

!!)!(

dcban

dbcadcba

dcban

dbdb

caca

The probability of any particularmatrix occurring by randomselection, given no associationbetween the two variables, is givenby the hypergeometric rule.

Probability Computation

For our matrix, 8 2

4 26, we are not only

interested in getting the probability of getting exactly8 annotation hits in the cluster but rather the probabilityof having 8 or more hits. In this case the probabilities of each of the possible matrices is summed.

9 1

3 27

10 0

2 28

8 2

4 26

.0002207 + 7.27x10-6 + 7.79x10-8 .000228

• Consider all of the Results

EASE reports all themes represented in a cluster and although some themes may not meet statistical significance it may still be important to note that particular biological roles or pathways are represented in the cluster.

• Independently Verify Roles

Once found, biological themes should be independently verified using annotation resources.

EASE Results

Basic EASE Requirements

Annotation keys; identifiers for each gene must be loaded with the data into MeV.

EASE file system; EASE uses a file system to link annotation keys to biological themes.

EASE File System

EASE(Expression Analysis Systematic Explorer)

Hosack et al. Identifying biological themes within lists of genes with EASE. Genome Biol., 4:R70-R70.8, 2003.

NIAID graciously provided the foundation Java classes upon which the MeV version was built.

• Algorithm scripting• Discriminant analysis• Chromosome Viewers

etc.

Coming Attractions

Documents

Mev Advanced