18
Microarray analysis Algorithms in Computational Biology Spring 2006 Written by Itai Sharon

Microarray analysis

Embed Size (px)

DESCRIPTION

Microarray analysis. Algorithms in Computational Biology Spring 2006. Written by Itai Sharon. Microarrays. Measure the expression of genes in the cell “Count” the number of mRNA molecules that attach to biological probes Expression data is gathered for many (thousands) of genes at once - PowerPoint PPT Presentation

Citation preview

Page 1: Microarray analysis

Microarray analysis

Algorithms in Computational BiologySpring 2006

Written by Itai Sharon

Page 2: Microarray analysis

2

Microarrays

Measure the expression of genes in the cell “Count” the number of mRNA molecules that attach

to biological probes

Expression data is gathered for many (thousands) of genes at once

Data is gathered for several experiments Either in several time stamps or different conditions

Page 3: Microarray analysis

3

NM1N

ij

31

2M21

1M131211

ee

e

e

ee

eeee

Genes N

sExperiment M

Relative expression of gene i in experiment j

Page 4: Microarray analysis

4

Detecting Patterns in Expression Data

Genes may have similar expression patterns because They are part of the same complex (protein-protein

interactions) They are part of the same pathway They have similar regulatory elements They have similar functions (part of a fail-safe

mechanism)

A popular solution: clustering (we saw already) Hierarchical clustering, K-means, agglomerative,...

Today: dimensionality reduction PCA SVD

Page 5: Microarray analysis

5

Why Dimensionality Reduction

Using irrelevant data may harm accuracy

Clustering algorithms do not perform well in high dimensional data

Visualizing high dimensional data

),( ii yx '

ix

Page 6: Microarray analysis

6

Principle Components Analysis (PCA)

PCA seeks for a linear projection that best describes the data in a least mean squares sense

Finds a set of principle components (PCs) A PC defines a projection that encapsulates the

maximum amount of variation in a dataset Each PC is orthogonal to all other PCs

Reduce dimensionality by picking the most informative PCs Namely, for reducing from dimension d to dimension

d’, pick the d’ most informative PCs

Page 7: Microarray analysis

7

PCA - Steps

Input: a dataset

Subtract the mean from each dimension

Compute the covariance matrix for the d dimensions The covariance of two variables X and Y:

The covariance matrix:

n

i

ii

n

YYXXYX

1 )1(

)()(),cov(

),cov(),(),( YXXYYX

id

iin sssssS ,..., },,...,{ 11

Page 8: Microarray analysis

8

PCA – Steps (cont.)

Compute the eigenvectors and eigenvalues of the covariance matrix

Choose the most informative PCs, construct a feature vector Eigenvectors with highest eigenvalues carry the

most information Feature vector is simply the combination of all

eigenvectors chosenFeatureVector = (eig1, eig2, …, eigd’)

Transform dataset to the new axis system For sS:

dd

T

s

s

s

eig

eig

eig

storFeatureVecs2

1

'

2

1

'

Page 9: Microarray analysis

9

When Things Get Messy…

PCA is fine when initial dimension is not too big Space and time complexity are of O(d2) - size of

covariance matrix

Otherwise – we have a problem… E.g. when d=104 time/space complexity is O(108)…

Luckily an alternative exists: SVD

Page 10: Microarray analysis

10

Eigengenes, Eigenarrays and SVD

The idea: Use the singular value decomposition (SVD)

theorem for transforming the dataset from the gene/array space to the eigengene/eigenarray space

Eigengenes, eigenarrays and eigenvalues: Each dimension is represented by an

eigengene/eigenarray/eigenvalue triplet Eigenvalues are used for ranking dimensions

Paper: Alter et. Al., 2000

Page 11: Microarray analysis

11

Singular Value Decomposition (SVD)

Theorem: if E is a real M by N matrix, then there exist orthogonal matrices

s.t.

Where

and

and ],...,[ MxM1 MuuU NxN1 ],...,[ NvvV

TVWUE

),...,( 1 pdiagW

),min( ,0...21 nmpp

Page 12: Microarray analysis

12

SVD

i is the ith singular value of E. ui and vi are the ith left singular vector and right singular vector of E, respectively.

It holds that

Efficient algorithms for calculating the SVD exist

),min(:1 NMivuE

uvEi

iiT

ii

i

Page 13: Microarray analysis

13

Orthogonality of Decomposition

MMM

M

uu

u

uuu

1

12

121

11

MNM

N

ee

e

eee

1

21

11211

00

0

00

22

11

NN

N

N

vv

v

vvv

1

21

112

11

111

1 ,..., ],,...,[ MiM uuuuuU

iN

iiN vvvvvV ,..., ],,...,[ 11

TVWUE

),...,( 1 pdiagW

Page 14: Microarray analysis

14

0

0

111

2222

1211

12122

1111

MMMMM

MMM

uu

uu

uuu

WU

NN

N

N

T

vv

v

vvv

V

1

21

112

11

Orthogonality of Decomposition

MNM

ij

N

T

ee

e

e

eee

VWU

1

21

11211

p

k

kj

kikij vue

1

p

k

kkk

T

vuE1

Page 15: Microarray analysis

15

SVD and Microarray analysis

Reduction from the N genes x M arrays to p eigengenes x p eigenarrays space W is the eigenexpression matrix U represents the expression of genes over

eigenarrays V represents the expression of eigengenes over

arrays

The “fraction of eigenexpression”:

“Shannon entropy” of the dataset:

p

kkiip

1

2

p

kkk pp

pd

1

1)log()log(

10

Page 16: Microarray analysis

16

Example: Cell cycle of Saccharomyces Cerevisiae

Data is available for 5981 genes over 14 time steps (with ½ hour intervals)

784 genes were classified as cell-cycle regulated (with no missing values)

Page 17: Microarray analysis

17

Data Sorting

For eigengenes 1 and 2, plot the correlation of each gene g1 with both on a 2-D plot X-axis represents the correlation with 2, Y-axis

relates to 1. Sort by angular distance

Page 18: Microarray analysis

18

Further Reading

PCA: L. Smith: A Tutorial on Principal Components

Decomposition

Eigengenes, eigenvectors and SVD: O. Alter, P. Brown & D. Botstein: Singular Value

Decomposition for Genome-wide Expression Data Processing and Modeling, PNAS 97:18, 2000