39
3/3/08 CAP5510 1 CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 [email protected] www.cis.fiu.edu/~giri/teach/BioinfS07.html

CAP 5510: Introduction to Bioinformatics

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 1

CAP 5510: Introduction to Bioinformatics

Giri NarasimhanECS 254; Phone: x3748

[email protected]

www.cis.fiu.edu/~giri/teach/BioinfS07.html

Page 2: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 2

Gene g

Probe 1 Probe 2 Probe N…

Page 3: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 3

Microarray Data

Gene3

Gene2

Gene1

Expression LevelGene

Page 4: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 4

Sample

Treated Sample(t1) Expt 1

Treated Sample(t2) Expt 2

Treated Sample(t3) Expt 3

Treated Sample(tn) Expt n

Study effect of treatment over time

Page 5: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 5

http://www.arabidopsis.org/info/2010_projects/comp_proj/AFGC/RevisedAFGC/Friday/

Page 6: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 6

How to compare 2 cell samples with Two-ColorMicroarrays?

mRNA from sample 1 is extracted and labeled with a red

fluorescent dye.

mRNA from sample 2 is extracted and labeled with a green

fluorescent dye.

Mix the samples and apply it to every spot on the

microarray. Hybridize sample mixture to probes.

Use optical detector to measure the amount of green and

red fluorescence at each spot.

Page 7: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 7

Variations in cells/individuals

Variations in mRNA extraction, isolation, introduction of dye, variationin dye incorporation, dye interference

Variations in probe concentration, probe amounts, substrate surfacecharacteristics

Variations in hybridization conditions and kinetics

Variations in optical measurements, spot misalignments, discretizationeffects, noise due to scanner lens and laser irregularities

Cross-hybridization of sequences with high sequence identity

Limit of factor 2 in precision of results

Variation changes with intensity: larger variation at low or highexpression levels

Sources of Variations & Experimental Errors

Need to Normalize data

Page 8: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 8

ClusteringClustering is a general method to study patterns in

gene expressions.

Several known methods:

Hierarchical Clustering (Bottom-Up Approach)

K-means Clustering (Top-Down Approach)

Self-Organizing Maps (SOM)

Page 9: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 9

Hierarchical Clustering: Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Page 10: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 10

A Dendrogram

Page 11: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 11

Hierarchical Clustering [Johnson, SC, 1967]

Given n points in Rd, compute the distance betweenevery pair of points

While (not done)Pick closest pair of points si and sj and make them part ofthe same cluster.

Replace the pair by an average of the two sij

Try the applet at:http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletH.html

Page 12: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 12

Distance Metrics

For clustering, define a distance function:

Euclidean distance metrics

Pearson correlation coefficient

k=2: Euclidean Distance

kd

i

kiik YXYXD

/1

1

)(),( ==

== y

i

x

id

i

xyYYXX

d 1

1-1 xy 1

Page 13: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 13

Page 14: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 14

Clustering of gene expressions

Represent each gene as a vector or a point in d-

space where d is the number of arrays or

experiments being analyzed.

Page 15: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 15

From Eisen MB, et al, PNAS 1998 95(25):14863-8

Clustering Random vs. Biological

Data

Page 16: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 16

K-Means Clustering: Example

Example from Andrew Moore’s tutorial on Clustering.

Page 17: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 17

Start

Page 18: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 18

Page 19: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 19

Page 20: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 20

Start

End

Page 21: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 21

K-Means Clustering [McQueen 67]

RepeatStart with randomly chosen cluster centers

Assign points to give greatest increase in score

Recompute cluster centers

Reassign points

until (no changes)Try the applet at:

http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletH.html

Page 22: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 22

Comparisons

Hierarchical clusteringNumber of clusters not preset.

Complete hierarchy of clusters

Not very robust, not very efficient.

K-MeansNeed definition of a mean. Categorical data?

More efficient and often finds optimum clustering.

Page 23: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 23

Functionally related

genes behave similarly

across experiments

Page 24: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 24

Self-Organizing Maps [Kohonen]

Kind of neural network.

Clusters data and find complex relationships

between clusters.

Helps reduce the dimensionality of the data.

Map of 1 or 2 dimensions produced.

Unsupervised Clustering

Like K-Means, except for visualization

Page 25: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 25

SOM Architectures

2-D Grid

3-D Grid

Hexagonal Grid

Page 26: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 26

SOM Algorithm

Select SOM architecture, and initialize weight

vectors and other parameters.

While (stopping condition not satisfied) do for each

input point x

winning node q has weight vector closest to x.

Update weight vector of q and its neighbors.

Reduce neighborhood size and learning rate.

Page 27: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 27

SOM Algorithm Details

Distance between x and weight vector:

Winning node:

Weight update function (for neighbors):

Learning rate:

iwx

)]()()[,,()()1( kwkxixkkwkw iii +=+ μ

ii

wxxq = min)(

=2

2)(

exp)(),,( 0μxqi rr

kixk

Page 28: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 28

World Bank Statistics

Data: World Bank statistics of countries in 1992.

39 indicators considered e.g., health, nutrition,

educational services, etc.

The complex joint effect of these factors can can

be visualized by organizing the countries using the

self-organizing map.

Page 29: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 29

World Poverty PCA

Page 30: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 30

World Poverty SOM

Page 31: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 31

World Poverty Map

Page 32: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 32

Page 33: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 33

Page 34: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 34

Viewing SOM Clusters on PCA axes

Page 35: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 35

1

5

4

3

2

1

2 3 4 5 6

SOM Example [Xiao-rui He]

Page 36: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 36

Neural Networks

Input X

Synaptic

Weights W

ƒ(•)

Bias

Output y

Page 37: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 37

Learning NN

Adaptive Algorithm

Input X

1

Weights W

+

Desired Response

Error

Page 38: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 38

Types of NNs

Recurrent NN

Feed-forward NN

Layered

Other issuesHidden layers possible

Different activation functions possible

Page 39: CAP 5510: Introduction to Bioinformatics

3/3/08 CAP5510 39

Application: Secondary Structure Prediction