63
. Computational Genomics Fall 2005/6 www.cs.tau.ac.il/~bchor/CG05/comp- genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT post.tau.ac.il ) TA: Tomer Shlomi (shlomito AT post.tau.ac.il) Lectures: Wednesday 11:00-14:00, Kaploon 324 Tutorials: Sunday 15:00-16:00, Kaploon 118

Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

Embed Size (px)

Citation preview

Page 1: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

.

Computational Genomics

Fall 2005/6 

www.cs.tau.ac.il/~bchor/CG05/comp-genom.html

Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT post.tau.ac.il )

TA: Tomer Shlomi (shlomito AT post.tau.ac.il)

Lectures: Wednesday 11:00-14:00, Kaploon 324 Tutorials: Sunday 15:00-16:00, Kaploon 118

Page 2: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

2

Course InformationRequirements & Grades: 20-25% homework, in five-to-six assignments,

containing both “dry” and “wet” problems. Submission - two weeks from posting.

Homework submission is obligatory. You are strongly encouraged to solve the

assignments independently (or at least give it a serious try).

75-80% exam. Must pass beyond 55 for the homework’s grade to count

Page 3: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

3

Bibliography

Biological Sequence Analysis, R.Durbin et al. , Cambridge University Press, 1998

Introduction to Molecular Biology, J. Setubal and

J. Meidanis, PWS publishing Company, 1997 

Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, D. Gusfield, Cambridge University Press, 1997.

Post-genome Informatics, M. Kanehisa , Oxford University Press, 2000.

More refs on course page.

Page 4: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

4

Course PrerequisitesComputer Science and Probability Background Computational Models Algorithms (“efficiency of computation”) Probability (any course)

Some Biology Background Formally: None, to allow CS students to take this course. Recommended: Some molecular biology course, and/or a serious

desire to complement your knowledge in Biology by reading the appropriate material.

Studying the algorithms in this course while acquiring enough biology background is far more rewarding than ignoring the biological context.

Page 5: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

5

What is Computational Biology?

Computational biology is the application of computational tools and techniques to molecular biology (primarily).  It enables new ways of study in life sciences, allowing analytic and predictive methodologies that support and enhance laboratory work. It is a multidisciplinary area of study that combines Biology, Computer Science, and Statistics.

Computational biology is also called Bioinformatics, although many practitioners define Bioinformatics somewhat narrower by restricting to the application of specialized software for deducing meaningful biological information.

Page 6: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

6

Why Bio-informatics ?

An explosive growth in the amount of biological information necessitates the use of computers for cataloging, retrieval and analyzing mega-data (> 3 billion bps, > 30,000 genes).

• The human genome project.

• Improved technologies, e.g. automated sequencing.

• GenBank is now approximately doubling every year !!!

Page 7: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

7

New Biotechnologies & Data• Micro arrays - gene expression.

• 2D gels – protein expression.

• Multi-level maps - genetic, physical: sequence, annotation.

• Networks of protein-protein interactions.

• Cross-species relationships -• Homologous genes.• Chromosome organization.

http://www.the-scientist.com/yr2002/apr/research020415.html

Page 8: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

8

BioInformatics Tools are Crucial !

• New biotechnology tools generate

explosive growth in the amount of

biological data.

• Impossible to analyze the data manually.

• Novel mathematical, statistical,

algorithmic and computational tools

are necessary !

Page 9: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

9

Areas of Interest (very partial list)

• Building evolutionary trees from molecular (and other) data• Efficiently reconstructing the genome sequence from sub-

parts (mapping, assembly, etc.)• Understanding the structure of genomes (Genes, SNP, SSR)• Understanding function of genes in the cell cycle and disease• Deciphering structure and function of proteins• Diagnosing cancer based on DNA microarrays (“chips”)

_____________________SNP: Single Nucleotide PolymorphismSSR: Simple Sequence Repeat

Much of this class has been edited from Nir Friedman’s lecture which is available at www.cs.huji.ac.il/~nir. Changes made by Dan Geiger, then Shlomo Moran, and finally

Benny Chor. Additional slides from Zohar Yakhini and Metsada Pasmanik.

Page 10: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

10

Growth of DNA Sequence Data: GenBank

bp sequences

Dec 19, 2001

Most Sequenced Organisms:Human (51%), Mouse (15%), Fruit Fly (4%), Rat (4%), Rice (2%), Wheat (2%),Wheat (2%), Worm (1%), Chimp (1%), Others (18%).

Sequ

en

ces

(mill

ions)

Base

pair

s of

DN

A (

mill

ions)

Page 11: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

11

PDB ContentGrowth

http://www.rcsb.org/pdb/

(Experimentallydetermined)

02

The Protein Data Bank

Page 12: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

13

Four Aspects

Biological What is the task?

Algorithmic How to perform the task at hand efficiently?

Learning How to adapt/estimate/learn parameters and

models describing the task from examples

Statistics How to differentiate true phenomena from

artifacts

USER
hide this and next slide?
Page 13: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

14

Example: Sequence Comparison

Biological Evolution preserves sequences, thus similar genes might

have similar function

Algorithmic Consider all ways to “align” one sequence against

another

Learning How do we define “similar” sequences? Use examples to

define similarity

Statistical When we compare to ~106 sequences, what is a random

match and what is true one

Page 14: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

16

Topics I

Dealing with DNA/Protein sequences:

Finding similar sequences Models of sequences: Hidden Markov Models Genome projects and how sequences are found Transcription regulation Protein Families Gene finding

Page 15: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

18

Topics II

High throughput biotechnologies –

potentials and computational challenges DNA microarrays applications to diagnostics applications to understanding gene networks

Page 16: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

19

Topics III (Structural BioInfo Course)

Protein World: How proteins fold - secondary & tertiary structure How to predict protein folds from sequences data How to predict protein function from its structure How to analyze proteins changes from raw

experimental measurements (MassSpec)

Page 17: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

20

Algorithmics

Will introduce algorithmic techniques that are

useful in computational genomics (and elsewhere): Dynamic programing, dynamic programing, dynamic.. Suffix trees and arrays Probabilistic models: PSSM (Position Specific

Scoring Matrices), HMM (Hidden Markov Models) Learning and classification, SVM (Support Vector

Machines) Heuristics for solving hard optimization problems

(Many problems in comp. genomics are NP-hard)

Page 18: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

21

Human Genome

Most human cells contain

46 chromosomes:

2 sex chromosomes (X,Y):

XY – in males.

XX – in females.

22 pairs of chromosomes named autosomes.

USER
what is autosome and the other words
Page 19: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

22

Watson and Crick

… On Feb. 28, 1953, Francis Crick walked into the Eagle pub in Cambridge, England, and, as James Watson later recalled, announced that "we had found the secret of life."

"The structure was too pretty not to be true."

-- JAMES D. WATSON, "The Double Helix"

Page 20: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

23

1920-1958

(1953)

DNA -the Code for Life

Died from ovarian cancer

http://www.nobel.se/medicine/laureates/1962/index.html

Page 21: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

25

The Double HelixS

ourc

e: A

lber

ts e

t al

Page 22: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

26

The Central Dogma of Molecular Biology

Transcription Translation

Replication

AC

UA A G C A

G

AC

UGUAC

DNA mRNAprotein

Phenotype

Page 23: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

28

Conclusion: DNA strands are complementary (1953).

Watson-Crick Complementarity

HumanSheepTurtleSea urchinWheat

E. coli

DNA source% of each base

Purines/Pyrimidines

Base ratios

PurinesPyrimidines

Page 24: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

29

Genome Sizes

E.Coli (bacteria) 4.6 x 106 bases Yeast (simple fungi) 15 x 106 bases Smallest human chromosome 50 x 106 bases Entire human genome 3 x 109 bases

USER
האם למטה זה כרומוזומי האדם? אם לא, מה זה?
Page 25: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

30

Genetic Information

Genome – the collection of genetic information.

Chromosomes – storage units of genes.

Gene – basic unit of genetic information. They determine the inherited characters.

Page 26: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

31

What is a Gene ?

DNA contains various recognition sites:• Promoter signals.• Transcription start signals.• Start codon.• Exon, intron boundaries.• Transcription termination signal.

Start codon Terminal codon

Transcribed region Un-codedregion

Un-codedregion

exon

exon exon

intron intronpromotor

Page 27: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

32

Control of the Human -Globin Gene

Page 28: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

33

Alternative Splicing 33

Page 29: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

34

Genes: How Many?The DNA strings include: Coding regions (“genes”)

E. coli has ~4,000 genes Yeast has ~6,000 genes C. Elegans has ~13,000 genes Humans have ~32,000 genes

Control regions These typically are adjacent to the genes They determine when a gene should be “expressed”

So called “Junk” DNA (unknown function - ~90% of the DNA in human’s chromosomes)

Page 30: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

35

Gene Finding• Only 4% of the human genome encodes for functional genes.

• Genes are found along large non-coding DNA regions.

• Repeats, pseudo-genes, introns, contamination of vectors, are confusing.

Page 31: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

36

Page 32: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

37

Gene FindingExisting programs for locating genes within genomic sequences utilize a number of statistical signals and employ statistical models such as hidden Markov models (HMMs).

The problem is not solvedyet, esp. for the newly discovered “RNA genes”.

Page 33: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

39

Page 34: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

40

Diversity of Tissues in Stomach

How is this variety encoded and expressed ?

Page 35: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

41

Central Dogma

Transcription

mRNA

Translation

ProteinGene

cells express different subset of the genesIn different tissues and under different conditions

שעתוק תרגום

Page 36: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

42

Transcription

Coding sequences can be transcribed to RNA

RNA nucleotides: Similar to DNA, slightly different backbone Uracil (U) instead of Thymine (T)

Sou

rce:

Mat

hew

s &

van

Hol

de

USER
הסבר על ה"נעצים" הקטנים
Page 37: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

43

Transcription: RNA Editing

Exons hold information, they are more stable during evolution.This process takes place in the nucleus. The mRNA molecules diffuse through the nucleus membrane to the outer cell plasma.

1. Transcribe to RNA2. Eliminate introns3. Splice (connect) exons* Alternative splicing exists

Page 38: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

44

RNA roles Messenger RNA (mRNA)

Encodes protein sequences. Each three nucleotide acids translate to an amino acid (the protein building block).

Transfer RNA (tRNA) Decodes the mRNA molecules to amino-acids. It connects

to the mRNA with one side and holds the appropriate amino acid on its other side.

Ribosomal RNA (rRNA) Part of the ribosome, a machine for translating mRNA to

proteins. It catalyzes (like enzymes) the reaction that attaches the hanging amino acid from the tRNA to the amino acid chain being created.

...

Page 39: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

45

New Roles of RNACellular Regulation

http://www.nature.com/nature/journal/v408/n6808/fig_tab/408037a0_F1.html

http://www.sciencemag.org/content/vol298/issue5602/cover.shtml

COVER: Researchers are discovering that small RNA molecules play a surprising variety of key roles in cells. They can inhibit translation of messenger RNA into protein, cause degradation of other messenger RNAs, and even initiate complete silencing of gene expression from the genome.

Page 40: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

46

Translation in Eukaryotes

http://www1.imim.es/courses/Lisboa01/slide1.6_translation.html

Animation: http://cbms.st-and.ac.uk/academics/ryan/Teaching/medsci/Medsci6.htm

Page 41: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

47

Translation

Translation is mediated by the ribosome Ribosome is a complex of protein & rRNA

molecules The ribosome attaches to the mRNA at a

translation initiation site Then ribosome moves along the mRNA sequence

and in the process constructs a sequence of amino acids (polypeptide) which is released and folds into a protein.

Page 42: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

48

Genetic Code

There are 20 amino acids from which proteins are build.

Page 43: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

49

Protein Structure

Proteins are poly-peptides of 70-3000 amino-acids

This structure is (mostly) determined by the sequence of amino-acids that make up the protein

USER
למצוא קצת יותר מידע על תמונה זו
Page 44: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

50

Protein Structure

Page 45: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

51

Page 46: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

52

The Central Paradigm of Bio-informatics

Geneticinformation

Molecular structure

Biochemical function

Symptoms

Page 47: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

53

Similarity Search in Databanks

Find similar sequencesto a working draft.

As databanks grow, homologies get harder,and quality is reduced.

Alignment Tools: BLAST & FASTA (time saving heuristics-approximations).

>gb|BE588357.1|BE588357 194087 BARC 5BOV Bos taurus cDNA 5'.

Length = 369

Score = 272 bits (137), Expect = 4e-71

Identities = 258/297 (86%), Gaps = 1/297 (0%)

Strand = Plus / Plus

Query: 17 aggatccaacgtcgctccagctgctcttgacgactccacagataccccgaagccatggca 76

|||||||||||||||| | ||| | ||| || ||| | |||| ||||| |||||||||

Sbjct: 1 aggatccaacgtcgctgcggctacccttaaccact-cgcagaccccccgcagccatggcc 59

Query: 77 agcaagggcttgcaggacctgaagcaacaggtggaggggaccgcccaggaagccgtgtca 136

|||||||||||||||||||||||| | || ||||||||| | ||||||||||| ||| ||

Sbjct: 60 agcaagggcttgcaggacctgaagaagcaagtggagggggcggcccaggaagcggtgaca 119

Query: 137 gcggccggagcggcagctcagcaagtggtggaccaggccacagaggcggggcagaaagcc 196

|||||||| | || | ||||||||||||||| ||||||||||| || ||||||||||||

Sbjct: 120 tcggccggaacagcggttcagcaagtggtggatcaggccacagaagcagggcagaaagcc 179

Query: 197 atggaccagctggccaagaccacccaggaaaccatcgacaagactgctaaccaggcctct 256

||||||||| | |||||||| |||||||||||||||||| ||||||||||||||||||||

Sbjct: 180 atggaccaggttgccaagactacccaggaaaccatcgaccagactgctaaccaggcctct 239

Query: 257 gacaccttctctgggattgggaaaaaattcggcctcctgaaatgacagcagggagac 313

|| || ||||| || ||||||||||| | |||||||||||||||||| ||||||||

Sbjct: 240 gagactttctcgggttttgggaaaaaacttggcctcctgaaatgacagaagggagac 296

Pairwise alignment:

Page 48: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

54

Multiple Sequence Alignment

Multiple alignment: Basis for phylogenetic tree construction. Useful to find protein families and functional domains.

Page 49: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

55

EvolutionEvolution - a process in which small changes occur within species over time.

These changes are mainly monitored today using molecular sequences (DNA/proteins).

The Tree of Life: A classical, basic science problem, since Darwin’s

1859 “Origin of Species”.

Page 50: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

56

Evolution

Related organisms have similar DNA Similarity in sequences of proteins Similarity in organization of genes along the

chromosomes Evolution plays a major role in biology

Many mechanisms are shared across a wide range of organisms

During the course of evolution existing components are adapted for new functions

Page 51: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

58

The Tree of Life

Sou

rce:

Alb

erts

et

al

Page 52: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

59

Phylogeny ReconstructionGoal: Given a set of species, reconstruct the

tree which best explains their evolutionary history.

Page 53: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

61

Today most phylogenetic trees are based on molecular sequence data (DNA or proteins).

Darwin (Origin of Species, 1859) and his contemporaries based their work on morphological and physiological properties (e.g. cold/warm blood, existance of scales, number of teeth, existance of wings, etc., etc.). Paleontological data is still in use when constructing trees for certain extinct species (e.g. dinosaures, mammoths, moas, unicorns, etc…)

Trees are Based on What ?

Page 54: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

62

www.tomchalk.com/evolution.gif

Evolution

Page 55: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

63

Example for Phylogenetic AnalysisInput: four nucleotide sequences: AAG, AAA, GGA, AGA taken from four species.

Question: Which evolutionary tree best explains these sequences ?

AGAAAA

GGAAAG

AAA AAA

AAA

21 1

Total #substitutions = 4

One Answer (the parsimony principle): Pick a tree that has a minimum total number of substitutions of symbols between species and their originator in the evolutionary tree (Also called phylogenetic tree).

Page 56: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

64

DNA Microarrays (Chips)

Page 57: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

65

A Modern Use of WC Complimentarity

A binds to TC binds to G

AATGCTTAGTCTTACGAATCAG

Perfect match

AATGCGTAGTCTTACGAATCAG

One base mismatch

Page 58: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

66

Array Based Hybridization Assays (DNA Chips)

Unknown sequence (target)Many copies.

Array of probes

Page 59: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

67

• Target hybs to WC complimentary probes only• Therefore – the fluorescence pattern is indicative of the

target sequence.

Array Based Hyb Assays

Page 60: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

68

Microarrays (“DNA Chips”)Leading edge, future technologies (since 1988):

In a single experiment, measure expression level of thousands of genes.

• Find informative genes that may have predictive power for medical diagnosis.

• Potential for personalized medicine, e.g. kits for identifying cancer types and prescribe “personal”

treatment.

Page 61: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

69

• Each chip has n “pixels” on it. Every pixel contains copies of a probe from a single gene.

• Do m experiments: Cells in each experiment are taken from different conditions: (different phase of cell cycle, different patient, different type of tissue etc.).

• Purpose: Measure mRNA expression levels (Color coded) of all n genes in one experiment.

DNA Chips - Structure

Page 62: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

70

Gene Expression Matrix

• Rows correspond to genes. (Typically n between 500 and 15,000).

• Columns correspond to experiments. (Typically m between 10 and 200).

• Entryi, j = expression level

of gene i, in experiment j.

Page 63: Computational Genomics Fall 2005/6 bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT

71

Algorithmic Challange

Analyse the vast amount of data in gene expression matrices.

Discover meaningful biological structures and functions.

And now, time for a break