34
1 Mona Singh What is computational biology?

What is computational biology?

  • Upload
    ita

  • View
    100

  • Download
    11

Embed Size (px)

DESCRIPTION

What is computational biology?. Genome. The entire hereditary information content of an organism. DNA. String over 4 letter alphabet A, T, G, C Organism’s genome is distributed over chromosomes (e.g., 46 chromosomes in human—22 pairs & XY) Genome size: number of base pairs in an organism. - PowerPoint PPT Presentation

Citation preview

Page 1: What is computational biology?

1 Mona Singh

What is computational biology?

Page 2: What is computational biology?

2 Mona Singh

Genome

• The entire hereditary information content of an organism

Page 3: What is computational biology?

3 Mona Singh

DNA

• String over 4 letter alphabet A, T, G, C

• Organism’s genome is distributed over chromosomes (e.g., 46 chromosomes in human—22 pairs & XY)

• Genome size: number of base pairs in an organism

Page 4: What is computational biology?

4 Mona Singh

Genome SizesHuman 3 billion bps

Mouse 3 billion bps

Fruit fly 165 million bps

Nematode worm

97 million bps

Yeast 15 million bps

E coli 5 million bps

~ 400 genomes sequenced

Page 5: What is computational biology?

5 Mona Singh

How are genomes sequenced?

• Can only sequence a few hundred base pairs at a time

• Make many copies of the DNA and cut into smaller (overlapping) pieces

• Assemble pieces: certain substrings occur in multiple fragments

Page 6: What is computational biology?

6 Mona Singh

Genomes to Life

ATGCCTTACGTACCCTGCGGCAGCACT

?Genome

Page 7: What is computational biology?

7 Mona Singh

• Portions of DNA code for genes, which carry the information for making proteins

• Proteins play key roles in most biological processes (e.g., signaling, catalysis, immune response, etc.)

Page 8: What is computational biology?

8 Mona Singh

gucgcuaccauuaccaguuggucuggugucaaaaauaauaauaaccgggcaggccaugucugcccguauuucgcguaaggaaauccauuauguacuauuuaaaaaacacaaacuuuuggauguucgguuuauucuuuuucuuuuacuuuuuuaucaugggagccuacuucccguuuuucccgauuuggcuacaugacaucaaccauaucagcaaaagugauacggguauuauuuuugccgcuauuucucuguucucgcuauuauuccaaccgcuguuuggucugcuuucugacaaacucgggcugcgcaaauaccugcuguggauuauuaccggcauguuagugauguuugcgccguucuuuauuuuuaucuucgggccacuguuacaauacaacauuuuaguaggaucgauuguuggugguauuuaucuaggcuuuuguuuuaacgccggugcgccagcaguagaggcauuuauugagaaagucagccgucgcaguaauuucgaauuuggucgcgcgcggauguuuggcuguguuggcugggcgcugugugccucgauugucggcaucauguucaccaucaauaaucaguuuguuuucuggcugggcucuggcugugcacucauccucgccguuuuacucuuuuucgccaaaacggaugcgcccucuucugccacgguugccaaugcgguaggugccaaccauucggcauuuagccuuaagcuggcacuggaacuguucagacagccaaaacugugguuuuugucacuguauguuauuggcguuuccugcaccuacgauGuuuuugaccaacaguuugcuaauuucuuuacuucguucugucaggugaa...gcaaucaaugucggaugcggcgcgacgcu

Gene Finding

Page 9: What is computational biology?

9 Mona Singh

gucgcuaccauuaccaguuggucuggugucaaaaauaauaauaaccgggcaggccaugucugcccguauuucgcguaaggaaauccauuauguacuauuuaaaaaacacaaacuuuuggauguucgguuuauucuuuuucuuuuacuuuuuuaucaugggagccuacuucccguuuuucccgauuuggcuacaugacaucaaccauaucagcaaaagugauacggguauuauuuuugccgcuauuucucuguucucgcuauuauuccaaccgcuguuuggucugcuuucugacaaacucgggcugcgcaaauaccugcuguggauuauuaccggcauguuagugauguuugcgccguucuuuauuuuuaucuucgggccacuguuacaauacaacauuuuaguaggaucgauuguuggugguauuuaucuaggcuuuuguuuuaacgccggugcgccagcaguagaggcauuuauugagaaagucagccgucgcaguaauuucgaauuuggucgcgcgcggauguuuggcuguguuggcugggcgcugugugccucgauugucggcaucauguucaccaucaauaaucaguuuguuuucuggcugggcucuggcugugcacucauccucgccguuuuacucuuuuucgccaaaacggaugcgcccucuucugccacgguugccaaugcgguaggugccaaccauucggcauuuagccuuaagcuggcacuggaacuguucagacagccaaaacugugguuuuugucacuguauguuauuggcguuuccugcaccuacgauGuuuuugaccaacaguuugcuaauuucuuuacuucguucugucaggugaa...gcaaucaaugucggaugcggcgcgacgcu

MYYLKNTNFWMFGLFFFFYFFIMGAYFPFFPIWLHDINHISKSDTGIIFAAISLFSLLFQPLFGLLSDKLGLRKYLLWIITGMLVMFAPFFIFIFGPLLQYNILVGSIVGGIYLGFCFNAGAPAVEAFIEKVSRRSNFEFGRARMFGCVGWALCASIVGIMFTINNQFVFWLGSGCALILAVLLFFAKTDAPSSATVANAVGANHSAFSLKLALELFRQPKLWFLSLYVIGVSCTYDVFDQQFANFFTSFFATGEQGTRVFGYVTTMGELLNASIMFFAPLIINRIGGKNALLLAGTIMSVRIIGSSFATSALEVVILKTLHMFEVPFLLVGCFKYIT

Gene Finding

Page 10: What is computational biology?

10 Mona Singh

AUG = methionine/startUUA = LeucineUUG = Leucine

UAA = StopUAG = StopUGA = Stop...

The Genetic Code

Stryer, Biochemistry

Page 11: What is computational biology?

11 Mona Singh

Gene Findinggucgcuaccauuaccaguuggucuggugucaaaaauaauaauaaccgggcaggccaugucugcccguauuucgcguaaggaaauccauuauguacuauuuaaaaaacacaaacuuuuggauguucgguuuauucuuuuucuuuuacuuuuuuaucaugggagccuacuucccguuuuucccgauuuggcuacaugacaucaaccauaucagcaaaagugauacggguauuauuuuugccgcuauuucucuguucucgcuauuauuccaaccgcuguuuggucugcuuucugacaaacucgggcugcgcaaauaccugcuguggauuauuaccggcauguuagugauguuugcgccguucuuuauuuuuaucuucgggccacuguuacaauacaacauuuuaguaggaucgauuguuggugguauuuaucuaggcuuuuguuuuaacgccggugcgccagcaguagaggcauuuauugagaaagucagccgucgcaguaauuucgaauuuggucgcgcgcggauguuuggcuguguuggcugggcgcugugugccucgauugucggcaucauguucaccaucaauaaucaguuuguuuucuggcugggcucuggcugugcacucauccucgccguuuuacucuuuuucgccaaaacggaugcgcccucuucugccacgguugccaaugcgguaggugccaaccauucggcauuuagccuuaagcuggcacuggaacuguucagacagccaaaacugugguuuuugucacuguauguuauuggcguuuccugcaccuacgauguuuuugaccaacaguuugcuaauuucuuuacuucguucugucaggugaa...gcaaucaaugucggaugcggcgcgacgcu

Page 12: What is computational biology?

12 Mona Singh

Gene Findingaug ucu gcc cgu auu ucg cgu aag gaa auc cau uau gua cua uuu aaa ...

Met Ser Ala Arg Ile Ser Arg Lys Glu Ile His Tyr Val Leu Phe Lys ...

M S A R I S R K E I H Y V L F K ...

Reading offfrom 1st starttriplet

Translating(3 letter aminoacid code)

(1 letter code)

Page 13: What is computational biology?

13 Mona Singh

Gene Findingaug ucu gcc cgu auu ucg cgu aag gaa auc cau uau gua cua uuu aaa ...

Met Ser Ala Arg Ile Ser Arg Lys Glu Ile His Tyr Val Leu Phe Lys ...

M S A R I S R K E I H Y V L F K ...

Reading offfrom 1st starttriplet

Translating(3 letter aminoacid code)

(1 letter code)

M Y Y L K N T N F W M F G L F F ...Actual protein sequence

Page 14: What is computational biology?

14 Mona Singh

Computational Gene Finding Methods

• Statistical bias: protein coding regions “look different”

- compare coding vs. non-coding regions (Hidden Markov Models, Neural Nets)• Sequence similarity - similar to known protein?

Page 15: What is computational biology?

15 Mona Singh

Gene finding is hard

• In some genomes, only a small portion of genome codes for protein (needle in haystack)

• Some genes contain introns and exons – exons are the part that actually encode the protein part – and exons can be short

• Have to get the precise boundaries to get correct protein

Page 16: What is computational biology?

16 Mona Singh

Number of genesHuman ~30,000

Mouse ~30,000

Fruit fly ~13,500

Nematode worm

~19,000

Yeast ~6,000

E coli ~4,000

Page 17: What is computational biology?

17 Mona Singh

MYYLKNTNFWMFGLFFFFYFFIMGAYFPFFPIWLHDINHISKSDTGIIFAAISLFSLLFQPLFGLLSDKLGLRKYLLWIITGMLVMFAPFFIFIFGPLLQYNILVGSIVGGIYLGFCFNAGAPAVEAFIEKVSRRSNFEFGRARMFGCVGWALCASIVGIMFTINNQFVFWLGSGCALILAVLLFFAKTDAPSSATVANAVGANHSAFSLKLALELFRQPKLWFLSLYVIGVSCTYDVFDQQFANFFTSFFATGEQGTRVFGYVTTMGELLNASIMFFAPLIINRIGGKNALLLAGTIMSVRIIGSSFATSALEVVILKTLHMFEVPFLLVGCFKYIT

Predicting Protein Function

DNA binding protein

Page 18: What is computational biology?

18 Mona Singh

Functions of Human Proteins

Science, 2001

Page 19: What is computational biology?

19 Mona Singh

Sequence similarity

CF: EGGNAILENISFSISPGQRVGLLGRTGSGKSTLLSAFLRLL-----NT: QAAQPLVHGVSLTLQRGRVLALVGGSGSGKSLTCAATLGILPAGVR

CF: NTEGEIQIDGVSWDSITL---------QQWRKAFGVIPQKVFIFSGNT: QTAGEILADGKPVSPCALRGIKIATIMQNPRSAFNPL---------

CF: TFRKNLDPYEQWSDQEIWKVADEVGLRSVIEQFP-GKLDFVLVDGGNT: ---HTMHTHARETCLALGKPADDATLTAAIEAVGLENAARVLKLYP

CF: CVLSHGHKQLMCLARSVLSKAKILLLDEPSAHLDPV NT: FEMSGGMLQRMMIAMAVLCESPFIIADEPTTDLDVV

Ex: cystic fibrosis gene and bacterial nickel transport gene

Page 20: What is computational biology?

20 Mona Singh

Database Searches

http://www.ncbi.nlm.nih.gov

Page 21: What is computational biology?

21 Mona Singh

Database SearchesSequences producing significant alignments: E-Value

gi|5523990|gb|AAD44047.1|AF108138_1 (AF108138) DNA helicase 4e-84gi|7511524|pir||T37310 PIF1 protein - Caenorhabditis elegans helicase 1e-77gi|7493349|pir||T40739 rrm3-pif1 helicase homolog - fission... 3e-59gi|11282390|pir||T47241 RRM3/PIF1 helicase homolog - fission yeast 3e-59gi|6321820|ref|NP_011896.1| DNA helicase; Rrm3p [Saccharomyces 4e-43gi|6323579|ref|NP_013650.1| 5' to 3' DNA helicase; Pif1p [Saccharo 1e-41gi|558414|emb|CAA86260.1| (Z38114) len: 750, CAI: 0.14, inc... 1e-41gi|7687929|emb|CAB89609.1| (AL354532) possible DNA helicase... 4e-41

Page 22: What is computational biology?

22 Mona Singh

Protein Structure

Sequence: KETAAAKFERQHMDSSTSAASSSN…Structure:

Page 23: What is computational biology?

23 Mona Singh

Primary TertiarySecondary Quaternary

Amino acids helixPolypeptide

chainAssembledsubunits

Proteins

Lehninger, Principles of Biochemistry

Page 24: What is computational biology?

24 Mona Singh

Protein Structure Prediction

•Physics-based methods•Statistics-based method

Page 25: What is computational biology?

25 Mona Singh

Statistics & Protein Structure Prediction

Given a new sequence and a library of folds, figure out which (if any) is a good fit to the sequence.

Page 26: What is computational biology?

26 Mona Singh

Secondary structure prediction

• Given a protein sequence, can you tell its secondary structure– E.g., LKVVAKRELVQNNQ aaaa bbbb aaaaaaa

a=alpha, b=beta : ~70% accuracy(neural nets or other learning techniques)

Page 27: What is computational biology?

27 Mona Singh

Genome annotation

• Many other important features of DNA– E.g., proteins bind DNA regulatory

elements: determines which genes are “on” when

• Statistical & comparative approaches for finding them– Motif finding

Page 28: What is computational biology?

28 Mona Singh

Prokaryotes Eukaryotes

Universal phylogenetic tree

Woese et al.

Page 29: What is computational biology?

29 Mona Singh

Building phylogenetic trees

Use DNA (or protein) sequences from various organisms

e.g., human ATCGAGGC mouse ATCCAGCC yeast ATTAAGTA

Page 30: What is computational biology?

30 Mona Singh

Building phylogenetic trees

Human Mouse

Yeast

Human

0 2 4

Mouse 2 0 4

Yeast 4 4 0

E.g.,DistanceMatrix:

Tree:1 1

1 2

Human Mouse Yeast

Page 31: What is computational biology?

31 Mona Singh

DNA RNA

Protein

Stimulus

Sti

mu

lus

Intracellular networks

Page 32: What is computational biology?

32 Mona Singh

DNA RNA

Protein

fn

DNA RNA

Protein

fn

DNA RNA

Protein

fn

DNA RNA

Protein

fn

DNA RNA

Protein

fn

DNA RNA

Protein

fn

Network of cells

Page 33: What is computational biology?

33 Mona Singh

DNA RNA

Protein

fn

fn

Page 34: What is computational biology?

34 Mona Singh

Lecture Notes

• www.cs.princeton.edu/~mona/computational_biology_notes.html