What is computational biology?

Preview:

DESCRIPTION

What is computational biology?. Genome. The entire hereditary information content of an organism. DNA. String over 4 letter alphabet A, T, G, C Organism’s genome is distributed over chromosomes (e.g., 46 chromosomes in human—22 pairs & XY) Genome size: number of base pairs in an organism. - PowerPoint PPT Presentation

Citation preview

1 Mona Singh

What is computational biology?

2 Mona Singh

Genome

• The entire hereditary information content of an organism

3 Mona Singh

DNA

• String over 4 letter alphabet A, T, G, C

• Organism’s genome is distributed over chromosomes (e.g., 46 chromosomes in human—22 pairs & XY)

• Genome size: number of base pairs in an organism

4 Mona Singh

Genome SizesHuman 3 billion bps

Mouse 3 billion bps

Fruit fly 165 million bps

Nematode worm

97 million bps

Yeast 15 million bps

E coli 5 million bps

~ 400 genomes sequenced

5 Mona Singh

How are genomes sequenced?

• Can only sequence a few hundred base pairs at a time

• Make many copies of the DNA and cut into smaller (overlapping) pieces

• Assemble pieces: certain substrings occur in multiple fragments

6 Mona Singh

Genomes to Life

ATGCCTTACGTACCCTGCGGCAGCACT

?Genome

7 Mona Singh

• Portions of DNA code for genes, which carry the information for making proteins

• Proteins play key roles in most biological processes (e.g., signaling, catalysis, immune response, etc.)

8 Mona Singh

gucgcuaccauuaccaguuggucuggugucaaaaauaauaauaaccgggcaggccaugucugcccguauuucgcguaaggaaauccauuauguacuauuuaaaaaacacaaacuuuuggauguucgguuuauucuuuuucuuuuacuuuuuuaucaugggagccuacuucccguuuuucccgauuuggcuacaugacaucaaccauaucagcaaaagugauacggguauuauuuuugccgcuauuucucuguucucgcuauuauuccaaccgcuguuuggucugcuuucugacaaacucgggcugcgcaaauaccugcuguggauuauuaccggcauguuagugauguuugcgccguucuuuauuuuuaucuucgggccacuguuacaauacaacauuuuaguaggaucgauuguuggugguauuuaucuaggcuuuuguuuuaacgccggugcgccagcaguagaggcauuuauugagaaagucagccgucgcaguaauuucgaauuuggucgcgcgcggauguuuggcuguguuggcugggcgcugugugccucgauugucggcaucauguucaccaucaauaaucaguuuguuuucuggcugggcucuggcugugcacucauccucgccguuuuacucuuuuucgccaaaacggaugcgcccucuucugccacgguugccaaugcgguaggugccaaccauucggcauuuagccuuaagcuggcacuggaacuguucagacagccaaaacugugguuuuugucacuguauguuauuggcguuuccugcaccuacgauGuuuuugaccaacaguuugcuaauuucuuuacuucguucugucaggugaa...gcaaucaaugucggaugcggcgcgacgcu

Gene Finding

9 Mona Singh

gucgcuaccauuaccaguuggucuggugucaaaaauaauaauaaccgggcaggccaugucugcccguauuucgcguaaggaaauccauuauguacuauuuaaaaaacacaaacuuuuggauguucgguuuauucuuuuucuuuuacuuuuuuaucaugggagccuacuucccguuuuucccgauuuggcuacaugacaucaaccauaucagcaaaagugauacggguauuauuuuugccgcuauuucucuguucucgcuauuauuccaaccgcuguuuggucugcuuucugacaaacucgggcugcgcaaauaccugcuguggauuauuaccggcauguuagugauguuugcgccguucuuuauuuuuaucuucgggccacuguuacaauacaacauuuuaguaggaucgauuguuggugguauuuaucuaggcuuuuguuuuaacgccggugcgccagcaguagaggcauuuauugagaaagucagccgucgcaguaauuucgaauuuggucgcgcgcggauguuuggcuguguuggcugggcgcugugugccucgauugucggcaucauguucaccaucaauaaucaguuuguuuucuggcugggcucuggcugugcacucauccucgccguuuuacucuuuuucgccaaaacggaugcgcccucuucugccacgguugccaaugcgguaggugccaaccauucggcauuuagccuuaagcuggcacuggaacuguucagacagccaaaacugugguuuuugucacuguauguuauuggcguuuccugcaccuacgauGuuuuugaccaacaguuugcuaauuucuuuacuucguucugucaggugaa...gcaaucaaugucggaugcggcgcgacgcu

MYYLKNTNFWMFGLFFFFYFFIMGAYFPFFPIWLHDINHISKSDTGIIFAAISLFSLLFQPLFGLLSDKLGLRKYLLWIITGMLVMFAPFFIFIFGPLLQYNILVGSIVGGIYLGFCFNAGAPAVEAFIEKVSRRSNFEFGRARMFGCVGWALCASIVGIMFTINNQFVFWLGSGCALILAVLLFFAKTDAPSSATVANAVGANHSAFSLKLALELFRQPKLWFLSLYVIGVSCTYDVFDQQFANFFTSFFATGEQGTRVFGYVTTMGELLNASIMFFAPLIINRIGGKNALLLAGTIMSVRIIGSSFATSALEVVILKTLHMFEVPFLLVGCFKYIT

Gene Finding

10 Mona Singh

AUG = methionine/startUUA = LeucineUUG = Leucine

UAA = StopUAG = StopUGA = Stop...

The Genetic Code

Stryer, Biochemistry

11 Mona Singh

Gene Findinggucgcuaccauuaccaguuggucuggugucaaaaauaauaauaaccgggcaggccaugucugcccguauuucgcguaaggaaauccauuauguacuauuuaaaaaacacaaacuuuuggauguucgguuuauucuuuuucuuuuacuuuuuuaucaugggagccuacuucccguuuuucccgauuuggcuacaugacaucaaccauaucagcaaaagugauacggguauuauuuuugccgcuauuucucuguucucgcuauuauuccaaccgcuguuuggucugcuuucugacaaacucgggcugcgcaaauaccugcuguggauuauuaccggcauguuagugauguuugcgccguucuuuauuuuuaucuucgggccacuguuacaauacaacauuuuaguaggaucgauuguuggugguauuuaucuaggcuuuuguuuuaacgccggugcgccagcaguagaggcauuuauugagaaagucagccgucgcaguaauuucgaauuuggucgcgcgcggauguuuggcuguguuggcugggcgcugugugccucgauugucggcaucauguucaccaucaauaaucaguuuguuuucuggcugggcucuggcugugcacucauccucgccguuuuacucuuuuucgccaaaacggaugcgcccucuucugccacgguugccaaugcgguaggugccaaccauucggcauuuagccuuaagcuggcacuggaacuguucagacagccaaaacugugguuuuugucacuguauguuauuggcguuuccugcaccuacgauguuuuugaccaacaguuugcuaauuucuuuacuucguucugucaggugaa...gcaaucaaugucggaugcggcgcgacgcu

12 Mona Singh

Gene Findingaug ucu gcc cgu auu ucg cgu aag gaa auc cau uau gua cua uuu aaa ...

Met Ser Ala Arg Ile Ser Arg Lys Glu Ile His Tyr Val Leu Phe Lys ...

M S A R I S R K E I H Y V L F K ...

Reading offfrom 1st starttriplet

Translating(3 letter aminoacid code)

(1 letter code)

13 Mona Singh

Gene Findingaug ucu gcc cgu auu ucg cgu aag gaa auc cau uau gua cua uuu aaa ...

Met Ser Ala Arg Ile Ser Arg Lys Glu Ile His Tyr Val Leu Phe Lys ...

M S A R I S R K E I H Y V L F K ...

Reading offfrom 1st starttriplet

Translating(3 letter aminoacid code)

(1 letter code)

M Y Y L K N T N F W M F G L F F ...Actual protein sequence

14 Mona Singh

Computational Gene Finding Methods

• Statistical bias: protein coding regions “look different”

- compare coding vs. non-coding regions (Hidden Markov Models, Neural Nets)• Sequence similarity - similar to known protein?

15 Mona Singh

Gene finding is hard

• In some genomes, only a small portion of genome codes for protein (needle in haystack)

• Some genes contain introns and exons – exons are the part that actually encode the protein part – and exons can be short

• Have to get the precise boundaries to get correct protein

16 Mona Singh

Number of genesHuman ~30,000

Mouse ~30,000

Fruit fly ~13,500

Nematode worm

~19,000

Yeast ~6,000

E coli ~4,000

17 Mona Singh

MYYLKNTNFWMFGLFFFFYFFIMGAYFPFFPIWLHDINHISKSDTGIIFAAISLFSLLFQPLFGLLSDKLGLRKYLLWIITGMLVMFAPFFIFIFGPLLQYNILVGSIVGGIYLGFCFNAGAPAVEAFIEKVSRRSNFEFGRARMFGCVGWALCASIVGIMFTINNQFVFWLGSGCALILAVLLFFAKTDAPSSATVANAVGANHSAFSLKLALELFRQPKLWFLSLYVIGVSCTYDVFDQQFANFFTSFFATGEQGTRVFGYVTTMGELLNASIMFFAPLIINRIGGKNALLLAGTIMSVRIIGSSFATSALEVVILKTLHMFEVPFLLVGCFKYIT

Predicting Protein Function

DNA binding protein

18 Mona Singh

Functions of Human Proteins

Science, 2001

19 Mona Singh

Sequence similarity

CF: EGGNAILENISFSISPGQRVGLLGRTGSGKSTLLSAFLRLL-----NT: QAAQPLVHGVSLTLQRGRVLALVGGSGSGKSLTCAATLGILPAGVR

CF: NTEGEIQIDGVSWDSITL---------QQWRKAFGVIPQKVFIFSGNT: QTAGEILADGKPVSPCALRGIKIATIMQNPRSAFNPL---------

CF: TFRKNLDPYEQWSDQEIWKVADEVGLRSVIEQFP-GKLDFVLVDGGNT: ---HTMHTHARETCLALGKPADDATLTAAIEAVGLENAARVLKLYP

CF: CVLSHGHKQLMCLARSVLSKAKILLLDEPSAHLDPV NT: FEMSGGMLQRMMIAMAVLCESPFIIADEPTTDLDVV

Ex: cystic fibrosis gene and bacterial nickel transport gene

20 Mona Singh

Database Searches

http://www.ncbi.nlm.nih.gov

21 Mona Singh

Database SearchesSequences producing significant alignments: E-Value

gi|5523990|gb|AAD44047.1|AF108138_1 (AF108138) DNA helicase 4e-84gi|7511524|pir||T37310 PIF1 protein - Caenorhabditis elegans helicase 1e-77gi|7493349|pir||T40739 rrm3-pif1 helicase homolog - fission... 3e-59gi|11282390|pir||T47241 RRM3/PIF1 helicase homolog - fission yeast 3e-59gi|6321820|ref|NP_011896.1| DNA helicase; Rrm3p [Saccharomyces 4e-43gi|6323579|ref|NP_013650.1| 5' to 3' DNA helicase; Pif1p [Saccharo 1e-41gi|558414|emb|CAA86260.1| (Z38114) len: 750, CAI: 0.14, inc... 1e-41gi|7687929|emb|CAB89609.1| (AL354532) possible DNA helicase... 4e-41

22 Mona Singh

Protein Structure

Sequence: KETAAAKFERQHMDSSTSAASSSN…Structure:

23 Mona Singh

Primary TertiarySecondary Quaternary

Amino acids helixPolypeptide

chainAssembledsubunits

Proteins

Lehninger, Principles of Biochemistry

24 Mona Singh

Protein Structure Prediction

•Physics-based methods•Statistics-based method

25 Mona Singh

Statistics & Protein Structure Prediction

Given a new sequence and a library of folds, figure out which (if any) is a good fit to the sequence.

26 Mona Singh

Secondary structure prediction

• Given a protein sequence, can you tell its secondary structure– E.g., LKVVAKRELVQNNQ aaaa bbbb aaaaaaa

a=alpha, b=beta : ~70% accuracy(neural nets or other learning techniques)

27 Mona Singh

Genome annotation

• Many other important features of DNA– E.g., proteins bind DNA regulatory

elements: determines which genes are “on” when

• Statistical & comparative approaches for finding them– Motif finding

28 Mona Singh

Prokaryotes Eukaryotes

Universal phylogenetic tree

Woese et al.

29 Mona Singh

Building phylogenetic trees

Use DNA (or protein) sequences from various organisms

e.g., human ATCGAGGC mouse ATCCAGCC yeast ATTAAGTA

30 Mona Singh

Building phylogenetic trees

Human Mouse

Yeast

Human

0 2 4

Mouse 2 0 4

Yeast 4 4 0

E.g.,DistanceMatrix:

Tree:1 1

1 2

Human Mouse Yeast

31 Mona Singh

DNA RNA

Protein

Stimulus

Sti

mu

lus

Intracellular networks

32 Mona Singh

DNA RNA

Protein

fn

DNA RNA

Protein

fn

DNA RNA

Protein

fn

DNA RNA

Protein

fn

DNA RNA

Protein

fn

DNA RNA

Protein

fn

Network of cells

33 Mona Singh

DNA RNA

Protein

fn

fn

34 Mona Singh

Lecture Notes

• www.cs.princeton.edu/~mona/computational_biology_notes.html

Recommended