1 Mona Singh
What is computational biology?
2 Mona Singh
Genome
• The entire hereditary information content of an organism
3 Mona Singh
DNA
• String over 4 letter alphabet A, T, G, C
• Organism’s genome is distributed over chromosomes (e.g., 46 chromosomes in human—22 pairs & XY)
• Genome size: number of base pairs in an organism
4 Mona Singh
Genome SizesHuman 3 billion bps
Mouse 3 billion bps
Fruit fly 165 million bps
Nematode worm
97 million bps
Yeast 15 million bps
E coli 5 million bps
~ 400 genomes sequenced
5 Mona Singh
How are genomes sequenced?
• Can only sequence a few hundred base pairs at a time
• Make many copies of the DNA and cut into smaller (overlapping) pieces
• Assemble pieces: certain substrings occur in multiple fragments
6 Mona Singh
Genomes to Life
ATGCCTTACGTACCCTGCGGCAGCACT
?Genome
7 Mona Singh
• Portions of DNA code for genes, which carry the information for making proteins
• Proteins play key roles in most biological processes (e.g., signaling, catalysis, immune response, etc.)
8 Mona Singh
gucgcuaccauuaccaguuggucuggugucaaaaauaauaauaaccgggcaggccaugucugcccguauuucgcguaaggaaauccauuauguacuauuuaaaaaacacaaacuuuuggauguucgguuuauucuuuuucuuuuacuuuuuuaucaugggagccuacuucccguuuuucccgauuuggcuacaugacaucaaccauaucagcaaaagugauacggguauuauuuuugccgcuauuucucuguucucgcuauuauuccaaccgcuguuuggucugcuuucugacaaacucgggcugcgcaaauaccugcuguggauuauuaccggcauguuagugauguuugcgccguucuuuauuuuuaucuucgggccacuguuacaauacaacauuuuaguaggaucgauuguuggugguauuuaucuaggcuuuuguuuuaacgccggugcgccagcaguagaggcauuuauugagaaagucagccgucgcaguaauuucgaauuuggucgcgcgcggauguuuggcuguguuggcugggcgcugugugccucgauugucggcaucauguucaccaucaauaaucaguuuguuuucuggcugggcucuggcugugcacucauccucgccguuuuacucuuuuucgccaaaacggaugcgcccucuucugccacgguugccaaugcgguaggugccaaccauucggcauuuagccuuaagcuggcacuggaacuguucagacagccaaaacugugguuuuugucacuguauguuauuggcguuuccugcaccuacgauGuuuuugaccaacaguuugcuaauuucuuuacuucguucugucaggugaa...gcaaucaaugucggaugcggcgcgacgcu
Gene Finding
9 Mona Singh
gucgcuaccauuaccaguuggucuggugucaaaaauaauaauaaccgggcaggccaugucugcccguauuucgcguaaggaaauccauuauguacuauuuaaaaaacacaaacuuuuggauguucgguuuauucuuuuucuuuuacuuuuuuaucaugggagccuacuucccguuuuucccgauuuggcuacaugacaucaaccauaucagcaaaagugauacggguauuauuuuugccgcuauuucucuguucucgcuauuauuccaaccgcuguuuggucugcuuucugacaaacucgggcugcgcaaauaccugcuguggauuauuaccggcauguuagugauguuugcgccguucuuuauuuuuaucuucgggccacuguuacaauacaacauuuuaguaggaucgauuguuggugguauuuaucuaggcuuuuguuuuaacgccggugcgccagcaguagaggcauuuauugagaaagucagccgucgcaguaauuucgaauuuggucgcgcgcggauguuuggcuguguuggcugggcgcugugugccucgauugucggcaucauguucaccaucaauaaucaguuuguuuucuggcugggcucuggcugugcacucauccucgccguuuuacucuuuuucgccaaaacggaugcgcccucuucugccacgguugccaaugcgguaggugccaaccauucggcauuuagccuuaagcuggcacuggaacuguucagacagccaaaacugugguuuuugucacuguauguuauuggcguuuccugcaccuacgauGuuuuugaccaacaguuugcuaauuucuuuacuucguucugucaggugaa...gcaaucaaugucggaugcggcgcgacgcu
MYYLKNTNFWMFGLFFFFYFFIMGAYFPFFPIWLHDINHISKSDTGIIFAAISLFSLLFQPLFGLLSDKLGLRKYLLWIITGMLVMFAPFFIFIFGPLLQYNILVGSIVGGIYLGFCFNAGAPAVEAFIEKVSRRSNFEFGRARMFGCVGWALCASIVGIMFTINNQFVFWLGSGCALILAVLLFFAKTDAPSSATVANAVGANHSAFSLKLALELFRQPKLWFLSLYVIGVSCTYDVFDQQFANFFTSFFATGEQGTRVFGYVTTMGELLNASIMFFAPLIINRIGGKNALLLAGTIMSVRIIGSSFATSALEVVILKTLHMFEVPFLLVGCFKYIT
Gene Finding
10 Mona Singh
AUG = methionine/startUUA = LeucineUUG = Leucine
UAA = StopUAG = StopUGA = Stop...
The Genetic Code
Stryer, Biochemistry
11 Mona Singh
Gene Findinggucgcuaccauuaccaguuggucuggugucaaaaauaauaauaaccgggcaggccaugucugcccguauuucgcguaaggaaauccauuauguacuauuuaaaaaacacaaacuuuuggauguucgguuuauucuuuuucuuuuacuuuuuuaucaugggagccuacuucccguuuuucccgauuuggcuacaugacaucaaccauaucagcaaaagugauacggguauuauuuuugccgcuauuucucuguucucgcuauuauuccaaccgcuguuuggucugcuuucugacaaacucgggcugcgcaaauaccugcuguggauuauuaccggcauguuagugauguuugcgccguucuuuauuuuuaucuucgggccacuguuacaauacaacauuuuaguaggaucgauuguuggugguauuuaucuaggcuuuuguuuuaacgccggugcgccagcaguagaggcauuuauugagaaagucagccgucgcaguaauuucgaauuuggucgcgcgcggauguuuggcuguguuggcugggcgcugugugccucgauugucggcaucauguucaccaucaauaaucaguuuguuuucuggcugggcucuggcugugcacucauccucgccguuuuacucuuuuucgccaaaacggaugcgcccucuucugccacgguugccaaugcgguaggugccaaccauucggcauuuagccuuaagcuggcacuggaacuguucagacagccaaaacugugguuuuugucacuguauguuauuggcguuuccugcaccuacgauguuuuugaccaacaguuugcuaauuucuuuacuucguucugucaggugaa...gcaaucaaugucggaugcggcgcgacgcu
12 Mona Singh
Gene Findingaug ucu gcc cgu auu ucg cgu aag gaa auc cau uau gua cua uuu aaa ...
Met Ser Ala Arg Ile Ser Arg Lys Glu Ile His Tyr Val Leu Phe Lys ...
M S A R I S R K E I H Y V L F K ...
Reading offfrom 1st starttriplet
Translating(3 letter aminoacid code)
(1 letter code)
13 Mona Singh
Gene Findingaug ucu gcc cgu auu ucg cgu aag gaa auc cau uau gua cua uuu aaa ...
Met Ser Ala Arg Ile Ser Arg Lys Glu Ile His Tyr Val Leu Phe Lys ...
M S A R I S R K E I H Y V L F K ...
Reading offfrom 1st starttriplet
Translating(3 letter aminoacid code)
(1 letter code)
M Y Y L K N T N F W M F G L F F ...Actual protein sequence
14 Mona Singh
Computational Gene Finding Methods
• Statistical bias: protein coding regions “look different”
- compare coding vs. non-coding regions (Hidden Markov Models, Neural Nets)• Sequence similarity - similar to known protein?
15 Mona Singh
Gene finding is hard
• In some genomes, only a small portion of genome codes for protein (needle in haystack)
• Some genes contain introns and exons – exons are the part that actually encode the protein part – and exons can be short
• Have to get the precise boundaries to get correct protein
16 Mona Singh
Number of genesHuman ~30,000
Mouse ~30,000
Fruit fly ~13,500
Nematode worm
~19,000
Yeast ~6,000
E coli ~4,000
17 Mona Singh
MYYLKNTNFWMFGLFFFFYFFIMGAYFPFFPIWLHDINHISKSDTGIIFAAISLFSLLFQPLFGLLSDKLGLRKYLLWIITGMLVMFAPFFIFIFGPLLQYNILVGSIVGGIYLGFCFNAGAPAVEAFIEKVSRRSNFEFGRARMFGCVGWALCASIVGIMFTINNQFVFWLGSGCALILAVLLFFAKTDAPSSATVANAVGANHSAFSLKLALELFRQPKLWFLSLYVIGVSCTYDVFDQQFANFFTSFFATGEQGTRVFGYVTTMGELLNASIMFFAPLIINRIGGKNALLLAGTIMSVRIIGSSFATSALEVVILKTLHMFEVPFLLVGCFKYIT
Predicting Protein Function
DNA binding protein
18 Mona Singh
Functions of Human Proteins
Science, 2001
19 Mona Singh
Sequence similarity
CF: EGGNAILENISFSISPGQRVGLLGRTGSGKSTLLSAFLRLL-----NT: QAAQPLVHGVSLTLQRGRVLALVGGSGSGKSLTCAATLGILPAGVR
CF: NTEGEIQIDGVSWDSITL---------QQWRKAFGVIPQKVFIFSGNT: QTAGEILADGKPVSPCALRGIKIATIMQNPRSAFNPL---------
CF: TFRKNLDPYEQWSDQEIWKVADEVGLRSVIEQFP-GKLDFVLVDGGNT: ---HTMHTHARETCLALGKPADDATLTAAIEAVGLENAARVLKLYP
CF: CVLSHGHKQLMCLARSVLSKAKILLLDEPSAHLDPV NT: FEMSGGMLQRMMIAMAVLCESPFIIADEPTTDLDVV
Ex: cystic fibrosis gene and bacterial nickel transport gene
20 Mona Singh
Database Searches
http://www.ncbi.nlm.nih.gov
21 Mona Singh
Database SearchesSequences producing significant alignments: E-Value
gi|5523990|gb|AAD44047.1|AF108138_1 (AF108138) DNA helicase 4e-84gi|7511524|pir||T37310 PIF1 protein - Caenorhabditis elegans helicase 1e-77gi|7493349|pir||T40739 rrm3-pif1 helicase homolog - fission... 3e-59gi|11282390|pir||T47241 RRM3/PIF1 helicase homolog - fission yeast 3e-59gi|6321820|ref|NP_011896.1| DNA helicase; Rrm3p [Saccharomyces 4e-43gi|6323579|ref|NP_013650.1| 5' to 3' DNA helicase; Pif1p [Saccharo 1e-41gi|558414|emb|CAA86260.1| (Z38114) len: 750, CAI: 0.14, inc... 1e-41gi|7687929|emb|CAB89609.1| (AL354532) possible DNA helicase... 4e-41
22 Mona Singh
Protein Structure
Sequence: KETAAAKFERQHMDSSTSAASSSN…Structure:
23 Mona Singh
Primary TertiarySecondary Quaternary
Amino acids helixPolypeptide
chainAssembledsubunits
Proteins
Lehninger, Principles of Biochemistry
24 Mona Singh
Protein Structure Prediction
•Physics-based methods•Statistics-based method
25 Mona Singh
Statistics & Protein Structure Prediction
Given a new sequence and a library of folds, figure out which (if any) is a good fit to the sequence.
26 Mona Singh
Secondary structure prediction
• Given a protein sequence, can you tell its secondary structure– E.g., LKVVAKRELVQNNQ aaaa bbbb aaaaaaa
a=alpha, b=beta : ~70% accuracy(neural nets or other learning techniques)
27 Mona Singh
Genome annotation
• Many other important features of DNA– E.g., proteins bind DNA regulatory
elements: determines which genes are “on” when
• Statistical & comparative approaches for finding them– Motif finding
28 Mona Singh
Prokaryotes Eukaryotes
Universal phylogenetic tree
Woese et al.
29 Mona Singh
Building phylogenetic trees
Use DNA (or protein) sequences from various organisms
e.g., human ATCGAGGC mouse ATCCAGCC yeast ATTAAGTA
30 Mona Singh
Building phylogenetic trees
Human Mouse
Yeast
Human
0 2 4
Mouse 2 0 4
Yeast 4 4 0
E.g.,DistanceMatrix:
Tree:1 1
1 2
Human Mouse Yeast
31 Mona Singh
DNA RNA
Protein
Stimulus
Sti
mu
lus
Intracellular networks
32 Mona Singh
DNA RNA
Protein
fn
DNA RNA
Protein
fn
DNA RNA
Protein
fn
DNA RNA
Protein
fn
DNA RNA
Protein
fn
DNA RNA
Protein
fn
Network of cells
33 Mona Singh
DNA RNA
Protein
fn
fn
34 Mona Singh
Lecture Notes
• www.cs.princeton.edu/~mona/computational_biology_notes.html