View
218
Download
0
Embed Size (px)
Citation preview
Proteins, Pair HMMs, and Alignment
CS262 Lecture 8, Win06, Batzoglou
A state model for alignment
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACCIMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII
M(+1,+1)
I(+1, 0)
J(0, +1)
Alignments correspond 1-to-1 with sequences of states M, I, J
CS262 Lecture 8, Win06, Batzoglou
Let’s score the transitions
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACCIMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII
M(+1,+1)
I(+1, 0)
J(0, +1)
Alignments correspond 1-to-1 with sequences of states M, I, J
s(xi, yj)
s(xi, yj) s(xi, yj)
-d -d
-e -e
CS262 Lecture 8, Win06, Batzoglou
Alignment with affine gaps – state version
Dynamic Programming:
M(i, j): Optimal alignment of x1…xi to y1…yj ending in M
I(i, j): Optimal alignment of x1…xi to y1…yj ending in I
J(i, j): Optimal alignment of x1…xi to y1…yj ending in J
The score is additive, therefore we can apply DP recurrence formulas
CS262 Lecture 8, Win06, Batzoglou
Alignment with affine gaps – state version
Initialization:M(0,0) = 0; M(i, 0) = M(0, j) = -, for i, j > 0I(i,0) = d + ie; J(0, j) = d + je
Iteration:
M(i – 1, j – 1)M(i, j) = s(xi, yj) + max I(i – 1, j – 1)
J(i – 1, j – 1)
e + I(i – 1, j)I(i, j) = max
d + M(i – 1, j)
e + J(i, j – 1)J(i, j) = max
d + M(i, j – 1)
Termination:Optimal alignment given by max { M(m, n), I(m, n), J(m, n) }
CS262 Lecture 8, Win06, Batzoglou
Brief introduction to the evolution of proteins
• Protein sequence and structure• Protein classification• Phylogeny trees• Substitution matrices
CS262 Lecture 8, Win06, Batzoglou
Muscle cells and contraction
CS262 Lecture 8, Win06, Batzoglou
Actin and myosin during muscle movement
CS262 Lecture 8, Win06, Batzoglou
Actin structure
CS262 Lecture 8, Win06, Batzoglou
Actin sequence
• Actin is ancient and abundant Most abundant protein in cells 1-2 actin genes in bacteria, yeasts, amoebas Humans: 6 actin genes
-actin in muscles; -actin, -actin in non-muscle cells• ~4 amino acids different between each version
MUSCLE ACTIN Amino Acid Sequence
1 EEEQTALVCD NGSGLVKAGF AGDDAPRAVF PSIVRPRHQG VMVGMGQKDS YVGDEAQSKR 61 GILTLKYPIE HGIITNWDDM EKIWHHTFYN ELRVAPEEHP VLLTEAPLNP KANREKMTQI 121 MFETFNVPAM YVAIQAVLSL YASGRTTGIV LDSGDGVSHN VPIYEGYALP HAIMRLDLAG 181 RDLTDYLMKI LTERGYSFVT TAEREIVRDI KEKLCYVALD FEQEMATAAS SSSLEKSYEL 241 PDGQVITIGN ERFRGPETMF QPSFIGMESS GVHETTYNSI MKCDIDIRKD LYANNVLSGG 301 TTMYPGIADR MQKEITALAP STMKIKIIAP PERKYSVWIG GSILASLSTF QQMWITKQEY 361 DESGPSIVHR KCF
CS262 Lecture 8, Win06, Batzoglou
A related protein in bacteria
CS262 Lecture 8, Win06, Batzoglou
Relation between sequence and structure
CS262 Lecture 8, Win06, Batzoglou
Protein Phylogenies
• Proteins evolve by both duplication and species divergence
CS262 Lecture 8, Win06, Batzoglou
Protein Phylogenies – Example
CS262 Lecture 8, Win06, Batzoglou
Structure Determines Function
What determines structure?
• Energy• Kinematics
How can we determine structure?
• Experimental methods• Computational predictions
The Protein Folding Problem
CS262 Lecture 8, Win06, Batzoglou
Primary Structure: Sequence
• The primary structure of a protein is the amino acid sequence
CS262 Lecture 8, Win06, Batzoglou
Primary Structure: Sequence
• Twenty different amino acids have distinct shapes and properties
CS262 Lecture 8, Win06, Batzoglou
Primary Structure: Sequence
A useful mnemonic for the hydrophobic amino acids is "FAMILY VW"
CS262 Lecture 8, Win06, Batzoglou
Secondary Structure: , , & loops
helices and sheets are stabilized by hydrogen bonds between backbone oxygen and hydrogen atoms
CS262 Lecture 8, Win06, Batzoglou
Tertiary Structure: A Protein Fold
CS262 Lecture 8, Win06, Batzoglou
PDB GrowthN
ew
PD
B s
tru
ctu
res
CS262 Lecture 8, Win06, Batzoglou
Only a few folds are found in nature
CS262 Lecture 8, Win06, Batzoglou
Protein classification
• Number of protein sequences grows exponentially• Number of solved structures grows exponentially• Number of new folds identified very small (and close to constant)• Protein classification can
Generate overview of structure types Detect similarities (evolutionary relationships) between protein sequences Help predict 3D structure of new protein sequences
Morten Nielsen,CBS, BioCentrum, DTU
SCOP release 1.69, Class # folds # superfamilies # families
All alpha proteins 218 376 608
All beta proteins 144 290 560
Alpha and beta proteins (a/b) 136 222 629
Alpha and beta proteins (a+b) 279 409 717
Multi-domain proteins 46 46 61
Membrane & cell surface 47 88 99
Small proteins 75 108 171
Total 945 1539 2845
Classification of 25,973 protein structures in PDB
CS262 Lecture 8, Win06, Batzoglou
Protein world
Protein fold
Protein structure classification
Protein superfamily
Protein family
Morten Nielsen,CBS, BioCentrum, DTU
CS262 Lecture 8, Win06, Batzoglou
Structure Classification Databases
• SCOP Manual classification (A. Murzin) scop.berkeley.edu
• CATH Semi manual classification (C. Orengo) www.biochem.ucl.ac.uk/bsm/cath
• FSSP Automatic classification (L. Holm) www.ebi.ac.uk/dali/fssp/fssp.html
Morten Nielsen,CBS, BioCentrum, DTU
CS262 Lecture 8, Win06, Batzoglou
Major classes in SCOP
• Classes All proteins All proteins and proteins (/) and proteins (+) Multi-domain proteins Membrane and cell surface proteins Small proteins Coiled coil proteins
Morten Nielsen,CBS, BioCentrum, DTU
CS262 Lecture 8, Win06, Batzoglou
All : Hemoglobin (1bab)
Morten Nielsen,CBS, BioCentrum, DTU
CS262 Lecture 8, Win06, Batzoglou
All : Immunoglobulin (8fab)
Morten Nielsen,CBS, BioCentrum, DTU
CS262 Lecture 8, Win06, Batzoglou
Triosephosphate isomerase (1hti)
Morten Nielsen,CBS, BioCentrum, DTU
CS262 Lecture 8, Win06, Batzoglou
: Lysozyme (1jsf)
Morten Nielsen,CBS, BioCentrum, DTU
CS262 Lecture 8, Win06, Batzoglou
Families
• Proteins whose evolutionarily relationship is readily recognizable from the sequence (>~25% sequence identity)
• Families are further subdivided into Proteins
• Families are divided into Species The same protein may be found in several
species
Fold
Family
Superfamily
Proteins
Morten Nielsen,CBS, BioCentrum, DTU
CS262 Lecture 8, Win06, Batzoglou
Superfamilies
• Proteins which are (remotely) evolutionarily related
Sequence similarity low
Share function
Share special structural features
• Relationships between members of a superfamily may not be readily recognizable from the sequence alone
Fold
Family
Superfamily
Proteins
Morten Nielsen,CBS, BioCentrum, DTU
CS262 Lecture 8, Win06, Batzoglou
Folds
• >~50% secondary structure elements arranged in the same order in sequence and in 3D
• No evolutionary relation
Fold
Family
Superfamily
Proteins
Morten Nielsen,CBS, BioCentrum, DTU
CS262 Lecture 8, Win06, Batzoglou
Substitutions of Amino Acids
Mutation rates between amino acids have dramatic differences!
CS262 Lecture 8, Win06, Batzoglou
Substitution Matrices
BLOSUM matrices:
1. Start from BLOCKS database (curated, gap-free alignments)2. Cluster sequences according to > X% identity
3. Calculate Aab: # of aligned a-b in distinct clusters, correcting by 1/mn, where m, n are the two cluster sizes
4. Estimate
P(a) = (b Aab)/(c≤d Acd); P(a, b) = Aab/(c≤d Acd)
CS262 Lecture 8, Win06, Batzoglou
Probabilistic interpretation of an alignment
An alignment is a hypothesis that the two sequences are related by evolution
Goal:
Produce the most likely alignment
Assert the likelihood that the sequences are indeed related
CS262 Lecture 8, Win06, Batzoglou
A Pair HMM for alignments
MP(xi, yj)
IP(xi)
JP(yj)
1 – 2
1 –
1 –
This model generates two sequences simultaneously
Match/Mismatch state M:P(x, y) reflects substitution frequencies between pairs of amino acids
Insertion states I, J:P(x), P(y) reflect frequencies of each amino acid
: set so that 1/2 is avg. length before next match
: set so that 1/(1 – ) is avg. length of a gap
Model MM
optional
CS262 Lecture 8, Win06, Batzoglou
A Pair HMM for unaligned sequences
IP(xi)
JP(yj)
1 1
Two sequences are independently generated from one another
P(x, y | R) = P(x1)…P(xm) P(y1)…P(yn) = i P(xi) j P(yj)
Model RR
CS262 Lecture 8, Win06, Batzoglou
To compare ALIGNMENT vs. RANDOM hypothesis
Every pair of letters contributes:
MM• (1 – 2) P(xi, yj) when matched
P(xi) P(yj) when gapped
RR• P(xi) P(yj) in random model
Focus on comparison of
P(xi, yj) vs. P(xi) P(yj)
MP(xi, yj)
IP(xi)
JP(yj)
1 – 2
1 –
1 –
IP(xi)
JP(yj)
1 1
CS262 Lecture 8, Win06, Batzoglou
To compare ALIGNMENT vs. RANDOM hypothesis
Idea:We will divide alignment score by the random score, and take logarithms
Let P(xi, yj)
s(xi, yj) = log ––––––––– + log (1 – 2) P(xi) P(yj)
(1 – ) P(xi) d = – log –––––––––––––
(1 – 2) P(xi)
P(xi) e = – log ––––––
P(xi)
=Defn substitution score
=Defn gap initiation penalty
=Defn gap extension penalty
CS262 Lecture 8, Win06, Batzoglou
The meaning of alignment scores
• The Viterbi algorithm for Pair HMMs corresponds exactly to global alignment DP with affine gaps
VM(i, j) = max { VM(i – 1, j – 1), VI( i – 1, j – 1) – d, Vj( i – 1, j – 1) } + s(xi, yj)
VI(i, j) = max { VM(i – 1, j) – d, VI( i – 1, j) – e }
VJ(i, j) = max { VM(i – 1, j) – d, VI( i – 1, j) – e }
s(.,.) ~how often a pair of letters substitute one another 1/mean length of next gap 1/mean arrival time of next gap
CS262 Lecture 8, Win06, Batzoglou
The meaning of alignment scores
Match/mismatch scores: P(xi, yj)
s(a, b) log –––––––––– (ignore log(1 – 2) for the moment) P(xi) P(yj)
Example:DNA regions between human and mouse genes have average conservation of 80%
1. What is the substitution score for a match?
P(a, a) + P(c, c) + P(g, g) + P(t, t) = 0.8 P(x, x) = 0.2P(a) = P(c) = P(g) = P(t) = 0.25s(x, x) = log [ 0.2 / 0.252 ] = 1.68
• What is the substitution score for a mismatch?
P(a, c) +…+P(t, g) = 0.2 P(x, yx) = 0.2/12 = 0.167s(x, y x) = log[ 0.167 / 0.252 ] = -1.42
• What ratio matches/(matches + mism.) gives score 0?1.67/(1.67+1.42) = 54 % (~halfway between random and conserved
model)
CS262 Lecture 8, Win06, Batzoglou
Substitution Matrices
BLOSUM matrices:
1. Start from BLOCKS database (curated, gap-free alignments)2. Cluster sequences according to > X% identity
3. Calculate Aab: # of aligned a-b in distinct clusters, correcting by 1/mn, where m, n are the two cluster sizes
4. Estimate
P(a) = (b Aab)/(c≤d Acd); P(a, b) = Aab/(c≤d Acd)
CS262 Lecture 8, Win06, Batzoglou
BLOSUM matrices
BLOSUM 50 BLOSUM 62
(The two are scaled differently)