Upload
cortez-wormwood
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
Example Mitochondrial cytochrome b –
transport electrons From NCBI protein web page, search
for cytb and Loxodonta africana (African elephant) Elephas maximus (Indian elephant) Mammuthus primigenius (Siberian wooly Mammoth)
Which modern elephant is closer to a mammoth ?
Use clustalW to do the alignment
Chap. 3: Sequence Alignment
>0012AAX12542.1| cytochrome b [Elephas maximus]MTHTRKSHPLFKIINKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTMTAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLLLLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALLLSILILGLMPLLHTSKHRSMMLRPLSQVLFWALTMDLLMLTWIGSQPVEYPYIAIGQMASILYFSIILAFLPIAGMIENYLIK
>gi|56578537|gb|AAW01445.1| cytochrome b [Loxodonta africana]MTHIRKSYPLLKIINKSFIDLPTPSNISAWWNFGSLLGACLITQILTGLFLAMHYTPDTMTAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFALHFILPFTMTALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLLLLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSILILGLMPLLHTSKYRSMMLRPLSQVLFWTLTMDLLMLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGMIENYLIK
>gi|2924604|dbj|BAA25008.1| cytochrome b [Mammuthus primigenius]MTHIRKSHPLLKILNKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTMTAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTDLVEWIWGGFSVDKATLNRFFALHFILPFTMIALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILFLLLLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALLLSILILGIMPLLHTSKHRSMMLRPLSQVLFWTLATDLLMLTWIGSQPVEYPYIIIGQMASILYFSIILAFLPIAGMIENYLIK
Pairwise sequence alignment is the most fundamental operation of bioinformatics
It is used to decide if two proteins (or genes) are related structurally or functionally
It is used to identify domains or motifs that are shared among proteins
It is the basis of BLAST searching (next) It is used in the analysis of genomes
Globin Globins carry oxygens and are first proteins to be
sequenced Hemoglobins – in read blood cell Myoglobin – in muscle cells of mammals Leghemoglobin – in legumes (beans, etc.)
Similarity and Homology Similarity
Observation or measurement of resemblance, independent of the source of the resemblance
Can be observed now but involves no historical hypothesis
Homology Specifies that sequences and the organisms
descended from a common ancestor Implies that similarities are shared ancestral
characteristics Cannot make the assertion of homology from
historical evidence, and thus is an inference from observations of similarity
Homology Similarity attributed to descent from a common
ancestor Two types of homology
Orthologs Homologous sequences in different species that arose from a
common ancestral gene during speciation; may or may not be responsible for a similar function.
Paralogs Homologous sequences within a single species that arose by
gene duplication.
Paralogs: members of a gene (protein) family within aspecies. This tree shows human globin paralogs.
Direct Alignment
Given two sequences +1 if letters in the same positions match -1, otherwise
Extremely simple, but what if there is a gap? Gap when a base is inserted or deleted (indel) Maybe only in biological data Maybe more significant mutation – give more
negative score as a penalty
RNDKPFSTARNRNQKPKWWTA+ + - + +- - - - - -
Visual Alignment -- Dotplot
A seq. in x axis and the other in y axis Dot on a crosspoint if
identical in both sequences
view
Sequence Alignment Direct alignment
An alignment with gaps
What is the criteria for a good alignment ? Use score to check for optimality May not produce a unique optimal alignment
g c t g a a c gc t a t a a t c
g c t g - a a - c - g- - c t - a t a a t c
g c t g - a a - c g- c t a t a a t c -
General approach to pairwise alignment
Given two sequences Select an algorithm that generates a score Allow gaps (insertions, deletions) Score reflects degree of similarity Alignments can be global or local Estimate probability that the alignment occurred by
chance
Pairwise alignment: protein sequencescan be more informative than DNA
protein is more informative (20 vs 4 characters); many amino acids share related biophysical properties
codons are degenerate: changes in the third position often do not alter the amino acid that is specified
protein sequences offer a longer “look-back” time DNA sequences can be translated into protein, and then
used in pairwise alignments Many times, DNA alignments are appropriate when
to confirm the identity of a cDNA to study noncoding regions of DNA to study DNA polymorphisms example: Neanderthal vs modern human DNA
Scoring Matrix Dotplot
Incredibly useful in identifying biological significance and interesting regions
Do not privde a measure of statistical similarity A numerical method
Not just provide position-by-position overlap But provide the nature and characteristics of residues
being aligned Scoring matrices
Empirical weighting schemes
Scoring Matrix Three biological factors in constructing a
scoring matrix Conservation
Account for conservation between proteins, but provide a way to assess conservation substitutions
Score represents what residues are capable of substitution for other residues while not adversely affecting the function of the native protein (determined by charge, size, hydrophobicity, etc.)
Frequency Reflect how often residues occur among
proteins Rare residues are given more weight
Evolution By design, implicitly represent evolutionary
patterns Review
http://books.google.com/books?hl=en&lr=&id=9p3E2sS1aJUC&oi=fnd&pg=PA73&ots=eJ0lzjEg_b&sig=Fl2kBl5QBq7VIoy-eDgDqXhaZ14#v=onepage&q&f=false
Scoring Matrix Log-Odds Score
qij : prob. of how often i and j are seen aligned pi: prob. of observing AA I among all proteins
sij = log(qij/ pipj)
score Represent the ratio of observed versus random
frequency of substitutign i by j Positive score – two residues are replaced more often
than by chance Negative – less likely to substitute than by chance
Other Scores
Gap penalty Gap initiation and extension
Clustal-W recommends use of identity matrix For DNA sequences
1 for a match, 0 for a mismatch, gap penalty of 10 for initiation and 0.1 for extension per residue
For AA sequences BLOSUM62 matrix for substitution, gap penalty
of 11 for initiation and 1 for extension per residue
a a a g a a aa a a – a a a
a a a g g g a a aa a a - - - a a a
Pairwise Alignment: Global and Local Given a scoring scheme, find alignments
maximizing the score Global
Entire sequence of protein or DNA sequence Needleman and Wunsch (dynamic
programming) Local
Focus on regions of greatest similarity Smith and Waterman In general, preferable to Global Alignment
Because only portions of proteins align
Dynamic Programming
Guaranteed to yield an optimal global alignment Drawback – many alignments may give the same
optimal score and none of them may correspond to biologically correct alignment W.Fitch and T.Smith found 17 alignments of alpha- and
beta-chains of chicken haemoglobin, one of which is correct based on structures
Drawback – complexity O(nm) for sequences of length n and m
Dynamic Programming
Rock removal game Two piles of rocks, each with 10 rocks A and B alternatively remove one rock from a
single pile or one rock each from both piles Player who remove the last rock(s) wins the game
Use reduction strategy starting with smaller problems
Consider 2+2 problem A removes one rock each, B removes one rock
each A removes one rock, B takes one rock from the
same pile B wins
3+3 problem ?
Rock Removal with 10+10 ↑ A takes one from pile X ← A takes one from pile Y A takes one from each pile * A will lose
Manhattan Tourist Problem
Visit as many tourist sites in a Manhattan grid Move to the east
or south only Start at upper
left corner End at # 15,
lower right corner
Problem Statement
Given a weighted grid G with two vertices (nodes) for a source and a sink
Find the longest path in a weighted grid
Weight: # of attraction sites on an edge (link)
Each vertex (node) can be identified by (i,j) Source at (0,0) Sink at (n, m)
3 2 4
1 0 2 43 2 4
4 6 5 20 7 3
4 4 5 23 3 0
Solution
Define si,j: the longest path from source to vertex (i,j) (0 ≤ i < n, 0 ≤ j < m)
Solve for smaller problems first
Solving for s0,j and si,0 is easy
3 2 4
1 0 2 43 2 4
4 6 5 20 7 3
4 4 5 23 3 0
0 3 5 9
1
5
9
(0,0)
Solution (2)
Iteratively solve for neighboring nodes si,1
si,2, etc.
si,j = max[si-1,j + weight on edge between (i-1,j) and (i,j),
si,j-1 + weight on edge between (i,j-1) and (i,j)]
3 2 4
1 0 2 43 2 4
4 6 5 20 7 3
4 4 5 23 3 0
0 3 5 9
1
5
9
(0,0)
4
10
14
(1,0)
(2,0)
(3,0)
(0,1)
Algorithm Algorithm
Given Weast(i,j) and Wsouth(i,j),
s0,0 = 0
for i =1 to n si,0 = si-1,0 + Wsouth(i,0)
for j =1 to n s0,j = s0,j-1 + Weast(0,j)
for i =1 to n for j = 1 to m
si,j = max[si-1,j + Wsouth(i,j),
si,j-1 + Weast(i,j)]
return sn,m
Directed Acyclic Graph DAG: Directed Acyclic Graph
G = (V, E) Longest Path Problem
sv = max(su + weight from u to v) over all u which are Predecessor(v)
Predecessor relationship has to be established ahead of the time
57 3
5v
u1
u2
u3
Graph Problem applied to Alignment Measure of similarity
Hamming distance: equal-length sequences Levenshtein or edit distance, 1966
unequal-length sequence Min. # of ‘edit operations’ (insertion,
deletion, alteration of a single character in either sequence) required to change one string into the other
e.g.
Levenshtein distance = 3
a g – t c cc g c t c a
Edit Distance and Alignment
Two strings, v and w Gaps are allowed in string, except that two gaps
are not allowed at the same char positions
Each char in a string is represented by positions in the original string without gaps v: (1 2 2 3 4 5 6 7 7) w: (1 2 3 4 5 5 6 6 7)
For both strings, (0
0) (11) (2
2) (23) (3
4) (45) (5
5) (66) (7
6) (77)
Represents a path in a grid
A T - G T T A T -A T C G T - A - G
Edit Distance
Vertex (i,j) corresponds to (i
j) for (vi, wj) G = (V, E) Longest Path Problem
sv = max(su + weight from u to v) over all u, Predecessor(v)
Predecessor relationship has to be established ahead of the time
Global Alignment
A string has a sequence of characters drawn from an alphabet A of size k
Scoring matrix, δ, of (k+1)x(k+1) Problem Statement
Given two strings, v and w, and a scoring matrix δ,
Find the longest (max. score) path
Dynamic programming kernel Recurrence relationship
si-1, j + δ(vi, -)si, j = max [ si, j-1 + δ(-, wj) ] si-1, j-1 + δ(vi, wj)
Global Alignment
Example of scoring matrix Match: +1; mismatch: -μ; indels: -σ
Indels are frequent, and gap penalties proportional to indel sizes are considered to be severe Affine gap penalties soften the penalty rate Can be linear, -(a + bx) for the indel length of x
si-1, j - σsi-1, j = max [ si, j-1 - σ ] si-1, j-1 + 1, if vi=wj
si-1, j-1 - μ, otherwise
Local Alignment
Global sequence alignment is useful for alignment of sequences from the same protein family, for example
Substrings from two sequences may be highly conserved in biological applications Temple Smith and Michael Waterman, 1981 Biologically irrelevant diagonal matches are likely
to have a higher score
Local Alignment Problem Given two strings v and w, and a scoring
matrix δ Find substrings of v and w whose global
alignment is maximal among all substrings of v and w Seemingly harder, because the global alignment
is to find the longest path from (0,0) to (n,m), whereas the local alignment is to find the longest path among all paths between two arbitrary points, (i,j) to (i’, j’)
Add edges of weight 0 from (0,0) to every other vertex (vertex (0,0) is a predecessor of every vertex
Local Alignment Solution
Recurrence kernel becomes
Select the largest si, j
Other non-maximal local alignments may have biological significance Select k best nonoverlapping local alignments
si-1, j + δ(vi, -)si, j = max [ si, j-1 + δ(-, wj) ] si-1, j-1 + δ(vi, wj)
0