Upload
jimbo
View
25
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Proteins Determine Function. Proteins make us tick Problems occur when Proteins are missing Proteins are malfunctioning Proteins are present that should not be there Controlling disease Understanding protein function Stopping protein function Supplying missing/desired protein function. - PowerPoint PPT Presentation
Citation preview
Proteins Determine Function• Proteins make us tick• Problems occur when
– Proteins are missing– Proteins are malfunctioning– Proteins are present that
should not be there• Controlling disease
– Understanding protein function
– Stopping protein function– Supplying missing/desired
protein function
04/21/23 1Sequence Alignment
The Central Dogma
04/21/23 2Sequence Alignment
The Human Genome
• The entire collection of our DNA– Consists of about
3.5x109 base pairs
• Our genome is split across chromosomes– Makes packaging
easier/more efficient
04/21/23 3Sequence Alignment
Human Genome Project• Began in 1990, a 13-year effort coordinated by the
DOE and the NIH. Project goals included:– identify all the approximately 30,000 genes in human DNA,– determine the sequences of the 3 billion chemical base
pairs that make up human DNA,– store this information in databases,– improve tools for data analysis,– transfer related technologies to the private sector, and.– address the ethical, legal, and social issues.
• The Human Genome Project ended in 2003 with the completion of the human genetic sequence.
04/21/23 4Sequence Alignment
DNA Sequencing
• Determining the sequence of nucleotides in a strand of DNA– 1978 Sanger
• Determined the DNA sequence for phi-x 174• 3.5x103 base pairs• Took about 2 years to sequence
– 2001 Venter/Celera• Sequenced the human genome• 3x109 base pairs• Took about 9 months to sequence
04/21/23 5Sequence Alignment
Sequence DataTCTAGGAGGGAAGCACCCACCTCCCCTAAGCTCCATCTCCCTGAGCACTCATTTCCCAATGACCATACCAGGTTTTGGCCCTAGAGAGTTTATTACAAAATAAGAAAGAGAAGTCTGGGGAAGGTTCACTCATCATAGAATTTTGGCAGTTCATTGCCCAAGATGACTCGATGGTCCACACCGGCAGCTGTAATAGTGACCAGGTAGATGACACCCCCGCTTGAGCCATCCCGGCTCATGGCCAGAGCAATAGCTGCAGAGGGTTTCAAGTTGGAGAGAGGGAGAGAGAGGATGGCTTAGCTTCAAAAATCTTTTTACTCCCCCTCCATCCATATGCCTACTACCACTTTCACCTCAAAACTCATCTTCCAGGAAGGCATATTTAGTGGTGTGCTGGTAAATCAGTTTTTTTACAAAAAGGCTTCCATATGTGGCATCTGCTGATGTCCGTGGTGTAAATGCTCCCGCTATGATGAATTGCAAGTTACAAATAGCTAAGCAGTTCACAAATCCTTGACTATTTAACAGTCCGCTCTCATGAGTGGTCCCAAGCCAGCCTCAGCACACCTCAGCACACCACTGGTTCTTTTTTTTTTTTTTTTTCTCCAGACAGGGTCTCTCTCTGTCACCTAGGCTGCAGCGCAGTGGTGCAATCACCGCTCACTACAGCCTTGATCTCCCCGGCTCAGATGATCTTTCCACCTCAGCCTCCTGAGTAGCTGGGACTACAGGTGTGCACCACTATGCCCAGTTCATTTTTTTTTTTACTTTTTTTTATTGTTTTTTGTGGAGACAGGGTTTCACCATCTTGCCTAGGCTGGCCTCAAACTCCTGGGCTCAAGTAATCCTCCTGCCTCAGCCTCCCAAATTGTTGGCATTACAGGTGTGAGCCACTGTGCTTAGCACACCACTGGTTCTCACAGTGACTGTGTATCCTCATTTGATTTACTCAGAACAGCCCTGGTTTATCCGTATTGCCCAAGAACCCCATTGAGCTTTGCATTTGTCCTGCCCCTTTTCACTCTTAAAAGTGTACCAGGCCCGGCATTAACTTAAATGGCCACCCCTGTATTTCTCTTCCTGTTCCTCATAATCTACTTCCTTCCCATGTTTCAAAGCCCTCCCCAGGTACCCTTCCACTTGGCTGGTTACCGTCTGTGGTGAAGCGCCTGCACTCCTCGGGAGACATGCCTGGCTTATATGCTGCATCCACATAACCATAGATAAAGGTGCTGCCGGAGCCACCAATGGCAAAAGGCTGTCGAGTCAGCATTCCTCCCAGGGTTCCATATACCTGGGAAAGGGATCCTCAGGTTAAAGAATCATCAAGCCCTTCCTTCCCACTGAGACATTAAGTGGTCTCTGCACCCTGCAATGAAGCCCTGGTATCTCATATCCCCAAAGTACTATGCTTTCAGAGGTAGTGTCCTTGGAACTCATTGCTAGAATGACATAGGACTTCCATCTTCCTCTGCAGGAGAGTGGGGAAGCCCAGAGGAGAGAGTGCTTTGGGAGAAACTCACCTGACCTCCTTCACGTTGGTCCCAGCCAGCTACCATGAGATGTGCAGACAAGTCCTCTCGATATTTATAGCTGATATTTCTCACCACATTTGCAGCAGCCAAAACAAGTGGAGGTTCCTCCAGTTCTATCCTGAGGGAAATATTAGGAATAAAGGTTGATAGAATTTTAAGTCTCATTCTCCTATACTGTTACCATCATCCCTGCTAAACGACCCCTGAAAACTGTAACTGCAATAGCTCAAACTGCAGCCTCCCTCCCACATGTACAGGGGAACCAGAGTCCCACACCACCAACTGGTAAGAAGCTTTCAATTGCTCACTCTTTTGCTCAGCCCCACCCACATAACTTTCTTTTGGCTGCAAGGACCCTGCTCTTATGGGGAAAAGCAGATAAGGTTCACTCGGTTCACCACCGCCTCGCTGTCAGGAGGGAGTCAACAGTCACCAAGTTAAAACTCAGGTTTTTTTTTTTTTTTTTTTTTTTTGAGACAGTCTCACTCTGTCACCCAGGCTGGAGTGCAGTGGATCAATCTTGGGCTCACTGCAAACTTCGCCTCCCTGGTTCAAGTGATTCTCCTGCCTCAGCCTCCCGAATAGCTGGGATTACAGGCACCCACCACCAAGCCCAGCTAATGTTTGTATTTTCAGTAGAGACAAGGTCCCAACATGTTGGCCAGGCTGGTCTCAAACTCCTGACCTCAAATATCTGCCCACCTCGGCATCCCAAAGTGCTGAGATTATAGATGTGAGCCACTGCACCCAACCAGAACTCAGGAATTTTTGAGGGTGATCATTCAATGTCTCTCAAATTTCTTTGACAAGAGAATAGCATGAAGTTTAATGCTTGGATTAAAGCAGGAGGCAAATAATCATCTCAGATATTATTAATCACTGCAGATGTTAATCAAAATTAGGCTTATTTTTCAGGCTTAGATTTTATAACAAAGCAAAAAATGCTAAGGTAAGAAAAATATGCCTCATCAATTTTCTTTGCTATTAACAATCTTGAGAGAGTTATGTTCTATGGAACATAATGTCAGTAATATTGACCTAACCCCATATACTCATTTTGCATGTGAGGAAATTGGTTAGGAGTGGGAGAAGAGACAAAATAGTTCAATATATGGTAAATGAGAAACCAGGTATCTGCTTGACAGAATCATCTTTTTGATCCCTAAGCACAGATGGAAAGAAGACCCTCAAAAATCTATCTCCTGTCCCCCTCTCAGACCCTATTCCTTTACTCATCCCTGTACACTACTGGGACAGGTCACATACACATTCAGACCCCAGATCCTCCTCCACAAATTCAGAGACCCAAGCACCCACCAAATAGCTTATCATAGTGGCTTTTGGGGAAGGTCAACTCCATTCCTCCAAGGCTCCAGTTTGCCAGTCTTTTCATGAATGGGTAAGGAAAGTGTGTATTTGAGGCCATTAGCTTCTTTCCAAATGCATACATCTTCACTTTTACTCACCCTGCAGACACTCGGGAATCAGAACCCATCACAACGCCCCCGTCAAACTCCACTGCCATGATGGTGGTCTGCAGAGACACAGAATATGGAATGTCAGGGCAAGAACAGCCTTGATGCCCTCATGTTAGAGAAGAAGAAACATTCCCAGAGAGGCGAAGTGACTGGCTCAAAGATTACACAGTAACAGGCCAGAGCTGACTGTCAGTACAGGCTTTTTTTCCCTTCATCTTTCCACTTTCTCTATTGCTTCATCCGGCTGCAGGGGAATGCCACAGCCCAGCTGTGATACAACACAGAAAGAACTGTGTCCCTAAGTTCCAACTTGCCTAGTGGAATCCTCTCCACTGTAGAGAGGTGGAG…….. 19,000 base pairs omitted!!!
04/21/23 6Sequence Alignment
Genes
• Definition varies– To a geneticist a gene is the region of a
chromosome that confers a particular trait– To a molecular biologist a gene is a sequence of
DNA which encode a protein or RNA and includes all of the relevant regulatory sequences• Genes can be turned on/off, up/down
• Not all parts of the DNA sequence are involved with the production of proteins
04/21/23 7Sequence Alignment
Gene Structure
04/21/23 8Sequence Alignment
Locate GenesGn.Ex Type S .Begin ...End .Len Fr Ph I/Ac Do/T CodRg P.... Tscr..----- ---- - ------ ------ ---- -- -- ---- ---- ----- ----- ------ 2.00 Prom + 5833 5872 40 -14.22 2.01 Init + 6023 6620 598 1 1 57 87 371 0.621 27.34 2.02 Intr + 7157 7271 115 0 1 122 41 76 0.997 5.81 2.03 Intr + 7420 7550 131 1 2 83 9 151 0.979 7.04 2.04 Intr + 8510 8715 206 0 2 123 91 98 0.826 12.52 2.05 Intr + 9142 9339 198 0 0 69 39 276 0.998 20.35 2.06 Intr + 10541 10669 129 1 0 89 78 131 0.992 13.09 2.07 Intr + 10819 11007 189 0 0 137 87 125 0.999 17.08 2.08 Intr + 11567 11740 174 1 0 101 80 233 0.966 23.94 2.09 Intr + 11984 12146 163 1 1 103 78 108 0.999 10.85 2.10 Intr + 12455 12591 137 0 2 101 94 127 0.999 14.89 2.11 Intr + 13874 14050 177 1 0 113 16 188 0.043 14.22 2.12 Intr + 16570 16717 148 0 1 98 64 116 0.984 10.01 2.13 Intr + 16876 16987 112 2 1 103 115 134 0.999 17.04 2.14 Intr + 17396 17525 130 2 1 100 98 174 0.999 20.30 2.15 Intr + 17924 18128 205 1 1 81 55 295 0.999 24.27 2.16 Term + 18612 18700 89 1 2 96 39 148 0.999 8.52 2.17 PlyA + 18919 18924 6 1.05
04/21/23 9Sequence Alignment
Codons
Alanine, Arginine, Aspartic Acid, Asparagine, Cystinine, Glutamic Acid, Glutamine, Glycine, Histidine, Isoleucine, Leucine, Lysine, Methionine, Phenylalanine, Proline, Serine, Threonine, Tryptophan, Tyrosine, Valine
04/21/23 10Sequence Alignment
Conversion to Protein
04/21/23 11Sequence Alignment
Predict Protein Sequence>18:21:37|GENSCAN_predicted_peptide_2|966_aaMASSRCPAPRGCRCLPGASLAWLGTVLLLLADWVLLRTALPRIFSLLVPTALPLLRVWAVGLSRWAVLWLGACGVLRATVGSKSENAGAQGWLAALKPLAAALGLALPGLALFRELISWGAPGSADSTRLLHWGSHPTAFVVSYAAALPAAALWHKLGSLWVPGGQGGSGNPVRRLLGCLGSETRRLSLFLVLVVLSSLGEMAIPFFTGRLTDWILQDGSADTFTRNLTLMSILTIASAVLEFVGDGIYNNTMGHVHSHLQGEVFGAVLRQETEFFQQNQTGNIMSRVTEDTSTLSDSLSENLSLFLWYLVRGLCLLGIMLWGSVSLTMVTLITLPLLFLLPKKVGKWYQLLEVQVRESLAKSSQVAIEALSAMPTVRSFANEEGEAQKFREKLQEIKTLNQKEAVAYAVNSWTTSISGMLLKVGILYIGGQLVTSGAVSSGNLVTFVLYQMQFTQAVEVLLSIYPRVQKAVGSSEKIFEYLDRTPRCPPSGLLTPLHLEGLVQFQDVSFAYPNRPDVLVLQGLTFTLRPGEVTALVGPNGSGKSTVAALLQNLYQPTGGQLLLDGKPLPQYEHRYLHRQVAAVGQEPQVFGRSLQENIAYGLTQKPTMEEITAAAVKSGAHSFISGLPQGYDTEVDEAGSQLSGGQRQAVALARALIRKPCVLILDDATSALDANSQLQVEQLLYESPERYSRSVLLITQHLSLVEQADHILFLEGGAIREGGTHQQLMEKKGCYWAMPTEFFQSLGGDGERNVQIEMAHGTTTLAFKFQHGVIAAVDSRASAGSYISALRVNKVIEINPYLLGTMSGCAADCQYWERLLAKECRLYYLRNGERISVSAASKLLSNMMCQYRGMGLSMGSMICGWDKKGPGLYYVDEHGTRLSGNMFSTGSGNTYAYGVMDSGYRPNLSPEEAYDLGRRAIAYATHRDSYSGGVVNMYHMKEDGWVKVESTDVSDLLHQYREANQ
04/21/23 12Sequence Alignment
hello.exe011110100100101111111110000001111110010101000011110100010001101010100011101110101101110100000010101100011010011101001110001010011010000001000100011101
100010001001011110010100111000101101101000010101010111001101000110000000101101001011111100000111011111001011100110011010101111111010101011111101111010
001010101000010101011100011110110110001010010110101011001111100010010110100110110110100011001000111101001111011110110101101111111110101010110011010011
001011011010101001010000011011101101100010110100001110110011001001000101101111001010111001101000011000110001110111011111111001101110000100110100011001
111010000011111011000010011010011110010010001100101110111001000100100011000001000111101100110100110101001111000111001100000010000010000101110110010110
111011100110110100010000010100001001001010001100010010000010101000000110111011111111111010110110000001010001011110000101011011101101000011000110010101
011001110001001100100101111100110001010000011000010010110011010000111011110100000011000110111000000010000101010011101101010010111000011110101100101101
111010001101011111011100000001011101010101000101110101011110000011001010000011001000000101011000101100011101101010100011111100010000010001001100011100
101111000101000101100110010001100100110010011010100100111001010001010000101000011001101110100001010001010010010111000101110010100000111100100011111100
001100110010001010001100000100100111111100001110001010111000110000011100011100001101100010000111111001111011111101001011000100100100011001011011111111
100011000000111101010000110010110110001011101110101111000001111101111110111001011010110111001010000101001010101001100100000111010001101100100101101001
101111111011000101011111010010011001010110111100111111011100100010000101110100000100111110001001111010110101000001000111101110001100100011010011100111
011101100101100001111001110111010010101001111111001011111101100101110110111000000100101100110110100010001001011011101011001100001000000110100010011001
000010110110111111101000011000010001010000111000010001111001000111111011110001011100100011010111001011101000111011101110000100001011000011010110001110
111001100101111010110111001010110010010010001010100010001100010000001001110101100011011010100111000010001010000011101011111110111010010101010110001101
010111110010011101010111010011110110000101011110001111000101010011101110101001101100011010001000000101001010011001000110101001010011010110010111100101
001110100001010101010111000100000001101110111000010001101001111111100010100010011111001000001101001110101000110010110010001010100100011111100010111010
101011101100111010101010101001001000110111110011001100100101001011000101100111000110110011100101100111010001010101010100111101011100011010011001101100
000001100100001000010110001100111001101010001001001110101100110010011101001101001011110110100111000110011001100001011011000111011111000000011010010100
111011000001101001001101010101000110110110100011101011001000100111100111111111010000000011100010100110010110010111110011011010100000111000001010010010
000110000101101110111010110010011010001111100110001000100001011011001000101011101111010111111100001111110100111001001000101100110100000100001100110101
101011010110100010111011111010101100101001110100101100011010001111110110001000110100111010110010100010010001100100110000101111001100100011000100010010
04/21/23 13Sequence Alignment
Find Similar Proteins
gi|549042|sp|Q03518|TAP1_HUMAN ANTIGEN PEPTIDE TRANSPORTER ... 1221 0.0gi|2506117|sp|P36370|TAP1_RAT ANTIGEN PEPTIDE TRANSPORTER 1... 831 0.0gi|2506116|sp|P21958|TAP1_MOUSE ANTIGEN PEPTIDE TRANSPORTER... 827 0.0gi|1172602|sp|P28062|PRCY_HUMAN PROTEASOME COMPONENT C13 PR... 455 e-127
04/21/23 14Sequence Alignment
Sequence Alignment
• Sequences of genes and proteins are compared to infer– Structural, functional, and evolutionary
relationships between the sequences
• Over time evolution causes– Substitutions which change residues in a sequence– Insertions/deletions add or remove residues
(gaps)
04/21/23 15Sequence Alignment
Sequence Alignment• Aligning two sequences is the cornerstone of
Bioinformatics
• What are we looking for when aligning sequences?– Identity: Two sequence that have a certain number of
positions in common at aligned positions– Similarity: Often a number of positions will be
replaced by ones of similar chemical properties– Homology: Two sequences that are evolutionarily
related and stem from a common ancestor
04/21/23 16Sequence Alignment
Possible Alignments
HEAGAWGHE-E-PA--W-HEAE
HEAGAWGHE-E---PAW-HEAE
HEAGAWGHE-E--P-AW-HEAE
substitution
insertion
deletion
04/21/23 17Sequence Alignment
How Do We Choose?• Obviously there are many alignments, how do we
choose?
• The Biologists provide a scoring mechanism which can be used to determine which alignment is best
• A simple scoring mechanism:– 0 for a match– 1 mismatch or gap– Lowest score is the best
04/21/23 18Sequence Alignment
Alignments With Score
HEAGAWGHE-E-PA--W-HEAE
Score: 6
04/21/23 19Sequence Alignment
Computing The Result
• We know how to generate different alignments
• We know how to score alignments
• One algorithm might be to generate all possible alignments and choose the one with the best score– Not feasible!!
04/21/23 20Sequence Alignment
Alignment
• At every step in the process of aligning two sequences, S1 and S2, you have to make one of three decisions– Match/mismatch– Add gap to S1
– Add gap to S2
• For example, when aligning ABC with BCABC -ABC ABCBC BC -BC
04/21/23 21Sequence Alignment
Using An Array
• An array could be used to keep track of the possible moves– Diagonal: match/mismatch– Across: gap in sequence 2– Down: gap in sequence 1
04/21/23 22Sequence Alignment
Example
- A B C
- --
B
C
04/21/23 23Sequence Alignment
Example
- A B C
- --
A-
AB--
ABC---
B -B
C --BC
04/21/23 24Sequence Alignment
Example
- A B C
- --
A-
AB--
ABC---
B -B
A-
-B
C --BC
A- bc-B c
04/21/23 25Sequence Alignment
Example
- A B C
- --
A-
AB--
ABC---
B -B
A
B
C --BC
A bcB c
04/21/23 26Sequence Alignment
Example
- A B C
- --
A-
AB--
ABC---
B -B
-A
B-
C --BC
-A bcB- c
04/21/23 27Sequence Alignment
Example
- A B C
- --
A-
AB--
ABC---
B -B
A--B
A
B
-A
B-
C --BC
A- bc (2)-B c
A bc (1)B c
-A bc (2)B- c
04/21/23 28Sequence Alignment
Example
- A B C
- --
A-
AB--
ABC---
B -B
A--B
A
B
-A
B-
AB-
--B
C --BC
AB- c--B c
04/21/23 29Sequence Alignment
Example
- A B C
- --
A-
AB--
ABC---
B -B
A--B
A
B
-A
-B
AB
-B
C --BC
AB c-B c
04/21/23 30Sequence Alignment
Example
- A B C
- --
A-
AB--
ABC---
B -B
A--B
A
B
-A
-B
A-B-B-
AB
B-
-AB
-B-
C --BC
A-B c-B- c
AB cB- c
-AB c-B- c
04/21/23 31Sequence Alignment
Example
- A B C
- --
A-
AB--
ABC---
B -B
A--B
A
B
-A
-B
A-B AB--B- --B
AB AB
B- -B
-AB
-B-
C --BC
AB- c (3)--B c
A-B c (3)-B- c
AB c (2)B- c
-AB c (2)-B- c
AB c (1)-B c
04/21/23 32Sequence Alignment
Dynamic Programming
• The word Programming in the name has nothing to do with writing computer programs.– Mathematicians use the word to describe a set
of rules which anyone can follow to solve a problem.– They do not have to be written in a computer language.
• Dynamic programming was the brainchild of an American Mathematician Richard Bellman– Store the results for small sub-problems and looks them
up, rather than recomputing them, when they are needed later to solve larger sub-problems
04/21/23 33Sequence Alignment
Fib(6)
0 1
2
1 2
3
4
0 1
5
1 2
3
0 1 0 1
2
1 2
3
4
0 1
6
04/21/23 34Sequence Alignment
Keep Only the Best Match
• There is no need to remember every possible extension
• We only need to keep the best one– There may be ties which is okay
• How do you determine which one is the best?
04/21/23 35Sequence Alignment
Keep Only the Best Match
• There is no need to remember every possible extension
• We only need to keep the best one– There may be ties which is okay
• How do you determine which one is the best?– Ask a Biologist!!!!
04/21/23 36Sequence Alignment
Example
- A B C
- 0 1 2 3
B 1 1
C 2
A- Score==2-B
A Score==1B
-A Score==2B-
04/21/23 37Sequence Alignment
Example
- A B C
- 0 1 2 3
B 1 1 1
C 2
AB- Score==3--B
AB Score==1-B
AB Score==2B-
04/21/23 38Sequence Alignment
Example
- A B C
- 0 1 2 3
B 1 1 1 2
C 2
ABC- Score==4---B
ABC Score==3--B
ABC Score==2-B-
04/21/23 39Sequence Alignment
Example
- A B C
- 0 1 2 3
B 1 1 1 2
C 2 2 2 1
Note:
Down or across always adds one to the score
Diagonal will add either one or a zero depending on whether or not the bases matchABC Score==1
-BC
04/21/23 40Sequence Alignment
Dynamic Programming
• Three steps– Initialization– Matrix Fill– Traceback
• These steps apply in general whether you are doing sequence alignment or folding predictions
04/21/23 41Sequence Alignment
Recurrence
• A mathematical relationship that defines fn as some combination of fi with i<n
• For our next alignment we will use the recurrence– Mi,j = Maximum of
• Mi-1,j-1 + Si,j
• Mi,j-1 + w (gap in sequence 1)• Mi-1,j + w (gap in sequence 2)
– Where• Si,j (1 if match, 0 if mismatch)• w = 0 (gap penalty)
04/21/23 42Sequence Alignment
InitializationG A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0
G 0
G 0
A 0
T 0
C 0
G 0
A 0
04/21/23 43Sequence Alignment
Matrix FillG A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0
G 0 1
G 0
A 0
T 0
C 0
G 0
A 0
04/21/23 44Sequence Alignment
Matrix FillG A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0
G 0 1 1
G 0 1
A 0 1
T 0 1
C 0 1
G 0 1
A 0 1
04/21/23 45Sequence Alignment
Matrix FillG A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0
G 0 1 1
G 0 1 1
A 0 1
T 0 1
C 0 1
G 0 1
A 0 1
04/21/23 46Sequence Alignment
Matrix FillG A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0
G 0 1 1
G 0 1 1
A 0 1 2
T 0 1 2
C 0 1 2
G 0 1 2
A 0 1 2
04/21/23 47Sequence Alignment
Matrix FillG A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0
G 0 1 1 1
G 0 1 1 1
A 0 1 2 2
T 0 1 2 2
C 0 1 2 2
G 0 1 2 2
A 0 1 2 3
04/21/23 48Sequence Alignment
Matrix FillG A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0
G 0 1 1 1 1
G 0 1 1 1 1
A 0 1 2 2 2
T 0 1 2 2 3
C 0 1 2 2 3
G 0 1 2 2 3
A 0 1 2 3 3
04/21/23 49Sequence Alignment
Matrix FillG A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0
G 0 1 1 1 1 1
G 0 1 1 1 1 1
A 0 1 2 2 2 2
T 0 1 2 2 3 3
C 0 1 2 2 3 3
G 0 1 2 2 3 3
A 0 1 2 3 3 3
04/21/23 50Sequence Alignment
Matrix FillG A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0
G 0 1 1 1 1 1 1
G 0 1 1 1 1 1 1
A 0 1 2 2 2 2 2
T 0 1 2 2 3 3 3
C 0 1 2 2 3 3 4
G 0 1 2 2 3 3 4
A 0 1 2 3 3 3 4
04/21/23 51Sequence Alignment
Matrix FillG A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0
G 0 1 1 1 1 1 1 1
G 0 1 1 1 1 1 1 1
A 0 1 2 2 2 2 2 2
T 0 1 2 2 3 3 3 3
C 0 1 2 2 3 3 4 4
G 0 1 2 2 3 3 4 4
A 0 1 2 3 3 3 4 5
04/21/23 52Sequence Alignment
Matrix FillG A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0
G 0 1 1 1 1 1 1 1 1
G 0 1 1 1 1 1 1 1 2
A 0 1 2 2 2 2 2 2 2
T 0 1 2 2 3 3 3 3 3
C 0 1 2 2 3 3 4 4 4
G 0 1 2 2 3 3 4 4 5
A 0 1 2 3 3 3 4 5 5
04/21/23 53Sequence Alignment
Matrix FillG A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0
G 0 1 1 1 1 1 1 1 1 1
G 0 1 1 1 1 1 1 1 2 2
A 0 1 2 2 2 2 2 2 2 2
T 0 1 2 2 3 3 3 3 3 3
C 0 1 2 2 3 3 4 4 4 4
G 0 1 2 2 3 3 4 4 5 5
A 0 1 2 3 3 3 4 5 5 5
04/21/23 54Sequence Alignment
Matrix FillG A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0
G 0 1 1 1 1 1 1 1 1 1 1
G 0 1 1 1 1 1 1 1 2 2 2
A 0 1 2 2 2 2 2 2 2 2 2
T 0 1 2 2 3 3 3 3 3 3 3
C 0 1 2 2 3 3 4 4 4 4 4
G 0 1 2 2 3 3 4 4 5 5 5
A 0 1 2 3 3 3 4 5 5 5 5
04/21/23 55Sequence Alignment
Matrix FillG A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0
G 0 1 1 1 1 1 1 1 1 1 1 1
G 0 1 1 1 1 1 1 1 2 2 2 2
A 0 1 2 2 2 2 2 2 2 2 2 3
T 0 1 2 2 3 3 3 3 3 3 3 3
C 0 1 2 2 3 3 4 4 4 4 4 4
G 0 1 2 2 3 3 4 4 5 5 5 5
A 0 1 2 3 3 3 4 5 5 5 5 6
04/21/23 56Sequence Alignment
Traceback
• We now know the score of the alignment• Need to work back to find the actual
alignment– Look at possible predecessors– Pick a valid one– Repeat until at position 0,0
04/21/23 57Sequence Alignment
TracebackG A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0 0
G 0 1 1 1 1 1 1 1 1 1 1 1
G 0 1 1 1 1 1 1 1 2 2 2 2
A 0 1 2 2 2 2 2 2 2 2 2 3
T 0 1 2 2 3 3 3 3 3 3 3 3
C 0 1 2 2 3 3 4 4 4 4 4 4
G 0 1 2 2 3 3 4 4 5 5 5 5
A 0 1 2 3 3 3 4 5 5 5 5 6
04/21/23 58Sequence Alignment
TracebackG A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0
G 0 1 1 1 1 1 1 1 1 1 1
G 0 1 1 1 1 1 1 1 2 2 2
A 0 1 2 2 2 2 2 2 2 2 2
T 0 1 2 2 3 3 3 3 3 3 3
C 0 1 2 2 3 3 4 4 4 4 4
G 0 1 2 2 3 3 4 4 5 5 5
A 6
04/21/23 59Sequence Alignment
TracebackG A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0 0
G 0 1 1 1 1 1 1 1 1 1 1
G 0 1 1 1 1 1 1 1 2 2 2
A 0 1 2 2 2 2 2 2 2 2 2
T 0 1 2 2 3 3 3 3 3 3 3
C 0 1 2 2 3 3 4 4 4 4 4
G 0 1 2 2 3 3 4 4 5 5 5
A 6
04/21/23 60Sequence Alignment
TracebackG A A T T C A G T T A
0 0 0 0 0 0 0 0 0 0
G 0 1 1 1 1 1 1 1 1 1
G 0 1 1 1 1 1 1 1 2 2
A 0 1 2 2 2 2 2 2 2 2
T 0 1 2 2 3 3 3 3 3 3
C 0 1 2 2 3 3 4 4 4 4
G 0 1 2 2 3 3 4 4 5 5 5
A 6
04/21/23 61Sequence Alignment
TracebackG A A T T C A G T T A
0 0 0 0 0 0 0 0 0
G 0 1 1 1 1 1 1 1 1
G 0 1 1 1 1 1 1 1 2
A 0 1 2 2 2 2 2 2 2
T 0 1 2 2 3 3 3 3 3
C 0 1 2 2 3 3 4 4 4
G 0 1 2 2 3 3 4 4 5 5 5
A 6
04/21/23 62Sequence Alignment
TracebackG A A T T C A G T T A
0 0 0 0 0 0 0 0
G 0 1 1 1 1 1 1 1
G 0 1 1 1 1 1 1 1
A 0 1 2 2 2 2 2 2
T 0 1 2 2 3 3 3 3
C 0 1 2 2 3 3 4 4
G 5 5 5
A 6
04/21/23 63Sequence Alignment
TracebackG A A T T C A G T T A
0 0 0 0 0 0 0
G 0 1 1 1 1 1 1
G 0 1 1 1 1 1 1
A 0 1 2 2 2 2 2
T 0 1 2 2 3 3 3
C 0 1 2 2 3 3 4 4
G 5 5 5
A 6
04/21/23 64Sequence Alignment
TracebackG A A T T C A G T T A
0 0 0 0 0 0
G 0 1 1 1 1 1
G 0 1 1 1 1 1
A 0 1 2 2 2 2
T 0 1 2 2 3 3
C 4 4
G 5 5 5
A 6
04/21/23 65Sequence Alignment
TracebackG A A T T C A G T T A
0 0 0 0 0
G 0 1 1 1 1
G 0 1 1 1 1
A 0 1 2 2 2
T 3
C 4 4
G 5 5 5
A 6
04/21/23 66Sequence Alignment
TracebackG A A T T C A G T T A
0 0 0 0
G 0 1 1 1
G 0 1 1 1
A 0 1 2 2 2
T 3
C 4 4
G 5 5 5
A 6
04/21/23 67Sequence Alignment
TracebackG A A T T C A G T T A
0 0 0
G 0 1 1
G 0 1 1
A 2 2
T 3
C 4 4
G 5 5 5
A 6
04/21/23 68Sequence Alignment
TracebackG A A T T C A G T T A
0 0
G 0 1
G 1
A 2 2
T 3
C 4 4
G 5 5 5
A 6
04/21/23 69Sequence Alignment
TracebackG A A T T C A G T T A
0
G 1
G 1
A 2 2
T 3
C 4 4
G 5 5 5
A 6
04/21/23 70Sequence Alignment
An Answer
• So one possible alignment is:
G _ A A T T C A G T T A
G G _ A _ T C _ G _ _ A
• There are other possible answers
04/21/23 71Sequence Alignment
RNA Folding• It is not the number of genes that matters, it is
how we use them– RNA editing– Splicing– RNA Folding
• Predicting how RNA folds is extremely computational intensive
• Many algorithms have been written to do this• Want to look at ways to do this on high
performance computing platforms
04/21/23 72Sequence Alignment
A U G C C U GC G
U C C U G G CUC
AACAUCAAA
UACAGGCAU
AA
ACA
UCGC
A
A
CU
AG
CA
AC
AAGGAGGAUGG
UUUUA
GUACG
UAG
GCAUUGC
G G A A C C C U CA A C
GU
GAAGAAGGUUC
AGAU
AGA
GCAAUG
AAU
CGUGCA
UGCUAGAGUCAUU
GG
U U CG A C C U A G U A
U
CU
U UC
GA
AG A
U U UC C A U U C C U U
CGCGAUCAAAA
C U G A G GCGCU
UG
A UAU
AGUGA U U A5 ’
3 ’
RNA Secondary Structure
04/21/23 73Sequence Alignment
Dynalign (a 4-D Dynamic Programming Algorithm):
Sankoff, D. “Simultaneous Solution of the RNA Folding, Alignment and Protosequence Problems”. Siam Journal on Applied Mathematics. 45: 810-825 (1985)
Mathews & Turner. Journal of Molecular Biology. 317: 191-203 (2002)
Zuker Algorithm forSecondary Structure Prediction
(2D dynamic programming algorithm)
Algorithm forSequence Alignment
(2D dynamic programming algorithm)
Simultaneously finds the sequence alignment and thermodynamically favorable common secondary structure.
04/21/23 74Sequence Alignment
Inputs, Optimization, and Outputs:
Input: Sequence A Sequence B
Optimization (minimize G°total):
G°total = G°sequence A + G°sequence B + (G°gap)(number of gaps)
Output: Sequence Alignment, Structure of A, Structure of Bwhere each BP in A must be homologous to a BP in B
04/21/23 75Sequence Alignment
V Array
V(i,j,k,l) = min
04/21/23 76Sequence Alignment
A Total of 16 Energy Equations
04/21/23 77Sequence Alignment
Why V?
• For any set of pairs i to j and k to l, the lowest free energy is – V(i,j,k,l) + V(j,I+N,l,N2)
04/21/23 78Sequence Alignment
Traceback
04/21/23 79Sequence Alignment
Traceback
04/21/23 80Sequence Alignment
Simplifications to Make Calculation Tractable:
Two sequences at most are considered.
The maximum distance allowed between aligned nucleotidesin the two sequences is restricted by a parameter, M.
O(M3N3)Storage increases proportionally to M2N2 where N is the length of the shorter of the two sequences.
(*Pentium III 600 MHz, 512 MB RAM.)04/21/23 81Sequence Alignment
Parallelization
• Reduce array sizes– Arrays are sparse– Pre filter canonical pairs as unimportant• N=388, M=75, less than 4GB
• Fill routine can be easily parallelized– Distributed memory means larger sequences
• Traceback is a problem
04/21/23 82Sequence Alignment