Sequence Alignments. DNA Replication Prior to cell division, all the genetic instructions must be copied so that each new cell will have a complete set

  • View
    214

  • Download
    0

Embed Size (px)

Text of Sequence Alignments. DNA Replication Prior to cell division, all the genetic instructions must be...

  • Slide 1

Sequence Alignments Slide 2 DNA Replication Prior to cell division, all the genetic instructions must be copied so that each new cell will have a complete set DNA polymerase is the enzyme that copies DNA Reads the old strand in the 3 to 5 direction Slide 3 Over time, genes accumulate mutations Environmental factors Radiation Oxidation Mistakes in replication or repair Deletions, Duplications Insertions Inversions Point mutations Slide 4 Codon deletion: ACG ATA GCG TAT GTA TAG CCG Effect depends on the protein, position, etc. Almost always deleterious Sometimes lethal Frame shift mutation: ACG ATA GCG TAT GTA TAG CCG ACG ATA GCG ATG TAT AGC CG? Almost always lethal Deletions Slide 5 Why align sequences? The draft human genome is available Automated gene finding is possible Gene: AGTACGTATCGTATAGCGTAA What does it do? One approach: Is there a similar gene in another species? Align sequences with known genes Find the gene with the best match Slide 6 Are there other sequences like this one? 1) Huge public databases - GenBank, Swissprot, etc. 2) Sequence comparison is the most powerful and reliable method to determine evolutionary relationships between genes 3) Similarity searching is based on alignment 4) BLAST and FASTA provide rapid similarity searching a. rapid = approximate (heuristic) b. false + and - scores Slide 7 Similarity Homology 1) 25% similarity 100 AAs is strong evidence for homology 2) Homology is an evolutionary statement which means descent from a common ancestor common 3D structure usually common function homology is all or nothing, you cannot say "50% homologous" Slide 8 Global vs Local similarity 1) Global similarity uses complete aligned sequences - total % matches GCG GAP program, Needleman & Wunch algorithm 2) Local similarity looks for best internal matching region between 2 sequences GCG BESTFIT program, Smith-Waterman algorithm, BLAST and FASTA 3) dynamic programming optimal computer solution, not approximate Slide 9 Search with Protein, not DNA Sequences 1) 4 DNA bases vs. 20 amino acids - less chance similarity 2) can have varying degrees of similarity between different AAs - # of mutations, chemical similarity, PAM matrix 3) protein databanks are much smaller than DNA databanks Slide 10 Similarity is Based on Dot Plots 1) two sequences on vertical and horizontal axes of graph 2) put dots wherever there is a match 3) diagonal line is region of identity (local alignment) 4) apply a window filter - look at a group of bases, must meet % identity to get a dot Slide 11 Simple Dot Plot Slide 12 Dot plot filtered with 4 base window and 75% identity Slide 13 Dot plot of real data Slide 14 Indels Comparing two genes it is generally impossible to tell if an indel is an insertion in one gene, or a deletion in another, unless ancestry is known: ACGTCTGATACGCCGTATCGTCTATCT ACGTCTGAT---CCGTATCGTCTATCT Slide 15 The Genetic Code Substitutions Substitutions are mutations accepted by natural selection. Synonymous: CGC CGA Non-synonymous: GAU GAA Slide 16 Comparing two sequences Point mutations, easy: ACGTCTGATACGCCGTATAGTCTATCT ACGTCTGATTCGCCCTATCGTCTATCT Indels are difficult, must align sequences: ACGTCTGATACGCCGTATAGTCTATCT CTGATTCGCATCGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT ----CTGATTCGC---ATCGTCTATCT Slide 17 Scoring a sequence alignment Match score:+1 Mismatch score:+0 Gap penalty:1 ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT Matches: 18 (+1) Mismatches: 2 0 Gaps: 7 ( 1) Score = +11 Slide 18 Origination and length penalties We want to find alignments that are evolutionarily likely. Which of the following alignments seems more likely to you? ACGTCTGATACGCCGTATAGTCTATCT ACGTCTGAT-------ATAGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT AC-T-TGA--CG-CGT-TA-TCTATCT We can achieve this by penalizing more for a new gap, than for extending an existing gap Slide 19 Scoring a sequence alignment (2) Match/mismatch score:+1/+0 Origination/length penalty:2/1 ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT Matches: 18 (+1) Mismatches: 2 0 Origination: 2 (2) Length: 7 (1) Score = +7 Slide 20 Scoring Similarity 1) Can only score aligned sequences 2) DNA is usually scored as identical or not 3) modified scoring for gaps - single vs. multiple base gaps (gap extension) 4) AAs have varying degrees of similarity a. # of mutations to convert one to another b. chemical similarity c. observed mutation frequencies 5) PAM matrix calculated from observed mutations in protein families Slide 21 DNA Scoring Matrix ATCG A1000 T0100 C0010 G0001 ATCG A5-4 T 5 C 5 G 5 ATCG A1-5 T-51-5 C 1-5 G-5 1 IdentityBLASTTransition/Transversion Slide 22 The PAM 250 scoring matrix Slide 23 How can we find an optimal alignment? Finding the alignment is computationally hard: ACGTCTGATACGCCGTATAGTCTATCT CTGAT---TCGCATCGTC--T-ATCT C(27,7) gap positions = ~888,000 possibilities Its possible, as long as we dont repeat our work! Dynamic programming: The Needleman & Wunsch algorithm Slide 24 What is the optimal alignment? ACTCG ACAGTAG Match: +1 Mismatch: 0 Gap: 1 Slide 25 The dynamic programming concept Suppose we are aligning: ACTCG ACAGTAG Last position choices: G+1ACTC GACAGTA G-1ACTC -ACAGTAG --1ACTCG GACAGTA Slide 26 We can use a table Suppose we are aligning: A with A A 0 A Slide 27 Needleman-Wunsch: Step 1 Each sequence along one axis Mismatch penalty multiples in first row/column 0 in [1,1] Slide 28 Needleman-Wunsch: Step 2 Vertical/Horiz. move: Score + (simple) gap penalty Diagonal move: Score + match/mismatch score Take the MAX of the three possibilities Slide 29 Needleman-Wunsch: Step 2 (contd) Fill out the rest of the table likewise Slide 30 Needleman-Wunsch: Step 2 (contd) Fill out the rest of the table likewise The optimal alignment score is calculated in the lower-right corner Slide 31 But what is the optimal alignment To reconstruct the optimal alignment, we must determine of where the MAX at each step came from Slide 32 A path corresponds to an alignment = GAP in top sequence = GAP in left sequence = ALIGN both positions One path from the previous table: Corresponding alignment (start at the end): AC--TCG ACAGTAG Score = +2 Slide 33 Semi-global alignment Suppose we are aligning: GCG GGCG Which do you prefer? G-CG-GCG GGCGGGCG Semi-global alignment allows gaps at the ends for free. Slide 34 Initialize first row and column to all 0s Allow free horizontal/vertical moves in last row and column Semi-global alignment Slide 35 Local alignment Global alignments score the entire alignment Semi-global alignments allow unscored gaps at the beginning or end of either sequence Local alignment find the best matching subsequence CGATG AAATGGA This is achieved by allowing a 4 th alternative at each position in the table: zero, if alternative neg. Smith-Waterman Algorithm (1981). Slide 36 Local alignment Mismatch = 1 this time CGATG AAATGGA Slide 37 Practice Problem Find an optimal alignment for these two sequences: GCGGTT GCGT Match: +1 Mismatch: 0 Gap: 1