Upload
dena
View
34
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Pairwise sequence comparison. Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington [email protected]. One-minute responses. Need more clarification on the alignment of sequences. - PowerPoint PPT Presentation
Citation preview
Pairwise sequence comparison
Prof. William Stafford NobleDepartment of Genome Sciences
Department of Computer Science and EngineeringUniversity of Washington
One-minute responses• Need more clarification on the alignment
of sequences.• More precise definitions, included on
slides.• Python programming very useful.• A lot to learn at once.• I did not get the Python part, but it’s easy
when I try it myself.• Please write on the board sometimes.• I liked the part about evolutionary theory.• I understood about 50% of the lecture.• You speak too fast.• I did not understand Moore’s law.• I liked the biology video and the analogy
with Moore’s law.• The Python code is different from what
we are used to.• Give a practical example of BLAST usage.
• More explanation on sys.argv.• I like the summary of the previous lecture.• Sometimes you talk a long time before
asking for questions.• Biology of cells and genomes wasn’t clear
enough.• Take time explaining the biology concepts.• The last part of the lecture was not clear.• Give more examples.• The class could go a bit slower on difficult
concepts.• It is a good habit to accept comments.• We may forget what you are saying
because we are not taking notes.• I didn’t understand the mechanism by
which we can physically read the DNA sequence.– This topic is outside the scope of this class
to cover. If you’d like to read about this, check out “Overview of DNA sequencing strategies.”
Other questions
• How many assignments will we do?
• Will there be a test?• Is there a lot of
mathematics here, or just Python and biology?
• Are theoretical understanding of genetics and biochemistry vital for studying bioinformatics?
• Are we going to write this feedback every day?
• How can I make a dictionary from a .txt file?
• When you compare two protein sequences, how do you guess where the first codon begins?
• What is evolution? Is it dogma?
Outline
• Responses from last class• Sequence alignment– Motivation– Scoring alignments
• Python
Revision
• What is the difference between the DNA and RNA alphabets?– In RNA, “T” is changed to “U.”
• What is a codon, and why is it significant?– A set of three adjacent RNA nucleotides. One codon codes for
a single amino acid.• What is the universal genetic code?– A set of rules for translating codons into amino acids.
• What is the purpose of aligning two DNA or protein sequences?– To infer common ancestry, function or structure.
Sequence comparison overview• Problem: Find the “best” alignment between a query
sequence and a target sequence.• To solve this problem, we need– a method for scoring alignments, and– an algorithm for finding the alignment with the best score.
• The alignment score is calculated using– a substitution matrix, and– gap penalties.
• The algorithm for finding the best alignment is dynamic programming.
A simple alignment problem.
• Problem: find the best pairwise alignment of GAATC and CATAC.
Scoring alignments
• We need a way to measure the quality of a candidate alignment.
• Alignment scores consist of two parts: a substitution matrix, and a gap penalty.
GAATCCATAC
GAATC-CA-TAC
GAAT-CC-ATAC
GAAT-CCA-TAC
-GAAT-CC-A-TAC
GA-ATCCATA-C
Scoring aligned bases
A C G TA 10 -5 0 -5C -5 10 -5 0G 0 -5 10 -5T -5 0 -5 10
A hypothetical substitution matrix:
GAATC | |CATAC
-5 + 10 + -5 + -5 + 10 = 5
A R N D C Q E G H I L K M F P S T W Y V B Z XA 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1
BLOSUM62
• Linear gap penalty: every gap character receives a score of d.
• Affine gap penalty: opening a gap receives a score of d; extending a gap receives a score of e.
Scoring gaps
GAAT-C d=-4CA-TAC
-5 + 10 + -4 + 10 + -4 + 10 = 17
G--AATC d=-4CATA--C e=-1
-5 + -4 + -1 + 10 + -4 + -1 + 10 = 5
A simple alignment problem.
• Problem: find the best pairwise alignment of GAATC and CATAC.
• Use a linear gap penalty of -4.• Use the following substitution matrix:
A C G TA 10 -5 0 -5C -5 10 -5 0G 0 -5 10 -5T -5 0 -5 10
How many possibilities?
• How many different alignments of two sequences of length n exist?
GAATCCATAC
GAATC-CA-TAC
GAAT-CC-ATAC
GAAT-CCA-TAC
-GAAT-CC-A-TAC
GA-ATCCATA-C
Too many to enumerate!
DP matrix
G A A T C
C
A
T
A
C
-8
The value in position (i,j) is the score of the best alignment of the first i
positions of the first sequence versus the first j positions of the second
sequence.
-G-CAT
DP matrix
G A A T C
C
A
T -8 -12A
C
Moving horizontally in the matrix introduces a
gap in the sequence along the left edge.
-G-ACAT-
DP matrix
G A A T C
C
A
T -8
A -12C
Moving vertically in the matrix introduces a gap in the sequence along
the top edge.
-G--CATA
Initialization
G A A T C
0
C
A
T
A
C
Introducing a gap
G A A T C
0 -4
C
A
T
A
C
G-
DP matrix
G A A T C
0 -4
C -4
A
T
A
C
-C
DP matrix
G A A T C
0 -4
C -4 -8
A
T
A
C
DP matrix
G A A T C
0 -4
C -4 -5
A
T
A
C
GC
Three legal moves
• A diagonal move aligns a character from the left sequence with a character from the top sequence.
• A vertical move introduces a gap in the sequence along the top edge.
• A horizontal move introduces a gap in the sequence along the left edge.
DP matrix
G A A T C
0 -4 -8 -12 -16 -20
C -4 -5
A -8
T -12
A -16
C -20
-----CATAC
DP matrix
G A A T C
0 -4 -8 -12 -16 -20
C -4 -5
A -8 ?
T -12
A -16
C -20
DP matrix
G A A T C
0 -4 -8 -12 -16 -20
C -4 -5
A -8 -4
T -12
A -16
C -20
-4
0 -4
-GCA
G-CA
--GCA-
-4 -9 -12
DP matrix
G A A T C
0 -4 -8 -12 -16 -20
C -4 -5
A -8 -4
T -12 ?
A -16 ?
C -20 ?
DP matrix
G A A T C
0 -4 -8 -12 -16 -20
C -4 -5
A -8 -4
T -12 -8
A -16 -12
C -20 -16
DP matrix
G A A T C
0 -4 -8 -12 -16 -20
C -4 -5 ?
A -8 -4 ?
T -12 -8 ?
A -16 -12 ?
C -20 -16 ?
DP matrix
G A A T C
0 -4 -8 -12 -16 -20
C -4 -5 -9
A -8 -4 5
T -12 -8 1
A -16 -12 2
C -20 -16 -2
DP matrix
G A A T C
0 -4 -8 -12 -16 -20
C -4 -5 -9
A -8 -4 5
T -12 -8 1
A -16 -12 2
C -20 -16 -2 ?
Find the optimal alignment, and its
score.
DP matrix
G A A T C
0 -4 -8 -12 -16 -20
C -4 -5 -9 -13 -12 -6
A -8 -4 5 1 -3 -7
T -12 -8 1 0 11 7
A -16 -12 2 11 7 6
C -20 -16 -2 7 11 17
DP in equation form
• Align sequence x and y.• F is the DP matrix; s is the substitution matrix;
d is the linear gap penalty.
djiF
djiF
yxsjiF
jiF
F
ji
1,
,1
,1,1
max,
00,0
DP in equation form
1,1 jiF
jiF , jiF ,1
1, jiF
d
d ji yxs ,
Summary
• Scoring a pairwise alignment requires a substition matrix and gap penalties.
• Dynamic programming is an efficient algorithm for finding the optimal alignment.
• Entry (i,j) in the DP matrix stores the score of the best-scoring alignment up to those positions.
• DP iteratively fills in the matrix using a simple mathematical rule.
A simple exampleA C G T
A 2 -7 -5 -7
C -7 2 -7 -5
G -5 -7 2 -7
T -7 -5 -7 2
A A G
A
G
C 1,1 jiF
jiF , jiF ,1
1, jiF
d
d ji yxs ,
Find the optimal alignment of AAG and AGC.Use a gap penalty of d=-5.
Some useful Python tidbits
sys.argv
• sys.argv is a list containing the strings given on the command line
• Write a program that adds up all of the numbers on the command line.
> add-numbers.py 1 2 36
sys.stdout versus sys.stderr
• sys.stdout and sys.stderr are two different streams to print to– Use sys.stdout for the primary output of your
program.– Use sys.stderr to report errors or give status
updates.
write() versus print()
• The write() function differs from print()– write() does not automatically add an end-of-line.– write() requires that you specify the name of the
file or stream to be written to.– write() only accepts a single string.
%
• The “%” operator substitutes values into a string based on the presence of format strings.– %s = string– %d = integer– %g = float
• Place “%” between a string and a tuple of values.
>>> "%s %s %s" % ("larry", "curly", "moe")'larry curly moe’>>> "%d + %d" % (21, 15)'21 + 15'
Using sys.stderr• Write a program to divide one number by a second number.> divide.py 8 32.66667
• Print the usage message if exactly two numbers are not given.> ./divide.pyUSAGE: divide.py <value1> <value2>
Divide <value1> by <value2> and report the result.
• Stop and print an error if the second number is zero.> divide.py 8 0Divide by zero error.
Using write() and %
• Write a program that prints the command line arguments with “+” signs between, and then reports their sum.
> add-numbers2.py 1 2 31 + 2 + 3 = 6
One-minute response
At the end of each class• Write for about one minute.• Provide feedback about the class.• Was part of the lecture unclear?• What did you like about the class?• Do you have unanswered questions?• Sign your nameI will begin the next class by responding to the one-minute responses