24
DNA sequences alignment measurement Lecture 13

DNA sequences alignment measurement

  • Upload
    basil

  • View
    51

  • Download
    0

Embed Size (px)

DESCRIPTION

DNA sequences alignment measurement. Lecture 13 . Introduction. Measurement of “strength” alignment Nucleic acid and amino acid substitutions Measurement of alignment gaps . Measurement of aligned sequences. When aligning sequences (DNA/AA ) it is assumed that: - PowerPoint PPT Presentation

Citation preview

Page 1: DNA sequences alignment measurement

DNA sequences alignment measurement

Lecture 13

Page 2: DNA sequences alignment measurement

Introduction

• Measurement of “strength” alignment

• Nucleic acid and amino acid substitutions

• Measurement of alignment gaps

Page 3: DNA sequences alignment measurement

Measurement of aligned sequences• When aligning sequences (DNA/AA ) it is assumed that:

– they have a common ancestor; – the differences between the sequences is the result of

mutations– important areas like coding sequences (CDS) will be

conserved. There is a bias “against” mutations in these areas– Furthermore there is a bias in the types of mutations:

substitutions more likely that insertions/deletions….

• The dot plot gives a visual representation of sequence alignment regions. But how do we measure the strength of these alignments.

Page 4: DNA sequences alignment measurement

Measurement of aligned sequences

• One way is to count the mismatches: the “difference” between the sequences. – Hamming distance; :

• The distance corresponds to mismatches for strings of equal length. – agtc – cgta Distance is 2 (give another example)

• If the sequences (strings) are not of equal length then use:– The Levenshtein distance: is the minimum number of edit operations

(alter/ insert/delete) to required to turn one string into another:• ag- tcc• cgctca what is the levensthein distance?

• The latter technique has the advantage of allowing the inclusions of gaps

Page 5: DNA sequences alignment measurement

Measurement of matching• But what about the biological plausibility of

these approaches to measuring “differences” between sequences (strings):

• DNA sequences (string mismatches) are different:– due to the probability of substitution; insertions,

deletions is not the same. – Certain types of mutations like inversions;

translocations; duplications …. Complicate the assessment of similarity; e.g. how would you treat tandem repeats; inverted repeats….

Page 6: DNA sequences alignment measurement

Nucleic Acid mutations• In sequence alignment we are trying to determine have the

differences (similarity) occurred due to:– chance (random mutations)– They had a common origin (degree of conservatism)

• One approach would be to count the percentage of matches but there is now a need to include the bias associated with possible substitutions.

• However, similarity does not necessarily imply common ancestor or visa versa Zvelebil and Baum (2008 p. 74) suggest this can occur in convergent evolution/divergent evolution.

• So the results need to be contextualised the findings of alignment tests. (bat and bird both have wings…)

Page 7: DNA sequences alignment measurement

Alignment Scoring methods• In general sequences are given a score at each

matching position and the one with the largest score is optimal and is chosen; however suboptimal may also need to be considered.

• The most basic approach is obtained by measuring the percentage of similarity.

• Given that not all “changes” occur with equal chance there is a need to develop:– A nucleotide substitution matrix

Page 8: DNA sequences alignment measurement

Nucleotide scoring Matrix• While it is know that certain mutations are more

likely to occur than others: e.g. transitions a<->g is more common than transversions c<->t.

• However since the probability of such difference is insignificant in relation to the chance of a mutation itself the differences are mostly ignored. The following shows a typical scoring matrix for nucleotides.

Adapted from Baxevanis p. 303

Page 9: DNA sequences alignment measurement

Nucleic acid scoring Matrix• The values are based on the probability of a

type of substitution occurring (expected value); this includes a nucleotide substituting with itself.

• These expected values are calculated by getting the ratio of :– number of “observed changes” /number of

changes “due to chance”• These values are obtained by examining large

numbers of DNA sequences.

Page 10: DNA sequences alignment measurement

Nucleic acid scoring Matrix

• Then calculate 10*log 10

(“expected value”).• This ensures that

adjacent nucleotides expected values can now be added as opposed to being multiplied in determining the alignment score.

Page 11: DNA sequences alignment measurement

Nucleic acid scoring Matrix• A expected value greater of 1 indicates the

substitution has the same change of occurrence as it is was occurring randomly.

• A value greater than 1 indicates a bias in favour or the substitution

• A values less than 1 indicates a bias against the substitution.

• A value of 5 will give what expected value?

Page 12: DNA sequences alignment measurement

Measuring Protein similarity

• Deriving a matrix for proteins is more complex because:• There are 20 amino acids so much higher set of

substitutions. • The amino acids have properties that affect the

structure and so the protein functionality.• Therefore substitutions can be conserved or semi-

conserved • Observations shows that conserved substitutions

• e.g. Hydrophobic <-> hydrophobic mutations are more common • semi conserved; e.g. hydrophilic <-> hydrophobic

Page 13: DNA sequences alignment measurement

The PAM matrix• The PAM, developed in the 1970’s by M. Dayhoff, was

derived from all types of sequences both in conserved and un-conserved regions of the protein.

• The PAM (point accepted mutation) Matrix is a measure of the chances of one amino acid being substitute by another.

• The PAM “1” matrix was “derived” from very similar sequences (close in terms of evolution ) and represented one point “accepted” mutation per 100 residues.

• It assumed that a substitution was independent of previous mutations at this position.

• This meant it could be used to derive matrices over longer “evolutionary distances” or more mutations per 100 residues.

Page 14: DNA sequences alignment measurement

The PAM MATRIX • The number of times the matrix is “multiplied” by itself represents

the number of mutations per 100 residues• So a PAM 100 represents 100 mutations per 100 residues: this of

course does mean that the 100 residues changed.• The most common the PAM 250 represents 250 per 100 residues.

• The original PAM 1 was derived in a similar way to the nucleotide substitution matrix:– 10 log10(expected value)

– (the number of observed changes were divided by number of changes based on AA frequency ( as if changes occurred by chance) : expected value.)

Page 15: DNA sequences alignment measurement

Dayhoff PAM (250) Matrix • A H<->N has a value of 2 or an

expected value 1.6 (occurred 1.6 times more often that would be expected.

• A same amino acid substitution can range from C <-> C has a high score 12; while V <-> V (4)

• A hydrophobic <-> Hydrophobic V<->A (0); while V<-> I is (4)

• A hydrophilic <-> hydrophilic K <-> T (0); K<-> R (3)

• A hydrophilic <-> hydrophobic: K <-> V (-2)

Page 16: DNA sequences alignment measurement

Blosum matrix • Another matrix the Blosum (Blocks Substitution ) Matrix used a

larger data set than the PAM • This allowed the Bolsum to look at mutations within “blocks” of

conserved sequences as opposed to point mutations on individual sequences in both conserved and variable.

• The bolsum number ; e.g. 62 matrix, unlike the PAM number, e.g.120, indicates that the substitution probabilities are derived from sequences with that much similarity or conservation.

• So the Blosum 62 is obtained from blocks sharing 62% conservation .

• In addition these results are obtained directly as opposing to being derived as in the case of the PAM matrix.

Page 17: DNA sequences alignment measurement

The Blosum and PAM matrices

Adapted from Zvelebil e baum p 82

The two matrices opposite are the Blosum 62 and PAM 120

The following shows scores from the Blosum 62 matrixSame resisdues:C<->C (9)V<->V (4)Hydrophobic to hydrophobic V<->A (O)V<-> I (3)Hydrophilic to HydrophilicK <-> T (-1)K<-> R (2)Hydrophobic to hydrophilicK<-> V (-2)

The trend is the same.

Page 18: DNA sequences alignment measurement

PAM and Bolsum Matrices• In the PAM matrix the as the number

increases so does evolutionary distance while it is the reverse it the Bolsum Matrix.

• According to Baxevanis (2003) the following represents the equivalence and most appropriate use of both matrices– PAM250 and the Bolsum 45– PAM160 and the Bolsum 62

Page 19: DNA sequences alignment measurement

PAM and Bolsum MatrixMatrix Best in determining

PAM 40/ Bolsum 90 Short similar (conserved) alignments

PAM 250 Longer more divergent alignments

Pam 160/ Bolsum 80 Detecting members of protein families

Bolsum 62 In finding all potential similarities

Adapted from Baxevanis 2005

An excellent review of scoring matrices can be found at : Henikoff and Henikoff 2000

Page 20: DNA sequences alignment measurement

20

Dot plot Matrix: imperfect match• Some alignments require

gaps to increase the matching score; the gaps are used represent inclusion/deletion mutations

• The diagram shows that most of the 2 sequences are aligned. Where there are gaps indicates areas of non-alignment or mismatches: gaps or substitutions

Adapted from: dotplot example

Page 21: DNA sequences alignment measurement

Measurement of alignment gaps • Gaps represents insertions and deletions• Baxevanis (2005) suggest that no more than “one gap

in 20 pairs is a good rule of thumb”.• Gaps in alignments are penalised; given a negative

scoring value. • The penalty associated with the using gaps is

dependent on – Opening the gap (introducing an insertion or deletion)– Extending the gap (as opposed to opening a new gap)– The length of the gap (the number of deletions/insertions).

Page 22: DNA sequences alignment measurement

Gap penalties • There is no overall agreement on what values should be

assigned to gap penalties (Zvelebil e Baum 2008). • The purpose of an insertion is to increase the strength

of the alignment. • So choosing a high score will eliminate sequences with

gaps while of the score is too low then alignments with more and larger gaps will be chosen.

• The value should also be dependent on how closely “related” the alignments must be :– So sequences with a very strict match would use a high gap

score. – Alignment between distantly related species would use a low

gap score.

Page 23: DNA sequences alignment measurement

Potential Exam Questions• What is the purpose of measuring the strength

of an alignment (3 marks)• Explain two differences between analysing a

string (sequence) and a DNA string. (4 marks)• Describe how you would measure the similarity

between two DNA sequences (10 marks)• Discuss the use of gap penalties in a sequence

alignment score (13 marks)

Page 24: DNA sequences alignment measurement

References

• Baxevanis A.D. 2005 Bioinformatics: a practical guide to the analysis of genes and proteins chapter 11; Wiley

• Lesk, A. 2008; Introduction to bioinformatics, 3rd edition, oxford university press

• Zvelebil e Baum (2008) Understanding Bioinformatics