Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
10/10/2016
1
Sequence Comparison
Abhishek Niroula
Protein Structure and Bioinformatics
Department of Experimental Medical Science
Lund University
1 2016-10-11
Learning goals
• What is a sequence alignment?
• What approaches are used for aligning sequences?
• How to choose the best alignment?
• What are substitution matrices?
• Which tools are available for aligning two or more sequences?
• How to use the alignment tools?
• How to interpret results obtained from the tools?
2 2016-10-11
10/10/2016
2
What is sequence alignment?
• A way of arranging two or more sequences to identify regions of similarity
• Shows locations of similarities and differences between the sequences
• The aligned residues correspond to original residue in their common ancestor
• Insertions and deletions are represented by gaps in the alignment
• An 'optimal' alignment exhibits the most similarities and the least differences
• Examples Protein sequence alignment
MSTGAVLIY--TSILIKECHAMPAGNE-----
---GGILLFHRTHELIKESHAMANDEGGSNNS
* * * **** ***
Nucleotide sequence alignment
attcgttggcaaatcgcccctatccggccttaa
att---tggcggatcg-cctctacgggcc----
*** **** **** ** ******
3 2016-10-11
Why align sequences?
• Reveal structural, functional and evolutionary relationship between
biological sequences
• Similar sequences may have similar structure and function
• Similar sequences are likely to have common ancestral sequence
• Annotation of new sequences
• Modelling of protein structures
4 2016-10-11
10/10/2016
3
Sequence alignment: Types
• Global alignment
– Aligns each residue in each sequence by introducing gaps
– Example: Needleman-Wunsch algorithm
2016-10-11 5
L G P S S K Q T G K G S - S R I W D N
L N - I T K S A G K G A I M R L G D A
Sequence alignment: Types
• Local alignment
– Finds regions with the highest density of matches locally
– Example: Smith-Waterman algorithm
2016-10-11 6
- - - - - - - T G K G H R R K S P
R S D E L K A A G K G - - - - - -
F T F T A L I L L - A V A V
- - F T A L - L L A A V - -
10/10/2016
4
How to find the best alignment?
• Seq1: TACGGGCAG
• Seq2: ACGGCG
7 2016-10-11
T A C G G G C A G
- A C - G G C - G
Option 1
T A C G G G C A G
- A C G G - C - G
Option 2
T A C G G G C A G
- A C G - G C - G
Option 3
Find the alignment score!!!
How to find the alignment score?
• Scoring matrices are used to assign scores to each comparison of a pair of
characters
• Identities and substitutions by similar amino acids are assigned positive scores
• Mismatches, or matches that are unlikely to have been a result of evolution, are given
negative scores
A C D E F G H I K
A C Y E F G R I K
+5 +5 -5 +5 +5 +5 -5 +5 +5
8 2016-10-11
Matches +5
Mismatches -5
10/10/2016
5
PAM-1 substitution matrix
9 2016-10-11
What is a PAM matrix?
• PAM matrices
– PAM - Percent Accepted Mutations
– PAM gives the probability that a given amino acid will be replaced by any other amino
acid
– An accepted point mutation in a protein is a replacement of one amino acid by
another, accepted by natural selection
– Derived from global alignments of closely related sequences
– The numbers with the matrix (PAM40, PAM100) refer to the evolutionary distance
(greater numbers mean greater distances)
– 1-PAM matrix refers to the amount evolution that would change 1% of the
residues/bases (on average)
– 2-PAM matrix does NOT refer to change in 2% of residues
• Refers 1-PAM twice
• Some variations may change back to original residue
10 2016-10-11
10/10/2016
6
BLOSUM62
11 2016-10-11
What is BLOSUM?
• BLOSUM matrices
– BLOSUM - Blocks Substitution Matrix
– Score for each position refers to obtained frequencies of substitutions in
blocks of local alignments of protein sequences.
– For example BLOSUM62 is derived from sequence alignments with no
more than 62% identity.
12 2016-10-11
10/10/2016
7
Which scoring matrix to use?
For global alignments use PAM matrices.
• Lower PAM matrices tend to find short alignments of highly similar regions
• Higher PAM matrices will find weaker, longer alignments
For local alignments use BLOSUM matrices
• BLOSUM matrices with HIGH number, are better for similar sequences
• BLOSUM matrices with LOW number, are better for distant sequences
13 2016-10-11
Sequence alignment: Methods
• Pairwise alignment
– Finding best alignment of two sequences
– Often used for searching sequences with highest similarity in the sequence databases
• Dot Matrix Analysis
• Dynamic Programming (DP)
• Short word matching
• Multiple Sequence Alignment (MSA)
– Alignment of more than two sequences
– Often used to find conserved domains, regions or sites among many sequences
• Dynamic programming
• Progressive methods
• Iterative methods
• Structural alignments
– Alignments based on structure
14 2016-10-11
10/10/2016
8
Sequence alignment: Methods
• Pairwise alignment
– Finding best alignment of two sequences
– Often used for searching best similar sequences in the sequence databases
• Dot Matrix Analysis
• Dynamic Programming (DP)
• Short word matching
• Multiple Sequence Alignment (MSA)
– Alignment of more than two sequences
– Often used to find conserved domains, regions or sites among many sequences
• Dynamic programming
• Progressive methods
• Iterative methods
• Structural alignments
– Alignments based on structure
15 2016-10-11
What is a Dot Matrix?
• Method for comparing two sequences (amino acid or nucleotide)
• Lets align two sequences using
dot matrix
A: A G C T A G G A
B: G A C T A G G C
– Sequence A is organized in X-axis
and sequence B in Y-axis
A G C T A G G A
G
A
C
T
A
G
G
C
16 2016-10-11
Sequence A
Seq
uen
ce B
10/10/2016
9
Find the matching nucleotides
– Starting from the first nucleotide in B,
move along the first row placing a dot in
columns with matching nucleotide
17 2016-10-11
A G C T A G G A
G ● ● ●
A
C
T
A
G
G
C
Sequence A
Seq
uen
ce B
Continue to fill the table
– Starting from the first nucleotide in B,
move along the first row placing a dot in
columns with matching nucleotide
– Repeat the procedure for all the
nucleotides in B
18 2016-10-11
A G C T A G G A
G ● ● ●
A ● ● ●
C
T
A
G
G
C
Sequence A
Seq
uen
ce B
10/10/2016
10
Why are some cells empty in a dot matrix?
– Starting from the first nucleotide in B,
move along the first row placing a dot in
columns with matching nucleotide
– Repeat the procedure for all the
nucleotides in B
19 2016-10-11
A G C T A G G A
G ● ● ●
A ● ● ●
C ●
T ●
A ● ● ●
G ● ● ●
G ● ● ●
C ●
Sequence A
Seq
uen
ce B
Cells corresponding to mismatching nucleotides are empty
Is there something interesting in the matrix?
– Starting from the first nucleotide in B,
move along the first row placing a dot in
columns with matching nucleotide
– Repeat the procedure for all the
nucleotides in B
– Region of similarity is revealed by a
diagonal row of dots
– Other isolated dots represent random
matches
20 2016-10-11
A G C T A G G A
G ● ● ●
A ● ● ●
C ●
T ●
A ● ● ●
G ● ● ●
G ● ● ●
C ●
Sequence A
Seq
uen
ce B
10/10/2016
11
How to interpret dot plots?
Two similar, but not identical,
sequences An insertion or deletion A tandem duplication
21 2016-10-11
How to interpret dot plots?
An inversion Joining sequences
22 2016-10-11
10/10/2016
12
Limitations of dot matrix
• Sequences with low-complexity regions give false diagonals
– Sequence regions with little diversity
• Noisy and space inefficient
• Limited to 2 sequences
23 2016-10-11
Dotplot exercise
• Use the following three tools to generate dot plots for two sequences
• YASS:: genomic similarity search tool
– http://bioinfo.lifl.fr/yass/yass.php
• Lalign/Palign
– http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=lalign
• multi-zPicture
– http://zpicture.dcode.org/
24 2016-10-11
10/10/2016
13
Sequence alignment: Methods
• Pairwise alignment
– Finding best alignment of two sequences
– Often used for searching best similar sequences in the sequence databases
• Dot Matrix Analysis
• Dynamic Programming (DP)
• Short word matching
• Multiple Sequence Alignment (MSA)
– Alignment of more than two sequences
– Often used to find conserved domains, regions or sites among many sequences
• Dynamic programming
• Progressive methods
• Iterative methods
• Structural alignments
– Alignments based on structure
25 2016-10-11
Dynamic programming
• Breaks down the alignment problem into smaller problems
• Example
– Needleman-Wunsch algorithm: global alignment
– Smith-Waterman algorithm: local alignment
• Three steps
– Initialization
– Scoring
– Traceback
26 2016-10-11
10/10/2016
14
Where to place gaps in the alignment?
• Insertion of gaps in the alignment
• Gaps should be penalized
• Gap opening should be penalized higher than gap extension (or at
least equal)
• In BLOSUM62
– Gap opening score = -11
– Gap extension score = -1
27 2016-10-11
A A A G A G A A A
A A A - - A A A A
Gap extention
A A A G A G A A A
A A A - A - A A A
Gap initiation
A A A G A G A A A -
A A A - - - A A A A
Gap extention
Local and global pairwise alignment
- A G T T A
- 0 -1 -2 -3 -4 -5
A -1 2
G -2
T -3
G -4
C -5
A -6
- A G T T A
- 0 0 0 0 0 0
A 0 2
G 0
T 0
G 0
C 0
A 0
• Needleman-Wunsch (global)
– Match =+2
– Mismatch =-1
– Gap =-1
• Smith-Waterman (local)
– Match =+2
– Mismatch =-1
– Gap =-1
• All negative values are replaced by 0
• Traceback starts at the highest value and ends
at 0
28 2016-10-11
10/10/2016
15
Needleman-Wunsch vs Smith-Waterman
Sequence alignment teacher (http://melolab.org/websoftware/web/?sid=3)
29 2016-10-11
Dynamic programming: example
• http://www.avatar.se/molbioinfo2001/dynprog/dynamic.html
• Scoring
– Match = +2
– Mismatch = -2
– Gap = -1
30 2016-10-11
10/10/2016
16
Dynamic programming exercise
• Generate a scoring matrix for nucleotides (A, C, G, and T)
• Align two sequences using dynamic programming and scoring
matrix you just generated
• Align two sequences using following tools
– EMBOSS Needle
• http://www.ebi.ac.uk/Tools/psa/emboss_needle/
– EMBOSS Water
• http://www.ebi.ac.uk/Tools/psa/emboss_water/
31 2016-10-11
DP exercise: Needle vs Water
32 2016-10-11
10/10/2016
17
Sequence alignment: Methods
• Pairwise alignment
– Finding best alignment of two sequences
– Often used for searching best similar sequences in the sequence databases
• Dot Matrix Analysis
• Dynamic Programming (DP)
• Short word matching
• Multiple Sequence Alignment (MSA)
– Alignment of more than two sequences
– Often used to find conserved domains, regions or sites among many sequences
• Dynamic programming
• Progressive methods
• Iterative methods
• Structural alignments
– Alignments based on structure
33 2016-10-11
MSA: What and Why?
• A multiple sequence alignment (MSA) is an alignment of three or
more sequences
• Why MSA?
– To identify patterns of conservation across more than 2 sequences
– To characterize protein families and generate profiles of protein families
– To infer relationships within and among gene families
– To predict secondary and tertiary structures of new sequences
– To perform phylogenetic studies
34 2016-10-11
10/10/2016
18
MSA application in variation interpretation
• Interpreting the impacts of genetic variants
• All tools for variation interpretation use MSAs of protein and
nucleotide sequences
• http://structure.bmc.lu.se/services.php
• PON-P2
• PON-BTK
• PON-MMR2
• PON-mt-tRNA
• PON-Diso
• PON-Sol
• PON-PS
• PPSC
2016-10-11 35
Are there tools to perform MSA?
36 2016-10-11
10/10/2016
19
Sequence alignment: Methods
• Pairwise alignment
– Finding best alignment of two sequences
– Often used for searching best similar sequences in the sequence databases
• Dot Matrix Analysis
• Dynamic Programming (DP)
• Short word matching
• Multiple Sequence Alignment (MSA)
– Alignment of more than two sequences
– Often used to find conserved domains, regions or sites among many sequences
• Dynamic programming
• Progressive methods
• Iterative methods
• Structural alignments
– Alignments based on structure
37 2016-10-11
Do you remember dynamic programming?
2 sequences
http://ai.stanford.edu/~serafim/CS262_2005/LectureNotes/Lecture17.pdf
3 sequences
38 2016-10-11
10/10/2016
20
How to align three sequences using DP?
• Dynamic programming
– Align each pair of sequences
– Sum scores for each pair at each position
39 2016-10-11
Sequence alignment: Methods
• Pairwise alignment
– Finding best alignment of two sequences
– Often used for searching best similar sequences in the sequence databases
• Dot Matrix Analysis
• Dynamic Programming (DP)
• Short word matching
• Multiple Sequence Alignment (MSA)
– Alignment of more than two sequences
– Often used to find conserved domains, regions or sites among many sequences
• Dynamic programming
• Progressive methods
• Iterative methods
• Structural alignments
– Alignments based on structure
40 2016-10-11
10/10/2016
21
MSA: Progressive alignment
• Progressive sequence alignment
– Hierarchical or tree based method
– E.g. ClustalW, T-Coffee, Clustal Omega
• Basic steps
– Calculate pairwise distances based on pairwise alignments between the
sequences
– Build a guide tree, which is an inferred phylogeny for the sequences
– Align the sequences
41 2016-10-11
Progressive MSA
1 3
2 5
1 3
d
4
5
1 3 2
5
1 3 2
42 2016-10-11
10/10/2016
22
Sequence alignment: Methods
• Pairwise alignment
– Finding best alignment of two sequences
– Often used for searching best similar sequences in the sequence databases
• Dot Matrix Analysis
• Dynamic Programming (DP)
• Short word matching
• Multiple Sequence Alignment (MSA)
– Alignment of more than two sequences
– Often used to find conserved domains, regions or sites among many sequences
• Dynamic programming
• Progressive methods
• Iterative methods
• Structural alignments
– Alignments based on structure
43 2016-10-11
MSA: Progressive alignment
• Iterative sequence alignment
– Improved progressive alignment
– Aligns the sequences repeatedly
– E.g. MUSCLE
44 2016-10-11
10/10/2016
23
Iterative MSA
• Follows 3 steps
Second progressive alignment
Refinement
Progressive alignment
45 2016-10-11
Phylogenetic tree
• A phylogenetic tree shows evolutionary relationships between the
sequences
• Types:
– Rooted
• Nodes represent most recent common ancestor
• Edge lengths represents time estimates
– Unrooted
• No ancestry and time estimates
• Algorithms to generate phylogenetic tree
– Neighbor-joining
– Unweighted Pair Group Method with Arithmetic Mean (UPGMA)
– Maximum parsimony
46 2016-10-11
10/10/2016
24
Neighbor joining method
http://en.wikipedia.org/wiki/Neighbor_joining 47 2016-10-11
MSA exercise
• Align the protein sequences SET 1 and SET 2 using MSA tools and
compare the alignments
• Clustalw Omega
– http://www.ebi.ac.uk/Tools/msa/clustalo/
• MUSCLE
– http://www.ebi.ac.uk/Tools/msa/muscle/
48 2016-10-11
10/10/2016
25
What to align: DNA or protein sequence?
• Many mis-matches in DNA sequences are
synonymous
• DNA sequences contain non-coding regions, which
should be avaided in homology searching
• Matches are more reliable in protein sequence
– Probability to occur randomly at any position in a sequence
• Amino acids: 1/20 = 0.05
• Nucleotides: 1/4 = 0.25
• Searcing at protein level: In case of frameshifts, the
alignment score for protein sequence may be very
low even though the DNA sequence are similar
ACT TTT CAT GGG ...
Thr Phe His Gly ...
ACT TTT TCA TGG G..
Thr Phe Ser Trp
If ORF exists, then always align at protein level
49 2016-10-11
Searching bioinformatics databases
50 2016-10-11
10/10/2016
26
Learning goals
• How to find information using Keywords?
• How to find information using Sequences?
• How to find correct gene names?
2016-10-11 51
Search strategy
• Keyword search
– Find information related to specific keywords
– Each bioinformatics database has its own search tool
– Some search tools have a wide spectrum which access multiple
databases and gather results together
– Gquery, EBI search
• Sequence search
– Use a sequence of interest to find more information about the sequence
– BLAST, FASTA
2016-10-11 52
10/10/2016
27
Keyword search
• Find information related to specific keywords
• Gquery
– A central search tool to find information in NCBI databases
– Searches in large number of NCBI databases and shows them in one
page
– http://www.ncbi.nlm.nih.gov/gquery
• EBI search
– Search tool to find infroamtion from databases developed, managed
and hosted by EMBL-EBI
– http://www.ebi.ac.uk/services
2016-10-11 53
Gquery
2016-10-11 54
10/10/2016
28
EBI search
2016-10-11 55
Limitations
• Synonyms
• Misspellings
• Old and new names/terms
• NOTES:
– Use different synonyms and read literature to find more approriate keywords
– Use boolean operators to combine different keywords
– Do not expect to find all the information using keyword search alone
– Note the database version or the version of entries in the databases you used
8 64 110
ELA2 ELANE
59 20
HIV 1
HIV-1
PubMed ClinVar
2016-10-11 56
10/10/2016
29
Gene nomenclature
• HUGO Gene Nomenclature Committee (HGNC)
– Assigns standardized nomenclature to human genes
– Each symbol is unique and each gene is given only one name
• Species specific nomenclature committees
– Mouse Genome Informatics Database
• http://www.informatics.jax.org/mgihome/nomen/
– Rat Genome Database
• http://rgd.mcw.edu/nomen/nomen.shtml
2016-10-11 57
HGNC symbol report
• Approved symbol
• Approved name
• Synonyms
– Terms used in literature to indicate the gene
– HGNC, Ensembl, Entrez Gene, OMIM
• Previous symbols and names
– Previous HGNC approved symbol
• NOTE: HGNC does not approve protein names. Usually genes and proteins
have the same name and gene names are written in italics.
2016-10-11 58
10/10/2016
30
HGNC search
2016-10-11 59
Keyword search
• Exercise
2016-10-11 60