Sequence Alignment and Comparison - Lunds universitet€¦ · 10/10/2016 11 How to interpret dot plots? Two similar, but not identical, A tandem sequences An insertion or deletion

10/10/2016

1

Sequence Comparison

Abhishek Niroula

Protein Structure and Bioinformatics

Department of Experimental Medical Science

Lund University

1 2016-10-11

Learning goals

• What is a sequence alignment?

• What approaches are used for aligning sequences?

• How to choose the best alignment?

• What are substitution matrices?

• Which tools are available for aligning two or more sequences?

• How to use the alignment tools?

• How to interpret results obtained from the tools?

2 2016-10-11

10/10/2016

2

What is sequence alignment?

• A way of arranging two or more sequences to identify regions of similarity

• Shows locations of similarities and differences between the sequences

• The aligned residues correspond to original residue in their common ancestor

• Insertions and deletions are represented by gaps in the alignment

• An 'optimal' alignment exhibits the most similarities and the least differences

• Examples Protein sequence alignment

MSTGAVLIY--TSILIKECHAMPAGNE-----

---GGILLFHRTHELIKESHAMANDEGGSNNS

* * * **** ***

Nucleotide sequence alignment

attcgttggcaaatcgcccctatccggccttaa

att---tggcggatcg-cctctacgggcc----

*** **** **** ** ******

3 2016-10-11

Why align sequences?

• Reveal structural, functional and evolutionary relationship between

biological sequences

• Similar sequences may have similar structure and function

• Similar sequences are likely to have common ancestral sequence

• Annotation of new sequences

• Modelling of protein structures

4 2016-10-11

10/10/2016

3

Sequence alignment: Types

• Global alignment

– Aligns each residue in each sequence by introducing gaps

– Example: Needleman-Wunsch algorithm

2016-10-11 5

L G P S S K Q T G K G S - S R I W D N

L N - I T K S A G K G A I M R L G D A

Sequence alignment: Types

• Local alignment

– Finds regions with the highest density of matches locally

– Example: Smith-Waterman algorithm

2016-10-11 6

- - - - - - - T G K G H R R K S P

R S D E L K A A G K G - - - - - -

F T F T A L I L L - A V A V

- - F T A L - L L A A V - -

10/10/2016

4

How to find the best alignment?

• Seq1: TACGGGCAG

• Seq2: ACGGCG

7 2016-10-11

T A C G G G C A G

- A C - G G C - G

Option 1

T A C G G G C A G

- A C G G - C - G

Option 2

T A C G G G C A G

- A C G - G C - G

Option 3

Find the alignment score!!!

How to find the alignment score?

• Scoring matrices are used to assign scores to each comparison of a pair of

characters

• Identities and substitutions by similar amino acids are assigned positive scores

• Mismatches, or matches that are unlikely to have been a result of evolution, are given

negative scores

A C D E F G H I K

A C Y E F G R I K

+5 +5 -5 +5 +5 +5 -5 +5 +5

8 2016-10-11

Matches +5

Mismatches -5

10/10/2016

5

PAM-1 substitution matrix

9 2016-10-11

What is a PAM matrix?

• PAM matrices

– PAM - Percent Accepted Mutations

– PAM gives the probability that a given amino acid will be replaced by any other amino

acid

– An accepted point mutation in a protein is a replacement of one amino acid by

another, accepted by natural selection

– Derived from global alignments of closely related sequences

– The numbers with the matrix (PAM40, PAM100) refer to the evolutionary distance

(greater numbers mean greater distances)

– 1-PAM matrix refers to the amount evolution that would change 1% of the

residues/bases (on average)

– 2-PAM matrix does NOT refer to change in 2% of residues

• Refers 1-PAM twice

• Some variations may change back to original residue

10 2016-10-11

10/10/2016

6

BLOSUM62

11 2016-10-11

What is BLOSUM?

• BLOSUM matrices

– BLOSUM - Blocks Substitution Matrix

– Score for each position refers to obtained frequencies of substitutions in

blocks of local alignments of protein sequences.

– For example BLOSUM62 is derived from sequence alignments with no

more than 62% identity.

12 2016-10-11

10/10/2016

7

Which scoring matrix to use?

For global alignments use PAM matrices.

• Lower PAM matrices tend to find short alignments of highly similar regions

• Higher PAM matrices will find weaker, longer alignments

For local alignments use BLOSUM matrices

• BLOSUM matrices with HIGH number, are better for similar sequences

• BLOSUM matrices with LOW number, are better for distant sequences

13 2016-10-11

Sequence alignment: Methods

• Pairwise alignment

– Finding best alignment of two sequences

– Often used for searching sequences with highest similarity in the sequence databases

• Dot Matrix Analysis

• Dynamic Programming (DP)

• Short word matching

• Multiple Sequence Alignment (MSA)

– Alignment of more than two sequences

– Often used to find conserved domains, regions or sites among many sequences

• Dynamic programming

• Progressive methods

• Iterative methods

• Structural alignments

– Alignments based on structure

14 2016-10-11

10/10/2016

8




– Often used for searching best similar sequences in the sequence databases












15 2016-10-11

What is a Dot Matrix?

• Method for comparing two sequences (amino acid or nucleotide)

• Lets align two sequences using

dot matrix

A: A G C T A G G A

B: G A C T A G G C

– Sequence A is organized in X-axis

and sequence B in Y-axis

A G C T A G G A

G

A

C

T

A

G

G

C

16 2016-10-11

Sequence A

Seq

uen

ce B

10/10/2016

9

Find the matching nucleotides

– Starting from the first nucleotide in B,

move along the first row placing a dot in

columns with matching nucleotide

17 2016-10-11

A G C T A G G A

G ● ● ●

A

C

T

A

G

G

C

Sequence A

Seq

uen

ce B

Continue to fill the table




– Repeat the procedure for all the

nucleotides in B

18 2016-10-11

A G C T A G G A

G ● ● ●

A ● ● ●

C

T

A

G

G

C

Sequence A

Seq

uen

ce B

10/10/2016

10

Why are some cells empty in a dot matrix?





nucleotides in B

19 2016-10-11

A G C T A G G A

G ● ● ●

A ● ● ●

C ●

T ●

A ● ● ●

G ● ● ●

G ● ● ●

C ●

Sequence A

Seq

uen

ce B

Cells corresponding to mismatching nucleotides are empty

Is there something interesting in the matrix?





nucleotides in B

– Region of similarity is revealed by a

diagonal row of dots

– Other isolated dots represent random

matches

20 2016-10-11

A G C T A G G A

G ● ● ●

A ● ● ●

C ●

T ●

A ● ● ●

G ● ● ●

G ● ● ●

C ●

Sequence A

Seq

uen

ce B

10/10/2016

11

How to interpret dot plots?

Two similar, but not identical,

sequences An insertion or deletion A tandem duplication

21 2016-10-11

How to interpret dot plots?

An inversion Joining sequences

22 2016-10-11

10/10/2016

12

Limitations of dot matrix

• Sequences with low-complexity regions give false diagonals

– Sequence regions with little diversity

• Noisy and space inefficient

• Limited to 2 sequences

23 2016-10-11

Dotplot exercise

• Use the following three tools to generate dot plots for two sequences

• YASS:: genomic similarity search tool

– http://bioinfo.lifl.fr/yass/yass.php

• Lalign/Palign

– http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=lalign

• multi-zPicture

– http://zpicture.dcode.org/

24 2016-10-11

http://bioinfo.lifl.fr/yass/yass.php



http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=lalign



http://zpicture.dcode.org/



10/10/2016

13
















25 2016-10-11

Dynamic programming

• Breaks down the alignment problem into smaller problems

• Example

– Needleman-Wunsch algorithm: global alignment

– Smith-Waterman algorithm: local alignment

• Three steps

– Initialization

– Scoring

– Traceback

26 2016-10-11

10/10/2016

14

Where to place gaps in the alignment?

• Insertion of gaps in the alignment

• Gaps should be penalized

• Gap opening should be penalized higher than gap extension (or at

least equal)

• In BLOSUM62

– Gap opening score = -11

– Gap extension score = -1

27 2016-10-11

A A A G A G A A A

A A A - - A A A A

Gap extention

A A A G A G A A A

A A A - A - A A A

Gap initiation

A A A G A G A A A -

A A A - - - A A A A

Gap extention

Local and global pairwise alignment

- A G T T A

- 0 -1 -2 -3 -4 -5

A -1 2

G -2

T -3

G -4

C -5

A -6

- A G T T A

- 0 0 0 0 0 0

A 0 2

G 0

T 0

G 0

C 0

A 0

• Needleman-Wunsch (global)

– Match =+2

– Mismatch =-1

– Gap =-1

• Smith-Waterman (local)

– Match =+2

– Mismatch =-1

– Gap =-1

• All negative values are replaced by 0

• Traceback starts at the highest value and ends

at 0

28 2016-10-11

10/10/2016

15

Needleman-Wunsch vs Smith-Waterman

Sequence alignment teacher (http://melolab.org/websoftware/web/?sid=3)

29 2016-10-11

Dynamic programming: example

• http://www.avatar.se/molbioinfo2001/dynprog/dynamic.html

• Scoring

– Match = +2

– Mismatch = -2

– Gap = -1

30 2016-10-11

http://melolab.org/websoftware/web/?sid=3

http://melolab.org/websoftware/web/?sid=3

http://www.avatar.se/molbioinfo2001/dynprog/dynamic.html

http://www.avatar.se/molbioinfo2001/dynprog/dynamic.html

10/10/2016

16

Dynamic programming exercise

• Generate a scoring matrix for nucleotides (A, C, G, and T)

• Align two sequences using dynamic programming and scoring

matrix you just generated

• Align two sequences using following tools

– EMBOSS Needle

• http://www.ebi.ac.uk/Tools/psa/emboss_needle/

– EMBOSS Water

• http://www.ebi.ac.uk/Tools/psa/emboss_water/

31 2016-10-11

DP exercise: Needle vs Water

32 2016-10-11

http://www.ebi.ac.uk/Tools/psa/emboss_needle/



http://www.ebi.ac.uk/Tools/psa/emboss_water/

10/10/2016

17
















33 2016-10-11

MSA: What and Why?

• A multiple sequence alignment (MSA) is an alignment of three or

more sequences

• Why MSA?

– To identify patterns of conservation across more than 2 sequences

– To characterize protein families and generate profiles of protein families

– To infer relationships within and among gene families

– To predict secondary and tertiary structures of new sequences

– To perform phylogenetic studies

34 2016-10-11

10/10/2016

18

MSA application in variation interpretation

• Interpreting the impacts of genetic variants

• All tools for variation interpretation use MSAs of protein and

nucleotide sequences

• http://structure.bmc.lu.se/services.php

• PON-P2

• PON-BTK

• PON-MMR2

• PON-mt-tRNA

• PON-Diso

• PON-Sol

• PON-PS

• PPSC

2016-10-11 35

Are there tools to perform MSA?

36 2016-10-11

http://structure.bmc.lu.se/services.php

http://structure.bmc.lu.se/services.php

10/10/2016

19
















37 2016-10-11

Do you remember dynamic programming?

2 sequences

http://ai.stanford.edu/~serafim/CS262_2005/LectureNotes/Lecture17.pdf

3 sequences

38 2016-10-11



10/10/2016

20

How to align three sequences using DP?


– Align each pair of sequences

– Sum scores for each pair at each position

39 2016-10-11
















40 2016-10-11

10/10/2016

21

MSA: Progressive alignment

• Progressive sequence alignment

– Hierarchical or tree based method

– E.g. ClustalW, T-Coffee, Clustal Omega

• Basic steps

– Calculate pairwise distances based on pairwise alignments between the

sequences

– Build a guide tree, which is an inferred phylogeny for the sequences

– Align the sequences

41 2016-10-11

Progressive MSA

1 3

2 5

1 3

d

4

5

1 3 2

5

1 3 2

42 2016-10-11

10/10/2016

22
















43 2016-10-11

MSA: Progressive alignment

• Iterative sequence alignment

– Improved progressive alignment

– Aligns the sequences repeatedly

– E.g. MUSCLE

44 2016-10-11

10/10/2016

23

Iterative MSA

• Follows 3 steps

Second progressive alignment

Refinement

Progressive alignment

45 2016-10-11

Phylogenetic tree

• A phylogenetic tree shows evolutionary relationships between the

sequences

• Types:

– Rooted

• Nodes represent most recent common ancestor

• Edge lengths represents time estimates

– Unrooted

• No ancestry and time estimates

• Algorithms to generate phylogenetic tree

– Neighbor-joining

– Unweighted Pair Group Method with Arithmetic Mean (UPGMA)

– Maximum parsimony

46 2016-10-11

10/10/2016

24

Neighbor joining method

http://en.wikipedia.org/wiki/Neighbor_joining 47 2016-10-11

MSA exercise

• Align the protein sequences SET 1 and SET 2 using MSA tools and

compare the alignments

• Clustalw Omega

– http://www.ebi.ac.uk/Tools/msa/clustalo/

• MUSCLE

– http://www.ebi.ac.uk/Tools/msa/muscle/

48 2016-10-11

http://en.wikipedia.org/wiki/Neighbor_joining

http://en.wikipedia.org/wiki/Neighbor_joining

http://www.ebi.ac.uk/Tools/msa/clustalo/

http://www.ebi.ac.uk/Tools/msa/clustalo/

http://www.ebi.ac.uk/Tools/msa/muscle/



10/10/2016

25

What to align: DNA or protein sequence?

• Many mis-matches in DNA sequences are

synonymous

• DNA sequences contain non-coding regions, which

should be avaided in homology searching

• Matches are more reliable in protein sequence

– Probability to occur randomly at any position in a sequence

• Amino acids: 1/20 = 0.05

• Nucleotides: 1/4 = 0.25

• Searcing at protein level: In case of frameshifts, the

alignment score for protein sequence may be very

low even though the DNA sequence are similar

ACT TTT CAT GGG ...

Thr Phe His Gly ...

ACT TTT TCA TGG G..

Thr Phe Ser Trp

If ORF exists, then always align at protein level

49 2016-10-11

Searching bioinformatics databases

50 2016-10-11

10/10/2016

26

Learning goals

• How to find information using Keywords?

• How to find information using Sequences?

• How to find correct gene names?

2016-10-11 51

Search strategy

• Keyword search

– Find information related to specific keywords

– Each bioinformatics database has its own search tool

– Some search tools have a wide spectrum which access multiple

databases and gather results together

– Gquery, EBI search

• Sequence search

– Use a sequence of interest to find more information about the sequence

– BLAST, FASTA

2016-10-11 52

10/10/2016

27

Keyword search

• Find information related to specific keywords

• Gquery

– A central search tool to find information in NCBI databases

– Searches in large number of NCBI databases and shows them in one

page

– http://www.ncbi.nlm.nih.gov/gquery

• EBI search

– Search tool to find infroamtion from databases developed, managed

and hosted by EMBL-EBI

– http://www.ebi.ac.uk/services

2016-10-11 53

Gquery

2016-10-11 54

http://www.ncbi.nlm.nih.gov/gquery

http://www.ncbi.nlm.nih.gov/gquery

http://www.ebi.ac.uk/services



10/10/2016

28

EBI search

2016-10-11 55

Limitations

• Synonyms

• Misspellings

• Old and new names/terms

• NOTES:

– Use different synonyms and read literature to find more approriate keywords

– Use boolean operators to combine different keywords

– Do not expect to find all the information using keyword search alone

– Note the database version or the version of entries in the databases you used

8 64 110

ELA2 ELANE

59 20

HIV 1

HIV-1

PubMed ClinVar

2016-10-11 56

10/10/2016

29

Gene nomenclature

• HUGO Gene Nomenclature Committee (HGNC)

– Assigns standardized nomenclature to human genes

– Each symbol is unique and each gene is given only one name

• Species specific nomenclature committees

– Mouse Genome Informatics Database

• http://www.informatics.jax.org/mgihome/nomen/

– Rat Genome Database

• http://rgd.mcw.edu/nomen/nomen.shtml

2016-10-11 57

HGNC symbol report

• Approved symbol

• Approved name

• Synonyms

– Terms used in literature to indicate the gene

– HGNC, Ensembl, Entrez Gene, OMIM

• Previous symbols and names

– Previous HGNC approved symbol

• NOTE: HGNC does not approve protein names. Usually genes and proteins

have the same name and gene names are written in italics.

2016-10-11 58

http://www.informatics.jax.org/mgihome/nomen/

http://www.informatics.jax.org/mgihome/nomen/

http://rgd.mcw.edu/nomen/nomen.shtml

http://rgd.mcw.edu/nomen/nomen.shtml

10/10/2016

30

HGNC search

2016-10-11 59

Keyword search

• Exercise

2016-10-11 60

Documents

Sequence Alignment and Comparison - Lunds universitet€¦ · 10/10/2016 11 How to interpret dot plots? Two similar, but not identical, A tandem sequences An insertion or deletion