Bioinformática Básica · Bioinformática Básica Multiple Sequence alignment Rafael Dias Mesquita [email protected] Laboratório de Bioinformática Departamento de Bioquímica

Bioinformática Básica Multiple Sequence alignment

Rafael Dias Mesquita [email protected]

Laboratório de Bioinformática

Departamento de Bioquímica Instituto de Química - UFRJ

Why we do alignments? !   Correspondence. Find out which parts “do the same thing”

! Similar genes are conserved across widely divergent species, often performing similar functions

!   Structure prediction ! Use knowledge of structure of one or more members of a protein MSA to

predict structure of other members ! Structure is more conserved than sequence

!   Create “profiles” for protein families ! Allow us to search for other members of the family

!   Genome assembly: Automated reconstruction of “contig” maps of genomic fragments such as ESTs

!   MSA is the starting point for phylogenetic analysis

An example of Multiple Alignment

VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWWSNG--

Global vs. Local !  Global – both sequences aligned along entire

lengths !  Local – best subsequence alignment found !  Global alignment of two genomic sequences may

not align exons !  Local alignment would only pick out maximum

scoring exon

Global Alignment

!   Based on Needleman-Wunsch algorithm !   An example: align COELACANTH and PELICAN !   Scoring scheme: +1 if letters match, -1 for mismatches, -1

for gaps

COELACANTH

P-ELICAN--

COELACANTH

-PELICAN--

Needleman-Wunsch Details !   Two-dimensional matrix !   Diagonal when two letters

align !   Horizontal when letters

paired to gaps

C OE L A C A N T HP C

P O

E E E

L L L

I A I

C C C

A A A

N N N

T -

H -

Needleman-Wunsch !   In reality, each cell of matrix contains score and

pointer !   Score is derived from scoring scheme (-1 or +1 in our

example) !   Pointer is an arrow that points up, left, or diagonal !   After initializing matrix, compute the score and arrow

for each cell

Algorithm

!   For each cell, compute ! Match score: sum of preceding diagonal cell and score of aligning

the two letters (+1 if match, -1 if no match) ! Horizontal gap score: sum of score to the left and gap score (-1) ! Vertical gap score: sum of score above and gap score (-1)

!   Choose highest score and point arrow towards maximum cell

!   When you finish, trace arrows back from lower right to get alignment

Algorithm : Matrix initialization

C O E L A C A N T HP -1 -1 -1

-1

-1

-1

-1

-1

-1

-1

E -1

-1

+1 -1

-1

-1

-1

-1

-1

-1

L -1

-1

-1

+1 -1

-1

-1

-1

-1

-1

I -1

-1

-1

-1

-1 -1

-1

-1

-1

-1

C +1

-1

-1

-1

-1

+1 -1

-1

-1

-1

A -1

-1

-1

-1

+1

-1

+1 -1

-1

-1

N -1

-1

-1

-1

-1

-1

-1

+1 -1 -1


C O E L A C A N T HP -1 -1 -1

-1

-1

-1

-1

-1

-1

-1

E -1

-1

+1 -1

-1

-1

-1

-1

-1

-1

L -1

-1

-1

+1 -1

-1

-1

-1

-1

-1

I -1

-1

-1

-1

-1 -1

-1

-1

-1

-1

C +1

-1

-1

-1

-1

+1 -1

-1

-1

-1

A -1

-1

-1

-1

+1

-1

+1 -1

-1

-1

N -1

-1

-1

-1

-1

-1

-1

+1 -1 -1

Algorithm : Computing sum of scores

C O E L A C A N T HP -1

-1 -1 -2

E +1 -1

L +1 0

I -1 -1

C +1 0

A +1 +1

N +1 +2

-1 +1

-1 0

Algorithm : Finding alignment

C O E L A C A N T HP -1

-1 -1 -2

E +1 -1

L +1 0

I -1 -1

C +1 0

A +1 +1

N +1 +2

-1 +1

-1 0

COELACANTH

P-ELICAN--

COELACANTH

-PELICAN--

Local Alignment

!   Smith-Waterman algorithm !   Modification of Needleman-Wunsch

! Edges of matrix initialized to 0 ! Maximum score never less than 0 ! No pointer unless score greater than 0 ! Trace-back starts at highest score (rather than lower right) and

ends at 0

!   How do these changes affect the algorithm?

!   An example: align COELACANTH and PELICAN !   Scoring scheme: +1 if letters match, -1 for mismatches, -1

for gaps

Smith-Waterman

ELACAN

ELICAN


0 C O E L A C A N T HP -1 -1 -1

-1

-1

-1

-1

-1

-1

-1

E -1

-1

+1 -1

-1

-1

-1

-1

-1

-1

L -1

-1

-1

+1 -1

-1

-1

-1

-1

-1

I -1

-1

-1

-1

-1 -1

-1

-1

-1

-1

C +1

-1

-1

-1

-1

+1 -1

-1

-1

-1

A -1

-1

-1

-1

+1

-1

+1 -1

-1

-1

N -1

-1

-1

-1

-1

-1

-1

+1 -1 -1


0 C O E L A C A N T HP -1 -1 -1

-1

-1

-1

-1

-1

-1

-1

E -1

-1

+1 -1

-1

-1

-1

-1

-1

-1

L -1

-1

-1

+1 -1

-1

-1

-1

-1

-1

I -1

-1

-1

-1

-1 -1

-1

-1

-1

-1

C +1

-1

-1

-1

-1

+1 -1

-1

-1

-1

A -1

-1

-1

-1

+1

-1

+1 -1

-1

-1

N -1

-1

-1

-1

-1

-1

-1

+1 -1 -1

Algorithm : Computing sum of scores

0 0 C O E L A C A N T HP -1

0 -1 0

E +1 +1

L +1 +2

I -1 +1

C +1 +2

A +1 +3

N +1 +4

-1 +3

-1 +2

0 0 C O E L A C A N T HP -1

0 -1 0

E +1 +1

L +1 +2

I -1 +1

C +1 +2

A +1 +3

N +1 +4

-1 +3

-1 +2

Algorithm : Finding alignment

ELACAN

ELICAN

Multiple Sequence Alignment: Approaches !   Optimal Global Alignments - Find alignment that

maximizes a score function !   Global Progressive Alignments - Match closely-related

sequences first using a guide tree !   Global Consistency-based Alignments – Progressive

approach that uses pairwise information to optimize alignment

!   Global Iterative Alignments - Multiple re-building attempts to find best alignment

!   Structure, HMM based alignments ! Praline-web, HMMalign

Optimal Global Alignments

!   Usually uses Dynamic programming !   Generalization of Needleman-Wunsch !   Find alignment that maximizes a score function !   2 sequences => 2 dimensional matrix !   n sequences => n dimensional matrix !   Computationally expensive: Time grows as product of

sequence lengths

Optimal Global Alignments: Examples FOR PAIRWISE ALIGNMENT

! NWalign http://zhanglab.ccmb.med.umich.edu/NW-align/

!   FOGSAA (uses a branch and bound approach – FASTER – no web service yet)

http://www.nature.com/srep/2013/130429/srep01746/full/srep01746.html

Progressive Multiple Alignment Method

!   Compare all sequences pairwise. !   Perform cluster analysis on the pairwise data to generate a hierarchy for

alignment. This may be in the form of a binary tree or a simple ordering !   Build the multiple alignment by first aligning the most similar pair of

sequences, then the next most similar pair and so on. Once an alignment of two sequences has been made, then this is fixed. Thus for a set of sequences A, B, C, D having aligned A with C and B with D the alignment of A, B, C, D is obtained by comparing the alignments of A and C with that of B and D using averaged scores at each aligned position.

Steps in alignment

Progressive Multiple Alignment Method

! ClustalW http://www.clustal.org/clustal2/ http://www.ebi.ac.uk/Tools/msa/clustalw2/

! Kalign: Useful to large datasets

http://msa.sbc.su.se/cgi-bin/msa.cgi http://www.ebi.ac.uk/Tools/msa/kalign/

Progressive Multiple Alignment Method: Examples

Global Consistency-based Alignments

!   Uses Progressive strategy !   During progression base alignment can change, not only

adding sequences. !   Can incorporate multiple sequence information in scoring

pairwise alignments !   Can consider additional information regarding pairwise

alignments when constructing multiple alignments !   Can use probability

! ProbCons: Uses a score system that includes probabilities http://probcons.stanford.edu/

!   T-Coffe: Slow, useful to small datasets http://www.tcoffee.org/ http://www.ebi.ac.uk/Tools/msa/tcoffee/

Global Consistency-based Alignments: Examples

Global Iterative Alignments

!   Can construct distance matrix based on kmer distance – related sequences have more kmers in common and construct the guide tree using UPGMA.

!   Uses Progressive alignment strategy !   Iterations refine the multiple alignment cutting it and

realign the profile created to both parts

!   MAFFT: Useful from small to large datasets (can perform progressive, consistency or iterative alignments)

http://mafft.cbrc.jp/alignment/server/ http://www.ebi.ac.uk/Tools/msa/mafft/

!   Muscle: http://www.drive5.com/muscle/ http://www.ebi.ac.uk/Tools/msa/muscle/

Global Iterative Alignments: Examples

Structure, HMM based alignments

!   Based on progressive strategy !   Uses secondary structure information !   Uses transmembrane regions information !   Pre-profile processing using PSI-blast – Homologous help

guiding the alignment

!   Praline-web http://www.ibi.vu.nl/programs/pralinewww/

Structure, HMM based alignments

!   MUMSA (compares different multiple alignments to evaluate quality)

http://msa.sbc.su.se/cgi-bin/msa.cgi !   Core-COFFE (indicate reliability areas with colors) http://tcoffee.crg.cat/apps/tcoffee/do:core ! iRMSD-COFFE (Evaluates Multiple Sequence Alignment

using structural information) http://tcoffee.crg.cat/apps/tcoffee/do:irmsd !   Leon http://bips.u-strasbg.fr/PipeAlign/jump_to.cgi?Leon+noid

Quality evaluation

!   FACET (no web server available – read the paper for a review)

http://facet.cs.arizona.edu/

Quality evaluation

Documents

Bioinformática Básica · Bioinformática Básica Multiple Sequence alignment Rafael Dias Mesquita [email protected] Laboratório de Bioinformática Departamento de Bioquímica