Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Bioinformática Básica Multiple Sequence alignment
Rafael Dias Mesquita [email protected]
Laboratório de Bioinformática
Departamento de Bioquímica Instituto de Química - UFRJ
Why we do alignments? ! Correspondence. Find out which parts “do the same thing”
! Similar genes are conserved across widely divergent species, often performing similar functions
! Structure prediction ! Use knowledge of structure of one or more members of a protein MSA to
predict structure of other members ! Structure is more conserved than sequence
! Create “profiles” for protein families ! Allow us to search for other members of the family
! Genome assembly: Automated reconstruction of “contig” maps of genomic fragments such as ESTs
! MSA is the starting point for phylogenetic analysis
An example of Multiple Alignment
VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWWSNG--
Global vs. Local ! Global – both sequences aligned along entire
lengths ! Local – best subsequence alignment found ! Global alignment of two genomic sequences may
not align exons ! Local alignment would only pick out maximum
scoring exon
Global Alignment
! Based on Needleman-Wunsch algorithm ! An example: align COELACANTH and PELICAN ! Scoring scheme: +1 if letters match, -1 for mismatches, -1
for gaps
COELACANTH
P-ELICAN--
COELACANTH
-PELICAN--
Needleman-Wunsch Details ! Two-dimensional matrix ! Diagonal when two letters
align ! Horizontal when letters
paired to gaps
C OE L A C A N T HP C
P O
E E E
L L L
I A I
C C C
A A A
N N N
T -
H -
Needleman-Wunsch ! In reality, each cell of matrix contains score and
pointer ! Score is derived from scoring scheme (-1 or +1 in our
example) ! Pointer is an arrow that points up, left, or diagonal ! After initializing matrix, compute the score and arrow
for each cell
Algorithm
! For each cell, compute ! Match score: sum of preceding diagonal cell and score of aligning
the two letters (+1 if match, -1 if no match) ! Horizontal gap score: sum of score to the left and gap score (-1) ! Vertical gap score: sum of score above and gap score (-1)
! Choose highest score and point arrow towards maximum cell
! When you finish, trace arrows back from lower right to get alignment
Algorithm : Matrix initialization
C O E L A C A N T HP -1 -1 -1
-1
-1
-1
-1
-1
-1
-1
E -1
-1
+1 -1
-1
-1
-1
-1
-1
-1
L -1
-1
-1
+1 -1
-1
-1
-1
-1
-1
I -1
-1
-1
-1
-1 -1
-1
-1
-1
-1
C +1
-1
-1
-1
-1
+1 -1
-1
-1
-1
A -1
-1
-1
-1
+1
-1
+1 -1
-1
-1
N -1
-1
-1
-1
-1
-1
-1
+1 -1 -1
Algorithm : Matrix initialization
C O E L A C A N T HP -1 -1 -1
-1
-1
-1
-1
-1
-1
-1
E -1
-1
+1 -1
-1
-1
-1
-1
-1
-1
L -1
-1
-1
+1 -1
-1
-1
-1
-1
-1
I -1
-1
-1
-1
-1 -1
-1
-1
-1
-1
C +1
-1
-1
-1
-1
+1 -1
-1
-1
-1
A -1
-1
-1
-1
+1
-1
+1 -1
-1
-1
N -1
-1
-1
-1
-1
-1
-1
+1 -1 -1
Algorithm : Computing sum of scores
C O E L A C A N T HP -1
-1 -1 -2
E +1 -1
L +1 0
I -1 -1
C +1 0
A +1 +1
N +1 +2
-1 +1
-1 0
Algorithm : Finding alignment
C O E L A C A N T HP -1
-1 -1 -2
E +1 -1
L +1 0
I -1 -1
C +1 0
A +1 +1
N +1 +2
-1 +1
-1 0
COELACANTH
P-ELICAN--
COELACANTH
-PELICAN--
Local Alignment
! Smith-Waterman algorithm ! Modification of Needleman-Wunsch
! Edges of matrix initialized to 0 ! Maximum score never less than 0 ! No pointer unless score greater than 0 ! Trace-back starts at highest score (rather than lower right) and
ends at 0
! How do these changes affect the algorithm?
! An example: align COELACANTH and PELICAN ! Scoring scheme: +1 if letters match, -1 for mismatches, -1
for gaps
Smith-Waterman
ELACAN
ELICAN
Algorithm : Matrix initialization
0 C O E L A C A N T HP -1 -1 -1
-1
-1
-1
-1
-1
-1
-1
E -1
-1
+1 -1
-1
-1
-1
-1
-1
-1
L -1
-1
-1
+1 -1
-1
-1
-1
-1
-1
I -1
-1
-1
-1
-1 -1
-1
-1
-1
-1
C +1
-1
-1
-1
-1
+1 -1
-1
-1
-1
A -1
-1
-1
-1
+1
-1
+1 -1
-1
-1
N -1
-1
-1
-1
-1
-1
-1
+1 -1 -1
Algorithm : Matrix initialization
0 C O E L A C A N T HP -1 -1 -1
-1
-1
-1
-1
-1
-1
-1
E -1
-1
+1 -1
-1
-1
-1
-1
-1
-1
L -1
-1
-1
+1 -1
-1
-1
-1
-1
-1
I -1
-1
-1
-1
-1 -1
-1
-1
-1
-1
C +1
-1
-1
-1
-1
+1 -1
-1
-1
-1
A -1
-1
-1
-1
+1
-1
+1 -1
-1
-1
N -1
-1
-1
-1
-1
-1
-1
+1 -1 -1
Algorithm : Computing sum of scores
0 0 C O E L A C A N T HP -1
0 -1 0
E +1 +1
L +1 +2
I -1 +1
C +1 +2
A +1 +3
N +1 +4
-1 +3
-1 +2
0 0 C O E L A C A N T HP -1
0 -1 0
E +1 +1
L +1 +2
I -1 +1
C +1 +2
A +1 +3
N +1 +4
-1 +3
-1 +2
Algorithm : Finding alignment
ELACAN
ELICAN
Multiple Sequence Alignment: Approaches ! Optimal Global Alignments - Find alignment that
maximizes a score function ! Global Progressive Alignments - Match closely-related
sequences first using a guide tree ! Global Consistency-based Alignments – Progressive
approach that uses pairwise information to optimize alignment
! Global Iterative Alignments - Multiple re-building attempts to find best alignment
! Structure, HMM based alignments ! Praline-web, HMMalign
Optimal Global Alignments
! Usually uses Dynamic programming ! Generalization of Needleman-Wunsch ! Find alignment that maximizes a score function ! 2 sequences => 2 dimensional matrix ! n sequences => n dimensional matrix ! Computationally expensive: Time grows as product of
sequence lengths
Optimal Global Alignments: Examples FOR PAIRWISE ALIGNMENT
! NWalign http://zhanglab.ccmb.med.umich.edu/NW-align/
! FOGSAA (uses a branch and bound approach – FASTER – no web service yet)
http://www.nature.com/srep/2013/130429/srep01746/full/srep01746.html
Progressive Multiple Alignment Method
! Compare all sequences pairwise. ! Perform cluster analysis on the pairwise data to generate a hierarchy for
alignment. This may be in the form of a binary tree or a simple ordering ! Build the multiple alignment by first aligning the most similar pair of
sequences, then the next most similar pair and so on. Once an alignment of two sequences has been made, then this is fixed. Thus for a set of sequences A, B, C, D having aligned A with C and B with D the alignment of A, B, C, D is obtained by comparing the alignments of A and C with that of B and D using averaged scores at each aligned position.
Steps in alignment
Progressive Multiple Alignment Method
! ClustalW http://www.clustal.org/clustal2/ http://www.ebi.ac.uk/Tools/msa/clustalw2/
! Kalign: Useful to large datasets
http://msa.sbc.su.se/cgi-bin/msa.cgi http://www.ebi.ac.uk/Tools/msa/kalign/
Progressive Multiple Alignment Method: Examples
Global Consistency-based Alignments
! Uses Progressive strategy ! During progression base alignment can change, not only
adding sequences. ! Can incorporate multiple sequence information in scoring
pairwise alignments ! Can consider additional information regarding pairwise
alignments when constructing multiple alignments ! Can use probability
! ProbCons: Uses a score system that includes probabilities http://probcons.stanford.edu/
! T-Coffe: Slow, useful to small datasets http://www.tcoffee.org/ http://www.ebi.ac.uk/Tools/msa/tcoffee/
Global Consistency-based Alignments: Examples
Global Iterative Alignments
! Can construct distance matrix based on kmer distance – related sequences have more kmers in common and construct the guide tree using UPGMA.
! Uses Progressive alignment strategy ! Iterations refine the multiple alignment cutting it and
realign the profile created to both parts
! MAFFT: Useful from small to large datasets (can perform progressive, consistency or iterative alignments)
http://mafft.cbrc.jp/alignment/server/ http://www.ebi.ac.uk/Tools/msa/mafft/
! Muscle: http://www.drive5.com/muscle/ http://www.ebi.ac.uk/Tools/msa/muscle/
Global Iterative Alignments: Examples
Structure, HMM based alignments
! Based on progressive strategy ! Uses secondary structure information ! Uses transmembrane regions information ! Pre-profile processing using PSI-blast – Homologous help
guiding the alignment
! Praline-web http://www.ibi.vu.nl/programs/pralinewww/
Structure, HMM based alignments
! MUMSA (compares different multiple alignments to evaluate quality)
http://msa.sbc.su.se/cgi-bin/msa.cgi ! Core-COFFE (indicate reliability areas with colors) http://tcoffee.crg.cat/apps/tcoffee/do:core ! iRMSD-COFFE (Evaluates Multiple Sequence Alignment
using structural information) http://tcoffee.crg.cat/apps/tcoffee/do:irmsd ! Leon http://bips.u-strasbg.fr/PipeAlign/jump_to.cgi?Leon+noid
Quality evaluation
! FACET (no web server available – read the paper for a review)
http://facet.cs.arizona.edu/
Quality evaluation