View
235
Download
3
Tags:
Embed Size (px)
Citation preview
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis Database Searching
Using alignment algorithms for finding similar sequences
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Why do we want to compare sequences?
Evolutionary relationships• Phylogenetic trees can be constructed based on
comparison of the sequences of a molecule (example: 16S rRNA) taken from different species
• Residues conserved during evolution play an important role
Prediction of protein structure and function• Proteins which are very similar in sequence generally
have similar 3D structure and function as well• By searching a sequence of unknown structure
against a database of known proteins the structure and/or function can in many cases be predicted
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Things to keep in mind when working with alignments
Pairwise alignment programs always find the optimal alignment of two sequences• They do so even if it does not make any sense at all to
align the two sequences• ”Optimal” means optimal according to the substitution
matrix and gap penalties you choose – also if you choose the wrong ones
Generally the underlying assumptions are wrong• The frequency of substitution is not the same at all
positions• Nor is the frequencies of insertions and deletions the
same• Affine gap penalties do not properly model indel events
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Using sequence alignment to search databases
The most common usage of pairwise sequence alignment is searching databases for related sequences
Although the alignments themselves may be unreliable the alignment scores gives a lot of information about which sequences are related and which are not
Having a set of related sequences is a lot more informative than just one sequence - even if nothing is known about the related sequences
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Requirements in addition to an alignment method
A very fast method to find potentially related sequences• Systematically searching through the databases with
the alignment methods take too long even though dynamic programming is fast
• Some method to initially identify possible matches is therefore needed to speed up the search
A method to evaluate which matches to trust• Statistics on the alignment score distributions can be
used to calculate the significance of an alignment• This way we can not only rank which matches are
better than others but also tell if any of them are good at all
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Local or global alignment
Generally local alignment is used for performing database searches• For most cases you would be interested in knowing if
any parts of you sequences looks like something else• The protein sequence databases have not been split
into domains
It is not always the optimal thing to do but …• In the case where the complete sequence should
match the local alignment score will be almost identical to the global one
• If you really want a global alignment you can make it afterwards
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Differences between global and local alignments
Extra constraint on scoring function: The expected score for a random alignment must be negative
Because you can to start a new alignment anywhere dynamic programming scores cannot become negative
The trace-back is started at the highest values rather than the lower right corner
The trace-back is stopped as soon as a zero is encountered
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
The Smith-Waterman algorithm(local alignment)
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Alignment score distributions
The local similarity scores for ungapped alignment of random sequences can be shown to follow an extreme value distribution:
P(Sx) = 1-exp(-Kmne-x),where m and n are the sequence lengths while K and are free parameters
This turns out to be a very good approximation for gapped alignment as long as reasonably large gap penalties are used
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Database searching
Positive reporting: When searching in a database we report only the few good matches
The expected number of database hits with a score of at least x can be calculated as:
E(Sx) = DP(Sx),where D is the number of entries in the database
E-values are much better for evaluating alignments than raw alignment scores or ”percent identity”
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
A curse or a blessing?
Large databases are a blessing …• They are more likely to contain something similar to
the query
… and a curse• Increasing the size of the database decreases the
significance of the hits you get• Searching huge databases requires fast computers
What requirements this puts on software development• The programs must be speeded up or database
searches will take longer and longer• The false positive rate must be reduced to not lose
specificity
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Heuristic search algorithms
FASTA (Pearson 1995)
Uses heuristics to avoid calculating the full dynamic programming matrix
Speed up searches by an order of magnitude compared to full Smith-Waterman
The statistical side of FASTA is still stronger than BLAST
BLAST (Altschul 1990, 1997)
Uses rapid word lookup methods to completely skip most of the database entries
Extremely fast• One order of magnitude
faster than FASTA• Two orders of magnitude
faster than Smith-Waterman
Almost as sensitive as FASTA
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Coffee breakTop 10 ways to tell you drink too much coffee10 Juan Valdez names his
donkey after you9 You get a speeding ticket
even when you're parked8 You grind your coffee beans
in your mouth7 You sleep with your eyes
open6 You watch videos in fast-
forward5 You lick your coffeepot clean4 Your eyes stay open when
you sneeze3 The nurse needs a scientific
calculator to take your pulse2 You can type sixty words a
minute with your feet1 You can jump-start your car
without jumper cables.
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
How BLAST works
The search is speeded up by indexing the sequence databases in a so-called suffix array• Three letter subsequences are used as keys to the
sequences• Closely related substitutions are also included• This gives ~150 index keys for each sequence
This is used in two ways• To quickly discard sequences that are not similar at
all before even beginning to align them• To constrain the alignment and thereby speed up the
alignment procedure itself
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Variations on a theme
BLASTN• Nucleotide query
sequence• Nucleotide database
BLASTP• Protein query sequence• Protein database
BLASTX• Nucleotide query
sequence• Protein database• Compares all six reading
frames with the database
TBLASTN• Protein query sequence• Nucleotide database• ”On the fly” six frame
translation of database
TBLASTX• Nucleotide query
sequence• Nucleotide database• Compares all reading
frames of query with all reading frames of the database
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
BLAST at NCBIhttp://www.ncbi.nlm.nih.gov/BLAST/
Very fast computer dedicated to running BLAST searches
Many databases that are always up to date
Nice simple web interface
But you still need to knowledge about BLAST to use it properly
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Performing a simple BLAST search
We will now do a small exercise together
The purpose of the exercise is simply to performing a simple BLAST search ”hands on”
Open a web browser on the page http://www.cbs.dtu.dk/dtucourse/cookbooks/ljj/exercise1.html
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
The most common and effective way to ruin your database search
What you should never ever do: take the nucleotide sequence of a gene and compare it with a database at the nucleotide level• Unfortunately this is a very intuitive thing to do• On the NCBI BLAST homepage nucleotide search
methods are listed before protein search – making it even more intuitive
What you should do instead• Extract the coding part of the DNA sequence,
translate it, and search with the resulting protein sequence
• Use a search method (such as BLASTX or TBLASTX) which compares the sequences at the protein level
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
The limits of sequence similarity
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Expectation values in BLAST
BLAST uses precomputed extreme value distributions to calculate E-values from alignment scores• For this reason BLAST only allows certain combinations
of substitution matrices and gap penalties• This also means that the fit is based on a different data
set than the one you are working on
A word of caution: BLAST tends to overestimate the significance of its matches• E-values from BLAST are fine for identifying sure hits• One should be careful using BLAST’s E-values to judge
if a marginal hit can be trusted
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Evaluating BLAST results
We will now do a second exercise together
The main point of this exercise is careful interpretation of the BLAST output
Open a web browser on the page http://www.cbs.dtu.dk/dtucourse/cookbooks/ljj/exercise2.html
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Pairwise alignment of hemoglobin alpha chain and myoglobin24.7% identity; Global alignment score: 130
10 20 30 40 50HBA_HU VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKG--- ::: .. : .:.:: : .. .: . : :.: : : : : .: . :..:.MYG_PH VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED 10 20 30 40 50 60
60 70 80 90 100 110HBA_HU ---HGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNF-KLLSHCLLVTLAAHL :: : :: . . :. :.. :: : .. :... ...:. .. .: ..MYG_PH LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKI-PIKYLEFISEAIIHVLHSRH 70 80 90 100 110
120 130 140HBA_HU PAEFTPAVHASLDKFLASVSTVLTSKYR------ :..: ......: : ...::.MYG_PH PGDFGADAQGAMNKALELFRKDIAAKYKELGYQG 120 130 140 150
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
A multiple sequence alignment of globinsHBB_HUMAN --------VHLTPEEKSAVTALWGKVN--VDEVGGEALGRLLVVYPWTQRFFESFGDLST
HBB_HORSE --------VQLSGEEKAAVLALWDKVN--EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN
HBA_HUMAN ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-
HBA_HORSE ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-
MYG_PHYCA ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT
GLB5_PETMA PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
LGB2_LUPLU --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE
*: : : * . : .: * : * : .
HBB_HUMAN PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL
HBB_HORSE PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL
HBA_HUMAN ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL
HBA_HORSE ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL
MYG_PHYCA EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF
GLB5_PETMA ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV
LGB2_LUPLU VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV
. .:: *. : . : *. * . : .
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Why multiple alignment is better
More sequences contain more information
Multiple sequence alignment allows us to compare all related proteins simultaneously
It allows us to identify features that are conserved among the sequences
Using a multiple sequence alignment (a profile) one can find more related sequences than by simple pairwise comparison
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Coffee break
Coffee break quiz:
Why is the lasT gene inE. coli called lasT?
Did some researchers fail to get the joke?
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
An iterative scheme for using profiles in database searches
Search your sequence against a large database using a pairwise alignment method (often BLAST) to obtain a set of closely related sequences
Make a multiple sequence alignment (using ClustalW) and estimate a profile
Search the profile against the database in an attempt to find more distantly related sequences
Include these in the profile and redo the profile search
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
If only life was so simple …
In the databases one may find large cluster of almost identical sequences• These will heavily bias the profile towards ”their
sequence”• To avoid this a sequence weighing scheme must be used
during construction of the profile
How should one estimate the frequencies of rare mutations that have not been observed• A more general problem: What to do when you have too
few observations to make a reliable estimate of a frequency
• The solution is called regularization which involved using prior knowledge on mutations (such as substitution matrices)
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Regularization by pseudo counts
In addition to the real counts actually observed in the sequences some extra pseudo counts are added
The simplest approach is to simply add 1 to all counters before calculating sequences
PSI-BLAST adds pseudo counts based on observations multiplied by a substitution matrix• This means that pseudo counts are mainly added to the
amino acids which are similar to the observed ones• The number of pseudo counts is adjusted so that pseudo
counts are mainly used when few real counts are available
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
An overview of PSI-BLAST
A fast heuristic method for doing profile searches which is almost as good as ”the real thing”
Outline of the algorithm• First ordinary BLAST is used to find close homologs• Rather than making a real multiple alignment the close
homologs are all just aligned to the query sequence (a master-slave alignment)
• A profile is constructed using a very simple empirical weighing scheme combined with substitution matrix pseudo-counts
• Ignoring the positional variation of indels the profile is again searched against the database
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
PSI-BLAST’s E-values
BLAST generally tends to overestimated the significance of database hits
PSI-BLAST E-values are not the E-value of the query sequence matching the database sequence
Instead the E-values represent the expectation value of the profile matching the database sequence
The profile might be wrong due the spurious hits in earlier iterations!
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Using PSI-BLAST
We will now use PSI-BLAST to find more homologs to our query sequence
Again the emphasis is on the interpretation of the results
Open a web browser on the page http://www.cbs.dtu.dk/dtucourse/cookbooks/ljj/exercise3.html
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Conserved domain BLAST
PSI-BLAST attempts to build a profile for the query sequence and search it against a sequence database
CD-BLAST instead builds a database of profiles and searches the query sequence against this• This means that CD-BLAST is not iterative and thus
faster• CD-BLAST works for sequences with no close homologs• The profiles come from the PFAM database which is
checked by experts to make sure that no unrelated sequence are included in the profiles
• However CD-BLAST can only identify conserved domains which are in the PFAM database
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Is it really worth the trouble?
Yes!
Profile based search methods like PSI-BLAST have been shown to find ~3 times as many homologs without increasing the number of false positives
This essentially translates into three times higher chance of finding a homolog with known structure or function
Using profiles rather than single sequences improved secondary structure prediction by ~10%
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Searching for conserved protein domains using CD-BLAST
We will now use CD-BLAST to see if the query sequence has matches to any known protein families
The results of this search should be compared to those found using PSI-BLAST
Open a web browser on the page http://www.cbs.dtu.dk/dtucourse/cookbooks/ljj/exercise4.html
Cen
ter
for
Bio
log
ical S
eq
uen
ce A
naly
sis
Important things to remember when using alignment to search databasesWhen searching in databases, size does matter!
• Searching large databases take very long time• The significance of matches drops when the database is
expanded
Doing things differently can lead to different conclusions• Nucleotide comparison vs. protein comparison• CD-BLAST vs. PSI-BLAST
Think before and after you search• The obvious thing to do is not always the right thing to do• E-values should be interpreted with care• Conclusions based on matches should be drawn with greater
care