Center for Biological Sequence Analysis Database Searching Using alignment algorithms for finding similar sequences

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis Database Searching

Using alignment algorithms for finding similar sequences

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Why do we want to compare sequences?

Evolutionary relationships• Phylogenetic trees can be constructed based on

comparison of the sequences of a molecule (example: 16S rRNA) taken from different species

• Residues conserved during evolution play an important role

Prediction of protein structure and function• Proteins which are very similar in sequence generally

have similar 3D structure and function as well• By searching a sequence of unknown structure

against a database of known proteins the structure and/or function can in many cases be predicted

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Things to keep in mind when working with alignments

Pairwise alignment programs always find the optimal alignment of two sequences• They do so even if it does not make any sense at all to

align the two sequences• ”Optimal” means optimal according to the substitution

matrix and gap penalties you choose – also if you choose the wrong ones

Generally the underlying assumptions are wrong• The frequency of substitution is not the same at all

positions• Nor is the frequencies of insertions and deletions the

same• Affine gap penalties do not properly model indel events

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Using sequence alignment to search databases

The most common usage of pairwise sequence alignment is searching databases for related sequences

Although the alignments themselves may be unreliable the alignment scores gives a lot of information about which sequences are related and which are not

Having a set of related sequences is a lot more informative than just one sequence - even if nothing is known about the related sequences

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Requirements in addition to an alignment method

A very fast method to find potentially related sequences• Systematically searching through the databases with

the alignment methods take too long even though dynamic programming is fast

• Some method to initially identify possible matches is therefore needed to speed up the search

A method to evaluate which matches to trust• Statistics on the alignment score distributions can be

used to calculate the significance of an alignment• This way we can not only rank which matches are

better than others but also tell if any of them are good at all

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Local or global alignment

Generally local alignment is used for performing database searches• For most cases you would be interested in knowing if

any parts of you sequences looks like something else• The protein sequence databases have not been split

into domains

It is not always the optimal thing to do but …• In the case where the complete sequence should

match the local alignment score will be almost identical to the global one

• If you really want a global alignment you can make it afterwards

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Differences between global and local alignments

Extra constraint on scoring function: The expected score for a random alignment must be negative

Because you can to start a new alignment anywhere dynamic programming scores cannot become negative

The trace-back is started at the highest values rather than the lower right corner

The trace-back is stopped as soon as a zero is encountered

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

The Smith-Waterman algorithm(local alignment)

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Alignment score distributions

The local similarity scores for ungapped alignment of random sequences can be shown to follow an extreme value distribution:

P(Sx) = 1-exp(-Kmne-x),where m and n are the sequence lengths while K and are free parameters

This turns out to be a very good approximation for gapped alignment as long as reasonably large gap penalties are used

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Database searching

Positive reporting: When searching in a database we report only the few good matches

The expected number of database hits with a score of at least x can be calculated as:

E(Sx) = DP(Sx),where D is the number of entries in the database

E-values are much better for evaluating alignments than raw alignment scores or ”percent identity”

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

A curse or a blessing?

Large databases are a blessing …• They are more likely to contain something similar to

the query

… and a curse• Increasing the size of the database decreases the

significance of the hits you get• Searching huge databases requires fast computers

What requirements this puts on software development• The programs must be speeded up or database

searches will take longer and longer• The false positive rate must be reduced to not lose

specificity

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Heuristic search algorithms

FASTA (Pearson 1995)

Uses heuristics to avoid calculating the full dynamic programming matrix

Speed up searches by an order of magnitude compared to full Smith-Waterman

The statistical side of FASTA is still stronger than BLAST

BLAST (Altschul 1990, 1997)

Uses rapid word lookup methods to completely skip most of the database entries

Extremely fast• One order of magnitude

faster than FASTA• Two orders of magnitude

faster than Smith-Waterman

Almost as sensitive as FASTA

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Coffee breakTop 10 ways to tell you drink too much coffee10 Juan Valdez names his

donkey after you9 You get a speeding ticket

even when you're parked8 You grind your coffee beans

in your mouth7 You sleep with your eyes

open6 You watch videos in fast-

forward5 You lick your coffeepot clean4 Your eyes stay open when

you sneeze3 The nurse needs a scientific

calculator to take your pulse2 You can type sixty words a

minute with your feet1 You can jump-start your car

without jumper cables.

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

How BLAST works

The search is speeded up by indexing the sequence databases in a so-called suffix array• Three letter subsequences are used as keys to the

sequences• Closely related substitutions are also included• This gives ~150 index keys for each sequence

This is used in two ways• To quickly discard sequences that are not similar at

all before even beginning to align them• To constrain the alignment and thereby speed up the

alignment procedure itself

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Variations on a theme

BLASTN• Nucleotide query

sequence• Nucleotide database

BLASTP• Protein query sequence• Protein database

BLASTX• Nucleotide query

sequence• Protein database• Compares all six reading

frames with the database

TBLASTN• Protein query sequence• Nucleotide database• ”On the fly” six frame

translation of database

TBLASTX• Nucleotide query

sequence• Nucleotide database• Compares all reading

frames of query with all reading frames of the database

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

BLAST at NCBIhttp://www.ncbi.nlm.nih.gov/BLAST/

Very fast computer dedicated to running BLAST searches

Many databases that are always up to date

Nice simple web interface

But you still need to knowledge about BLAST to use it properly

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Performing a simple BLAST search

We will now do a small exercise together

The purpose of the exercise is simply to performing a simple BLAST search ”hands on”

Open a web browser on the page http://www.cbs.dtu.dk/dtucourse/cookbooks/ljj/exercise1.html

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

The most common and effective way to ruin your database search

What you should never ever do: take the nucleotide sequence of a gene and compare it with a database at the nucleotide level• Unfortunately this is a very intuitive thing to do• On the NCBI BLAST homepage nucleotide search

methods are listed before protein search – making it even more intuitive

What you should do instead• Extract the coding part of the DNA sequence,

translate it, and search with the resulting protein sequence

• Use a search method (such as BLASTX or TBLASTX) which compares the sequences at the protein level

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

The limits of sequence similarity

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Expectation values in BLAST

BLAST uses precomputed extreme value distributions to calculate E-values from alignment scores• For this reason BLAST only allows certain combinations

of substitution matrices and gap penalties• This also means that the fit is based on a different data

set than the one you are working on

A word of caution: BLAST tends to overestimate the significance of its matches• E-values from BLAST are fine for identifying sure hits• One should be careful using BLAST’s E-values to judge

if a marginal hit can be trusted

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Evaluating BLAST results

We will now do a second exercise together

The main point of this exercise is careful interpretation of the BLAST output


Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Pairwise alignment of hemoglobin alpha chain and myoglobin24.7% identity; Global alignment score: 130

10 20 30 40 50HBA_HU VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKG--- ::: .. : .:.:: : .. .: . : :.: : : : : .: . :..:.MYG_PH VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED 10 20 30 40 50 60

60 70 80 90 100 110HBA_HU ---HGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNF-KLLSHCLLVTLAAHL :: : :: . . :. :.. :: : .. :... ...:. .. .: ..MYG_PH LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKI-PIKYLEFISEAIIHVLHSRH 70 80 90 100 110

120 130 140HBA_HU PAEFTPAVHASLDKFLASVSTVLTSKYR------ :..: ......: : ...::.MYG_PH PGDFGADAQGAMNKALELFRKDIAAKYKELGYQG 120 130 140 150

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

A multiple sequence alignment of globinsHBB_HUMAN --------VHLTPEEKSAVTALWGKVN--VDEVGGEALGRLLVVYPWTQRFFESFGDLST

HBB_HORSE --------VQLSGEEKAAVLALWDKVN--EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN

HBA_HUMAN ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-

HBA_HORSE ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-

MYG_PHYCA ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT

GLB5_PETMA PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT

LGB2_LUPLU --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE

*: : : * . : .: * : * : .

HBB_HUMAN PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL

HBB_HORSE PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL

HBA_HUMAN ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL

HBA_HORSE ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL

MYG_PHYCA EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF

GLB5_PETMA ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV

LGB2_LUPLU VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV

. .:: *. : . : *. * . : .

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Why multiple alignment is better

More sequences contain more information

Multiple sequence alignment allows us to compare all related proteins simultaneously

It allows us to identify features that are conserved among the sequences

Using a multiple sequence alignment (a profile) one can find more related sequences than by simple pairwise comparison

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Coffee break

Coffee break quiz:

Why is the lasT gene inE. coli called lasT?

Did some researchers fail to get the joke?

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

An iterative scheme for using profiles in database searches

Search your sequence against a large database using a pairwise alignment method (often BLAST) to obtain a set of closely related sequences

Make a multiple sequence alignment (using ClustalW) and estimate a profile

Search the profile against the database in an attempt to find more distantly related sequences

Include these in the profile and redo the profile search

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

If only life was so simple …

In the databases one may find large cluster of almost identical sequences• These will heavily bias the profile towards ”their

sequence”• To avoid this a sequence weighing scheme must be used

during construction of the profile

How should one estimate the frequencies of rare mutations that have not been observed• A more general problem: What to do when you have too

few observations to make a reliable estimate of a frequency

• The solution is called regularization which involved using prior knowledge on mutations (such as substitution matrices)

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Regularization by pseudo counts

In addition to the real counts actually observed in the sequences some extra pseudo counts are added

The simplest approach is to simply add 1 to all counters before calculating sequences

PSI-BLAST adds pseudo counts based on observations multiplied by a substitution matrix• This means that pseudo counts are mainly added to the

amino acids which are similar to the observed ones• The number of pseudo counts is adjusted so that pseudo

counts are mainly used when few real counts are available

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

An overview of PSI-BLAST

A fast heuristic method for doing profile searches which is almost as good as ”the real thing”

Outline of the algorithm• First ordinary BLAST is used to find close homologs• Rather than making a real multiple alignment the close

homologs are all just aligned to the query sequence (a master-slave alignment)

• A profile is constructed using a very simple empirical weighing scheme combined with substitution matrix pseudo-counts

• Ignoring the positional variation of indels the profile is again searched against the database

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

PSI-BLAST’s E-values

BLAST generally tends to overestimated the significance of database hits

PSI-BLAST E-values are not the E-value of the query sequence matching the database sequence

Instead the E-values represent the expectation value of the profile matching the database sequence

The profile might be wrong due the spurious hits in earlier iterations!

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Using PSI-BLAST

We will now use PSI-BLAST to find more homologs to our query sequence

Again the emphasis is on the interpretation of the results


Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Conserved domain BLAST

PSI-BLAST attempts to build a profile for the query sequence and search it against a sequence database

CD-BLAST instead builds a database of profiles and searches the query sequence against this• This means that CD-BLAST is not iterative and thus

faster• CD-BLAST works for sequences with no close homologs• The profiles come from the PFAM database which is

checked by experts to make sure that no unrelated sequence are included in the profiles

• However CD-BLAST can only identify conserved domains which are in the PFAM database

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Is it really worth the trouble?

Yes!

Profile based search methods like PSI-BLAST have been shown to find ~3 times as many homologs without increasing the number of false positives

This essentially translates into three times higher chance of finding a homolog with known structure or function

Using profiles rather than single sequences improved secondary structure prediction by ~10%

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Searching for conserved protein domains using CD-BLAST

We will now use CD-BLAST to see if the query sequence has matches to any known protein families

The results of this search should be compared to those found using PSI-BLAST


Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Important things to remember when using alignment to search databasesWhen searching in databases, size does matter!

• Searching large databases take very long time• The significance of matches drops when the database is

expanded

Doing things differently can lead to different conclusions• Nucleotide comparison vs. protein comparison• CD-BLAST vs. PSI-BLAST

Think before and after you search• The obvious thing to do is not always the right thing to do• E-values should be interpreted with care• Conclusions based on matches should be drawn with greater

care

Documents

Center for Biological Sequence Analysis Database Searching Using alignment algorithms for finding similar sequences