124

Bioinformatica t5-database searching

Embed Size (px)

DESCRIPTION

Blast/Fatsa

Citation preview

Page 1: Bioinformatica t5-database searching
Page 2: Bioinformatica t5-database searching

FBW23-10-2012

Wim Van Criekinge

Page 3: Bioinformatica t5-database searching

Inhoud Lessen: Bioinformatica

GEEN LES

Page 4: Bioinformatica t5-database searching
Page 5: Bioinformatica t5-database searching

DataBase Searching

Dynamic ProgrammingReloaded

Database SearchingFastaBlastStatisticsPractical Guide

ExtentionsPSI-BlastPHI-Blast Local BlastBLAT

Page 6: Bioinformatica t5-database searching

Needleman-Wunsch-edu.pl

The Score Matrix----------------

Seq1(j)1 2 3 4 5 6 7 8 9 10Seq2 * C K H V F C R V C I(i) * 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -101 C -1 1 0 -1 -2 -3 -4 -5 -6 -7 -82 K -2 0 2 1 0 -1 -2 -3 -4 -5 -63 K -3 -1 1 1 0 -1 -2 -3 -4 -5 -64 C -4 -2 0 0 0 -1 0 -1 -2 -3 -45 F -5 -3 -1 -1 -1 1 0 -1 -2 -3 -46 C -6 -4 -2 -2 -2 0 2 1 0 -1 -27 K -7 -5 -3 -3 -3 -1 1 1 0 -1 -28 C -8 -6 -4 -4 -4 -2 0 0 0 1 09 V -9 -7 -5 -5 -3 -3 -1 -1 1 0 0

abc

A: matrix(i,j) = matrix(i-1,j-1) + (MIS)MATCH if (substr(seq1,j-1,1) eq substr(seq2,i-1,1)

B: up_score = matrix(i-1,j) + GAP

C: left_score = matrix(i,j-1) + GAP

Page 7: Bioinformatica t5-database searching

• The most practical and widely used method in multiple sequence alignment is the hierarchical extensions of pairwise alignment methods.

• The principal is that multiple alignments is achieved by successive application of pairwise methods. – First do all pairwise alignments (not just one

sequence with all others)– Then combine pairwise alignments to generate

overall alignment

Multiple Alignment Method

Page 8: Bioinformatica t5-database searching

• Consider the task of searching SWISS PROT against a query sequence: – say our query sequence is 362

amino acids long – SWISS PROT release 38 contains

29,085,265 amino acids – finding local alignments via

dynamic programming would entail O(1010) matrix operations

• Given size of databases, more efficient methods needed

Database Searching

Page 9: Bioinformatica t5-database searching

FASTA (Pearson 1995)

Uses heuristics to avoid calculating the full dynamic programming matrix

Speed up searches by an order of magnitude compared to full Smith-Waterman

The statistical side of FASTA is still stronger than BLAST

BLAST (Altschul 1990, 1997)

Uses rapid word lookup methods to completely skip most of the database entries

Extremely fastOne order of magnitude

faster than FASTA

Two orders of magnitude faster than Smith-Waterman

Almost as sensitive as FASTA

Heuristic approaches to DP for database searching

Page 10: Bioinformatica t5-database searching

« Hit and extend heuristic»• Problem: Too many calculations

“wasted” by comparing regions that have nothing in common

• Initial insight: Regions that are similar between two sequences are likely to share short stretches that are identical

• Basic method: Look for similar regions only near short stretches that match exactly

FASTA

Page 11: Bioinformatica t5-database searching

FASTA-Stages

1. Find k-tups in the two sequences (k=1,2 for proteins, 4-6 for DNA sequences)

2. Score and select top 10 scoring “local diagonals”3. Rescan top 10 regions, score with PAM250

(proteins) or DNA scoring matrix. Trim off the ends of the regions to achieve highest scores.

4. Try to join regions with gapped alignments. Join if similarity score is one standard deviation above average expected score

5. After finding the best initial region, FASTA performs a global alignment of a 32 residue wide region centered on the best initial region, and uses the score as the optimized score.

Page 12: Bioinformatica t5-database searching
Page 13: Bioinformatica t5-database searching
Page 14: Bioinformatica t5-database searching

• Sensitivity: the ability of a program to identify weak but biologically significant sequence similarity.

• Selectivity: the ability of a program to discriminate between true matches and matches occurring by chance alone. – A decrease in selectivity results in

more false positives being reported.

FastA

Page 15: Bioinformatica t5-database searching

FastA (http://www.ebi.ac.uk/fasta33/)

Blosum50 default.Lower PAM higher blosum to detect close sequencesHigher PAM and lower blosumto detect distant sequences

Gap opening penalty -12, -16 by default for fasta with proteins and DNA, respectively

Gap extension penalty -2, -4 by default for fasta with proteins and DNA, respectively

The larger the word-length the less sensitive, but faster the search will be

Max number of scores and alignments is 100

Page 16: Bioinformatica t5-database searching

FastA Output

Database code hyperlinked to the SRS database at EBI

Accession number

Description Length

Initn, init1, opt, z-score calculated during run

E score - expectation value, how many hits are expected to be found by chance with such a score while comparing this query to this database.

E() does not represent the % similarity

Page 17: Bioinformatica t5-database searching

Query: DNA Protein

Database:DNA Protein

FastA, TFastA, FastX, FastY

FastA is a family of programs

Page 18: Bioinformatica t5-database searching

FASTA can miss significant similarity since– For proteins, similar sequences do

not have to share identical residues• Asp-Lys-Val is quite similar to

• Glu-Arg-Ile yet it is missed even with ktuple size of 1 since no amino acid matches

• Gly-Asp-Gly-Lys-Gly is quite similar

to Gly-Glu-Gly-Arg-Gly but there is no match with ktuple size of 2

FASTA problems

Page 19: Bioinformatica t5-database searching

FASTA can miss significant similarity since– For nucleic acids, due to codon

“wobble”, DNA sequences may look like XXyXXyXXy where X’s are conserved and y’s are not• GGuUCuACgAAg and

GGcUCcACaAAA both code for the same peptide sequence (Gly-Ser-Thr-Lys) but they don’t match with ktuple size of 3 or higher

FASTA problems

Page 20: Bioinformatica t5-database searching

DataBase Searching

Dynamic ProgrammingReloaded

Database SearchingFastaBlastStatisticsPractical Guide

ExtentionsPSI-BlastPHI-Blast Local BlastBlast

Page 21: Bioinformatica t5-database searching

BLAST - Basic Local Alignment Search Tool

Page 22: Bioinformatica t5-database searching

What does BLAST do?

• Search a large target set of sequences...

• …for hits to a query sequence...

• …and return the alignments and scores from those hits...

• Do it fast.

Show me those sequences that deserve a second look. Blast programs were designed for fast database searching, with minimal sacrifice of sensitivity to distant related sequences.

Page 23: Bioinformatica t5-database searching

The big red button

Do My Job

It is dangerous to hide too much of the underlying complexity from the scientists.

Page 24: Bioinformatica t5-database searching

• Approach: find segment pairs by first finding word pairs that score above a threshold, i.e., find word pairs of fixed length w with a score of at least T

• Key concept “Neigborhood”: Seems similar to FASTA, but we are searching for words which score above T rather than that match exactly

• Calculate neigborhood (T) for substrings of query (size W)

Overview

Page 25: Bioinformatica t5-database searching

Compile a list of words which give a score above T when paired with the query sequence.– Example using PAM-120 for query sequence ACDE

(w=4, T=17):

A C D E A C D E = +3 +9 +5 +5 = 22

• try all possibilities:

A A A A = +3 -3 0 0 = 0 no good A A A C = +3 -3 0 -7 = -7 no good

• ...too slow, try directed change

Overview

Page 26: Bioinformatica t5-database searching

A C D EA C D E = +3 +9 +5 +5 = 22

• change 1st pos. to all acceptable substitutionsg C D E = +1 +9 +5 +5 = 20 okn C D E = +0 +9 +5 +5 = 19 okI C D E = -1 +9 +5 +5 = 18 okk C D E = -2 +9 +5 +5 = 17 ok

• change 2nd pos.: can't - all alternatives negative and the other three positions only add up to 13

• change 3rd pos. in combination with first positiongCnE = 1 9 2 5 = 17 ok

• continue - use recursion

• For "best" values of w and T there are typically about 50 words in the list for every residue in the query sequence

Overview

Page 27: Bioinformatica t5-database searching

Neighborhood.pl

# Calculate neighborhoodmy %NH;for (my $i = 0; $i < @A; $i++) { my $s1 = $S{$W[0]}{$A[$i]}; for (my $j = 0; $j < @A; $j++) { my $s2 = $S{$W[1]}{$A[$j]}; for (my $k = 0; $k < @A; $k++) { my $s3 = $S{$W[2]}{$A[$k]}; my $score = $s1 + $s2 + $s3; my $word = "$A[$i]$A[$j]$A[$k]"; next if $word =~ /[BZX\*]/; $NH{$word} = $score if $score >= $T; } }}

# Output neighborhoodforeach my $word (sort {$NH{$b} <=> $NH{$a} or $a cmp $b} keys %NH) { print "$word $NH{$word}\n";}

Page 28: Bioinformatica t5-database searching

BLOSUM62 RGD 11

RGD 17KGD 14QGD 13RGE 13EGD 12HGD 12NGD 12RGN 12AGD 11MGD 11RAD 11RGQ 11RGS 11RND 11RSD 11SGD 11TGD 11

PAM200 RGD 13

RGD 18RGE 17RGN 16KGD 15RGQ 15KGE 14HGD 13KGN 13RAD 13RGA 13RGG 13RGH 13RGK 13RGS 13RGT 13RSD 13WGD 13

Page 29: Bioinformatica t5-database searching
Page 30: Bioinformatica t5-database searching

S

Length of extension

ScoreTrim to max

indexed

*

*Two non-overlapping HSP’s on a diagonal within distance A

Page 31: Bioinformatica t5-database searching

S

Length of extension

ScoreTrim to max

indexed

*

*Two non-overlapping HSP’s on a diagonal within distance A

Page 32: Bioinformatica t5-database searching

The BLAST algorithm

• Break the search sequence into words– W = 3 for proteins, W = 12 for DNA

• Include in the search all words that score above a certain value (T) for any search word

MCGPFILGTYC

MCG

CGP

MCG, CGP, GPF, PFI, FIL, ILG, LGT, GTY, TYC

MCG CGPMCT MGP …MCN CTP … …

This list can be computed in linear time

Page 33: Bioinformatica t5-database searching

The Blast Algorithm (2)

• Search for the words in the database– Word locations can be precomputed and indexed– Searching for a short string in a long string

• HSP (High Scoring Pair) = A match between a query word and the database

• Find a “hit”: Two non-overlapping HSP’s on a diagonal within distance A

• Extend the hit until the score falls below a threshold value, S

Page 34: Bioinformatica t5-database searching
Page 35: Bioinformatica t5-database searching

BLAST parameters

• Lowering the neighborhood word threshold (T) allows more distantly related sequences to be found, at the expense of increased noise in the results set.

• Choosing a value for w – small w: many matches to expand – big w: many words to be generated – w=4 is a good compromise

• Lowering the segment extension cutoff (S) returns longer extensions for each hit.

• Changing the minimum E-value changes the threshold for reporting a hit.

Page 36: Bioinformatica t5-database searching

Critical parameters: T,W and scoring matrix

• The proper value of T depends ons both the values in the scoring matrix and balance between speed and sensitivity

• Higher values of T progressively remove more word hits and reduce the search space.

• Word size (W) of 1 will produce more hits than a word size of 10. In general, if T is scaled uniformly with W, smaller word sizes incraese sensitivity and decrease speed.

• The interplay between W,T and the scoring matrix is criticial and choosing them wisely is the most effective way of controlling the speed and sensiviy of blast

Page 37: Bioinformatica t5-database searching

DataBase Searching

Dynamic ProgrammingReloaded

Database SearchingFastaBlastStatisticsPractical Guide

ExtentionsPSI-BlastPHI-BlastLocal BlastBLAT

Page 38: Bioinformatica t5-database searching

Database Searching

• How can we find a particular short sequence in a database of sequences (or one HUGE sequence)?

• Problem is identical to local sequence alignment, but on a much larger scale.

• We must also have some idea of the significance of a database hit.– Databases always return some kind of hit, how

much attention should be paid to the result?• How can we determine how “unusual” a

particular alignment score is?

Page 39: Bioinformatica t5-database searching

Sentence 1:“These algorithms are trying to find the best way to match up two sequences”

Sentence 2:“This does not mean that they will find anything profound”

ALIGNMENT:

THESEALGRITHMARETR--YINGTFINDTHEBESTWAYTMATCHPTWSEQENCES:: :.. . .. ...: : ::::.. :: . : ...THISDESNTMEANTHATTHEYWILLFINDAN-------YTHIN-GPRFND------

12 exact matches14 conservative substitutions

Is this a good alignment?

Significance

Page 40: Bioinformatica t5-database searching

• A key to the utility of BLAST is the ability to calculate expected probabilities of occurrence of Maximum Segment Pairs (MSPs) given w and T

• This allows BLAST to rank matching sequences in order of “significance” and to cut off listings at a user-specified probability

Overview

Page 41: Bioinformatica t5-database searching

Mathematical Basis of BLAST

• Model matches as a sequence of coin tosses• Let p be the probability of a “head”

– For a “fair” coin, p = 0.5• (Erdös-Rényi) If there are n throws, then the

expected length R of the longest run of heads is

R = log1/p (n).• Example: Suppose n = 20 for a “fair” coin

R=log2(20)=4.32

• Trick is how to model DNA (or amino acid) sequence alignments as coin tosses.

Page 42: Bioinformatica t5-database searching

Mathematical Basis of BLAST

• To model random sequence alignments, replace a match with a “head” and mismatch with a “tail”.

• For DNA, the probability of a “head” is 1/4– What is it for amino acid sequences?

AATCAT

ATTCAGHTHHHT

Page 43: Bioinformatica t5-database searching

Mathematical Basis of BLAST

• So, for one particular alignment, the Erdös-Rényi property can be applied

• What about for all possible alignments?– Consider that sequences are being shifted back and forth,

dot matrix plot

• The expected length of the longest match is

R=log1/p(mn)

where m and n are the lengths of the two sequences.

Page 44: Bioinformatica t5-database searching

Analytical derivation

Erdös-Rényi

Karlin-Alschul

Page 45: Bioinformatica t5-database searching

Karlin-Alschul Statistics

E=kmn-λS

This equation states that the number of alignments expected by chance (E) during the sequence database search is a function of the size of the search space (m*n), the normalized score (λS) and a minor constant (k mostly 0.1)

E-Value grows linearly with the product of target and query sizes. Doubling target set size and doubling query length have the same effect on e-value

Page 46: Bioinformatica t5-database searching

Analytical derivation

Erdös-Rényi

Karlin-Alschul

R=log1/p(mn)

E=kmn-λS

Page 47: Bioinformatica t5-database searching

Scoring alignments

• Score: S (~R)

– S=SM(qi,ti) - Sgaps

• Any alignment has a score• Any two sequences have a(t least one)

optimal alignment

Page 48: Bioinformatica t5-database searching

• For a particular scoring matrix and its associated gap initiation and extention costs

one must calculate λ and k• Unfortunately (for gapped alignments), you

can’t do this analytically and the values must be estimated empirically– The procedure involves aligning random

sequences (Monte Carlo approach) with a specific scoring scheme and observing the alignment properties (scores, target frequencies and lengths)

Page 49: Bioinformatica t5-database searching

“Monte Carlo” Approach:

• Compares result to randomized result, similarly to results generated by a roulette wheel at Monte Carlo

• Typical procedure for alignments– Randomize sequence A– Align to sequence B– Repeat many times (hundreds)– Keep track op optimal score

• Histogram of scores …

Significance

Page 50: Bioinformatica t5-database searching

Assessing significance requires a distribution

• I have an pumpkin of diameter 1m. Is that unusual?

Diameter (m)

Fre

quen

cy

Page 51: Bioinformatica t5-database searching
Page 52: Bioinformatica t5-database searching
Page 53: Bioinformatica t5-database searching

• In seeking optimal Alignments between two sequences, one desires those that have the highest score - i.e. one is seeking a distribution of maxima

• In seeking optimal Matches between an Input Sequence and Sequence Entries in a Database, one again desires the matches that have the highest score, and these are obtained via examination of the distribution of such scores for the entries in the database - this is again a distribution of maxima.

“A Normal Distribution is a distribution of Sums of independent variables rather than a sum of their Maxima.“

Normal Distribution does NOT Fit Alignment Scores !!

Significance

Page 54: Bioinformatica t5-database searching

Comparing distributions

x

ex

eexf1

2

2

2

2

1

x

exf

Extreme Value:Gaussian:

Page 55: Bioinformatica t5-database searching

P(xS) = 1-exp(-kmne-

S)

m, n: sequence lengths.

k, : l free parameters.

This can be shown analytically for ungapped alignments and has been found empirically to also hold for gapped alignments under commonly used conditions.

Alignment of unrelated/random sequences result in scores following an extreme value distribution

Alignment scores follow extreme value distributions

E

x

P = 1 –e-E

E=-ln(1-P)

Page 56: Bioinformatica t5-database searching

Alignment algorithms will always produce alignments, regardless of whether it is meaningful or not

=> important to have way of selecting significant alignments from large set of database hits.

Solution: fit distribution of scores from database search to extreme value distribution; determine p-value of hit from this fitted distribution.

Example: scores fitted to extreme value distribution.

99.9% of this distribution is located below score=112

=> hit with score = 112 has a p-value of 0.1%

Alignment scores follow extreme value distributions

Page 57: Bioinformatica t5-database searching

BLAST uses precomputed extreme value distributions to calculate E-values from alignment scores

For this reason BLAST only allows certain combinations of substitution matrices and gap penalties

This also means that the fit is based on a different data set than the one you are working on

A word of caution: BLAST tends to overestimate the significance of its matches

E-values from BLAST are fine for identifying sure hitsOne should be careful using BLAST’s E-values to judge if a marginal hit can be trusted (e.g., you may want to use E-values of 10-4 to 10-5).

Significance

Page 58: Bioinformatica t5-database searching

Determining P-values

• If we can estimate and , then we can determine, for a given match score x, the probability that a random match with score x or greater would have occurred in the database.

• For sequence matches, a scoring system and database can be parameterized by two parameters, k and , related to and .– It would be nice if we could compare hit

significance without regard to the scoring system used!

Page 59: Bioinformatica t5-database searching

Bit Scores

• The expected number of hits with score S is:E = Kmn e s

– Where m and n are the sequence lengths• Normalize the raw score using:

• Obtains a “bit score” S’, with a standard set of units.

• The new E-value is:

2ln

ln KSS

SmnE 2

Page 60: Bioinformatica t5-database searching

-74

-73

-72

*-7

1**

***

-70

****

***

-69

****

****

**-6

8**

****

****

****

*-6

7**

****

****

****

****

****

***

-66

****

****

****

****

****

****

*-6

5**

****

****

****

****

****

****

****

****

**-6

4**

****

****

****

****

****

****

****

****

****

***

-63

****

****

****

****

****

****

****

****

****

****

****

****

****

****

****

-61

****

****

****

****

****

****

-60

****

****

****

****

****

****

****

*-5

9**

****

****

****

****

*-5

8**

****

****

****

-57

****

****

*-5

6**

****

**-5

5**

***

-54

****

-53

*-5

2*

-51

*-5

0-4

9

Needleman-wunsch-Monte-Carlo.pl

(Average around -64 !)

Page 61: Bioinformatica t5-database searching

• The distribution of scores graph of frequency of observed scores

• expected curve (asterisks) according to the extreme value distribution

–the theoretic curve should be similar to the observed results

• deviations indicate that the fitting parameters are wrong

–too weak gap penalties

–compositional biases

FastA Output

Page 62: Bioinformatica t5-database searching

< 20 222 0 :* 22 30 0 :* 24 18 1 :* 26 18 15 :* 28 46 159 :* 30 207 963 :* 32 1016 3724 := * 34 4596 10099 :==== * 36 9835 20741 :========= * 38 23408 34278 :==================== * 40 41534 47814 :=================================== * 42 53471 58447 :============================================ * 44 73080 64473 :====================================================*======= 46 70283 65667 :=====================================================*==== 48 64918 62869 :===================================================*== 50 65930 57368 :===============================================*======= 52 47425 50436 :======================================= * 54 36788 43081 :=============================== * 56 33156 35986 :============================ * 58 26422 29544 :====================== * 60 21578 23932 :================== * 62 19321 19187 :===============* 64 15988 15259 :============*= 66 14293 12060 :=========*== 68 11679 9486 :=======*== 70 10135 7434 :======*==

FastA Output

Page 63: Bioinformatica t5-database searching

72 8957 5809 :====*=== 74 7728 4529 :===*=== 76 6176 3525 :==*=== 78 5363 2740 :==*== 80 4434 2128 :=*== 82 3823 1628 :=*== 84 3231 1289 :=*= 86 2474 998 :*== 88 2197 772 :*= 90 1716 597 :*= 92 1430 462 :*= :===============*======================== 94 1250 358 :*= :============*=========================== 96 954 277 :* :=========*======================= 98 756 214 :* :=======*=================== 100 678 166 :* :=====*================== 102 580 128 :* :====*=============== 104 476 99 :* :===*============= 106 367 77 :* :==*========== 108 309 59 :* :==*======== 110 287 46 :* :=*======== 112 206 36 :* :=*====== 114 161 28 :* :*===== 116 144 21 :* :*==== 118 127 16 :* :*====>120 886 13 :* :*==============================

Related

FastA Output

Page 64: Bioinformatica t5-database searching

• A summary of the statistics and of the program parameters follows the histogram. – An important number in this summary is the

Kolmogorov-Smirnov statistic, which indicates how well the actual data fit the theoretical statistical distribution. The lower this value, the better the fit, and the more reliable the statistical estimates.

– In general, a Kolmogorov-Smirnov statistic under 0.1 indicates a good fit with the theoretical model. If the statistic is higher than 0.2, the statistics may not be valid, and it is recommended to repeat the search, using more stringent (more negative) values for the gap penalty parameters.

FastA Output

Page 65: Bioinformatica t5-database searching

Statistics summary

• Optimal local alignment scores for pairs of random amino acid sequences of the same length follow and extreme-value distribution. For any score S, the probability of observing a score >= S is given by the Karlin-Altschul statistic (P(score>=S)=1-exp(-kmne(-lambda.S))

• k en Lambda are parameters related to the position of the maximum and the with of the distribution,

• Note the long tail at the right. This means that a score serveral standard deviations above the mean has higher probability of arising by chance (that is, it is less significant) than if the scores followed a normal distribution.

Page 66: Bioinformatica t5-database searching

P-values

• Many programs report P = the probability that the alignment is no better than random. The relationship between Z and P depends on the distribution of the scores from the control population, which do NOT follow the normal distributions– P<=10E-100 (exact match)– P in range 10E-100 10E-50 (sequences nearly identical eg.

Alleles or SNPs– P in range 10E-50 10E-10 (closely related sequenes,

homology certain)– P in range 10-5 10E-1 (usually distant relatives)– P > 10-1 (match probably insignificant)

Page 67: Bioinformatica t5-database searching

E

• For database searches, most programs report E-values. The E-value of an alignemt is the expected number of sequences that give the same Z-score or better if the database is probed with a random sequence. E is found by multiplying the value of P by the size of the database probed. Note that E but not P depends on the size of the database. Values of P are between 0 and 1. Values of E are between 0 and the number of sequences in the database searched:– E<=0.02 sequences probably homologous– E between 0.02 and 1 homology cannot be ruled out– E>1 you would have to expect this good a match by just chance

Page 68: Bioinformatica t5-database searching

DataBase Searching

Dynamic ProgrammingReloaded

Database SearchingFastaBlastStatisticsPractical Guide

ExtentionsPSI-BlastPHI-Blast Local BlastBlast

Page 69: Bioinformatica t5-database searching

BLAST is actually a family of programs:• BLASTN - Nucleotide query searching a

nucleotide database.• BLASTP - Protein query searching a

protein database.• BLASTX - Translated nucleotide query

sequence (6 frames) searching a protein database.

• TBLASTN - Protein query searching a translated nucleotide (6 frames) database.

• TBLASTX - Translated nucleotide query (6 frames) searching a translated nucleotide (6 frames) database.

Blast

Page 70: Bioinformatica t5-database searching

Blast

Page 71: Bioinformatica t5-database searching

Blast

Page 72: Bioinformatica t5-database searching

Blast

Page 73: Bioinformatica t5-database searching

Blast

Page 74: Bioinformatica t5-database searching

Blast

Page 75: Bioinformatica t5-database searching

Blast

Page 76: Bioinformatica t5-database searching

Blast

Page 77: Bioinformatica t5-database searching
Page 78: Bioinformatica t5-database searching
Page 79: Bioinformatica t5-database searching
Page 80: Bioinformatica t5-database searching
Page 81: Bioinformatica t5-database searching
Page 82: Bioinformatica t5-database searching
Page 83: Bioinformatica t5-database searching
Page 84: Bioinformatica t5-database searching
Page 85: Bioinformatica t5-database searching

• Be aware of what options you have selected when using BLAST, or FASTA implementations.

• Treat BLAST searches as scientific experiments

• So you should try your searches with the filters on and off to see whether it makes any difference to the output

Tips

Page 86: Bioinformatica t5-database searching

Tips: Low-complexity and Gapped Blast Algorithm

• The common, Web-based ones often have default settings that will affect the outcome of your searches. By default all NCBI BLAST implementations filter out biased sequence composition from your query sequence (e.g. signal peptide and transmembrane sequences - beware!).

• The SEG program has been implemented as part of the blast routine in order to mask low-complexity regions

• Low-complexity regions are denoted by strings of Xs in the query sequence

Page 87: Bioinformatica t5-database searching

• The sequence databases contain a wealth of information. They also contain a lot of errors. Contaminants …

• Annotation errors, frameshifts that may result in erroneous conceptual translations.

• Hypothetical proteins ?

• In the words of Fox Mulder, "Trust no one."

Tips

Page 88: Bioinformatica t5-database searching

• Once you get a match to things in the databases, check whether the match is to the entire protein, or to a domain. Don't immediately assume that a match means that your protein carries out the same function (see above). Compare your protein and the match protein(s) along their entire lengths before making this assumption.

Tips

Page 89: Bioinformatica t5-database searching

• Domain matches can also cause problems by hiding other informative matches. For instance if your protein contains a common domain you'll get significant matches to every homologous sequence in the database. BLAST only reports back a limited number of matches, ordered by P value.

• If this list consists only of matches to the same domain, cut this bit out of your query sequence and do the BLAST search again with the edited sequence (e.g. NHR).

Tips

Page 90: Bioinformatica t5-database searching

• Do controls wherever possible. In particular when you use a particular search software for the first time.

• Suitable positive controls would be protein sequences known to have distant homologues in the databases to check how good the software is at detecting such matches.

• Negative controls can be employed to make sure the compositional bias of the sequence isn't giving you false positives. Shuffle your query sequence and see what difference this makes to the matches that are returned. A real match should be lost upon shuffling of your sequence.

Tips

Page 91: Bioinformatica t5-database searching

• Perform Controls#!/usr/bin/perl -wuse strict;

my ($def, @seq) = <>;print $def;chomp @seq;@seq = split(//, join("", @seq));my $count = 0;while (@seq) {

my $index = rand(@seq);my $base = splice(@seq, $index, 1);print $base;print "\n" if ++$count % 60 == 0;

}print "\n" unless $count %60 == 0;

Tips

Page 92: Bioinformatica t5-database searching

• Read the footer first• View results graphically• Parse Blasts with Bioperl

Tips

Page 93: Bioinformatica t5-database searching

• BLAST's major advantage is its speed. – 2-3 minutes for BLAST versus several hours

for a sensitive FastA search of the whole of GenBank.

• When both programs use their default setting, BLAST is usually more sensitive than FastA for detecting protein sequence similarity.– Since it doesn't require a perfect sequence

match in the first stage of the search.

FastA vs. Blast

Page 94: Bioinformatica t5-database searching

Weakness of BLAST:– The long word size it uses in the initial stage of DNA

sequence similarity searches was chosen for speed, and not sensitivity.

– For a thorough DNA similarity search, FastA is the program of choice, especially when run with a lowered KTup value.

– FastA is also better suited to the specialised task of detecting genomic DNA regions using a cDNA query sequence, because it allows the use of a gap extension penalty of 0. BLAST, which only creates ungapped alignments, will usually detect only the longest exon, or fail altogether.

• In general, a BLAST search using the default parameters should be the first step in a database similarity search strategy. In many cases, this is all that may be required to yield all the information needed, in a very short time.

FastA vs. Blast

Page 95: Bioinformatica t5-database searching

DataBase Searching

Dynamic ProgrammingReloaded

Database SearchingFastaBlastStatisticsPractical Guide

ExtentionsPSI-BlastPHI-Blast Local BlastBLAT

Page 96: Bioinformatica t5-database searching

1. Old (ungapped) BLAST

2. New BLAST (allows gaps)

3. Profile -> PSI Blast - Position Specific Iterated

· Strategy:Multiple alignment of the hitsCalculates a position-specific score

matrix Searches with this matrix

· In many cases is much more sensitive to weak but biologically relevant sequence

similarities· PSSM !!!

PSI-Blast

Page 97: Bioinformatica t5-database searching

• Patterns of conservation from the alignment of related sequences can aid the recognition of distant similarities. – These patterns have been variously called motifs,

profiles, position-specific score matrices, and Hidden Markov Models.

For each position in the derived pattern, every amino acid is assigned a score.

(1) Highly conserved residue at a position: that residue is assigned a high positive score, and others are assigned high negative scores.

(2) Weakly conserved positions: all residues receive scores near zero.

(3) Position-specific scores can also be assigned to potential insertions and deletions.

PSI-Blast

Page 98: Bioinformatica t5-database searching

Pattern

• a set of alternative sequences, using “regular expressions”

• Prosite (http://www.expasy.org/prosite/)

Page 99: Bioinformatica t5-database searching

PSSM (Position Specific Scoring Matrice)

Page 100: Bioinformatica t5-database searching

PSSM (Position Specific Scoring Matrice)

Page 101: Bioinformatica t5-database searching

PSSM (Position Specific Scoring Matrice)

Page 102: Bioinformatica t5-database searching

• The power of profile methods can be further enhanced through iteration of the search procedure. – After a profile is run against a database,

new similar sequences can be detected. A new multiple alignment, which includes these sequences, can be constructed, a new profile abstracted, and a new database search performed.

– The procedure can be iterated as often as desired or until convergence, when no new statistically significant sequences are detected.

PSI-Blast

Page 103: Bioinformatica t5-database searching

(1) PSI-BLAST takes as an input a single protein sequence and compares it to a protein database, using the gapped BLAST program.

(2) The program constructs a multiple alignment, and then a profile, from any significant local alignments found.

The original query sequence serves as a template for the multiple alignment and profile, whose lengths are identical to that of the query. Different numbers of sequences can be aligned in different template positions.

(3) The profile is compared to the protein database, again seeking local alignments using the BLAST algorithm.

(4) PSI-BLAST estimates the statistical significance of the local alignments found.

Because profile substitution scores are constructed to a fixed scale, and gap scores remain independent of position, the statistical theory and parameters for gapped BLAST alignments remain applicable to profile alignments.

(5) Finally, PSI-BLAST iterates, by returning to step (2), a specified number of times or until convergence.

PSI-Blast

Page 104: Bioinformatica t5-database searching

From: http://bioweb.pasteur.fr/seqanal/blast/intro-uk.html

PSI-BLAST

PSSM

PSSM

Page 105: Bioinformatica t5-database searching

PSI-BLAST

Page 106: Bioinformatica t5-database searching

PSI-BLAST

Page 107: Bioinformatica t5-database searching

PSI-BLAST

Page 108: Bioinformatica t5-database searching

PSI-BLAST

Page 109: Bioinformatica t5-database searching

• Avoid too close sequences: overfit!• Can include false homologous! Therefore check

the matches carefully: include or exclude sequences based on biological knowledge.

• The E-value reflects the significance of the match to the previous training set not to the original sequence!

• Choose carefully your query sequence.• Try reverse experiment to certify.

PSI-BLAST pitfalls

Page 110: Bioinformatica t5-database searching

• A single sequence is selected from a set of blocks and enriched by replacing the conserved regions delineated by the blocks by consensus residues derived from the blocks.

• Embedding consensus residues improves performance

• S. Henikoff and J.G. Henikoff; Protein Science (1997) 6:698-705.

Reduce overfitting risk by Cobbler

Page 111: Bioinformatica t5-database searching

DataBase Searching

Dynamic ProgrammingReloaded

Database SearchingFastaBlastStatisticsPractical Guide

ExtentionsPSI-BlastPHI-BlastLocal BlastBLAT

Page 112: Bioinformatica t5-database searching

PHI-Blast Local Blast (Pattern-Hit Initiated BLAST)

Page 113: Bioinformatica t5-database searching

PHI-Blast Local Blast

From: http://bioweb.pasteur.fr/seqanal/blast/intro-uk.html

Page 114: Bioinformatica t5-database searching

PHI-Blast Local Blast

Page 115: Bioinformatica t5-database searching

PHI-Blast Local Blast

Page 116: Bioinformatica t5-database searching

PHI-Blast Local Blast

Page 117: Bioinformatica t5-database searching

DataBase Searching

Dynamic ProgrammingReloaded

Database SearchingFastaBlastStatisticsPractical Guide

ExtentionsPSI-BlastPHI-BlastLocal BlastBLAT

Page 118: Bioinformatica t5-database searching

Installing Blast Locally

• 2 flavors: NCBI/WuBlast• Excutables:

– ftp://ftp.ncbi.nih.gov/blast/executables/• Database:

– ftp://ftp.ncbi.nih.gov/blast/db/• Formatdb

– formatdb -i ecoli.nt -p F– formatdb -i ecoli.protein -p T

• For options: blastall - – blastall -p blastp -i query -d database -o output

Page 119: Bioinformatica t5-database searching

DataBase Searching

Dynamic ProgrammingReloaded

Database SearchingFastaBlastStatisticsPractical Guide

ExtentionsPSI-BlastPHI-Blast Local BlastBLAT

Page 120: Bioinformatica t5-database searching

Main database: BLAT

• BLAT: BLAST-Like Alignment Tool• Aligns the input sequence to the

Human Genome• Connected to several databases, like:

– mRNAs - GenScan– ESTs - TwinScan– RepeatMasker - UniGene– RefSeq - CpG

Islands

Page 121: Bioinformatica t5-database searching

BLAT Human Genome Browser

Page 122: Bioinformatica t5-database searching

BLAT method

• Align sequence with BLAT, get alignment info

• Per BLAT hit, pick up additional info from connected databases:– mRNAs– ESTs– RepeatMasker– CpG Islands– RefSeq Genes

Page 123: Bioinformatica t5-database searching
Page 124: Bioinformatica t5-database searching

Weblems

W5.1: Submit the amino acid sequence of papaya papein to a BLAST (gapped and ungapped) and to a PSI-BLAST search. What are the main difference in results?

W5.2: Is there a relationship between Klebsiella aerogenes urease, Pseudomonas diminuta phosphotriesterase and mouse adenosine deaminase ? Also use DALI, ClustalW and T-coffee.

W5.3: Yeast two-hybrid typically yields DNA sequences. How would you find the corresponding protein ?

W5.4: When and why would you use tblastn ?W5.5: How would you search a database if you want to

restrict the search space to those entries having a secretion signal consisting of 4 consecutive (N-terminal) basic residues ?