48
Lecture 3.1 1 BLAST

BLAST

  • Upload
    willa

  • View
    32

  • Download
    1

Embed Size (px)

DESCRIPTION

BLAST. BLAST. B asic L ocal A lignment S earch T ool Developed in 1990 and 1997 (S. Altschul) A heuristic method for performing local alignments through searches of high scoring segment pairs (HSP’s) 1st to use statistics to predict significance of initial matches - saves on false leads - PowerPoint PPT Presentation

Citation preview

Page 1: BLAST

Lecture 3.1 1

BLAST

Page 2: BLAST

Lecture 3.1 2

BLAST

• Basic Local Alignment Search Tool

• Developed in 1990 and 1997 (S. Altschul) • A heuristic method for performing local

alignments through searches of high scoring segment pairs (HSP’s)

• 1st to use statistics to predict significance of initial matches - saves on false leads

• Offers both sensitivity and speed

Page 3: BLAST

Lecture 3.1 3

• Looks for clusters of nearby or locally dense “similar or homologous” k-tuples

• Uses “look-up” tables to shorten search time

• Uses larger “word size” than FASTA to accelerate the search process

• Performs both Global and Local alignment

• Fastest and most frequently used sequence alignment tool -- THE STANDARD

BLAST

Page 4: BLAST

Lecture 3.1 4

BLAST Access

• NCBI BLAST• http://www.ncbi.nlm.nih.gov/BLAST/

• Canadian Bioinformatics Resource BLAST• http://cbr-rbc.nrc-cnrc.gc.ca/blast/

• European Bioinformatics Institute BLAST• http://www.ebi.ac.uk/blastall/

• http://www.ebi.ac.uk/blast2/

Page 5: BLAST

Lecture 3.1 5

Page 6: BLAST

Lecture 3.1 6

Page 7: BLAST

Lecture 3.1 7

Page 8: BLAST

Lecture 3.1 8

Different Flavours of BLAST

• BLASTP - protein query against protein DB

• BLASTN - DNA/RNA query against GenBank (DNA)

• BLASTX - 6 frame trans. DNA query against proteinDB

• TBLASTN - protein query against 6 frame GB transl.

• TBLASTX - 6 frame DNA query to 6 frame GB transl.

• PSI-BLAST - protein ‘profile’ query against protein DB

• PHI-BLAST - protein pattern against protein DB

Page 9: BLAST

Lecture 3.1 9

Other BLAST Services• MEGABLAST - for comparison of large sets of

long DNA sequences • RPS-BLAST - Conserved Domain Detection• BLAST 2 Sequences - for performing pairwise

alignments for 2 chosen sequences• Genomic BLAST - for alignments against

select human, microbial or malarial genomes• VecScreen - for detecting cloning vector

contamination in sequenced data

Page 10: BLAST

Lecture 3.1 10

Running NCBI BLAST

Page 11: BLAST

Lecture 3.1 11

MT0895

• MMKIQIYGTGCANCQMLEKNAREAVKELGIDAEFEKIKEMDQILEAGLTALPGLAVDGELKIMGRVASKEEIKKILS

Page 12: BLAST

Lecture 3.1 12

• Paste in sequence (FASTA format, raw sequence or type in GI or accession number)

Running NCBI BLAST

>Mysequence MT0895 KIQIYGTGCANCQMLEKNAREAVKELGIDAEFEKIKEMDQILEAGLTALPGLAVDGELKIDS

> KIQIYGTGCANCQMLEKNAREAVKELGIDAEFEKIKEMDQILEAGLTALPGLAVDGELKIDS

OR

KIQIYGTGCANCQMLEKNAREAVKELGIDAEFEKIKEMDQILEAGLTALPGLAVDGELKIDS

OR

Page 13: BLAST

Lecture 3.1 13

• Choose a range of interest in the sequence “set subsequences” (not usually used)

• Select the database from pull-down menu (usually choose nr = non-redundant)

• Keep CD Search “check box” on• Leave “Options” unchanged (use defaults)• Go to “Format” menu and adjust Number of

descriptions and alignments as desired

Running NCBI BLAST

Page 14: BLAST

Lecture 3.1 14

Running NCBI BLAST

Select Database

Page 15: BLAST

Lecture 3.1 15

Conserved Domain Database

• Contains a collection of pre-identified functional or structural domains

• Derived from Pfam and Smart databases as well as other sources

• Uses Reverse Position Specific BLAST (RPS-BLAST) to perform search

• Query sequence is compared to a PSSM derived from each of the aligned domains

Page 16: BLAST

Lecture 3.1 16

Running NCBI BLAST

Click BLAST!

Page 17: BLAST

Lecture 3.1 17

Formatting Results

Page 18: BLAST

Lecture 3.1 18

BLAST Format Options

Page 19: BLAST

Lecture 3.1 19

BLAST Output

Page 20: BLAST

Lecture 3.1 20

BLAST Output

Page 21: BLAST

Lecture 3.1 21

BLAST Output

Page 22: BLAST

Lecture 3.1 22

BLAST Output

Page 23: BLAST

Lecture 3.1 23

BLAST Output

Page 24: BLAST

Lecture 3.1 24

BLAST Output

Page 25: BLAST

Lecture 3.1 25

BLAST Parameters

• Identities - No. & % exact residue matches

• Positives - No. and % similar & ID matches

• Gaps - No. & % gaps introduced

• Score - Summed HSP score (S)

• Bit Score - a normalized score (S’)

• Expect (E) - Expected # of chance HSP aligns

• P - Probability of getting a score > X

• T - Minimum word or k-tuple score (Threshold)

Page 26: BLAST

Lecture 3.1 26

BLAST - Rules of Thumb

• Expect (E-value) is equal to the number of BLAST alignments with a given Score that are expected to be seen simply due to chance

• Don’t trust a BLAST alignment with an Expect score > 0.01 (Grey zone is between 0.01 - 1)

• Expect and Score are related, but Expect contains more information. Note that %Identies is more useful than the bit Score

• Recall Doolittle’s Curve (%ID vs. Length, next slide) %ID > 30 - numres/50

• If uncertain about a hit, perform a PSI-BLAST search

Page 27: BLAST

Lecture 3.1 27

Doolittle’s Curve

Evolutionary Distance VS Percent Sequence Identity

0

20

40

60

80

100

120

0 40 80 120 160 200 240 280 320 360 400

Number of Residues

Sequ

ence

Iden

tity

(%)

Twilight Zone

Page 28: BLAST

Lecture 3.1 28

Getting the Most from BLAST

Page 29: BLAST

Lecture 3.1 29

BLAST Options

Page 30: BLAST

Lecture 3.1 30

BLAST Options

• Composition-based statistics (Yes)• Sequence Complexity Filter (Yes)• Expect (E) value (10)• Word Size (3)• Substitution or Scoring Matrix (Blosum62)• Gap Insertion Penalty (11)• Gap Extension Penalty (1)

Page 31: BLAST

Lecture 3.1 31

Composition Statistics

• Recent addition to BLAST algorithm• Permits calculated E (Expect) values to

account for amino acid composition of queries and database hits

• Improves accuracy and reduces false positives

• Effectively conducts a different scoring procedure for each sequence in database

Page 32: BLAST

Lecture 3.1 32

LCR’s (low complexity)

• Watch out for…

– transmembrane or signal peptide regions

– coil-coil regions

– short amino acid repeats (collagen, elastin)

– homopolymeric repeats

• BLAST uses SEG to mask amino acids

• BLAST uses DUST to mask bases

Page 33: BLAST

Lecture 3.1 33

Scoring Matrices

• BLOSUM Matrices– Developed by Henikoff & Henikoff (1992)– BLOcks SUbstitution Matrix– Derived from the BLOCKS database

• PAM Matrices– Developed by Schwarz and Dayhoff (1978)– Point Accepted Mutation– Derived from manual alignments of closely

related proteins

Page 34: BLAST

Lecture 3.1 34

How to Make Your Own Matrix

ACDEFGH..ACDEFGK..AADEFGH..GCDEFGH..ACAEYGK..ACAEFAH..

Perform Calculate Fill SubAlignment Frequencies Matrix

f(A,A) =

AA

C

D

0.8 -- -- C D ...

E

0.2 0.8 --

#Aobs

#Aexp

0.0 0.3 1.0

-- -- -- f(C,A) =#C/Aobs

#Aexp #Cexp+

Page 35: BLAST

Lecture 3.1 35

PAM versus BLOSUM

• First useful scoring matrix for protein

• Assumed a Markov Model of evolution (I.e. all sites equally mutable and independent)

• Derived from small, closely related proteins with ~15% divergence

• Much later entry to matrix “sweepstakes”

• No evolutionary model is assumed

• Built from PROSITE derived sequence blocks

• Uses much larger, more diverse set of protein sequences (30% - 90% ID)

Page 36: BLAST

Lecture 3.1 36

PAM versus BLOSUM

• Higher PAM numbers to detect more remote sequence similarities

• Lower PAM numbers to detect high similarities

• 1 PAM ~ 1 million years of divergence

• Errors in PAM 1 are scaled 250X in PAM 250

• Lower BLOSUM numbers to detect more remote sequence similarities

• Higher BLOSUM numbers to detect high similarities

• Sensitive to structural and functional subsitution

• Errors in BLOSUM arise from errors in alignment

Page 37: BLAST

Lecture 3.1 37

PAM Matricies

• PAM 40 - prepared by multiplying PAM 1 by itself a total of 40 times best for short alignments with high similarity

• PAM 120 - prepared by multiplying PAM 1 by itself a total of 120 times best for general alignment

• PAM 250 - prepared by multiplying PAM 1 by itself a total of 250 times best for detecting distant sequence similarity

Page 38: BLAST

Lecture 3.1 38

BLOSUM Matricies

• BLOSUM 90 - prepared from BLOCKS sequences with >90% sequence ID best for short alignments with high similarity

• BLOSUM 62 - prepared from BLOCKS sequences with >62% sequence ID best for general alignment (default)

• BLOSUM 30 - prepared from BLOCKS sequences with >30% sequence ID best for detecting weak local alignments

Page 39: BLAST

Lecture 3.1 39

Scraping the Bottom of the Barrel with Psi-BLAST

Page 40: BLAST

Lecture 3.1 40

PSI-BLAST Algorithm

• Perform initial alignment with BLAST using BLOSUM 62 substitution matrix

• Construct a multiple alignment from matches

• Prepare position specific scoring matrix

• Use PSSM profile as the scoring matrix for a second BLAST run against database

• Repeat steps 3-5 until convergence

Page 41: BLAST

Lecture 3.1 41

PSI-BLAST

Page 42: BLAST

Lecture 3.1 42

PSI-BLASTPresS Iterate!

Page 43: BLAST

Lecture 3.1 43

PSI-BLAST

PresS Iterate!

Page 44: BLAST

Lecture 3.1 44

PSI-BLAST

Page 45: BLAST

Lecture 3.1 45

PSI-BLAST

• For Protein Sequences ONLY

• Much more sensitive than BLAST

• Slower (iterative process)

• Often yields results that are as good as many common threading methods

• SHOULD BE YOUR FIRST CHOICE IN ANALYZING A NEW SEQUENCE

Page 46: BLAST

Lecture 3.1 46

BLAST against PDB

Page 47: BLAST

Lecture 3.1 47

Still Confused?http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html

Page 48: BLAST

Lecture 3.1 48

Conclusions

• BLAST is the most important program in bioinformatics (maybe all of biology)

• BLAST is based on sound statistical principles (key to its speed and sensitivity)

• A basic understanding of its principles is key for using/interpreting BLAST output

• Use NBLAST or MEGABLAST for DNA• Use PSI-BLAST for protein searches