Alignment to a database of sequences

Alignment to a database of sequences

Biology 162 Computational Genetics

Todd Vision7 Sep 2004

Preview

• How to speed up pairwise alignment

• How to statistically evaluate alignments and database matches

• How to use BLAST• What database should you search?

Need for fast pairwise alignment

• Even O(mn) is too slow or memory intensive for some applications– Shotgun assembly (Phrap)– Database search– Alignment of long sequences

History• FASTA (1983)

– First practical tool for fast pairwise alignment– Restricts alignment to path-graph “hot-spots”

• BLAST (1990) (and WU-BLAST)– Basic local alignment search tool– Refinement of ideas for locating “hot-spots”– Searching dB for “hot-spots” in sublinear time– Probability of alignment scores

• Gapped BLAST and FASTA3 (1998, 2000)– The two programs have largely converged– Can now both produce gapped alignments

with accurate statistics

Why is BLAST so fast?

• Exclusion of database sequences unlikely to contain good alignments

• Preprocessing of database to enable fast matching algorithms

• Filling out only a small part of the path graph

Query Subject Flavor

Nucleotide Nucleotide BLASTN

Protein Protein BLASTP

Nucleotide (translated)

Protein BLASTX

Protein Nucleotide(translated)

TBLASTN

Nucleotide(translated)

Nucleotide(translated)

TBLASTX

How BLAST works

• Segment pair - a pair of equal length substrings in query and dB

• Locally maximal segment pair (LMSP) - ungapped score would decrease by extending or shortening either end

• High scoring pair (HSP) - the LMSP of highest score between the query and a single DB sequence

• BLAST heuristically finds a gapped alignment for one or more HSPs that has a score above some threshold

How BLAST works• Query broken into short overlapping

words (default 11nt or 3aa) • Up to 50 high scoring word

neighbors (with minimum score T) determined for each word in query

LRQ

LRELROLRTLRGLRT

LRQNMLAC

RQN

How BLAST works

• Preprocessed database searched for exact matches to all high scoring words– Generation of words and search for

matches are both linear in length of query• A memory intensive data structure allows

a fast algorithm

– We will see how this can be achieved with suffix trees later

How BLAST works

• Segment pairs on the same diagonal and within A cells of each other are extended in both directions– Extension proceeds until the dropoff from the

highest score is too great– If the score is sufficiently high, and no other

overlapping LMSP is higher, the highest scoring extension is considered to be an HSP

– Extension is the time consuming step of BLAST

• Requiring two close segment pairs on the same diagonal means that most dB sequences can be discarded before the extension step

• HSPs are joined by a gapped alignment

Alignment probabilities

• Each HSP has an associated score (sij)• We know the scores for every possible

amino acid pair i, j• We can compute the frequency of each i, j

in a random alignment• Why not simply compute the probability of

obtaining a score at least as good as S?

ELVIS||| |ELVYS

BLAST statistics

• Scores are inflated over a naive probability model for two reasons– We have optimized the alignment– We are taking the MSP from that alignment

and not a random pair of aligned segments– We are only reporting the highest scoring

alignments after searching a large database

• The expected number of matches that meet or surpass our observed score is given by the extreme value (or Gumbel) distribution

opt E()

< 20 173 0:==

22 0 0: one = represents 122 library sequences

24 2 0:=

26 1 1:*

28 8 16:*

30 37 95:*

32 170 367:== *

34 714 996:====== *

36 1809 2045:=============== *

38 3729 3379:===========================*===

40 5555 4713:======================================*=======

42 6991 5761:===============================================*==========

44 7264 6355:====================================================*=======

46 6786 6473:=====================================================*==

48 6059 6197:==================================================*

50 4920 5655:========================================= *

52 4204 4972:=================================== *

54 3410 4247:============================ *

56 2957 3547:========================= *

58 2469 2912:===================== *

60 2280 2359:===================*

62 1872 1891:===============*

64 1450 1504:============*

66 1176 1189:=========*

68 1010 935:=======*=

70 817 733:======*

72 668 573:====*=

74 501 446:===*=

76 431 347:==*=

78 342 270:==*

80 315 210:=*=

82 214 160:=*

84 161 127:=*

86 116 98:*

88 116 76:* inset = represents 2 library sequences

90 85 59:*

92 47 46:* :======================*=

94 55 35:* :=================*==========

96 32 27:* :=============*==

98 28 21:* :==========*===

100 22 16:* :=======*===

102 10 13:* :===== *

104 8 10:* :====*

106 10 8:* :===*=

108 6 6:* :==*

110 6 5:* :==*

112 4 4:* :=*

114 3 3:* :=*

116 2 2:* :*

118 1 2:* :*

>120 67 1:* :*=================================

Bit scores, E-values

• Raw score cannot be compared among searches• Bit score is a function of raw score, scoring

matrix and amino acid frequencies in database.– Can be compared among searches– Requires parameters and K

• Expect value (E) depends on length of query (m) and subject (n)– Cannot be compared among searches of different dBs

€

′S =λS − ln(K)

ln(2), E =

mn

2 ′ S

Caution

• Preceding theory does not hold for global alignments

Where do and K come from?

• qi is the frequency of residue i • pij is the frequency of aligned residues i and j

with score sij

• is the unique solution to

• K is related to the random walk of locally maximal scores for a given scoring matrix and amino acid frequency

€

pij = qiq jeλsij pij =1

i, j

∑

How to interpret E• E is the expected # of matches S’ ≥ S’obs

– Given the lengths of the two sequences and assuming both are random

– Will be approximately Poisson under null hypothesis

• E can be much smaller or much greater than 1– When E << 1, it is approximately equal to

the probability that any match would have S’ ≥ S’obs

– There may be many matches with E << 1 if there are multiple homologs in the dB

Additional considerations

• Edge effects– Length of expected random alignment

must be substracted from length of sequence and query

• Multiple comparisons– When searching D such sequences for

a given E threshold, the expected frequency of matches is E’ ~ DE

– BLAST actually reports E’

Gapped versus ungapped alignments

• For gapped alignments– and K can be theoretically computed

• For ungapped alignments– and K must be simulated for each matrix and

dB

• E values are approximate for gapped alignment– Assumes gaps are expensive/rare

• BLAST shows E value for the sum of HSP scores, rather than that of the gapped alignment itself

Calculating critical scores

• Two sequences of length 100 • What is S’crit for E ≤ 0.05?

• Answer: 13.3 bits€

′S crit = log2

nm

E

⎛

⎝ ⎜

⎞

⎠ ⎟=

lnnm

E

⎛

⎝ ⎜

⎞

⎠ ⎟

ln(2)=

lnnm

E

⎛

⎝ ⎜

⎞

⎠ ⎟

0.69315

Masking low-complexity regions

• Alignment statistics require that symbols occur randomly in strings– Long substrings of one or a few symbols

violate this assumption– Sequences are preprocessed to identify

such low-complexity regions (LCR)

• LCR are masked– Do not contribute to alignment or score– Appear as X’s in BLAST output

• Tools include DUST (for DNA) and SEG (protein)

• Other types of repeats may also be masked by BLAST

Complexity

€

Complexity =1

LlogN

L!

f i!∀ i

∏

⎛

⎝

⎜ ⎜ ⎜

⎞

⎠

⎟ ⎟ ⎟

where L=window length

fi is frequency of symbol i, i=1..n

Example: GGGG

L=4,

fG=4, fA= fC= fT=0,

4!=4*3*2*1=24, 0!=1

Complexity=(1/4) * log4(24/4*1*1*1) = 0

Which matches should you care about?

• Beware of redundancy• Search the right set of sequences

and no more • Choosing a significance threshold

requires judgement

Are you searching the right dB?

• Protein– nr (Non-redundant, sort of)– Swissprot (Annotated)– Pat (Patented)– PDB (3D structures in Protein Data Bank)– Yeast, E. coli, Drosophila, Human, etc.– Month

• Nucleotide– nt (nucleotide version of nr)– EST (Expressed sequence tags)– GSS (Genome survey sequence - low pass)– HTGS (High Throughput Genomic Sequence)– Chromosome (completed genomes,

chromosomes)– etc.

BLAST parameters• -W wordsize [Integer] default = 11 nucleotides, 3 proteins

• -G Cost to open gap [Integer] default = 5 nucleotides, 11 proteins• -E Cost to extend gap [Integer] default = 2 nucleotides, 1 proteins

• Dropoffs for blast extensions

• Scoring matrix for proteins• -q Penalty for nucleotide mismatch [Integer] default = -3• -r reward for nucleotide match [Integer] default = 1

• Masking of low-complexity and repeat sequences

• -e expect value [Real] default = 10

Making your own BLAST dB

• Start with a MultiFASTA file• Formatdb program from NCBI converts

MultiFASTA file into BLAST dB– Words are preprocessed– K and are calculated for allowed scoring

matrices

• MultiFASTA file can also be used a query to do many searches against one database

• All-by-all search - same file as query and dB

Don’t use BLAST blindly

• When BLAST is not appropriate – Perfect matches (faster methods available)

• Primer sequences (also too short)• Assembly (more specialized tools exist)

– Really distant homology (pairwise alignment is not sufficiently sensitive)

• Pay attention to program choice/parameters when– Aligning cDNA against a genome– Aligning two very long sequences– Aligning sequences with many repeats

Cautionary tale: BRCA1

• Initial BLAST search showed match to granins with E = 0.00175

• Granins are typically found in endocrine cell secretory vesicles– Is a cancer protein excreted outside

of the cell?!

• Now known to be a spurious match

Summary• To attain high speed when searching a dB,

BLAST– Sacrifices some sensitivity by only extending HSPs– Preprocesses the database and holds it in memory

• Meaningful statistics are key to BLAST’s widespread use

• Conversion of – Raw score to bit score: depends on scoring matrix

and amino acid frequencies– Bit score to E value: depends on database size

• BLAST is easy to use and versatile, which makes it awfully tempting to misuse

Reading assignment

• Gibson & Muse, Chapter 2: Genome Sequencing and Annotation, pgs. 63-101

Documents

Alignment to a database of sequences