48
Alignments BLAST, BLAT

Alignments BLAST, BLAT - VU ·  · 2005-10-19What is BLAST Basic Local Alignment Search Tool Bad news: it is only a heuristics – Heuristics: A rule of thumb that often helps in

Embed Size (px)

Citation preview

AlignmentsBLAST, BLAT

Genome vs. gene

Genome Gene

Built of... DNA DNA

Describes Organism Protein

Stored as... Part of genome

Circular/ linear Linear

“Life cycle” DNA-DNA-DNA-... DNA-RNA-protein

Size 100b ... 100000b

1 500 .. 50000

2% ... 100% ~30%

Single molecule, or a few of them

Both (depending on the species)

0.5Mb*) ... 3500Mb

Amount per cell

Information content

The amount of genetic information in organisms

Name # genes

0.5 4704.5 4400

12 5500

120 18000

97 22000Homo sapiens 3000 23457Zea mays 2500 50000

Genome size (Mb)

Mycoplasma genitaliumEscherichia coli Saccharomyces cerevisiaeDrosophila melanogasterCaenorhabtitis elegans

The amount of genetic information in organisms

http://www.cbs.dtu.dk/staff/dave/roanoke/genetics980313.html

Largest genome: amoeba Chaos chaos (200x human genome) http://www.lawrence.edu/dept/biology/animal/

Sequence searching - challenges Exponential growth of

databases

Sequence searching – definition

Task:

– Query: short, new sequence (~1000 letters)

– Database (searching space): very many sequences

– Goal: find seqs homologous to the query

Sequence searching – definition

We want:

– fast tool

– primarily a filter: most sequences will be unrelated to the query

– fine-tune the alignment later

Database Search Algorithms:Sensitivity, Selectivity

• True Positive (TP) – a homology detected (positive) correctly (true)

Signal Detected NameYes Yes True PositiveNo No True NegativeYes No False NegativeNo Yes False Positive

Database Search Algorithms:Sensitivity, Selectivity

• Sensitivity =TP/(TP+FN)

• Selectivity =TN/(TN+FP)

SensitivitySensitivity

SelectivitySelectivity

Courtesy of Gary Benson (ISSCB 2003)

What is BLAST Basic Local Alignment Search Tool Bad news: it is only a heuristics

– Heuristics: A rule of thumb that often helps in solving a certain class of problems, but makes no guarantees.Perkins, DN (1981) The Mind's Best Work

Basic idea:– High scoring segments have well

conserved (almost identical) part

– As well conserved part are identified, extend it to the real alignment

q

e

s

-

euqes-

What means well conserved for BLAST? BLAST works with k-words (words of length

k)

– k is a parameter

– different for DNA (>10) and proteins (2..4)

word w1 is T-similar to w2 if the sum of pair scores is at least T (e.g. T=12)

Similar 3-wordsW1: R K PW2: R R P

Score: 9 –1 7 ∑ = 15

BLAST algorithm3 basic steps

1)Preprocess the query: extract all the k-words

2)Scan for T-similar matches in database

3)Extend them to alignments

1)Preprocess2)Scan3)Extend

BLAST, Step 1: Preprocess the query

Take the query (e.g. LVNRKPVVP) Chop it into overlapping k-words (k=3 in this

case) Query: LVNRKPVVPWord1: LVNWord2: VNRWord3: NRK…

For each word find all similar words (scoring at least T)

E.g. for RKP the following 3-words are similar:QKP KKP RQP REP RRP RKP

1)Preprocess2)Scan3)Extend

Finite state machine

AC*T|GGC

abstract machine constant amount of memory

(states) used in computation and languages recognizes regular expressions

– cp dmt*.pdf /home/john

1)Preprocess2)Scan3)Extend

BLAST, Step 2: Find ¨exact¨ matches with scanning Use all the T-similar k-words to build the Finite

State Machine Scan for exact matches

...VLQKPLKKPPLVKRQPCCEVVRKPLVKVIRCLA...

QKPKKPRQPREPRRPRKP...

movement

1)Preprocess2)Scan3)Extend

BLAST, Step 3: Extending ¨exact¨ matches Having the list of exact matches we extend

alignment in both directions

Query: L V N R K P V V P

T-similar: R R P

Subject: G V C R R P L K C

Score: -3 4 -3 5 2 7 1 -2 -3

…till the sum of scores drops below some level X (e.g. X=-100) from the best known- what with gaps?

1)Preprocess2)Scan3)Extend

Gapped BLAST (now standard)

gapped local alignments are computed: much, much, much slower

therefore: modified “Hit criteria”

1)Preprocess2)Scan3)Extend

Hit criteria Extends the alignment only if there are

close two hits on the same diagonal

– sensitivity would drop without lowering T

– reduces extensions (90% time is spend on extensions)

Gapped local alignments are computed

– increased sensitivity allows us raise T

– raising T speeds up the search

1)Preprocess2)Scan3)Extend

dbpos

querypos

close

hit, s

ame d

iag

Gapped BLAST v BLAST

We end up with

– same speed

– gapped alignments!

– much higher sensitivity

BLAST flavours blastp: protein query, protein db blastn: DNA query, DNA db blastx: DNA query, protein db

– in all reading frames. Used to find potential translation products of an unknown nucleotide sequence.

tblastn: protein query, DNA db

– database dynamically translated in all reading frames.

tblastx: DNA query, DNA db

– all translations of query against all translations of db

PSI-BLAST

Position-Specific Iterated BLAST A profile is derived from the result

of the first search Database is searched against the

profile (instead of a sequence) Up to 3 iterations

Profile Profile is generalized form of

sequence probabilities instead of a letter

Score of the profile

scorep,i ,A =∑B p[i ,B]scoreblosum62 A ,B

ACD...WY

0.30.10...0.30.3

0.500...00.5

00.50.2...0.10.2

0.20.00.1...0.40.3

...

...

...

...

...

...

...

...

profile positionletter

Constructing a profile Take significant BLAST results Make an alignment Assign weights to sequences Construct the profile

ACD...WY

0.30.10...0.30.3

0.500...00.5

00.50.2...0.10.2

0.20.00.1...0.40.3

...

...

...

...

...

...

...

...

BLAT

The Blast-Like Alignment Tool Large-scale genome comparison:

– query can be large Preprocessing phase:

– BLAST: query

– BLAT: db

BLAT, Step 1: Preprocess the database Index the database with k-words

– k=8..16 for nucleotides

– k=3..5 for proteins For each k-word store in which

sequences it appears

k-word: RKP

Hashed DB:QKP: HUgn0151194, Gene14, IG0, ...KKP: haemoglobin, Gene134, IG_30, ...RQP: HSPHOSR1, GeneA22...RKP: galactosyltransferase, IG_1...REP: haemoglobin, Gene134, IG_30, ...RRP: Z17368,  Creatine kinase, ......

1)Preprocess2)Scan3)Extend

Hashing – associative arrays Indexing with the object Hash function:

– Objects should be “well spread”

hash:

x

possible objects - largesmall

(fits in memory)

1)Preprocess2)Scan3)Extend

Hashing - examples

T9 Predictive Text in mobile phones– “hello” in Multitap:

4, 4, 3, 3, 5, 5, 5, (pause) 5, 5, 5, 6, 6, 6

– “hello” in T9: 4, 3, 5, 5, 6

– Collisions: 4, 6:“in”, “go”

1)Preprocess2)Scan3)Extend

BLAT, Step 1: Index to find exact matches with hashing The database is preprocessed only

once! (independent from the query)

k-word: RKP

Hashed DB:QKP: HUgn0151194, Gene14, IG0, ...KKP: haemoglobin, Gene134, IG_30, ...RQP: HSPHOSR1, GeneA22...RKP: galactosyltransferase, IG_1...REP: haemoglobin, Gene134, IG_30, ...RRP: Z17368,  Creatine kinase, ......

1)Preprocess2)Scan3)Extend

BLAT, Step 2: Hit criteria

1)Preprocess2)Scan3)Extend

In a constant time we can get the sequences with a certain k-word

relaxing hit definition -> improve sensitivity– allow imperfect hits

• costly, huge hash grows a few times!➔ shorten k (would lead to FP), but

expect two hits (see BLAST)

BLAT, Step 3: Identifying homologous regions Exclude common k-words For all k-words from query

– find out the position in db For results (qpos, dbpos):

–split into buckets (64kbp)

–sort on the diagonal (diag=qpos-dbpos)

1)Preprocess2)Scan3)Extend

BLAT, Step 3: Identifying homologous regions continued...

– from diagonally close hits (gap limit) create “pre-clusters”

–sort each “pre-cluster” on dbpos

–create clusters from close hits

– run Local Alignment for each cluster

1)Preprocess2)Scan3)Extend

Seeds – improving sensitivity

More general form of k-word is a seed

The seed

gives “hits” with both sequences

CT.GT.AT.

...CTCGTTATA...

...CTAGTAATG...

How to detect homology?

Take the score of an maximal local alignment

can it be obtained by chance?– any score can be obtained from

comparing (long enough) random sequences

What is a “chance”?

Extracting local alignments from random sequences

P-value (e.g. =0.01)– The probability of obtaining the result

by pure chance

– An alignment giving lower P-value than set by user is considered a hit.

Best Local Alignmentsby chance

0

1000

2000

3000

4000

5000

6000

7000

25 30 35 40 45 50 55 60 65 70 75

Score

# A

lignm

ents

Create random seqs, each 1000aa long Find the max local align Repeat

The Statistics of local alignment

Subst matrix must guarantee• E(score(a,b)) < 0 for random a, b

Analytical solution– sum of i.i.d. variables -> normal

distribution

– max of i.i.d. -> extreme value distribution (EVD)

http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

Expected number of aligns

E-value: the expected number of alignments scoring >= S

2x size of seq -> 2x number aligns 2x S -> E drops exponentially

E=K m n e−S

E-value depends on n, m

Example– For comparing seqA with seqB: S=88

-> E = 0.001

– For comparing seqA with 1000 seqs: score 88 -> E=1

Important for db searching:– n – size of query, m - size of db

E=K m n e−S

Deriving K, L

The above eq is theoretical result for gapless (g=inf) alignment– K, L can be derived from the subst

table For gapped case

– it seems that the equation holds

– we can derive K, L from “experiments”

E=Kmne−S

0

1000

2000

3000

4000

5000

6000

7000

25 30 35 40 45 50 55 60 65 70 75

Score

# A

lignm

ents

Bit score S'

Score S depends on the substitution table

What if we want table-independent score?

E=Kmne−S

E=mn2−S' where S'=S−ln K

ln 2

Why does the BLAST work?Relevant riddle

Are there at least 2 people in Amsterdam with the same number of hairs?– At most 500 000 hairs on

each head

– 700 000 people living in Amsterdam

Why does the BLAST work?Pigeons pigeonhole principle: having 9 boxes and 10

pigeons, there is at least one box with more than 1 pigeon

– n=9, k=7 case:

http://en.wikipedia.org/wiki/Pigeonhole_principle

Why does the BLAST work?Average case

pigeonhole principle describes the worst case!

On average we'll expect two pigeons in the same box much earlier

Birthday paradox: among 23 people, probability that they have the same birthday is > 0.5

– note: 365 boxes and only 23 pigeons!

Birthday paradox

Why does the BLA[S]T work?

Forget the T-similar words, now use only identities 2 sequences, 100 nucleotides each:

– What's the minimal sequence identity for which there's a string of 3 consecutive identities?

?68st

0 identities, 100 mismatches:

67 identities, 33 mismatches:

but if seqs are 50% id, we'll detect it with prob. 99% 28% id -> we'll detect it with prob. 50%

– how is it calculated?

Expected sensitivity

We assume that letters are independent

I – identity between seqs, for human-mouse– 86% for DNA,

– 89% for proteins

pwordid=Ik

Expected sensitivity

Q - query size number of non-overlapping words

prob. of a “hit”

pdetect=1−1−pwordidR

R=Qk

Expected specificity

How many matches by chance (C)? G – genome size

For h-m, to get 99% sensitivity we have to set k=7, and for Q=1000

C ~= 25,000,000– 7h assuming 1/1000 per alignment

C=Q−k1∗Gk∗14

k