Upload
dinhmien
View
213
Download
1
Embed Size (px)
Citation preview
Genome vs. gene
Genome Gene
Built of... DNA DNA
Describes Organism Protein
Stored as... Part of genome
Circular/ linear Linear
“Life cycle” DNA-DNA-DNA-... DNA-RNA-protein
Size 100b ... 100000b
1 500 .. 50000
2% ... 100% ~30%
Single molecule, or a few of them
Both (depending on the species)
0.5Mb*) ... 3500Mb
Amount per cell
Information content
The amount of genetic information in organisms
Name # genes
0.5 4704.5 4400
12 5500
120 18000
97 22000Homo sapiens 3000 23457Zea mays 2500 50000
Genome size (Mb)
Mycoplasma genitaliumEscherichia coli Saccharomyces cerevisiaeDrosophila melanogasterCaenorhabtitis elegans
The amount of genetic information in organisms
http://www.cbs.dtu.dk/staff/dave/roanoke/genetics980313.html
Largest genome: amoeba Chaos chaos (200x human genome) http://www.lawrence.edu/dept/biology/animal/
Sequence searching – definition
Task:
– Query: short, new sequence (~1000 letters)
– Database (searching space): very many sequences
– Goal: find seqs homologous to the query
Sequence searching – definition
We want:
– fast tool
– primarily a filter: most sequences will be unrelated to the query
– fine-tune the alignment later
Database Search Algorithms:Sensitivity, Selectivity
• True Positive (TP) – a homology detected (positive) correctly (true)
Signal Detected NameYes Yes True PositiveNo No True NegativeYes No False NegativeNo Yes False Positive
Database Search Algorithms:Sensitivity, Selectivity
• Sensitivity =TP/(TP+FN)
• Selectivity =TN/(TN+FP)
–
SensitivitySensitivity
SelectivitySelectivity
Courtesy of Gary Benson (ISSCB 2003)
What is BLAST Basic Local Alignment Search Tool Bad news: it is only a heuristics
– Heuristics: A rule of thumb that often helps in solving a certain class of problems, but makes no guarantees.Perkins, DN (1981) The Mind's Best Work
Basic idea:– High scoring segments have well
conserved (almost identical) part
– As well conserved part are identified, extend it to the real alignment
q
e
s
-
euqes-
What means well conserved for BLAST? BLAST works with k-words (words of length
k)
– k is a parameter
– different for DNA (>10) and proteins (2..4)
word w1 is T-similar to w2 if the sum of pair scores is at least T (e.g. T=12)
Similar 3-wordsW1: R K PW2: R R P
Score: 9 –1 7 ∑ = 15
BLAST algorithm3 basic steps
1)Preprocess the query: extract all the k-words
2)Scan for T-similar matches in database
3)Extend them to alignments
1)Preprocess2)Scan3)Extend
BLAST, Step 1: Preprocess the query
Take the query (e.g. LVNRKPVVP) Chop it into overlapping k-words (k=3 in this
case) Query: LVNRKPVVPWord1: LVNWord2: VNRWord3: NRK…
For each word find all similar words (scoring at least T)
E.g. for RKP the following 3-words are similar:QKP KKP RQP REP RRP RKP
1)Preprocess2)Scan3)Extend
Finite state machine
AC*T|GGC
abstract machine constant amount of memory
(states) used in computation and languages recognizes regular expressions
– cp dmt*.pdf /home/john
1)Preprocess2)Scan3)Extend
BLAST, Step 2: Find ¨exact¨ matches with scanning Use all the T-similar k-words to build the Finite
State Machine Scan for exact matches
...VLQKPLKKPPLVKRQPCCEVVRKPLVKVIRCLA...
QKPKKPRQPREPRRPRKP...
movement
1)Preprocess2)Scan3)Extend
BLAST, Step 3: Extending ¨exact¨ matches Having the list of exact matches we extend
alignment in both directions
Query: L V N R K P V V P
T-similar: R R P
Subject: G V C R R P L K C
Score: -3 4 -3 5 2 7 1 -2 -3
…till the sum of scores drops below some level X (e.g. X=-100) from the best known- what with gaps?
1)Preprocess2)Scan3)Extend
Gapped BLAST (now standard)
gapped local alignments are computed: much, much, much slower
therefore: modified “Hit criteria”
1)Preprocess2)Scan3)Extend
Hit criteria Extends the alignment only if there are
close two hits on the same diagonal
– sensitivity would drop without lowering T
– reduces extensions (90% time is spend on extensions)
Gapped local alignments are computed
– increased sensitivity allows us raise T
– raising T speeds up the search
1)Preprocess2)Scan3)Extend
dbpos
querypos
close
hit, s
ame d
iag
BLAST flavours blastp: protein query, protein db blastn: DNA query, DNA db blastx: DNA query, protein db
– in all reading frames. Used to find potential translation products of an unknown nucleotide sequence.
tblastn: protein query, DNA db
– database dynamically translated in all reading frames.
tblastx: DNA query, DNA db
– all translations of query against all translations of db
PSI-BLAST
Position-Specific Iterated BLAST A profile is derived from the result
of the first search Database is searched against the
profile (instead of a sequence) Up to 3 iterations
Profile Profile is generalized form of
sequence probabilities instead of a letter
Score of the profile
scorep,i ,A =∑B p[i ,B]scoreblosum62 A ,B
ACD...WY
0.30.10...0.30.3
0.500...00.5
00.50.2...0.10.2
0.20.00.1...0.40.3
...
...
...
...
...
...
...
...
profile positionletter
Constructing a profile Take significant BLAST results Make an alignment Assign weights to sequences Construct the profile
ACD...WY
0.30.10...0.30.3
0.500...00.5
00.50.2...0.10.2
0.20.00.1...0.40.3
...
...
...
...
...
...
...
...
BLAT
The Blast-Like Alignment Tool Large-scale genome comparison:
– query can be large Preprocessing phase:
– BLAST: query
– BLAT: db
BLAT, Step 1: Preprocess the database Index the database with k-words
– k=8..16 for nucleotides
– k=3..5 for proteins For each k-word store in which
sequences it appears
k-word: RKP
Hashed DB:QKP: HUgn0151194, Gene14, IG0, ...KKP: haemoglobin, Gene134, IG_30, ...RQP: HSPHOSR1, GeneA22...RKP: galactosyltransferase, IG_1...REP: haemoglobin, Gene134, IG_30, ...RRP: Z17368, Creatine kinase, ......
1)Preprocess2)Scan3)Extend
Hashing – associative arrays Indexing with the object Hash function:
– Objects should be “well spread”
hash:
x
possible objects - largesmall
(fits in memory)
1)Preprocess2)Scan3)Extend
Hashing - examples
T9 Predictive Text in mobile phones– “hello” in Multitap:
4, 4, 3, 3, 5, 5, 5, (pause) 5, 5, 5, 6, 6, 6
– “hello” in T9: 4, 3, 5, 5, 6
– Collisions: 4, 6:“in”, “go”
1)Preprocess2)Scan3)Extend
BLAT, Step 1: Index to find exact matches with hashing The database is preprocessed only
once! (independent from the query)
k-word: RKP
Hashed DB:QKP: HUgn0151194, Gene14, IG0, ...KKP: haemoglobin, Gene134, IG_30, ...RQP: HSPHOSR1, GeneA22...RKP: galactosyltransferase, IG_1...REP: haemoglobin, Gene134, IG_30, ...RRP: Z17368, Creatine kinase, ......
1)Preprocess2)Scan3)Extend
BLAT, Step 2: Hit criteria
1)Preprocess2)Scan3)Extend
In a constant time we can get the sequences with a certain k-word
relaxing hit definition -> improve sensitivity– allow imperfect hits
• costly, huge hash grows a few times!➔ shorten k (would lead to FP), but
expect two hits (see BLAST)
BLAT, Step 3: Identifying homologous regions Exclude common k-words For all k-words from query
– find out the position in db For results (qpos, dbpos):
–split into buckets (64kbp)
–sort on the diagonal (diag=qpos-dbpos)
1)Preprocess2)Scan3)Extend
BLAT, Step 3: Identifying homologous regions continued...
– from diagonally close hits (gap limit) create “pre-clusters”
–sort each “pre-cluster” on dbpos
–create clusters from close hits
– run Local Alignment for each cluster
1)Preprocess2)Scan3)Extend
Seeds – improving sensitivity
More general form of k-word is a seed
The seed
gives “hits” with both sequences
CT.GT.AT.
...CTCGTTATA...
...CTAGTAATG...
How to detect homology?
Take the score of an maximal local alignment
can it be obtained by chance?– any score can be obtained from
comparing (long enough) random sequences
What is a “chance”?
Extracting local alignments from random sequences
P-value (e.g. =0.01)– The probability of obtaining the result
by pure chance
– An alignment giving lower P-value than set by user is considered a hit.
Best Local Alignmentsby chance
0
1000
2000
3000
4000
5000
6000
7000
25 30 35 40 45 50 55 60 65 70 75
Score
# A
lignm
ents
Create random seqs, each 1000aa long Find the max local align Repeat
The Statistics of local alignment
Subst matrix must guarantee• E(score(a,b)) < 0 for random a, b
Analytical solution– sum of i.i.d. variables -> normal
distribution
– max of i.i.d. -> extreme value distribution (EVD)
http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
Expected number of aligns
E-value: the expected number of alignments scoring >= S
2x size of seq -> 2x number aligns 2x S -> E drops exponentially
E=K m n e−S
E-value depends on n, m
Example– For comparing seqA with seqB: S=88
-> E = 0.001
– For comparing seqA with 1000 seqs: score 88 -> E=1
Important for db searching:– n – size of query, m - size of db
E=K m n e−S
Deriving K, L
The above eq is theoretical result for gapless (g=inf) alignment– K, L can be derived from the subst
table For gapped case
– it seems that the equation holds
– we can derive K, L from “experiments”
E=Kmne−S
0
1000
2000
3000
4000
5000
6000
7000
25 30 35 40 45 50 55 60 65 70 75
Score
# A
lignm
ents
Bit score S'
Score S depends on the substitution table
What if we want table-independent score?
E=Kmne−S
E=mn2−S' where S'=S−ln K
ln 2
Why does the BLAST work?Relevant riddle
Are there at least 2 people in Amsterdam with the same number of hairs?– At most 500 000 hairs on
each head
– 700 000 people living in Amsterdam
Why does the BLAST work?Pigeons pigeonhole principle: having 9 boxes and 10
pigeons, there is at least one box with more than 1 pigeon
– n=9, k=7 case:
http://en.wikipedia.org/wiki/Pigeonhole_principle
Why does the BLAST work?Average case
pigeonhole principle describes the worst case!
On average we'll expect two pigeons in the same box much earlier
Birthday paradox: among 23 people, probability that they have the same birthday is > 0.5
– note: 365 boxes and only 23 pigeons!
Why does the BLA[S]T work?
Forget the T-similar words, now use only identities 2 sequences, 100 nucleotides each:
– What's the minimal sequence identity for which there's a string of 3 consecutive identities?
?68st
0 identities, 100 mismatches:
67 identities, 33 mismatches:
but if seqs are 50% id, we'll detect it with prob. 99% 28% id -> we'll detect it with prob. 50%
– how is it calculated?
Expected sensitivity
We assume that letters are independent
I – identity between seqs, for human-mouse– 86% for DNA,
– 89% for proteins
pwordid=Ik
Expected sensitivity
Q - query size number of non-overlapping words
prob. of a “hit”
pdetect=1−1−pwordidR
R=Qk