Upload
airlia
View
47
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Efficient Probe Selection in Micro-array Design. Algorithmics Group, Dept. of Computer Science, University of Liverpool. Speaker: Cindy Y. Li Joint work with: Leszek G ą sieniec, Paul Sant, and Prudence Wong Special thanks go to: David Peleg. Talk Overview. - PowerPoint PPT Presentation
Citation preview
http://www.csc.liv.ac.uk/~cindy
1
Efficient Probe Selectionin Micro-array Design
Speaker: Cindy Y. Li Joint work with:
Leszek Gąsieniec, Paul Sant, and Prudence WongSpecial thanks go to: David Peleg
Algorithmics Group, Dept. of Computer Science, University of Liverpool
http://www.csc.liv.ac.uk/~cindy
2
Talk Overview
Background: Microarrays & Hybridization Problem Statement Our Approach Experimental Work Conclusion
http://www.csc.liv.ac.uk/~cindy
3
Hybridization Process
DNA 5’... TGTGCTTGACAACATAGTTG... 3’
|| | |
Short DNA Fragments 3’-CTACGGACCGAT-5’A single-stranded DNA probe (middle panel) is linked to an enzyme and allowed to base pair (hybridize) with the mRNA. After a series of washes, only fragments that are hybridized with the target mRNA remain.
http://www.csc.liv.ac.uk/~cindy
4
Tool: DNA Microarrays
Labeled DNA/RNA mixture flushed over array of short DNA fragments
Laser activation of fluorescent labels
http://www.csc.liv.ac.uk/~cindy
5
Talk Overview
Background: Microarrays & Hybridization Problem Statement Our Approach Experimental Work Conclusion
http://www.csc.liv.ac.uk/~cindy
6
Probe concept
A probe is a substring of a gene, which acts as its fingerprint (a.k.a., signature)
Probes are relatively short DNA sequences. Usually, a probe is ~ 20-25 base pairs long.
For example:
DNA ...TGTGCTTGGCAACATAGATAGATGC...
Probe TGCTTGGCAACATAGATAGA
http://www.csc.liv.ac.uk/~cindy
7
Finding unique probesP5P1 P2 P3 P4
G1
G3
G4
Probes
Genes G2
We are interested in finding a single (or a small group of) unique probe(s) for each gene
The search process should be both time and space efficient
http://www.csc.liv.ac.uk/~cindy
8
Finding unique probes Given a database S of gene sequences For each sequence g in S try to find a single
probe P which hybridizes only with g If P cross-hybridizes with some other
sequences in S (i.e., P has a close occurrence in S) then find a small set of probes that uniquely identifies g.
Sometimes multiple probes are required due to the error prone wet lab environment
http://www.csc.liv.ac.uk/~cindy
9
The use of probes
The uniqueness of probes allows us to identify the genes taking part in the experiment in the wet lab
I.e., seeing the trace (green color) of a number of probes on the microarray we can identify precisely which genes were involved in the experiment
http://www.csc.liv.ac.uk/~cindy
10
Finding Unique Probes - Performance Measure
Each gene in the database S should be uniquely identified by a smallest possible number of probes
The search for probes should be time/space efficient The time of the search for probes should be “fairly”
independent of the length of the probes All probes should be far (Hamming distance) from each other Probes should satisfy some extra (e.g., related to hybridization
process) conditions
Naive approach:Scans through the whole length-n genome for every length-m probe and determine if the Hamming distance is big enough, which takes O(mn2) time. For example, 72 hours for S. pombe genome of length 7.1 x 106 bps and thus impractical for large genome.
http://www.csc.liv.ac.uk/~cindy
11
Previous Work – Approaches based on
Suffix array and fast pattern matching [Li F. and Stormo G., 2001]
BLAST to avoid cross-hybridization [Rouillard J. M., Herbert C. J. and Zuker M., 2002]
Longest common substrings [Rahmann S. 2002]
Various filtering techniques [Lockhart DJ et al, 1996]
Methods based on pigeon hole principle [Lee W. H. and Sung W. K., 2003]
etc
http://www.csc.liv.ac.uk/~cindy
12
Previous Work – The probe selection criteria
No single base exceeds 50% of the probe size The length of any contiguous As and Ts or Cs and
Gs is less than 25% of the probe size (G+C)% is between 40% and 60% of the probe Sensitivity - No self-complementarity within the probe
sequence Homogeneity - Melting Temperature not being too
low or too high Specificity – probes are unique to each gene
http://www.csc.liv.ac.uk/~cindy
13
Previous Work – Test data
Test data
Genome Name
E. coli S. pombe Yeast
Genome
Length
4,752,411 13,149,651 8,783,280
Number of Genes
5,253 521 5,888
http://www.csc.liv.ac.uk/~cindy
14
Previous Work – Test data
Yeast
0
500
1000
1500
2000
2500
<1K [1K, 2K) [2K, 3K) [3K, 4K) [4K, 5K) [5K, 6K) [6K, 7K) >= 7K
gene length
# o
f g
enes
Total length 8,783,280
Total # of genes5,888
http://www.csc.liv.ac.uk/~cindy
15
Previous WorkLi and
Stormo BIBE,2000
Rouillard et al,Bioinfor-
matics,2002
Rahmann,WABI, 2002
Lee and Sung,CSB, 2003
E.Coli 23-bps
1.5 days
50-bps
31 minutes
Yeast 24-bps
4 days
50-bps
1 day
50-bps
49 minutes
Neurospora crassa
25-bps
4 hours
50-bps
3.5 hours
# of probes Top 10 All probes Top 50 All probes
http://www.csc.liv.ac.uk/~cindy
16
Talk Overview Background: Microarrays & Hybridization Problem Statement Our new alternative approach - main observations - the algorithm Experimental work Conclusion
http://www.csc.liv.ac.uk/~cindy
17
Main Observations
In general randomness help! 80% of “randomly” (based on our algorithm) chosen
candidates for probes satisfy the probe selection criteria related to hybridization process
[this suggests that random sequences hybridize properly more likely]
The expected Hamming distance between two randomly chosen sequences of a length n over 4 letter alphabet is ~ 3n/4.
[this suggests that randomly chosen probes will be far from each other]
http://www.csc.liv.ac.uk/~cindy
18
An interesting observation
In general, fragments of DNA sequences representing genes are more deterministic (contain more organized information) comparing to the rest of the sequence.
In contrary, the best probes (signatures) representing genes are very likely to be random or almost random!
http://www.csc.liv.ac.uk/~cindy
19
The Algorithm
(*) For every gene g in the database S:
a) generate a random base-pair sequence of length m
b) find the closest length-m substring P in gene g
c) check P for good probe criteria [80% pass this test] If P does not pass the criteria go to a)
d) cross-hybridization checking for P [98% pass this test] For every length-m substring Q in other sequences S-{g}:
If H(P,Q) > d, P is chosen as the probe for g, goto (*) Otherwise, P can possibly cross-hybridize and we must
generate another length-m random substring P', go to a)
http://www.csc.liv.ac.uk/~cindy
20
The algorithm
R
(*) For every gene g in the database S:
a) generate a random base-pair sequence of length m
gP
b) find the closest length-m substring P in gene g
c) Check P for good probe criteria, if P does not pass the criteria, go to a)
http://www.csc.liv.ac.uk/~cindy
21
The algorithm
gP
d) Check P for cross-hybridization checking For every length-m substring Q in other sequences (S - {g}):If H(P,Q) > d, P is chosen as the probe for g, goto (*);Otherwise, P can possibly cross-hybridize and we must generate another length-m random substring, go to a)
g1P is far from g1 √
H(P,Q)<d XgiQ
BackgroundSequences …g2 P is far from g2 √
Generate another length-m random substring
http://www.csc.liv.ac.uk/~cindy
22
Talk Overview
Background: Microarrays & Hybridization Problem Statement Algorithm Experimental Work Conclusion
http://www.csc.liv.ac.uk/~cindy
23
Experimental Work
For Yeast: 1.80% genes with no probes
Duplicated / very similar / too short apart from that
98.0% genes need only one probe 1.5% genes need two probes 0.5% genes need three probes
Similar result with genome E.coli
http://www.csc.liv.ac.uk/~cindy
24
Talk Overview
Background: Microarrays & Hybridization Problem Statement Algorithm Experimental Work Conclusion
http://www.csc.liv.ac.uk/~cindy
25
Conclusion
Almost all (98%) genes can be uniquely identified by a single probe; the others need at most three probes
Our method is: Suitable for large scale probe design Fairly independent from the length of probes Both time and space efficient Useful in design of fault-tolerant system of probes
http://www.csc.liv.ac.uk/~cindy
26
Ongoing Work
Distinguish multiple targets in a sample
P1
g1
P2’P1’g3
P2
g2
http://www.csc.liv.ac.uk/~cindy
27
Questions
???? ??
http://www.csc.liv.ac.uk/~cindy
28
Thank You!
Presented By Cindy Y. Li
http://www.csc.liv.ac.uk/~cindy
29
self-complementarity
Probe 5‘ TTTCAGTAATAAAAGATTTCTGT 3‘ |||| Probe 3‘ TGTCTTTAGAAAAATTAGACTTT 5‘
http://www.csc.liv.ac.uk/~cindy
30
Melting Temperature TM can be used as a parameter to evaluate probe hybridization
behavior TM is calculated for each probe as (SantaLucia et al., 1996)
is the sum of the nearest neighbor enthalpy changes
is the sum of the nearest neighbor entropy changes
R is the Gas Constant (1.987 cal deg-1 mol-1)
CT is the total molar concentration of strands ( )
http://www.csc.liv.ac.uk/~cindy
31
Melting Temperature thermodynamic stability / nearest neighbour/
A T C G
A -1.2
-0.9
-1.5
-1.5
T -0.9
-1.2
-1.5
-1.7
C -1.7
-1.5
-2.1
-2.8
G -1.5
-1.5
-2.3
-2.1
TTTCAGTAATTAAAAAGATTTCTGT
-1.2 -1.5-1.7 kcal/mol