Upload
loan
View
37
Download
0
Embed Size (px)
DESCRIPTION
Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda. Fine Tuning the Enhanced Suffix Arrays. Table of Contents. Suffix array The enhanced suffix array Our accomplishment: Minimal Perfect Hashing Function The exact pattern matching problem - PowerPoint PPT Presentation
Citation preview
Ayat A.Dawood 1
Fine Tuning the Enhanced Suffix ArraysAyat A.DawoodCIS, Nile UniversityJoined work with: Mohamed AbouelHoda
Ayat A.Dawood 2
Table of Contents
Suffix array The enhanced suffix array Our accomplishment:
Minimal Perfect Hashing Function The exact pattern matching problem Improving the bucket table
representation
Ayat A.Dawood 3
Suffix array Array of integers
in the range from 0 to (n+1), specifying the lexicographic order of the (n+1) suffixes of S$.
e.g., S = acaaacatat$
S(Suftab[i]) Suftab Iaaacatat$ 2 0aacatat$ 3 1acaaacatat$ 0 2acatat$ 4 3atat$ 6 4at$ 8 5caaacatat$ 1 6catat$ 5 7tat$ 7 8t$ 9 9$ 10 10
Ayat A.Dawood 4
Suffix array Array of integers
in the range from 0 to (n+1), specifying the lexicographic order of the (n+1) suffixes of S$.
e.g., S = acaaacatat$
S(Suftab[i]) Suftab Iaaacatat$ 2 0aacatat$ 3 1acaaacatat$ 0 2acatat$ 4 3atat$ 6 4at$ 8 5caaacatat$ 1 6catat$ 5 7tat$ 7 8t$ 9 9$ 10 10
Ayat A.Dawood 5
Enhanced suffix array Basically it is the suffix
array enhanced with a set of tables.
Using those tables, best performance and complexity are achieved
lcptab[i] stores the length of longest common prefix of the suffixes suftab[i] and suftab[i-1].
S(Suftab[i])
lcptable
Suftab
I
aaacatat$ 0 2 0aacatat$ 2 3 1acaaacatat$
1 0 2
acatat$ 3 4 3atat$ 1 6 4at$ 2 8 5caaacatat$
0 1 6
catat$ 2 5 7tat$ 0 7 8t$ 1 9 9$ 0 10 10
6
Enhanced suffix array: l-interval
L-interval: interval of suffixes sharing the same prefixAyat A.Dawood
S(Suftab[i])
lcptable
Suftab
I
aaacatat$ 0 2 0aacatat$ 2 3 1acaaacatat$
1 0 2
acatat$ 3 4 3atat$ 1 6 4at$ 2 8 5caaacatat$
0 1 6
catat$ 2 5 7tat$ 0 7 8t$ 1 9 9$ 0 10 10
1-[0..5]
7
Enhanced suffix array: l-interval
Ayat A.Dawood
S(Suftab[i])
lcptable
Suftab
I
aaacatat$ 0 2 0aacatat$ 2 3 1acaaacatat$
1 0 2
acatat$ 3 4 3atat$ 1 6 4at$ 2 8 5caaacatat$
0 1 6
catat$ 2 5 7tat$ 0 7 8t$ 1 9 9$ 0 10 10
1-[0..5]
2-[0..1]
a
L-interval: interval of suffixes sharing the same prefix
8
Enhanced suffix array: l-interval
Ayat A.Dawood
S(Suftab[i])
lcptable
Suftab
I
aaacatat$ 0 2 0aacatat$ 2 3 1acaaacatat$
1 0 2
acatat$ 3 4 3atat$ 1 6 4at$ 2 8 5caaacatat$
0 1 6
catat$ 2 5 7tat$ 0 7 8t$ 1 9 9$ 0 10 10
0-[0..10]
1-[0..5] 2-[6..7] 1-[8..9]
2-[4..5]3-[2..3]2-[0..1]
a
a c
c t
t
L-interval: interval of suffixes sharing the same prefix
Ayat A.Dawood 9
Our accomplishment
Improvement (Fine Tuning): Alphabet-independent exact pattern
matching. Improving bucket table representation Improving access to the lcp-table.
Improvements are achieved using minimal perfect hashing techniques.
Ayat A.Dawood 10
Minimal perfect hashing(MPHF) Storing n static keys from universe U
in O(n) space with O(1) access time.[Botelho et. al]
Look up table requires O(|U|) space to achieve constant access time
11
Exact pattern matching problem
Ayat A.Dawood
0-[0..10]
1-[0..5] 2-[6..7] 1-[8..9]
2-[4..5]3-[2..3]2-[0..1]
a
a c
c t
t
S(Suftab[i])
lcptable
Suftab
I
aaacatat$ 0 2 0aacatat$ 2 3 1acaaacatat$
1 0 2
acatat$ 3 4 3atat$ 1 6 4at$ 2 8 5caaacatat$
0 1 6
catat$ 2 5 7tat$ 0 7 8t$ 1 9 9$ 0 10 10
e.g., pattern = aca
12
Exact pattern matching problem
Ayat A.Dawood
0-[0..10]
1-[0..5] 2-[6..7] 1-[8..9]
2-[4..5]3-[2..3]2-[0..1]
a
a c
c t
t
S(Suftab[i])
lcptable
Suftab
I
aaacatat$ 0 2 0aacatat$ 2 3 1acaaacatat$
1 0 2
acatat$ 3 4 3atat$ 1 6 4at$ 2 8 5caaacatat$
0 1 6
catat$ 2 5 7tat$ 0 7 8t$ 1 9 9$ 0 10 10
e.g., pattern = aca
13
Exact pattern matching problem
Ayat A.Dawood
0-[0..10]
1-[0..5] 2-[6..7] 1-[8..9]
2-[4..5]3-[2..3]2-[0..1]
a
a c
c t
t
S(Suftab[i])
lcptable
Suftab
I
aaacatat$ 0 2 0aacatat$ 2 3 1acaaacatat$
1 0 2
acatat$ 3 4 3atat$ 1 6 4at$ 2 8 5caaacatat$
0 1 6
catat$ 2 5 7tat$ 0 7 8t$ 1 9 9$ 0 10 10
e.g., pattern = aca
14
Exact pattern matching problem
Ayat A.Dawood
0-[0..10]
1-[0..5] 2-[6..7] 1-[8..9]
2-[4..5]3-[2..3]2-[0..1]
a
a c
c t
t
S(Suftab[i])
lcptable
Suftab
I
aaacatat$ 0 2 0aacatat$ 2 3 1acaaacatat$
1 0 2
acatat$ 3 4 3atat$ 1 6 4at$ 2 8 5caaacatat$
0 1 6
catat$ 2 5 7tat$ 0 7 8t$ 1 9 9$ 0 10 10
e.g., pattern = aca
Ayat A.Dawood 15
Exact pattern matching problem Using normal method: takes O(nm) Using the enhanced suffix arrays, it
can be achieved in O(|∑|m) [AbouElHoda et. al]
Other modification to the enhanced suffix arrays allows it to be done in O(m log (|∑|)). [Kim et. al],[Fischer et. al]
Ayat A.Dawood 16
Exact pattern matching problem Our work:
Using minimal perfect hashing technique, it can be achieved in O(m), removing the alphabet factor.0-[0..10]
1-[0..5] 2-[6..7] 1-[8..9]
2-[4..5]3-[2..3]2-[0..1]
a
a c
c t
t
MPHF table
MPHF table
Ayat A.Dawood 17
Exact pattern matching problem Our work:
Using minimal perfect hashing technique, it can be achieved in O(m), removing the alphabet factor.0-[0..10]
1-[0..5] 2-[6..7] 1-[8..9]
2-[4..5]3-[2..3]2-[0..1]
a
a c
c t
t
Ayat A.Dawood 18
Exact pattern matching problem Our work:
Using minimal perfect hashing technique, it can be achieved in O(m), removing the alphabet factor.0-[0..10]
1-[0..5] 2-[6..7] 1-[8..9]
2-[4..5]3-[2..3]2-[0..1]
a
a c
c t
t
Ayat A.Dawood 19
Improving the bucket table representation
S(Suftab[i])
lcptable
Suftab
I
aaacatat$ 0 2 0aacatat$ 2 3 1acaaacatat$
1 0 2
acatat$ 3 4 3atat$ 1 6 4at$ 2 8 5caaacatat$
0 1 6
catat$ 2 5 7tat$ 0 7 8t$ 1 9 9$ 0 10 10
Bucket table0 aa2 ac4 at
ag6 ca
ctcccg
8 tatctgttgagtgcgg
Bucket table: is a table used to jump directly to a certain lcp-interval in the suffix array
Ayat A.Dawood 20
Improving the bucket table representation
S(Suftab[i])
lcptable
Suftab
I
aaacatat$ 0 2 0aacatat$ 2 3 1acaaacatat$
1 0 2
acatat$ 3 4 3atat$ 1 6 4at$ 2 8 5caaacatat$
0 1 6
catat$ 2 5 7tat$ 0 7 8t$ 1 9 9$ 0 10 10
Bucket table0 aa2 ac4 at
ag6 ca
ctcccg
8 tatctgttgagtgcgg
Bucket table: is a table used to jump directly to a certain lcp-interval in the suffix array
Ayat A.Dawood 21
Improving the bucket table representation cont’ Problem:
Space consumption of the look up table is prohibitive for large d and ∑ (d ^ |∑|).
Solution: Use minimal perfect hashing techniques
to store the look up table.
Ayat A.Dawood 22
Improving the bucket table representation cont’ Results:
For the bacterial ecoli genome (size = 5400 bp) and for d= 12
Reduction comparing to lookup table
MPHF size in
bits
Lookup table
size in bits
No. of keys
Alphabet size
46% reduction 7231956.638
1677216 3474814
4 (A,T,C,G)
93% reduction 17590331.64
244140625
8451811
5(A,T,C,G,*N)*N for undefined nucleotide or dummy
character
Ayat A.Dawood 23
Conclusion
Exact pattern matching problem Improving the bucket table
representation. Improving access to the lcp-table.
Ayat A.Dawood 24
Questions???
Ayat A.Dawood 25
Improving access to the lcp-table To reduce space, lcp- table is
stored in 1 byte. If a common prefix is longer
than 255, then it is stored in another table.
To access this table, it is accessed sequential or using binary search
Our Enhancement: Use MPHF to store the extra
table to access it in constant time.
02
32
0
257279
300260
lcp-table
Extra lcp-table