1
Work @ Fudan UniversityChen, Yaoliang
2
ENGINEERING WORK
• TTS System• A Chinese Text-To-Speech system
• SafeDB• Bug backlog
• SMemoHelper• A small tool that helps learn English words.
• Fraud Detecting• Time series tech
3
•CGAP-align: A high performance DNA short read alignment tool▫Coauthor with BCM. Bioinformatics in
progress▫NDBC Demo
•On Encoding Shortest Paths in Large Graphs ▫Coauthor with Jian Pei. VLDB in progress▫Coauthor with Haixun Wang. Sigmod in
progress▫NDBC
•Other Projects
RESEARCH WORK
4
•Baylor College of Medicine•序列比对及意义▫Reference & Reads
ACTAGCGATATAACCCTTTCCCTTTCCCTTT CACGAT
•Given a number z reference X and read W, we want to find a subsequence W’=X[i,i+1,…,j] such that EditDistance(W,W’)≤z.
CGAP-ALIGN: BACKGROUND
ACTAGCGATATAACCCTTTCCCTTTCCCTTT CACGAT
5
DNA sequences in GenBank
CHALLENGES
•A human genome sequence▫ 2000 € 1,000,000,000 in ~10 years▫ 2008 € 50 - 100,000 in ~4 months▫ 2010 € 5 - 10,000 in ~2 weeks▫ ...2015 € 1,000 in ~1 day▫ ...2020 € 10 in ~1 hour to minutes
6
•Burrows-Wheeler Alignment Tool▫一个流行的在大型参照序列上对基因片段进行比对工具
•Optimization of BWA▫Code level▫Algorithm level
•BWA Performance: T = N × Taln▫N: enumerate all mismatches and gaps
of the read▫Taln: time to locate the modified reads in
the reference during the alignment stage
PERFORMANCE OF BWA
7
•Optimizing Taln: efficiency for matching▫Suffix Tarray
•Optimizing N: pruning ability to avoid enumerating unnecessary mismatches and gaps▫Data-Conscious D-Array Calculating
OPTIMIZATION
8
•Suffix Tree•Suffix Array Based on BWT (FM-index)•Comparison
SUFFIX TARRAY
...
Root
A C G T
FM-index
R(AA)
Leaf(b=2) A C G T A C TA
R(AA)_
R(TT)_
R(TT)R(TC)_
R(TC)...
Ref=ATCTTCAAGARead=TAA
9BURROWS-WHEELER TRANSFORM
mississippi#ississippi#mssissippi#mi sissippi#mis
sippi#missisippi#mississppi#mississipi#mississipi#mississipp
#mississippi
ssippi#missiissippi#miss Sort the rows
p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i
i ssippi#mis s
m ississippi #i ssissippi# m
i ppi#missis s i #mississip p
#mississipp i
LF
From Yuval Rikover
10
BURROWS-WHEELER TRANSFORM
1. Find F by sorting L 2. First char of T? m3. Find m in L4. L[i] precedes F[i] in T. Therefore we
get mi5. How do we choose the correct i in L?
▫ The i’s are in the same order in L and F▫ As are the rest of the char’s
6. i is followed by s: mis7. And so on….
F
Reminder: Recovering T from L
#iiiimppssssipssm#pissii
L
11
NEXT: COUNT P IN T
• Backward-search algorithm• Uses only L (output of BWT)• Relies on 2 structures:
▫ C[1,…,|Σ|] : C[c] contains the total number of text chars in T which are alphabetically smaller then c (including repetitions of chars)
▫ Occ(c,q): number of occurrences of char c in prefix L[1,q]
Example
•C[ ] for T = mississippi#
•occ(s, 5) = 2•occ(s,12) = 4
Occ Rank
8 6 5 1
123456789101112
i m p s
12SUBSTRING SEARCH IN T (COUNT THE PATTERN OCCURRENCES)
frocc=2[lr-fr+1]
#mississippi#mississipippi#missisissippi#misississippi#
mississippipi#mississippi#mississsippi#missisissippi#missippi#missssissippi#m
ipssm#pissii
L
mississippi
#1i 2m 7p 8S 10
C
Availa
ble in
foP = siFirst step
fr
lr Inductive step: Given fr,lr for P[j+1,p]Take
c=P[j]
P[ j ]
Find the first c in L[fr, lr]Find the last c in L[fr, lr]
lr
rows prefixedby char “i” s
s
unknown
Occ() oracle is enough
13
•Backward search•Store “First” and “Last” (k and l) values
SUFFIX T-ARRAY
14
BACKWARD-SEARCH EXAMPLE•P = CAA▫ i =
▫ c =
▫ First =
▫ Last =
3
‘A’First(AA)
Last(AA)
‘C’C[‘T’] + Occ(‘C’,First(AA)) +1
C[‘T’] + Occ(‘C’,Last(AA))
12
‘A’
A
A
FM-index
Root
15
•Optimizing Taln: efficiency for matching▫Suffix Tarray
•Optimizing N: pruning ability to avoid enumerating unnecessary mismatches and gaps▫Data-Conscious D-Array Calculating
OPTIMIZATION
16
• e(W)▫minimal number of the edit operations that
is needed to make W exactly align onto the reference X.
•D-array▫D[i] : Lower bound of e(W[0…i])
D-ARRAY: MOTIVATION
34
… i0
17
•Given a string W and an arbitrary combination strings of W = w1,w2,…,wk, we have e(W)>
•D array in BWA▫split W into several small strings like
W=w1w2…wk with e(wi)=1 for all i. The correctness of the algorithm depends on the inequality: e(W) > .
D ARRAY: MOTIVATION
18
•Example Reference X = “AACGTATCGACG”
▫W▫D
•A better segmentation: Consider e(·)= 2
▫W▫D
▫calculating e(·) costs exponential time▫Need to pre-compution
D ARRAY: MOTIVATION
A G T C A AT C A AC A AA AA A GG0 0 1 10 1
A G T C A AG T C A AT C A AC A AA AA A0 0 2 20 1
19
• Fasta file F containing training reads
•Should be similar to the reads in practice
•Data Concious
SOLUTION - FREQUENT PATTERN
Train Reads
Frequent Patterns
Trie DFA
Frequent Patterns
Train Reads
Trie DFA
•Mining Frequent Patterns (FPs)
•Art of State Methods
•Our solution: A simple DFS on FM-index▫Count=Last-First+1
•Generate prefix trie T for the FPs with e(w)=2.
•Refine T to a DFA GT
20
•Why Trie DFA?▫When online doing alignment, we need to
find all the FPs contained in a read ▫This operation should be no more expensive
than O(|W|)
TRIE DETERMINISTIC FINITE AUTOMATON
21
Offline Index: Construction•String Set(FP set)▫AA▫C▫G▫T▫AC▫AG
•The prefix trie done. We start to construct DFA.
R
1
A C G T
LC LG LT
LAA
A C G
LAC LAG
TRIE DETERMINISTIC FINITE AUTOMATON
R
41
6
5
2 7
3
T
22
•DFS order – minimize the average hop between each jump. (7% up)
RE-ORDERING
R
1
A C G T
3 4 5
2
A C G
6 7
T
65 7
2 43
23
Online Query•String Set(FP set)▫AA▫AC▫AG▫C▫G▫T
•W=“CACAT”
R
1
A C G T
LC LG LT
LAA
A C G
LAC LAG
T
TRIE DETERMINISTIC FINITE AUTOMATON
R
LC1
LAC
1 LT
24
•Optimizing Taln: efficiency for matching▫Suffix Tarray (20% up)
•Optimizing N: pruning ability to avoid enumerating unnecessary mismatches and gaps▫Data-Conscious D-Array Calculating (0-
200% up)
EXPERIMENT
25
• Background•Consider a graph G = (V,E), where V is a
set of vertices and E =VxV is a set of edges.
•FH-Partition
ON ENCODING SHORTEST PATHS IN LARGE GRAPHS
26
EXAMPLES
7 47->10 FH(7,10) = 9; FH(9,10) = 2; FH(2,10) = 10
27
PROBLEM STATEMENT
•Numbering Function
28
MCN IS NP-HARD!!
29
WORKFLOW•Compute a naïve
numbering function
•Store the FH-partitions
Compute FH-Partitions
Get Numbering Function(s)
Encoding FH-Partitions
Get Numbering Function(s)
Compute FH-Partitions
Encoding FH-Partitions
•Reduce to TSP
•Region tree
•Multi numbering functions
•Further Compression
•Answering query efficiently
30
EXPERIMENTS
31
Thank you!