Download pptx - Work @ Fudan University

1

Work @ Fudan UniversityChen, Yaoliang

2

ENGINEERING WORK

• TTS System• A Chinese Text-To-Speech system

• SafeDB• Bug backlog

• SMemoHelper• A small tool that helps learn English words.

• Fraud Detecting• Time series tech

3

•CGAP-align: A high performance DNA short read alignment tool▫Coauthor with BCM. Bioinformatics in

progress▫NDBC Demo

•On Encoding Shortest Paths in Large Graphs ▫Coauthor with Jian Pei. VLDB in progress▫Coauthor with Haixun Wang. Sigmod in

progress▫NDBC

•Other Projects

RESEARCH WORK

4

•Baylor College of Medicine•序列比对及意义▫Reference & Reads

ACTAGCGATATAACCCTTTCCCTTTCCCTTT CACGAT

•Given a number z reference X and read W, we want to find a subsequence W’=X[i,i+1,…,j] such that EditDistance(W,W’)≤z.

CGAP-ALIGN: BACKGROUND

ACTAGCGATATAACCCTTTCCCTTTCCCTTT CACGAT

5

DNA sequences in GenBank

CHALLENGES

•A human genome sequence▫ 2000 € 1,000,000,000 in ~10 years▫ 2008 € 50 - 100,000 in ~4 months▫ 2010 € 5 - 10,000 in ~2 weeks▫ ...2015 € 1,000 in ~1 day▫ ...2020 € 10 in ~1 hour to minutes

6

•Burrows-Wheeler Alignment Tool▫一个流行的在大型参照序列上对基因片段进行比对工具

•Optimization of BWA▫Code level▫Algorithm level

•BWA Performance: T = N × Taln▫N: enumerate all mismatches and gaps

of the read▫Taln: time to locate the modified reads in

the reference during the alignment stage

PERFORMANCE OF BWA

7

•Optimizing Taln: efficiency for matching▫Suffix Tarray

•Optimizing N: pruning ability to avoid enumerating unnecessary mismatches and gaps▫Data-Conscious D-Array Calculating

OPTIMIZATION

8

•Suffix Tree•Suffix Array Based on BWT (FM-index)•Comparison

SUFFIX TARRAY

...

Root

A C G T

FM-index

R(AA)

Leaf(b=2) A C G T A C TA

R(AA)_

R(TT)_

R(TT)R(TC)_

R(TC)...

Ref=ATCTTCAAGARead=TAA

9BURROWS-WHEELER TRANSFORM

mississippi#ississippi#mssissippi#mi sissippi#mis

sippi#missisippi#mississppi#mississipi#mississipi#mississipp

#mississippi

ssippi#missiissippi#miss Sort the rows

p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i

i ssippi#mis s

m ississippi #i ssissippi# m

i ppi#missis s i #mississip p

#mississipp i

LF

From Yuval Rikover

10

BURROWS-WHEELER TRANSFORM

1. Find F by sorting L 2. First char of T? m3. Find m in L4. L[i] precedes F[i] in T. Therefore we

get mi5. How do we choose the correct i in L?

▫ The i’s are in the same order in L and F▫ As are the rest of the char’s

6. i is followed by s: mis7. And so on….

F

Reminder: Recovering T from L

#iiiimppssssipssm#pissii

L

11

NEXT: COUNT P IN T

• Backward-search algorithm• Uses only L (output of BWT)• Relies on 2 structures:

▫ C[1,…,|Σ|] : C[c] contains the total number of text chars in T which are alphabetically smaller then c (including repetitions of chars)

▫ Occ(c,q): number of occurrences of char c in prefix L[1,q]

Example

•C[ ] for T = mississippi#

•occ(s, 5) = 2•occ(s,12) = 4

Occ Rank

8 6 5 1

123456789101112

i m p s

12SUBSTRING SEARCH IN T (COUNT THE PATTERN OCCURRENCES)

frocc=2[lr-fr+1]

#mississippi#mississipippi#missisissippi#misississippi#

mississippipi#mississippi#mississsippi#missisissippi#missippi#missssissippi#m

ipssm#pissii

L

mississippi

#1i 2m 7p 8S 10

C

Availa

ble in

foP = siFirst step

fr

lr Inductive step: Given fr,lr for P[j+1,p]Take

c=P[j]

P[ j ]

Find the first c in L[fr, lr]Find the last c in L[fr, lr]

lr

rows prefixedby char “i” s

s

unknown

Occ() oracle is enough

13

•Backward search•Store “First” and “Last” (k and l) values

SUFFIX T-ARRAY

14

BACKWARD-SEARCH EXAMPLE•P = CAA▫ i =

▫ c =

▫ First =

▫ Last =

3

‘A’First(AA)

Last(AA)

‘C’C[‘T’] + Occ(‘C’,First(AA)) +1

C[‘T’] + Occ(‘C’,Last(AA))

12

‘A’

A

A

FM-index

Root

15

•Optimizing Taln: efficiency for matching▫Suffix Tarray

•Optimizing N: pruning ability to avoid enumerating unnecessary mismatches and gaps▫Data-Conscious D-Array Calculating

OPTIMIZATION

16

• e(W)▫minimal number of the edit operations that

is needed to make W exactly align onto the reference X.

•D-array▫D[i] : Lower bound of e(W[0…i])

D-ARRAY: MOTIVATION

34

… i0

17

•Given a string W and an arbitrary combination strings of W = w1,w2,…,wk, we have e(W)>

•D array in BWA▫split W into several small strings like

W=w1w2…wk with e(wi)=1 for all i. The correctness of the algorithm depends on the inequality: e(W) > .

D ARRAY: MOTIVATION

18

•Example Reference X = “AACGTATCGACG”

▫W▫D

•A better segmentation: Consider e(·)= 2

▫W▫D

▫calculating e(·) costs exponential time▫Need to pre-compution

D ARRAY: MOTIVATION

A G T C A AT C A AC A AA AA A GG0 0 1 10 1

A G T C A AG T C A AT C A AC A AA AA A0 0 2 20 1

19

• Fasta file F containing training reads

•Should be similar to the reads in practice

•Data Concious

SOLUTION - FREQUENT PATTERN

Train Reads

Frequent Patterns

Trie DFA

Frequent Patterns

Train Reads

Trie DFA

•Mining Frequent Patterns (FPs)

•Art of State Methods

•Our solution: A simple DFS on FM-index▫Count=Last-First+1

•Generate prefix trie T for the FPs with e(w)=2.

•Refine T to a DFA GT

20

•Why Trie DFA?▫When online doing alignment, we need to

find all the FPs contained in a read ▫This operation should be no more expensive

than O(|W|)

TRIE DETERMINISTIC FINITE AUTOMATON

21

Offline Index: Construction•String Set(FP set)▫AA▫C▫G▫T▫AC▫AG

•The prefix trie done. We start to construct DFA.

R

1

A C G T

LC LG LT

LAA

A C G

LAC LAG


R

41

6

5

2 7

3

T

22

•DFS order – minimize the average hop between each jump. (7% up)

RE-ORDERING

R

1

A C G T

3 4 5

2

A C G

6 7

T

65 7

2 43

23

Online Query•String Set(FP set)▫AA▫AC▫AG▫C▫G▫T

•W=“CACAT”

R

1

A C G T

LC LG LT

LAA

A C G

LAC LAG

T


R

LC1

LAC

1 LT

24

•Optimizing Taln: efficiency for matching▫Suffix Tarray (20% up)

•Optimizing N: pruning ability to avoid enumerating unnecessary mismatches and gaps▫Data-Conscious D-Array Calculating (0-

200% up)

EXPERIMENT

25

• Background•Consider a graph G = (V,E), where V is a

set of vertices and E =VxV is a set of edges.

•FH-Partition

ON ENCODING SHORTEST PATHS IN LARGE GRAPHS

26

EXAMPLES

7 47->10 FH(7,10) = 9; FH(9,10) = 2; FH(2,10) = 10

27

PROBLEM STATEMENT

•Numbering Function

28

MCN IS NP-HARD!!

29

WORKFLOW•Compute a naïve

numbering function

•Store the FH-partitions

Compute FH-Partitions

Get Numbering Function(s)

Encoding FH-Partitions

Get Numbering Function(s)

Compute FH-Partitions

Encoding FH-Partitions

•Reduce to TSP

•Region tree

•Multi numbering functions

•Further Compression

•Answering query efficiently

30

EXPERIMENTS

31

Thank you!