Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Succinct de Bruijn graphsDiego Díaz DomínguezDepartment of Computer Science
University of Chile
de Bruijn graphs
Given a parameter k and a set ! of strings, a de Bruijn graph
(dBG) is a directed graph that encodes the k-1 overlaps
between all the k-substrings (k-mers) of !.
ATGC TGCT
GCTA
CTAT
TATG
TGCG GCGT
k=5
T
A
G
T
G
C
T
ATGCTATGCGTATGCTGCT
k-mer = ATGCT
!={ }
10/25/18 2Diego Díaz Domínguez
de Bruijn graphs
Different values for k yield different graph topologies
10/25/18 3Diego Díaz Domínguez
!=
C
ATGCTACTATGC
ATGCGT
ATG
TGC GCT
CTATAT
GCG CGT
k=4
T
A
TG
G
T
ATGC TGCT
GCTA
CTAT
TATG
TGCG GCGT
k=5
T
A
G
T
G
C
de Bruijn graphsAdvantages:• They are easy to construct• They catch the underlying structure of the string set• They can be used to solve several problems in Bioinformatics
Disadvantages:• Choosing the right value for k is always a major concern.• The size of the graph grows with k.• Having several dBG, with different orders, for the same
dataset is expensive.
10/25/18 4Diego Díaz Domínguez
Encoding a dBG
10/25/18 5Diego Díaz Domínguez
Things we have to do:• Represent the nodes and the edges succinctly• Traverse the graph efficiently• Combine the information of several graph orders• Colour the nodes
Operations in a dBG
10/25/18 6Diego Díaz Domínguez
Navigational operations we are interested in:
• outdegree(v):number of outgoing edges of node v• indegree(v):number of incoming edges of v• outgoing(v,a):follow outgoing edge of v labeled with a• incoming(v,a):follow an incoming node of v whose label
starts with a
Succinct dBG: BOSS
10/26/18 7Diego Díaz Domínguez
C
$$$ATGCTA$$$$$$CTATGC$$$$$$ATGCGT$$$
ATG
TGC
GCTCTA
TATGCG
CGT
k=4
T
A
T
G
G
T
$AT$$A
$$$
$$C $CT
TA$A$$
GT$
T$$
GC$
C$$
A
TG
C
TA
$
$$
$$
$
$$
$
$$$ A1 0
C1 1A$$ $ 1C$$ $ 1T$$ $ 1TA$ $ 1GC$ $ 1GT$ $ 1$$A T1 1CTA $ 0
T2 1$$C T3 1TGC $ 0
G1 0
T4 1GCG T5 1ATG C2 1$AT G2 1TAT *G2 1$CT A1 1GCT *A1 1CGT $ 1
BOSSMatrix
EdgeBWT
NodeMarks
C[$] = 0C[A] = 7C[C] = 9
C[G] = 11 C[T] = 13
2
2
BOSS:outdegree
10/26/18 8Diego Díaz Domínguez
2
2
select(NodeMarks,11)
select(NodeMarks,10)+1
BOSS: forward
10/26/18 9Diego Díaz Domínguez
$$$ A1 0
C1 1A$$ $ 1C$$ $ 1T$$ $ 1TA$ $ 1GC$ $ 1GT$ $ 1$$A T1 1CTA $ 0
T2 1$$C T3 1TGC $ 0
G1 0
T4 1GCG T5 1ATG C2 1$AT G2 1TAT *G2 1$CT A1 1GCT *A1 1CGT $ 1
BOSSMatrix
EdgeBWT
NodeMarks
C
ATG
TGC
GCTCTA
TATGCG
CGT
T
A
T
G
G
T
$AT$$A
$$$
$$C $CT
TA$A$$
GT$
T$$
GC$
C$$
A
TG
C
TA
$
$$
$$
$
$$
$
11
18
C[$]=0C[A]=7C[C]=9C[G]=11C[T]=13
forward(11,T)= 17= C[C] + rank(EdgeBWT,T,15)= 13 + 4 = 17
2
217
11
15
$$$ A1 0
C1 1A$$ $ 1C$$ $ 1T$$ $ 1TA$ $ 1GC$ $ 1GT$ $ 1$$A T1 1CTA $ 0
T2 1$$C T3 1TGC $ 0
G1 0
T4 1GCG T5 1ATG C2 1$AT G2 1TAT *G2 1$CT A1 1GCT *A1 1CGT $ 1
BOSSMatrix
EdgeBWT
NodeMarks
C
ATG
TGC
GCTCTA
TATGCG
CGT
T
A
T
G
G
T
$AT$$A
$$$
$$C $CT
TA$A$$
GT$
T$$
GC$
C$$
A
TG
C
TA
$
$$
$$
$
$$
$
C[$]=0C[A]=7C[C]=9C[G]=11C[T]=13
a
b
BOSS: indegree
10/26/18 10Diego Díaz Domínguez
select(EdgeBWT,G,3)=b
select(EdgeBWT,G,2)=aindegree(13)= 2= 1 + rank(EdgeBWT,*G,b)–
rank(EdgeBWT,*G,a)= 2
13
2
2
Bidirectional BOSS
10/26/18 11Diego Díaz Domínguez
incoming is more expensive than outgoing because we don’t have accessto the first symbol of the nodes. By doubling the size of the index, we cancompute incoming fast.
C
ATG
TGC
GCTCTA
TATGCG
CGT
T
A
T
G
G
T
$AT$$A
$$$
$$C $CT
TA$A$$
GT$
T$$
GC$
C$$
A
TG
C
TA
$
$$
$$
$
$$
$
$$$ A 0
C 1A$$ $ 1C$$ $ 1T$$ $ 1TA$ $ 1GC$ $ 1GT$ $ 1$$A T 1CTA $ 0
T 1$$C T 1TGC $ 0
G 0
T 1GCG T 1ATG C 1$AT G 1TAT *G 1$CT A 1GCT *A 1CGT $ 1
$$$ $ 1A$$ *$ 1C$$ *$ 1TA$ $ 1TC$ $ 1$$A T 1GTA $ 0
T 1$$C G 1TGC G 1ATC $ 0
G 1$CG T 1GCG *T 1TCG *T 1$TG C 1$$T G 1$AT C 1TAT *C 1CGT A 1
BOSSrBOSS
outgoingselect
incoming(13,G)= 21
13
21
Variable-order BOSS
10/26/18 12Diego Díaz Domínguezk=4 k=2
By augmenting BOSS with the LCP, we can represent all the dBG up to order k
i
i’->LCP[i’]<(k-1)
j->LCP[j]<(k-1)
Encoding EdgeBWT: Wavelet trees
10/25/18 13Diego Díaz Domínguez
Encoding EdgeBWT: RL-BWT
10/25/18 14Diego Díaz Domínguez
Closing remarks
10/26/18 15Diego Díaz Domínguez
• We can compress several dBG and still supporting fast operations
• It is still necessary to propose new algorithms that work ontop of BOSS
• Can we encode the patterns of the graph? (bubbles, tips cross-links)
• Can we combine BOSS with other compression schemes (LZ, grammars) to improve compression even further?
10/26/18 16Diego Díaz Domínguez
Thanks!
Encoding EdgeBWT: RL-BWT
10/25/18 17Diego Díaz Domínguez
1 $1 *$1 *$1 $1 $0 T1 $0 T0 G0 G1 $0 G0 T0 *T0 *T0 C0 G0 C0 *C0 A
0 T0 T0 G0 G0 G0 T1 *T1 *T0 C0 G0 C1 *C0 A
T T
T T
C G
G
G
T
C
G
C
A
D EdgeBWT
M
E* RL-E