82
Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion Succinct Data Structures: From Theory to Practice Simon Gog Computing and Information Systems The University of Melbourne January 15th 2014

Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Embed Size (px)

Citation preview

Page 1: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Succinct Data Structures:From Theory to Practice

Simon Gog

Computing and Information SystemsThe University of Melbourne

January 15th 2014

Page 2: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Big data: Volume, Velocity,...

Page 3: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

...and Variety

Picture source IBM

Page 4: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

...and Variety

Picture source IBM

This talk addresses two Vs

Variety: Index unstructured data

Volume: Index larger data sets

Velocity is not directly addressed

Index: Construct once, query often

Page 5: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Outline

1 The Inverted Index: a success story

2 Suffix sorted based indexes: a tour de force

3 More virtues of the Wavelet Tree

4 Various Structures

5 SDSL: A toolbox of succinct structures

6 Conclusion

Page 6: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Index structures are the bases of information retrieval

Page 7: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Inverted Index

Inverted Indexes for a collection of documents Cused for web indexingpractical in domains with well-defined vocabularyspace is usually 5− 10% of C (without positions)space is about 50% if positions are includedcan not be used to reconstruct original text (parsing,stemming, stopping,...)original text has to be stored separately to allow snippetextraction

Page 8: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Inverted Index

Documents (already normalized)d0 : is big data really bigd1 : is it big in scienced2 : big data is big

Inverted Lists

big : {(0,1),(0,4),(1,2),(2,0),(2,3)}data : {(0,2),(2,1)}

in : {(1,3)}is : {(0,0),(1,0),(2,2)}

really : {(0,3)}science : {(1,4)}

Page 9: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Exact search using Inverted Index based systems (1)

No document containing „<line=break />” is returned.

Page 10: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Exact search using Inverted Index based systems (2)

but it exists...

Page 11: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Summary: Inverted Indexes

Advantages:Sequential locality in many operations (fast!)Index can be stored on diskGood compression for natural language text

Disadvantages:Decision, what is searchable, made at parsing timeNo fast calculation of phrase queriesNo alphabet⇒ not applicableAdapt data to index structure

Search quality seems to be enough for casual web search.Scientific tasks require high quality search.

Page 12: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Outline

1 The Inverted Index: a success story

2 Suffix sorted based indexes: a tour de force

3 More virtues of the Wavelet Tree

4 Various Structures

5 SDSL: A toolbox of succinct structures

6 Conclusion

Page 13: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Suffix sorted based indexes

Suffix sorted based indexes for a text Tsort all n suffixes of Tworks also in domains with no well-defined vocabularyused in Bioinformaticsspace

uncompressed: 5× to 50× size of Tcompressed: n × Hk (T ) ≤ bzip’d T

index structure adapts to data

can be used to reconstruct original text (Self-Index)Not cover in this talk: LZ compressed indexes (works well forhighly repetitive data)

Page 14: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Hk of Pizza&Chili corpus texts (200MB versions)

Hk (T ) in bits contexts/|T | in percent

k DBLP.XML DNA ENGLISH PROTEINS

0 5.26 0.0 1.97 0.0 4.53 0.0 4.20 0.01 3.48 0.0 1.93 0.0 3.62 0.0 4.18 0.02 2.17 0.0 1.92 0.0 2.95 0.0 4.16 0.03 1.43 0.1 1.92 0.0 2.42 0.0 4.07 0.04 1.05 0.4 1.91 0.0 2.06 0.3 3.83 0.15 0.82 1.3 1.90 0.0 1.84 1.0 3.16 1.76 0.70 2.7 1.88 0.0 1.67 2.7 1.50 17.4

Page 15: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Index size comparison (English text)si

ze

original text Suffix Tree Suffix Array Self-Index

Page 16: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Self-Index Operations

Let P be a character pattern of length m = |P|.

Operations:

count(P) How many times does P occur in T?locate(P) List all count(P) positions where P occurs.

extract(i,j) Extract text T [i ..j] from the index.

Operations are efficient (like O(m logσ) , O(m log n))

Page 17: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Suffix Tree (ST) of T=umulmundumulmum$

15

$

7

dumulmum$

11

m$

3

ndumulmum$

lmu

14

$

9

m$

1

ndumulmum$

lmu

12

m$

4

ndumulmum$

um

6

ndumulmum$

10

m$

2

ndumulmum$

lmu

13$

8

m$

0

ndumulmum$

ulmu

m5

ndumulmum$

u n = 16Σ ={$,d,l,m,n,u}σ = 6

O(n) construction(Weiner,1973)

Problem:Structure size:≥ 20× |T |

Page 18: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Suffix Array (SA) of T=umulmundumulmum$

i

0123456789

101112131415

0123456789101112131415

T

umulmundumulmum$mulmundumulmum$ulmundumulmum$lmundumulmum$mundumulmum$undumulmum$ndumulmum$dumulmum$umulmum$mulmum$ulmum$lmum$mum$um$m$$

Page 19: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Suffix Array (SA) of T=umulmundumulmum$

i

0123456789

101112131415

SA

1571131491124610213805

T

$dumulmum$lmum$lmundumulmum$m$mulmum$mulmundumulmum$mum$mundumulmum$ndumulmum$ulmum$ulmundumulmum$um$umulmum$umulmundumulmum$undumulmum$

Size: ≥ 5× |T |Binary search to answercount(|P|)Match pattern from left toright: e.g. mum

Page 20: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Burrows-Wheeler Transform (BWT) of T

i

0123456789

101112131415

SA

1571131491124610213805

TBWT

mnuuuuullummmd$m

T

$dumulmum$lmum$lmundumulmum$m$mulmum$mulmundumulmum$mum$mundumulmum$ndumulmum$ulmum$ulmundumulmum$um$umulmum$umulmundumulmum$undumulmum$

Well-compressibleNew search algorithm(Ferragina & Manzini,2000): Backward searchDoes not require SA.

Page 21: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Backward Searchi TBWT T0 a $1 r a$2 r abarbara$3 d abrabarbara$4 $ abracadabrabarbara$5 r acadabrabarbara$6 c adabrabarbara$7 b ara$8 b arbara$9 r bara$

10 a barbara$11 a brabarbara$12 a bracadabrabarbara$13 a cadabrabarbara$14 a dabrabarbara$15 a ra$16 b rabarbara$17 b racadabrabarbara$18 a rbara$

Array C contains start of eachcharacter interval:

$ a b c d r r+10 1 9 13 14 15 19

rank(i , c) = # occurences ofc in TBWT[0, i).Search backwards for bar.

Page 22: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Backward Searchi TBWT T0 a $abracadabrabarbara1 r a$2 r abarbara$3 d abrabarbara$4 $ abracadabrabarbara$5 r acadabrabarbara$6 c adabrabarbara$7 b ara$8 b arbara$9 r bara$

10 a barbara$11 a brabarbara$12 a bracadabrabarbara$13 a cadabrabarbara$14 a dabrabarbara$15 a ra$16 b rabarbara$17 b racadabrabarbara$18 a rbara$

C$ a b c d r0 1 9 13 14 15

Now search backwards forbar .Initial interval:[sp0,ep0] = [0..n − 1]

Determine interval for r :sp1 = C[r ]+rank(sp0, r)ep1 = C[r ]+rank(ep0+1, r)−1

Page 23: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Backward Searchi TBWT T0 a $abracadabrabarbara1 r a$2 r abarbara$3 d abrabarbara$4 $ abracadabrabarbara$5 r acadabrabarbara$6 c adabrabarbara$7 b ara$8 b arbara$9 r bara$

10 a barbara$11 a brabarbara$12 a bracadabrabarbara$13 a cadabrabarbara$14 a dabrabarbara$15 a ra$16 b rabarbara$17 b racadabrabarbara$18 a rbara$

C$ a b c d r0 1 9 13 14 15

Now search backwards forbar .Initial interval:[sp0,ep0] = [0..n − 1]

Determine interval for r :sp1 = 15+rank(0, r)ep1 = 15+rank(19, r)

Page 24: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Backward Searchi TBWT T0 a $abracadabrabarbara1 r a$2 r abarbara$3 d abrabarbara$4 $ abracadabrabarbara$5 r acadabrabarbara$6 c adabrabarbara$7 b ara$8 b arbara$9 r bara$

10 a barbara$11 a brabarbara$12 a bracadabrabarbara$13 a cadabrabarbara$14 a dabrabarbara$15 a ra$16 b rabarbara$17 b racadabrabarbara$18 a rbara$

C$ a b c d r0 1 9 13 14 15

Now search backwards forbar .Initial interval:[sp0,ep0] = [0..n − 1]

Determine interval for r :sp1 = 15+0ep1 = 15+rank(19, r)− 1

Page 25: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Backward Searchi TBWT T0 a $abracadabrabarbara1 r a$2 r abarbara$3 d abrabarbara$4 $ abracadabrabarbara$5 r acadabrabarbara$6 c adabrabarbara$7 b ara$8 b arbara$9 r bara$

10 a barbara$11 a brabarbara$12 a bracadabrabarbara$13 a cadabrabarbara$14 a dabrabarbara$15 a ra$16 b rabarbara$17 b racadabrabarbara$18 a rbara$

C$ a b c d r0 1 9 13 14 15

Now search backwards forbar .Initial interval:[sp0,ep0] = [0..n − 1]

Determine interval for r :sp1 = 15+0 = 15ep1 = 15+4− 1 = 18

Page 26: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Backward Searchi TBWT T0 a $1 r a$2 r abarbara$3 d abrabarbara$4 $ abracadabrabarbara$5 r acadabrabarbara$6 c adabrabarbara$7 b ara$8 b arbara$9 r bara$

10 a barbara$11 a brabarbara$12 a bracadabrabarbara$13 a cadabrabarbara$14 a dabrabarbara$15 a ra$16 b rabarbara$17 b racadabrabarbara$18 a rbara$

C$ a b c d r0 1 9 13 14 15

Now search backwards forbar .Interval: [sp1,ep1] = [15..18]

Determine interval for ar :sp2 = C[a]+rank(sp1,a)ep2 = C[a]+rank(ep1+1,a)−1

Page 27: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Backward Searchi TBWT T0 a $1 r a$2 r abarbara$3 d abrabarbara$4 $ abracadabrabarbara$5 r acadabrabarbara$6 c adabrabarbara$7 b ara$8 b arbara$9 r bara$

10 a barbara$11 a brabarbara$12 a bracadabrabarbara$13 a cadabrabarbara$14 a dabrabarbara$15 a ra$16 b rabarbara$17 b racadabrabarbara$18 a rbara$

C$ a b c d r0 1 9 13 14 15

Now search backwards forbar .Interval: [sp1,ep1] = [15..18]

Determine interval for ar :sp2 = 1+rank(15,a)ep2 = 1+rank(ep1,a)

Page 28: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Backward Searchi TBWT T0 a $1 r a$2 r abarbara$3 d abrabarbara$4 $ abracadabrabarbara$5 r acadabrabarbara$6 c adabrabarbara$7 b ara$8 b arbara$9 r bara$

10 a barbara$11 a brabarbara$12 a bracadabrabarbara$13 a cadabrabarbara$14 a dabrabarbara$15 a ra$16 b rabarbara$17 b racadabrabarbara$18 a rbara$

C$ a b c d r0 1 9 13 14 15

Now search backwards forbar .Interval: [sp1,ep1] = [15..18]

Determine interval for ar :sp2 = 1+rank(15,a)ep2 = 1+rank(ep1,a)

Page 29: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Backward Searchi TBWT T0 a $1 r a$2 r abarbara$3 d abrabarbara$4 $ abracadabrabarbara$5 r acadabrabarbara$6 c adabrabarbara$7 b ara$8 b arbara$9 r bara$

10 a barbara$11 a brabarbara$12 a bracadabrabarbara$13 a cadabrabarbara$14 a dabrabarbara$15 a ra$16 b rabarbara$17 b racadabrabarbara$18 a rbara$

C$ a b c d r0 1 9 13 14 15

Now search backwards forbar .Interval: [sp1,ep1] = [15..18]

Determine interval for ar :sp2 = 1+6ep2 = 1+rank(19,a)− 1

Page 30: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Backward Searchi TBWT T0 a $1 r a$2 r abarbara$3 d abrabarbara$4 $ abracadabrabarbara$5 r acadabrabarbara$6 c adabrabarbara$7 b ara$8 b arbara$9 r bara$

10 a barbara$11 a brabarbara$12 a bracadabrabarbara$13 a cadabrabarbara$14 a dabrabarbara$15 a ra$16 b rabarbara$17 b racadabrabarbara$18 a rbara$

C$ a b c d r0 1 9 13 14 15

Now search backwards forbar .Interval: [sp1,ep1] = [15..18]

Determine interval for ar :sp2 = 1+6 = 7ep2 = 1+8− 1 = 8

Page 31: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Backward Searchi TBWT T0 a $1 r a$2 r abarbara$3 d abrabarbara$4 $ abracadabrabarbara$5 r acadabrabarbara$6 c adabrabarbara$7 b ara$8 b arbara$9 r bara$

10 a barbara$11 a brabarbara$12 a bracadabrabarbara$13 a cadabrabarbara$14 a dabrabarbara$15 a ra$16 b rabarbara$17 b racadabrabarbara$18 a rbara$

C$ a b c d r0 1 9 13 14 15

Now search backwards forbar .Interval: [sp2,ep2] = [7..8]

Determine interval for bar :sp3 = C[b]+rank(sp2,b)ep3 =C[b]+rank(ep2+1,b)−1

Page 32: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Backward Searchi TBWT T0 a $1 r a$2 r abarbara$3 d abrabarbara$4 $ abracadabrabarbara$5 r acadabrabarbara$6 c adabrabarbara$7 b ara$8 b arbara$9 r bara$

10 a barbara$11 a brabarbara$12 a bracadabrabarbara$13 a cadabrabarbara$14 a dabrabarbara$15 a ra$16 b rabarbara$17 b racadabrabarbara$18 a rbara$

C$ a b c d r0 1 9 13 14 15

Now search backwards forbar .Interval: [sp2,ep2] = [7..8]

Determine interval for bar :sp3 = 9+rank(7,b)ep3 = 9+rank(ep1,b)

Page 33: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Backward Searchi TBWT T0 a $1 r a$2 r abarbara$3 d abrabarbara$4 $ abracadabrabarbara$5 r acadabrabarbara$6 c adabrabarbara$7 b ara$8 b arbara$9 r bara$

10 a barbara$11 a brabarbara$12 a bracadabrabarbara$13 a cadabrabarbara$14 a dabrabarbara$15 a ra$16 b rabarbara$17 b racadabrabarbara$18 a rbara$

C$ a b c d r0 1 9 13 14 15

Now search backwards forbar .Interval: [sp2,ep2] = [7..8]

Determine interval for bar :sp3 = 9+rank(7,b)ep3 = 9+rank(ep1,b)

Page 34: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Backward Searchi TBWT T0 a $1 r a$2 r abarbara$3 d abrabarbara$4 $ abracadabrabarbara$5 r acadabrabarbara$6 c adabrabarbara$7 b ara$8 b arbara$9 r bara$

10 a barbara$11 a brabarbara$12 a bracadabrabarbara$13 a cadabrabarbara$14 a dabrabarbara$15 a ra$16 b rabarbara$17 b racadabrabarbara$18 a rbara$

C$ a b c d r0 1 9 13 14 15

Now search backwards forbar .Interval: [sp2,ep2] = [7..8]

Determine interval for bar :sp3 = 9+0ep3 = 9+rank(9,b)−1

Page 35: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Backward Searchi TBWT T0 a $1 r a$2 r abarbara$3 d abrabarbara$4 $ abracadabrabarbara$5 r acadabrabarbara$6 c adabrabarbara$7 b ara$8 b arbara$9 r bara$

10 a barbara$11 a brabarbara$12 a bracadabrabarbara$13 a cadabrabarbara$14 a dabrabarbara$15 a ra$16 b rabarbara$17 b racadabrabarbara$18 a rbara$

C$ a b c d r0 1 9 13 14 15

Now search backwards forbar .Interval: [sp2,ep2] = [7..8]

Determine interval for bar :sp3 = 9+0 = 9ep3 = 9+2− 1 = 10

Page 36: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Conclusion Backward Search

String matching on T only requiresa structure which can answer rank queries on TBWT

the C arrayText T itself is not needed.

We search a data structurerepresents TBWT compressedanswers rank quickly

Page 37: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

The Wavelet Tree (WT)

Let X be a sequence of length n over alphabet Σ of size σ.

Efficient Wavelet Tree operations:

access(i) Recover X [i].rank(i,c) How many times does c occur in X [0, i)?

select(i,c) Location of the i-th count(P) positions where Poccurs.

Space: ≈ n logσ + o(n logσ) + σ log n

Page 38: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

WT operation rank

arrd$rcbbraaaaaabba

0111011001000000000

a$bbaaaaaabba

0011000000110

a$aaaaaaa

101111111

$

0

aaaaaaaa

10

bbbb

1

0

rrdrcr110101

dc10

c

0

d1

0

rrrr

1

1

Only the bitvectors(dark gray) is storedcode(a) = 001

Reduce WT rank onbitvector rank

rank(11,a,WT ) = rank(rank(rank(11,0,bε) = 5,0,b0) =3,1,b00) = 2

Page 39: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

WT operation rank

arrd$rcbbraaaaaabba

0111011001000000000

a$bbaaaaaabba

0011000000110

a$aaaaaaa

101111111

$

0

aaaaaaaa

10

bbbb

1

0

rrdrcr110101

dc10

c

0

d1

0

rrrr

1

1

Only the bitvectors(dark gray) is storedcode(a) = 001

Reduce WT rank onbitvector rank

rank(11,a,WT ) = rank(rank(rank(11,0,bε) = 5,0,b0) =3,1,b00) = 2

Page 40: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

WT operation rank

arrd$rcbbraaaaaabba

0111011001000000000

a$bbaaaaaabba

0011000000110

a$aaaaaaa

101111111

$

0

aaaaaaaa

10

bbbb

1

0

rrdrcr110101

dc10

c

0

d1

0

rrrr

1

1

Only the bitvectors(dark gray) is storedcode(a) = 001

Reduce WT rank onbitvector rank

rank(11,a,WT ) = rank(rank(rank(11,0,bε) = 5,0,b0) =3,1,b00) = 2

Page 41: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

WT operation rank

arrd$rcbbraaaaaabba

0111011001000000000

a$bbaaaaaabba

0011000000110

a$aaaaaaa

101111111

$

0

aaaaaaaa

10

bbbb

1

0

rrdrcr110101

dc10

c

0

d1

0

rrrr

1

1

Only the bitvectors(dark gray) is storedcode(a) = 001

Reduce WT rank onbitvector rank

rank(11,a,WT ) = rank(rank(rank(11,0,bε) = 5,0,b0) =3,1,b00) = 2

Page 42: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Rank on bitvectors

n + o(n) bit space and constant time solution

Jacobson: Space-efficient Static Trees and Graphs, FOCS 1989.Two level schema:

Divide bitvector in superblocks of size log2 n and blocks oflog n/2.Store absolute prefix sums for superblocks.Relative perfix sums for blocks.Four Russian-Trick (Lookup-Table counting set bits) for blocks.

Practice todaySince 2008 CPUs support popcount operation⇒ no need forLookup-Tables.

Page 43: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Improving WT space: Huffman shape

arrd$rcbbraaaaaabba

1000000000111111001

rrd$rcbbrbb

11001000100

d$cbbbb

0001111

d$c

100

$c

01

$

0

c

10

d

10

bbbb

10

rrrr

10

aaaaaaaa

1

Char c codeword(c)$ 00000a 1b 001c 00001d 0001r 01

Avg. depth: H0(BWT ). Total space: ≈ nH0 + 2σ log n

Page 44: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Other WT parametrizations

Shape: balanced, Huffman-shape, Hu-Tucker-shape, WaveletMatrixfor large alphabets (decreases O(σ log n) to O(logσ log n))run-length compressed versionsparameterized bitvector

Page 45: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Experimental results: operation count

26 S. GOG AND M.PETRI

0 20 40 60 80 100

0

2

4

6

8

Index size in (%)

Tim

epe

rcha

ract

erin

(µs)

XZGZIP

K =15

K =63

K =127

IndexFM-HF-R3KFM-RLMNFM-HF-VFM-HF-1L

instance = WEB-64MB

Index size in (%)

0 20 40 60 80 100

XZGZIP

K =15

K =63

K =127

FeatureTBLBLT+HP

instance = WEB-64GB

Figure 10. Count time and space of our index implementations on input instances of different size withcompression effectiveness baselines using standard compression utilities XZ and GZIP with option --best.

FM-RLMN profits from each feature. Especially replacing the lookup table version for popcntand select64 with more efficient BLT versions discussed above. Since FM-RLMN is a rather complexstructure the different rank and select sub data structures “compete” for cache. As expected the FM-index based on RANK-1L is slower than FM-HF-V when no optimization is used but slightly fasterwhen BLT and HP are activated. This reflects the outcome of the experiments of the basic rank datastructures used in both indexes shown in Figure 3.

Finally we discuss the time-space trade-offs of the different index types in Figure 10. As discussedabove, optimizations BLT and hugepages only affects the runtime performance on larger data sets(right). The uncompressed representations FM-HF-V and FM-HF-1L, and FM-RLMN are affectedthe most by the features. The FM-indexes based on H0 compressed bitvectors are also affected.Interestingly, the indexes for K = 63 and 128 achieve better compression then our gzip --best

baseline. The FM-index using RANK-R3K and K = 128 is only 5% larger than the xz compressedrepresentation of the test input. The index for K = 256 is even smaller but is not shown in Figure10 due to the increased runtime as shown in Table IX.

Overall our new data structures, improvements and optimized implementations enable FM-indexes that are faster than the state of the art (using RANK-1L) or smaller than state of the artindexes (using RANK-R3K with K = 128, 255).

7. CONCLUSION

We proposed a simple, cache-friendly rank data structure, a practical select data structure foruncompressed bitvectors and an improved implementation of compressed bitvector representations.We explore the behavior of rank and select data structures for binary and general sequencesfor varying data sizes, implementations, instruction sets and operating system features. We showthat using the built-in popcnt and our optimized select64 operation significantly improve theperformance of classic rank and select data structures. We further discuss the effect of larger pagesizes on the performance of succinct data structures. We demonstrate that these improvementspropagate directly to more complex succinct data structures such as FM-indexes. Using largerpage sizes has not yet been explored in literature in the context of succinct data structures and

Copyright c� 0000 John Wiley & Sons, Ltd. Softw. Pract. Exper. (0000)Prepared using speauth.cls DOI: 10.1002/spe

(G & Petri, 2014)

Page 46: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Experimental results: construction

Resource diagram for the construction of a Self-Index for 200MBof English text.

Page 47: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Outline

1 The Inverted Index: a success story

2 Suffix sorted based indexes: a tour de force

3 More virtues of the Wavelet Tree

4 Various Structures

5 SDSL: A toolbox of succinct structures

6 Conclusion

Page 48: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

More WT operations: Query point grids

σ

n

Operations:

count(x0,x1,y0,y1) # points in rectangle [x0, x1][y0, y1].report(x0,x1,y0,y1) Report points in rectangle [x0, x1][y0, y1].

Page 49: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

WT application: Top-k document retrieval

CSA

ω20ω10ω30ω30

#1ω10ω10ω40ω10

#1ω10ω40ω30ω10

#1ω50ω50

#1

T =b =

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

4

0

9

1

14

2

17

3

8

1

13

2

5

1

6

1

1

0

10

2

12

2

0

0

3

0

2

0

7

1

11

2

16

3

15

3D =

position=

ω 1

Page 50: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

WT application: Top-k document retrieval

Represent document array D as wavelet treeFirst explore nodes, which contain the maximal intervalExample: Search top-2 documents (frequency based ranking)

012312110220001233001101000110000111

01111000010111100001

00000

0

111111

0

2322223301000011

22222

0

333

1

1

Top documents containing ω1:

Page 51: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

WT application: Top-k document retrieval

Represent document array D as wavelet treeFirst explore nodes, which contain the maximal intervalExample: Search top-2 documents (frequency based ranking)

012312110220001233001101000110000111

01111000010111100001

00000

0

111111

0

2322223301000011

22222

0

333

1

1

Top documents containing ω1:

Page 52: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

WT application: Top-k document retrieval

Represent document array D as wavelet treeFirst explore nodes, which contain the maximal intervalExample: Search top-2 documents (frequency based ranking)

012312110220001233001101000110000111

01111000010111100001

00000

0

111111

0

2322223301000011

22222

0

333

1

1

Top documents containing ω1:

Page 53: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

WT application: Top-k document retrieval

Represent document array D as wavelet treeFirst explore nodes, which contain the maximal intervalExample: Search top-2 documents (frequency based ranking)

012312110220001233001101000110000111

01111000010111100001

00000

0

111111

0

2322223301000011

22222

0

333

1

1

Top documents containing ω1: 1 (3 times)

Page 54: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

WT application: Top-k document retrieval

Represent document array D as wavelet treeFirst explore nodes, which contain the maximal intervalExample: Search top-2 documents (frequency based ranking)

012312110220001233001101000110000111

01111000010111100001

00000

0

111111

0

2322223301000011

22222

0

333

1

1

Top documents containing ω1: 1 (3 times), 2 (2 times)

Page 55: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

WT application: Top-k document retrieval

Other results:2n + o(n) structure answers document frequency in constanttime (Sadakane 2007)Use RMQs for document listing (Sadakane 2007)Prestore selected intervals (Hon et al. 2009). Use skeletonSuffix Tree for mapping.

Page 56: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Outline

1 The Inverted Index: a success story

2 Suffix sorted based indexes: a tour de force

3 More virtues of the Wavelet Tree

4 Various Structures

5 SDSL: A toolbox of succinct structures

6 Conclusion

Page 57: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Compressed Bitvectors: SD-array (Elias-Fano coding)

ApplicationsGraphs (deBruijn subgraph, web graph,...)Lists (Inverted Lists,...)

monotone sequence X with 0 ≤ x0 ≤ x1 ≤ . . . ≤ xm−1 ≤ n.Lower ` = blog n

mc bits stored explicitly.Upper bits as sequence H of unary encoded gapsUses 2 + blog n

mc bits per elementConstant time access X by addding o(m) extra bits to H.

Page 58: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Compressed Bitvectors: SD-array (Elias-Fano coding)

X =

δ =

H =

L =

6

001101 2

01

2

11-0

8

010002 0

01

0

12-1

9

010012 1

1

1

02-2

17

100014 1

001

1

24-2

17

100014 1

1

1

04-4

31

111117 3

0001

3

37-4

X[3]=rank0(H, select1(H,4)) · 22 + L[3] = 4 · 4 + 1 = 17

Page 59: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Compressed Bitvectors: SD-array (Elias-Fano coding)

Interpret X as positions of set bits in a bitvector b.Constant time acccess, rank, select on b.Well suited for sparse bitvectors.

b =

H =

L =

00

01

02

03

04

05

16

07

18

19

00

01

02

03

04

05

06

17

08

09

00

01

02

03

04

05

06

07

08

09

00

01

01

2

01

0

1

1

001

1

1

1

0001

3

select(b,4) = rank0(H, select1(H,4)) · 22 + L[3] = 4 · 4 + 1 = 17

rank(b, i) and access(b, i): use select0 on H to select right upperpart + scan left for entry in L

Page 60: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Balanced Parentheses Sequences (BPS)

Represent BPS as bitvector.

(()()(()())(()((()())()()))()((()())(()(()()))()))

find_open(i) Find matching closing paren of BPS[i].find_close(P) Find matching opening paren of BPS[i].

enclose(i) Find tightest paren pair which encloses pair(i ,find_close(i))

and more (double_enclose(i,j), rank(i), select(i),. . . )can be answered in constant time by just adding o(n) space.(Munro 1996, Geary et al. 2004 & 2006)

Page 61: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Balanced Parentheses Sequences (BPS)

Application: Succinct representation of trees

15$

7dumulmum$

11m$

3

ndumulmum$

lmu

14$

9m$

1

ndumulmum$

lmu

12

m$

4

ndumulmum$

u

m

6

ndumulmum$

10m$

2

ndumulmum$

lmu

13$

8

m$

0

ndumulmum$

ulmu

m

5

ndumulmum$

u

(()()(()())(()((()())()()))()((()())(()(()()))()))BPSdfs=

tree uncompressedO(n log n) bits

compressed4n bits

Page 62: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Balanced Parentheses Sequences (BPS)

Application: Succinct representation of trees

15$

7dumulmum$

11m$

3

ndumulmum$

lmu

14$

9m$

1

ndumulmum$

lmu

12

m$

4

ndumulmum$

u

m

6

ndumulmum$

10m$

2

ndumulmum$

lmu

13$

8

m$

0

ndumulmum$

ulmu

m

5

ndumulmum$

u

(()()(()())(()((()())()()))()((()())(()(()()))()))BPSdfs= (BPSdfs=

0

tree uncompressedO(n log n) bits

compressed4n bits

Page 63: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Balanced Parentheses Sequences (BPS)

Application: Succinct representation of trees

15$

7dumulmum$

11m$

3

ndumulmum$

lmu

14$

9m$

1

ndumulmum$

lmu

12

m$

4

ndumulmum$

u

m

6

ndumulmum$

10m$

2

ndumulmum$

lmu

13$

8

m$

0

ndumulmum$

ulmu

m

5

ndumulmum$

u

(()()(()())(()((()())()()))()((()())(()(()()))()))BPSdfs= ((BPSdfs=

01

tree uncompressedO(n log n) bits

compressed4n bits

Page 64: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Balanced Parentheses Sequences (BPS)

Application: Succinct representation of trees

15$

7dumulmum$

11m$

3

ndumulmum$

lmu

14$

9m$

1

ndumulmum$

lmu

12

m$

4

ndumulmum$

u

m

6

ndumulmum$

10m$

2

ndumulmum$

lmu

13$

8

m$

0

ndumulmum$

ulmu

m

5

ndumulmum$

u

(()()(()())(()((()())()()))()((()())(()(()()))()))BPSdfs= (()BPSdfs=

01

tree uncompressedO(n log n) bits

compressed4n bits

Page 65: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Balanced Parentheses Sequences (BPS)

Application: Succinct representation of trees

15$

7dumulmum$

11m$

3

ndumulmum$

lmu

14$

9m$

1

ndumulmum$

lmu

12

m$

4

ndumulmum$

u

m

6

ndumulmum$

10m$

2

ndumulmum$

lmu

13$

8

m$

0

ndumulmum$

ulmu

m

5

ndumulmum$

u

(()()(()())(()((()())()()))()((()())(()(()()))()))BPSdfs= (()(BPSdfs=

01

3

tree uncompressedO(n log n) bits

compressed4n bits

Page 66: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Balanced Parentheses Sequences (BPS)

Application: Succinct representation of trees

15$

7dumulmum$

11m$

3

ndumulmum$

lmu

14$

9m$

1

ndumulmum$

lmu

12

m$

4

ndumulmum$

u

m

6

ndumulmum$

10m$

2

ndumulmum$

lmu

13$

8

m$

0

ndumulmum$

ulmu

m

5

ndumulmum$

u

(()()(()())(()((()())()()))()((()())(()(()()))()))BPSdfs= (()()BPSdfs=

01

3

tree uncompressedO(n log n) bits

compressed4n bits

Page 67: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Balanced Parentheses Sequences (BPS)

Application: Succinct representation of trees

15$

7dumulmum$

11m$

3

ndumulmum$

lmu

14$

9m$

1

ndumulmum$

lmu

12

m$

4

ndumulmum$

u

m

6

ndumulmum$

10m$

2

ndumulmum$

lmu

13$

8

m$

0

ndumulmum$

ulmu

m

5

ndumulmum$

u

(()()(()())(()((()())()()))()((()())(()(()()))()))BPSdfs= (()()(BPSdfs=

01

3 5

tree uncompressedO(n log n) bits

compressed4n bits

Page 68: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Balanced Parentheses Sequences (BPS)

Application: Succinct representation of trees

15$

7dumulmum$

11m$

3

ndumulmum$

lmu

14$

9m$

1

ndumulmum$

lmu

12

m$

4

ndumulmum$

u

m

6

ndumulmum$

10m$

2

ndumulmum$

lmu

13$

8

m$

0

ndumulmum$

ulmu

m

5

ndumulmum$

u

(()()(()())(()((()())()()))()((()())(()(()()))()))BPSdfs= (()()((BPSdfs=

01

3 5

6

tree uncompressedO(n log n) bits

compressed4n bits

Page 69: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Balanced Parentheses Sequences (BPS)

Application: Succinct representation of trees

15$

7dumulmum$

11m$

3

ndumulmum$

lmu

14$

9m$

1

ndumulmum$

lmu

12

m$

4

ndumulmum$

u

m

6

ndumulmum$

10m$

2

ndumulmum$

lmu

13$

8

m$

0

ndumulmum$

ulmu

m

5

ndumulmum$

u

(()()(()())(()((()())()()))()((()())(()(()()))()))BPSdfs= (()()(()BPSdfs=

01

3 5

6

tree uncompressedO(n log n) bits

compressed4n bits

Page 70: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Balanced Parentheses Sequences (BPS)

Application: Succinct representation of trees

15$

7dumulmum$

11m$

3

ndumulmum$

lmu

14$

9m$

1

ndumulmum$

lmu

12

m$

4

ndumulmum$

u

m

6

ndumulmum$

10m$

2

ndumulmum$

lmu

13$

8

m$

0

ndumulmum$

ulmu

m

5

ndumulmum$

u

(()()(()())(()((()())()()))()((()())(()(()()))()))BPSdfs= (()()(()(BPSdfs=

01

3 5

6

8tree uncompressedO(n log n) bits

compressed4n bits

Page 71: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Balanced Parentheses Sequences (BPS)

Application: Succinct representation of trees

15$

7dumulmum$

11m$

3

ndumulmum$

lmu

14$

9m$

1

ndumulmum$

lmu

12

m$

4

ndumulmum$

u

m

6

ndumulmum$

10m$

2

ndumulmum$

lmu

13$

8

m$

0

ndumulmum$

ulmu

m

5

ndumulmum$

u

(()()(()())(()((()())()()))()((()())(()(()()))()))BPSdfs= (()()(()()BPSdfs=

01

3 5

6

8tree uncompressedO(n log n) bits

compressed4n bits

Page 72: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Balanced Parentheses Sequences (BPS)

Application: Succinct representation of trees

15$

7dumulmum$

11m$

3

ndumulmum$

lmu

14$

9m$

1

ndumulmum$

lmu

12

m$

4

ndumulmum$

u

m

6

ndumulmum$

10m$

2

ndumulmum$

lmu

13$

8

m$

0

ndumulmum$

ulmu

m

5

ndumulmum$

u

(()()(()())(()((()())()()))()((()())(()(()()))()))BPSdfs= (()()(()())BPSdfs=

01

3 5

6

8

01

3 5

6

8

1112 14

15

16

18

2123

27

29

30

31

33

36

3739

40

42

46

tree uncompressedO(n log n) bits

compressed4n bits

Page 73: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Balanced Parentheses Sequences (BPS)

Application: Range Minimum Queries

Array X of n elements from a well-ordered set.rmq(i,j) returns position of minimum element in X [i , j].2n + o(n) bits and O(1) time solution (Fischer & Heun 2011)

X =10

41

62

23

994

105

56

137

28

749

3210

rmq(2,6)=3

Page 74: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Compressed Suffix Trees (CST)

15

$

7

dumulmum$

11

m$

3

ndumulm

um$

lmu

14

$

9

m$

1

ndumulm

um$

lmu

12

m$

4

ndumulm

um$

um

6

ndumulm

um$

10

m$

2

ndumulm

um$

lmu

13

$

8

m$

0ndum

ulmum

$ulm

um

5

ndumulm

um$

u

15 7 11 3 14 9 1 12 4 6 10 2 13 8 0 5

0 0 0 3 0 1 5 2 2 0 0 4 1 2 6 1

(()()(()())(()((()())()()))()((()())(()(()()))()))

Full functionality...

root()

is leaf (v)

parent(v)

degree(v)

child(v , c)

select child(v , i)sibling(v)

depth(v)

lca(v ,w)

sl(v), wl(v , c)

id(v), inv_id(i)Size: |CSA|+ |LCP|+ |BPS|, i.e. about original text size

Page 75: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Compressed Suffix Trees (CST) – Construction

Resource diagram for the construction of a CST for 200MB ofEnglish text.

Page 76: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Outline

1 The Inverted Index: a success story

2 Suffix sorted based indexes: a tour de force

3 More virtues of the Wavelet Tree

4 Various Structures

5 SDSL: A toolbox of succinct structures

6 Conclusion

Page 77: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

The Succinct Data Structure Library SDSL

https://github.com/simongog/sdsl-lite

Input from about 40 research papersFast and space-efficient constructionIndex structures available for byte and integer alphabetsWell-optimized (e.g. hardware POPCOUNT, hugepages,... )Easy configuration of myriads of CSAs, CSTs, WTsFast prototypingTests, benchmarks, tutorial, cheat sheet

Page 78: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Outline

1 The Inverted Index: a success story

2 Suffix sorted based indexes: a tour de force

3 More virtues of the Wavelet Tree

4 Various Structures

5 SDSL: A toolbox of succinct structures

6 Conclusion

Page 79: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Conclusion

Limitations/Challenges of Succinct Structures:Often only work for a static settingMemory access pattern often non-localImplementation from scratch difficultConstruction is often the bottleneckAlgorithms have to be adapted to work fast with succinctstructures

Page 80: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Conclusion

Advantages of Succinct Structures:Allow indexation of larger data sets (about 10-100 largercompared to classical indexes).Adapt to the data: E.g. capture repetitions.Often provide more functionality.More functionality often produces conceptional easiersolutions.Not limited to text indexing.

Page 81: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Conclusion

Many applications heavily depend on Indexes. Self-Indexes madeit possible to process Next Generation Sequencing in Genomicsand Life Sciences.

Why?

Only the speed of self indexes matches the amazing throughput ofsequencing technologies.

Page 82: Succinct Data Structures: From Theory to Practice · Inverted IndexesSuffix Sorted IndexesWavelet TreesVarious StructuresSDSL: A toolboxConclusion Succinct Data Structures: From

Inverted Indexes Suffix Sorted Indexes Wavelet Trees Various Structures SDSL: A toolbox Conclusion

Thank you for your attention.Thank MASTODONS