14
Elegant XML Compression Elegant XML Compression Presented by Minko Dudev 02.02.200 6 IR Seminar WS06/07 IR Seminar WS06/07 Final Presentation Final Presentation

Elegant XML Compression Presented by Minko Dudev 02.02.2006 IR Seminar WS06/07 Final Presentation

Embed Size (px)

Citation preview

Page 1: Elegant XML Compression Presented by Minko Dudev 02.02.2006 IR Seminar WS06/07 Final Presentation

Elegant XML CompressionElegant XML Compression

Presented by Minko Dudev

02.02.2006

IR Seminar WS06/07IR Seminar WS06/07Final PresentationFinal Presentation

Page 2: Elegant XML Compression Presented by Minko Dudev 02.02.2006 IR Seminar WS06/07 Final Presentation

XMLXML

1|Emma|J. Austin|1816|English|A. Bertrand2|Jane Eyre|C. Bronte|1847|English|Smith Elder and Co

<biblio><book id=1>

<title>Emma</title><author>J. Austin</author><year>1816</year><language>English</language><publisher>A. Bertrand</publisher>

</book><book id=2>

<title>Jane Eyre</title><author>C. Bronte</author><year>1847</year><language> English</language><publisher>Smith Elder and Co</publisher>

</book></biblio>

Readable

Hierarchical

Simple to parse

Platform independent

BUTVERY

VERBOSE

Page 3: Elegant XML Compression Presented by Minko Dudev 02.02.2006 IR Seminar WS06/07 Final Presentation

Non-Queriable Compression Non-Queriable Compression

<book id=1><title>Emma</title><author>J. Austin</author><year>1816</year><language>English</language>

</book><book id=2>

<title>Jane Eyre</title><author>C. Bronte</author><year>1847</year><language>English</language>

</book>

T1

T3

T4T5

T6

C112

C2Emma

Jane Eyre

C3J. AustinC. Bronte

C418161847

C5EnglishEnglishT1 T2 C1 T3 C2/ T4 C3/ T5 C4/ T6 C5// T1…

T2

Very good compression BUT

WHOLE DOCUMENT MUST BE DECOMPRESSED

Page 4: Elegant XML Compression Presented by Minko Dudev 02.02.2006 IR Seminar WS06/07 Final Presentation

Queriable CompressionQueriable Compression<book id=1>

<title>Emma</title><author>J. Austin</author><year>1816</year><language>English</language>

</book><book id=2>

<title>Jane Eyre</title><author>C. Bronte</author><year>1847</year><language>English</language>

</book>

T1 T2 enc(1)T3 enc(Emma) /T4 enc(J. Austin) /T5 enc(1816) /T6 enc(English) /

/T1 T2 enc(2)

T3 enc(Jane Eyre) /T4 enc(C. Bronte) /T5 enc(1847) /T6 enc(English) /

/Can be queried

BUTHAS BAD COMPRESSION RATIO

Page 5: Elegant XML Compression Presented by Minko Dudev 02.02.2006 IR Seminar WS06/07 Final Presentation

GoalsGoals

A new scheme that

Has very good compression properties

Can be queried

Has good performance

Page 6: Elegant XML Compression Presented by Minko Dudev 02.02.2006 IR Seminar WS06/07 Final Presentation

<biblio><book id=1>

<title>Emma</title><author>J. Austin</author><year>1816</year><language>English</

language></book><book id=2>

<title>Jane Eyre</title><author>C.

Bronte</author><year>1847</year><language>English</

language></book>

</biblio>

XML as a TreeXML as a Tree

biblio

book

id author title

1 J.Austin Emma

id author title

2 C.Bronte Jane Eyre

book

XML document = labeled treeSearch operations

What are the children of some node What are the parents of some nodeWhat are the nodes that have a certain path prefixHow many paths with a certain prefix exist

Page 7: Elegant XML Compression Presented by Minko Dudev 02.02.2006 IR Seminar WS06/07 Final Presentation

The XBW TransformThe XBW Transform

A

B BC

D a E

a b

D b D

c c

D

b

ABDaaEbCDcbDcBDb

emptyABADBABABAEBAADADCACACADCAABADBA

0001011001011111

Slast Slabel SpathABCBDaEDDbDabccb

emptyAAABABABABACACACADBADBADCADCAEBA

0001001100111111

Slast Slabel Spath

stablesort

pre-order

∑N={A, B, C…}

∑L={a, b, c… }

Skew AlgorithmO(N)

Page 8: Elegant XML Compression Presented by Minko Dudev 02.02.2006 IR Seminar WS06/07 Final Presentation

CompressibilityCompressibility<biblio

<book

@id <author <title

§1 §J.Austin §Emma

= = =

Slast Slabel Spath

1 1 <biblio empty

2 1 = <author<book<biblio

3 1 = <author<book<biblio

4 0 <book <biblio

5 1 <book <biblio

6 0 @id <book<biblio

7 0 <author <book<biblio

8 1 <title <book<biblio

9 0 @id <book<biblio

10 0 <author <book<biblio

11 1 <title <book<biblio

12 1 = <title<book<biblio

13 1 = <title<book<biblio

14 1 = @id<book<biblio

15 1 = @id<book<biblio

16 1 §J. Austin =<author<book<biblio

17 1 §C. Bronte =<author<book<biblio

18 1 §Emma =<title<book<biblio

19 1 §Jane Eyre =<title<book<biblio

20 1 §1 =@id<book<biblio

21 1 §2 =@id<book<biblio

PCDATA =

Page 9: Elegant XML Compression Presented by Minko Dudev 02.02.2006 IR Seminar WS06/07 Final Presentation

Some propertiesSome properties

ABCBDaEDDbDabccb

emptyAAABABABABACACACADBADBADCADCAEBA

0001001100111111

Slast Slabel SpathA

B BC

D a E

a b

D b D

c c

D

b

Children lie contiguously

Relative order of parents and children is preserved

Page 10: Elegant XML Compression Presented by Minko Dudev 02.02.2006 IR Seminar WS06/07 Final Presentation

Only scans = O(N)

Inverse XBW TransformInverse XBW Transform

123456789

10111213141516

ABCBDaEDDbDabccb

0001001100111111

Slast Slabel

J[i]= Jump to the first child of node i; J[5]=12

emptyAAABABABABACACACADBADBADCADCAEBA

A

B BC

D a E

a b

D b D

c c

D

b

FJ

1 A2 B3 C4 D5 E

2591216

C12141

259712-1161314-115-1-1-1-1-1

259712-1161314-115-1-1-1-1-1

F[x]= First component prefixed by x; F[B=2]=5

C[x]= Count occurrences of x in Slabel; C[B=2]=2

Page 11: Elegant XML Compression Presented by Minko Dudev 02.02.2006 IR Seminar WS06/07 Final Presentation

Subpath searchSubpath search

ABCBDaEDDbDabccb

0001001100111111

Slast Slabel

123456789

10111213141516

emptyAAABABABABACACACADBADBADCADCAEBA

Find nodes with path P=ABD

A

B BC

D a E

a b

D b D

c c

D

b

F2591216

1 A2 B3 C4 D5 E

rank and select O(1)

thus O(|P|) time for subpath search

Page 12: Elegant XML Compression Presented by Minko Dudev 02.02.2006 IR Seminar WS06/07 Final Presentation

Compression & SearchCompression & SearchAll search operations boil down to counting

How many 1s are there up to …

How many labels=X up to …

Slast=111010010011111

gzip(11101001)0 gzip(0011111)5

Slabel=<biblio==<book<book<@id<author<title@id<author<title====

gzip(<biblio==<book<book<@id<author)

<author:0 <biblio:0<book:0<title:0@id:0

=:0

gzip(<title@id<author<title====)

<author=1

<biblio=1<book=2<title=0 @id=1

=:2

C=5

C~30B

Page 13: Elegant XML Compression Presented by Minko Dudev 02.02.2006 IR Seminar WS06/07 Final Presentation

Compression & SearchCompression & Search

Slast

Slabel

gzip(<biblio==<book<book<@id<author)

<author:0 <biblio:0<book:0<title:0@id:0

=:0

gzip(<title@id<author<title====)

<author:1 <biblio:1<book:2<title:0 @id:1

=:2

Spcdata=§J. Austin§C. Bronte§Emma§Jane Eyre§1§2

=<author<book<biblio =<title<book<biblio =<id<book<biblio

J. AustinC. Bronte

EmmaJane Eyre

12420

gzip(11101001)0 gzip(0011111)5

FM-Index

Page 14: Elegant XML Compression Presented by Minko Dudev 02.02.2006 IR Seminar WS06/07 Final Presentation

SummarySummary

You have seen Challenges of XML

Verbose, Non-queriable compression, Querieble but bad compression

The XBW transformHow to construct and invert it in O(N) time

How to navigate and search in it O(1) and O(P) time

Partitioning of the XBW transform for compression

FIN