51
Dictionary Compression Reducing Symbolic Redundancy in RDF Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 23 TH AUGUST 2017 3rd KEYSTONE Training School Keyword search in Big Linked Data Image: ALCÁZAR (SEGOVIA, SPAIN)

Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

Dictionary Compression

Reducing Symbolic Redundancy in RDF

Antonio Fariña, Javier D. Fernández and

Miguel A. Martinez-Prieto

23TH AUGUST 2017

3rd KEYSTONE Training SchoolKeyword search in Big Linked Data

Image:ALCÁZAR (SEGOVIA, SPAIN)

Page 2: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

Introduction

What is Dictionary Compression?

Compressed String Dictionaries

Some Experimental Numbers

RDF Dictionaries

Foundations

RDF Dictionary-based Compression

Dictionaries in Practice

Conclusions

PAGE 2

Agenda

images: zurb.com

Page 3: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

• What is Dictionary Compression?

• Compressed String Dictionaries

Introduction

Dictionary Compression

Page 4: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

Dictionary compression is a simple but effective technique which replaces the occurrences of terms by identifiers which are more compact to encode and easier and more efficient to handle.

What is DictionaryCompression?

DICTIONARY COMPRESSIONPAGE 4

Page 5: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length) terms by (short) identifiers which are more compact to encode and easier and more efficient to handle.

Implementing this class of compression requires an efficient data structure configuration (dictionary) which provides, at least, two basic mapping operations:

locate(t) returns i if the term t is the i-th element in the dictionary.

extract(i) returns the i-th term (t) in the dictionary.

The dictionary organizes all different terms (vocabulary) in the dataset.

Dictionary compression has been traditionally applied for natural language processing purposes (e.g. information retrieval).

Dictionary Compression

DICTIONARY COMPRESSIONPAGE 5

Dictionary Compression

Page 6: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

DICTIONARY COMPRESSIONPAGE 6

Dictionary Compression

la tarara síla tarara nola tarara niñaque la he visto yo

ID String

1 he

2 la

3 niña

4 no

5 que

6 sí

7 tarara

8 visto

9 yo

data structure

Page 7: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

DICTIONARY COMPRESSIONPAGE 7

Dictionary Compression

ID String

1 he

2 la

3 niña

4 no

5 que

6 sí

7 tarara

8 visto

9 yo

2 7 62 7 42 7 35 2 1 8 9

data structure

Page 8: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

DICTIONARY COMPRESSIONPAGE 8

Dictionary Compression

la tarara síla tarara nola tarara niñaque la he visto yo

The original text takes 59 bytes59 chars * 1 byte/char

Page 9: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

DICTIONARY COMPRESSIONPAGE 9

Dictionary Compression

2 7 62 7 42 7 35 2 1 8 9

The original text takes 59 bytes59 chars * 1 byte/char

+the cost of serializing the data structure.

The dictionary compressedtext takes 7 bytes

14 IDs * log2(9) bits/ID

Page 10: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

Dictionary Compression is used for optimizing applications of…

Natural Language Processing (e.g. Information Retrieval or Machine Translation)

Web Graph Management.

Triplestores (e.g. RDF3X) and other semantic tools (e.g. HDT)

NoSQL databases.

Bioinformatics search engines.

Internet Routing.

Geographic Information Systems.

….

DICTIONARY COMPRESSIONPAGE 10

Dictionary Compression

Page 11: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

Dictionaries have been traditionally implemented using well-known data structures:

Hash tables or tries for resolving locate queries.

Arrays for resolving extract queries.

These solutions are efficient, but require high amounts of memory for using them in practical scenarios.

DICTIONARY COMPRESSIONPAGE 11

Data Structures

Page 12: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

Data sets are increasingly bigger and more varied: Vocabularies are also larger and comprise more heterogeneous terms.

The dictionary size is a bottleneck for applications running under restrictions of main memory.

The resulting dictionary data structure is very large and do not scale for efficient in-memory management: Dictionary management is becoming a scalability issue by itself and it

must be optimized for Big Data scenarios.

Preconditions: Dictionaries are static (they are rebuilt from the scratch when the vocabulary is

changed).

Dictionaries are cached in main memory.

DICTIONARY COMPRESSIONPAGE 12

The Problem…

Page 13: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

Compressed String Dictionaries are a particular class of compacta data structure which is optimize for dealing with string vocabularies from different domains.

Compressed String Dictionaries

DICTIONARY COMPRESSIONPAGE 13

Page 14: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

Innovative compressed string dictionaries are proposed for managing big vocabularies in main memory:

Traditional dictionaries are revisited for optimizing their memory footprint. Existing compact data structures are tuned to perform as dictionaries.

New compact data structures has been designed as compressed string dictionaries.

All these techniques ensure efficient in-memory query resolution:

locate and extract are resolved at microsecond level.

New interesting queries are also supported by these techniques:

Prefix-based queries retrieve IDs / terms matching a given prefix. Substring-based queries retrieve IDs / terms matching a given substring.

DICTIONARY COMPRESSIONPAGE 14

The Solutions…

Page 15: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

locate(“tarara”)

extract(2)

locatePrefix(“n”)

extractPrefix(“n”)

locateSubstring(“a”)

extractSubstring(“a”)

DICTIONARY COMPRESSIONPAGE 15

Queries

ID String

1 he

2 la

3 niña

4 no

5 que

6 sí

7 tarara

8 visto

9 yo

= 7

= “la”

= {3,4}

= {“niña”,”no”}

= {2,3,7}

= {“la”,”niña”,”tarara”}

Page 16: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

Compressed Hash:

The hash table is simulated using bitmaps.

Strings are stored in compressed form (Huffman/Re-Pair).

locate / extract operations are implemented using rank / select.

Differential Front-Coding Compression:

Front-Coding exploits that consecutive strings (in the vocabulary) are likely to share a common prefix.

Plain Front-Coding dictionaries use byte-oriented compression.

Compressed Front-Coding dictionaries combines HuTucker and Huffman/Re-Paircompression.

Primitive and prefix-based operations are implemented using binary search and efficient sequential decoding.

Self-Indexes:

The FM-Index is adapted to perform as dictionary and the XBW introduce a self-indexed trie.

All operations are implemented exploiting the BWT features.

DICTIONARY COMPRESSIONPAGE 16

Techniques for Compressing Dictionaries

Page 17: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

DICTIONARY COMPRESSIONPAGE 17

More Details…

Page 18: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

Compressed String Dictionaries answer queries at the level of microseconds, while compressing vocabularies up to 20 times.

Some Experimental Numbers

DICTIONARY COMPRESSIONPAGE 18

Page 19: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

We analyze compression effectiveness and retrieval speed:

locate, extract.

Prefix-based operations (URIs)

Substring-based operations (Literals).

In practice, extract is the most important query:

It is used many times as results are retrieved from the compressed dataset.

26,948,638 URIs from Uniprot:

Averaged length: 51.04 chars per URI.

Highly-repetitive.

27,592,013 Literals from DBpedia:

Averaged length: 60.45 chars per Literal.

DICTIONARY COMPRESSIONPAGE 19

Experimental Setup

Page 20: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

DICTIONARY COMPRESSIONPAGE 20

Locate / Extract Performance (URIs)

PFC is the faster choice for locate/extract…

locate ≈ 1.6 μs/string.

extract ≈ 0.3-0.6 μs/ID.

..but requires more space:

≈ 9 − 19 % of the original space.

HTFC (compressed Front-Coding) reports the most balanced space/time tradeoffs:

locate ≈ 2.2-3 μs/string .

extract ≈ 0.7-1.6 μs/ID.

≈ 5 − 13 % of the original space.

Page 21: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

DICTIONARY COMPRESSIONPAGE 21

Locate / Extract Performance (Literals)

HTFC reports the best compression ratios, but its performance is less competitive:

locate ≈ 2-2.5 μs/string .

extract > 2.5 μs/ID.

≈ 12 % of the original space.

HashDAC-rp (compressed Hashing) reports the best tradeoffs:

locate ≈ 1.5 μs/string .

extract ≈ 1μs/ID.

≈ 15 % of the original space.

Page 22: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

DICTIONARY COMPRESSIONPAGE 22

Domain Entity Retrieval (URIs)

PFC is the best choice for prefix-based operations:

Although it uses more space than the other approaches.

Page 23: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

DICTIONARY COMPRESSIONPAGE 23

Full-Text Search (Literals)

Self-index based dictionaries are the only ones providing fullt-text search:

FMI is the fastest solution (≈ 1μs/result) when uses more space than the original vocabulary.

XBW is the better choice for this scenario:

≈ 5-6 μs/result.

≈ 40% of the original space.

Page 24: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

• Foundations

• RDF Dictionary-based Compression

• Dictionaries in Practice

RDF Dictionaries

Dictionary Compression

Page 25: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

RDF Dictionaries are a core component of any compression or indexing approach desginedfor semantic datasets.

Foundations

DICTIONARY COMPRESSIONPAGE 25

Page 26: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

An RDF dictionary comprises all different terms used in the dataset:

Terms are drawn from 3 disjoint vocabularies: URIs, Literals, and blank nodes.

URIs are medium-size strings which share long prefixes:http://example.org/property/age

http://example.org/property/location

http://example.org/person/abe-simpson

http://example.org/person/bart-simpson

Literals tends to be large-size strings (with no predictable features), or numbers, or dates…:

“742 Evergreen Terrace”

“Bart Simpson”

“Homer Simpson”

10

Blank node serialization is not standardized:

“Auto-incremental” strings are usually used → similar features than URIs.

DICTIONARY COMPRESSIONPAGE 26

Basics

Page 27: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

Primitive Operations are exhaustively used:

locate operations are common when the dictionary is used for lookup

purposes (e.g. RDF stores, semantic search engines, etc.).

extract operations are common when the dictionary is used for data access

purposes (e.g. decompression, result retrieval, etc.).

Prefix-based operations are most relevant for URIs:

Finding all URIs in a given domain: e.g. retrieve all URIs from http://example.org/person/.

Substring-based operations are an open challenge for Literals:

REGEX SPARQL queries: e.g. look for all literals containing the substring “Simpson”.

DICTIONARY COMPRESSIONPAGE 28

Dictionary Queries

Page 28: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

URIs and Literals should be compressed and managed independently…

Their structure is very different and they are queried in a different way.

…but they should be also organized to according to their role in the dataset:

Literals always play an object role.

URIs can be used as subject, predicate, and/or object.

DICTIONARY COMPRESSIONPAGE 29

Decisions

Page 29: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

RDF Dictionary-based compression handles some dictionaries to optimize URIs and Literals compression.

RDF Dictionary-based Compression

DICTIONARY COMPRESSIONPAGE 30

Page 30: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

A role-based partition is first performed:

Subjects are encoded in the range [1,|S|].

Predicates are encoded in the range [1,|P|].

Objects are encoded in the range [1,|O|].

URIs playing as subject and object are encoded once:

IDs in [1,|SO|] encode terms playing as subjects and objects.

Subjects are encoded in [|SO+1|,|S|].

Objects are encoded using two dictionaries:

[|SO+1|,|Ox|] encode URIs which only performs as objects.

[|Ox +1|,|O|] encode Literals.

Predicates are encoded in [1,|P|].DICTIONARY COMPRESSIONPAGE 31

Dictionary Organization

Page 31: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

DICTIONARY COMPRESSIONPAGE 32

RDF Dictionaries in Practice

person:homer-simpson

person:abe-simpson

"Homer Simpson"property:name

"742 Evergreen Terrace"

property:address

property:father

person:marge-simpson

property:address

"Marge Simpson"

property:namelocation:springfield

property:location

property:location

person:bart-simpson "Springfield"

property:mother

property:father

property:name"Bart Simpson"

10

property:name

property:age

83 "Bart Simpson"property:age property:name

Page 32: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

<http://example.org/location/springfield> <http://example.org/property/name> "Springfield" .

<http://example.org/person/abe-simpson> <http://example.org/property/age> 83 .

<http://example.org/person/abe-simpson> <http://example.org/property/name> "Abe Simpson" .

<http://example.org/person/bart-simpson> <http://example.org/property/age> 10 .

<http://example.org/person/bart-simpson> <http://example.org/property/name> "Bart Simpson" .

<http://example.org/person/bart-simpson> <http://example.org/property/father> <http://example.org/person/homer-simpson> .

<http://example.org/person/bart-simpson> <http://example.org/property/mother> <http://example.org/person/marge-simpson> .

<http://example.org/person/homer-simpson> <http://example.org/property/address> "742 Evergreen Terrace" .

<http://example.org/person/homer-simpson> <http://example.org/property/name> "Homer Simpson" .

<http://example.org/person/homer-simpson> <http://example.org/property/location> <http://example.org/location/springfield> .

<http://example.org/person/homer-simpson> <http://example.org/property/father> <http://example.org/person/abe-simpson> .

<http://example.org/person/marge-simpson> <http://example.org/property/address> "742 Evergreen Terrace" .

<http://example.org/person/marge-simpson> <http://example.org/property/name> "Marge Simpson" .

<http://example.org/person/marge-simpson> <http://example.org/property/location> <http://example.org/location/springfield> .

DICTIONARY COMPRESSIONPAGE 33

Looking for Subject-Object (SO) terms…

Page 33: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

1 <http://example.org/property/name> "Springfield" .

2 <http://example.org/property/age> 83 .

2 <http://example.org/property/name> "Abe Simpson" .

<http://example.org/person/bart-simpson> <http://example.org/property/age> 10 .

<http://example.org/person/bart-simpson> <http://example.org/property/name> "Bart Simpson" .

<http://example.org/person/bart-simpson> <http://example.org/property/father> 3 .

<http://example.org/person/bart-simpson> <http://example.org/property/mother> 4 .

3 <http://example.org/property/address> "742 Evergreen Terrace" .

3 <http://example.org/property/name> "Homer Simpson" .

3 <http://example.org/property/location> 1 .

3 <http://example.org/property/father> 2 .

4 <http://example.org/property/address> "742 Evergreen Terrace" .

4 <http://example.org/property/name> "Marge Simpson" .

4 <http://example.org/property/location> 1 .

<http://example.org/location/springfield> <http://example.org/property/name> "Springfield" .

<http://example.org/person/abe-simpson> <http://example.org/property/age> 83 .

<http://example.org/person/abe-simpson> <http://example.org/property/name> "Abe Simpson" .

<http://example.org/person/bart-simpson> <http://example.org/property/age> 10 .

<http://example.org/person/bart-simpson> <http://example.org/property/name> "Bart Simpson" .

<http://example.org/person/bart-simpson> <http://example.org/property/father> <http://example.org/person/homer-simpson> .

<http://example.org/person/bart-simpson> <http://example.org/property/mother> <http://example.org/person/marge-simpson> .

<http://example.org/person/homer-simpson> <http://example.org/property/address> "742 Evergreen Terrace" .

<http://example.org/person/homer-simpson> <http://example.org/property/name> "Homer Simpson" .

<http://example.org/person/homer-simpson> <http://example.org/property/location> <http://example.org/location/springfield> .

<http://example.org/person/homer-simpson> <http://example.org/property/father> <http://example.org/person/abe-simpson> .

<http://example.org/person/marge-simpson> <http://example.org/property/address> "742 Evergreen Terrace" .

<http://example.org/person/marge-simpson> <http://example.org/property/name> "Marge Simpson" .

<http://example.org/person/marge-simpson> <http://example.org/property/location> <http://example.org/location/springfield> .

DICTIONARY COMPRESSIONPAGE 34

Building SO Dictionary & Compressing terms

DICTIONARY COMPRESSION

ID RDF Term

1 http://example.org/location/springfield

2 http://example.org/person/abe-simpson

3 http://example.org/person/homer-simpson

4 http://example.org/person/marge-simpson

SO

Page 34: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

1 <http://example.org/property/name> "Springfield" .

2 <http://example.org/property/age> 83 .

2 <http://example.org/property/name> "Abe Simpson" .

<http://example.org/person/bart-simpson> <http://example.org/property/age> 10 .

<http://example.org/person/bart-simpson> <http://example.org/property/name> "Bart Simpson" .

<http://example.org/person/bart-simpson> <http://example.org/property/father> 3 .

<http://example.org/person/bart-simpson> <http://example.org/property/mother> 4 .

3 <http://example.org/property/address> "742 Evergreen Terrace" .

3 <http://example.org/property/name> "Homer Simpson" .

3 <http://example.org/property/location> 1 .

3 <http://example.org/property/father> 2 .

4 <http://example.org/property/address> "742 Evergreen Terrace" .

4 <http://example.org/property/name> "Marge Simpson" .

4 <http://example.org/property/location> 1 .

DICTIONARY COMPRESSION

Looking for Subject (S) terms…

ID RDF Term

1 http://example.org/location/springfield

2 http://example.org/person/abe-simpson

3 http://example.org/person/homer-simpson

4 http://example.org/person/marge-simpson

SO

Page 35: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

1 <http://example.org/property/name> "Springfield" .

2 <http://example.org/property/age> 83 .

2 <http://example.org/property/name> "Abe Simpson" .

5 <http://example.org/property/age> 10 .

5 <http://example.org/property/name> "Bart Simpson" .

5 <http://example.org/property/father> 3 .

5 <http://example.org/property/mother> 4 .

3 <http://example.org/property/address> "742 Evergreen Terrace" .

3 <http://example.org/property/name> "Homer Simpson" .

3 <http://example.org/property/location> 1 .

3 <http://example.org/property/father> 2 .

4 <http://example.org/property/address> "742 Evergreen Terrace" .

4 <http://example.org/property/name> "Marge Simpson" .

4 <http://example.org/property/location> 1 .

1 <http://example.org/property/name> "Springfield" .

2 <http://example.org/property/age> 83 .

2 <http://example.org/property/name> "Abe Simpson" .

<http://example.org/person/bart-simpson> <http://example.org/property/age> 10 .

<http://example.org/person/bart-simpson> <http://example.org/property/name> "Bart Simpson" .

<http://example.org/person/bart-simpson> <http://example.org/property/father> 3 .

<http://example.org/person/bart-simpson> <http://example.org/property/mother> 4 .

3 <http://example.org/property/address> "742 Evergreen Terrace" .

3 <http://example.org/property/name> "Homer Simpson" .

3 <http://example.org/property/location> 1 .

3 <http://example.org/property/father> 2 .

4 <http://example.org/property/address> "742 Evergreen Terrace" .

4 <http://example.org/property/name> "Marge Simpson" .

4 <http://example.org/property/location> 1 .

DICTIONARY COMPRESSIONPAGE 36

Building S Dictionary & Compressing terms

PAGE 36PAGE 36

ID RDF Term

5 http://example.org/person/bart-simpsonS

ID RDF Term

1 http://example.org/location/springfield

2 http://example.org/person/abe-simpson

3 http://example.org/person/homer-simpson

4 http://example.org/person/marge-simpson

SO

Page 36: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

DICTIONARY COMPRESSIONPAGE 37

Looking for Object (O) terms…

PAGE 37PAGE 37

1 <http://example.org/property/name> "Springfield" .

2 <http://example.org/property/age> 83 .

2 <http://example.org/property/name> "Abe Simpson" .

5 <http://example.org/property/age> 10 .

5 <http://example.org/property/name> "Bart Simpson" .

5 <http://example.org/property/father> 3 .

5 <http://example.org/property/mother> 4 .

3 <http://example.org/property/address> "742 Evergreen Terrace" .

3 <http://example.org/property/name> "Homer Simpson" .

3 <http://example.org/property/location> 1 .

3 <http://example.org/property/father> 2 .

4 <http://example.org/property/address> "742 Evergreen Terrace" .

4 <http://example.org/property/name> "Marge Simpson" .

4 <http://example.org/property/location> 1 .

ID RDF Term

5 http://example.org/person/bart-simpsonS

ID RDF Term

1 http://example.org/location/springfield

2 http://example.org/person/abe-simpson

3 http://example.org/person/homer-simpson

4 http://example.org/person/marge-simpson

SO

Page 37: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

1 <http://example.org/property/name> 10 .

2 <http://example.org/property/age> 12 .

2 <http://example.org/property/name> 6 .

5 <http://example.org/property/age> 11 .

5 <http://example.org/property/name> 7 .

5 <http://example.org/property/father> 3 .

5 <http://example.org/property/mother> 4 .

3 <http://example.org/property/address> 5 .

3 <http://example.org/property/name> 8 .

3 <http://example.org/property/location> 1 .

3 <http://example.org/property/father> 2 .

4 <http://example.org/property/address> 5 .

4 <http://example.org/property/name> 9 .

4 <http://example.org/property/location> 1 .

DICTIONARY COMPRESSIONPAGE 38

Building O Dictionary & Compressing Terms

ID RDF Term

5 http://example.org/person/bart-simpsonSPAGE 38PAGE 38

ID RDF Term

1 http://example.org/location/springfield

2 http://example.org/person/abe-simpson

3 http://example.org/person/homer-simpson

4 http://example.org/person/marge-simpson

SO

1 <http://example.org/property/name> "Springfield" .

2 <http://example.org/property/age> 83 .

2 <http://example.org/property/name> "Abe Simpson" .

5 <http://example.org/property/age> 10 .

5 <http://example.org/property/name> "Bart Simpson" .

5 <http://example.org/property/father> 3 .

5 <http://example.org/property/mother> 4 .

3 <http://example.org/property/address> "742 Evergreen Terrace" .

3 <http://example.org/property/name> "Homer Simpson" .

3 <http://example.org/property/location> 1 .

3 <http://example.org/property/father> 2 .

4 <http://example.org/property/address> "742 Evergreen Terrace" .

4 <http://example.org/property/name> "Marge Simpson" .

4 <http://example.org/property/location> 1 .

ID RDF Term

5 "742 Evergreen Terrace"

6 "Abe Simpson"

7 "Bart Simpson"

8 "Homer Simpson"

9 "Marge Simpson"

10 "Springfield"

11 10

12 83

O

Page 38: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

1 <http://example.org/property/name> 10 .

2 <http://example.org/property/age> 12 .

2 <http://example.org/property/name> 6 .

5 <http://example.org/property/age> 11 .

5 <http://example.org/property/name> 7 .

5 <http://example.org/property/father> 3 .

5 <http://example.org/property/mother> 4 .

3 <http://example.org/property/address> 5 .

3 <http://example.org/property/name> 8 .

3 <http://example.org/property/location> 1 .

3 <http://example.org/property/father> 2 .

4 <http://example.org/property/address> 5 .

4 <http://example.org/property/name> 9 .

4 <http://example.org/property/location> 1 .

DICTIONARY COMPRESSIONPAGE 39

Looking for Predicate (P) terms…

ID RDF Term

5 http://example.org/person/bart-simpsonSPAGE 39PAGE 39

ID RDF Term

1 http://example.org/location/springfield

2 http://example.org/person/abe-simpson

3 http://example.org/person/homer-simpson

4 http://example.org/person/marge-simpson

SO

ID RDF Term

5 "742 Evergreen Terrace"

6 "Abe Simpson"

7 "Bart Simpson"

8 "Homer Simpson"

9 "Marge Simpson"

10 "Springfield"

11 10

12 83

O

Page 39: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

1 <http://example.org/property/name> 10 .

2 <http://example.org/property/age> 12 .

2 <http://example.org/property/name> 6 .

5 <http://example.org/property/age> 11 .

5 <http://example.org/property/name> 7 .

5 <http://example.org/property/father> 3 .

5 <http://example.org/property/mother> 4 .

3 <http://example.org/property/address> 5 .

3 <http://example.org/property/name> 8 .

3 <http://example.org/property/location> 1 .

3 <http://example.org/property/father> 2 .

4 <http://example.org/property/address> 5 .

4 <http://example.org/property/name> 9 .

4 <http://example.org/property/location> 1 .

DICTIONARY COMPRESSIONPAGE 40

Building P Dictionary & Compressing terms

ID RDF Term

5 http://example.org/person/bart-simpsonSPAGE 40PAGE 40

ID RDF Term

1 http://example.org/location/springfield

2 http://example.org/person/abe-simpson

3 http://example.org/person/homer-simpson

4 http://example.org/person/marge-simpson

SO

ID RDF Term

5 "742 Evergreen Terrace"

6 "Abe Simpson"

7 "Bart Simpson"

8 "Homer Simpson"

9 "Marge Simpson"

10 "Springfield"

11 10

12 83

O

ID RDF Term

1 http://example.org/property/address

2 http://example.org/property/age

3 http://example.org/property/father

4 http://example.org/property/location

5 http://example.org/property/mother

6 http://example.org/property/name

P

1 6 10 .

2 2 12 .

2 6 6 .

5 2 11 .

5 6 7 .

5 3 3 .

5 5 4 .

3 1 5 .

3 6 8 .

3 4 1 .

3 3 2 .

4 1 5 .

4 6 9 .

4 4 1 .

Page 40: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

Dictionary Compression

ID RDF Term

1 http://example.org/property/address

2 http://example.org/property/age

3 http://example.org/property/father

4 http://example.org/property/location

5 http://example.org/property/mother

6 http://example.org/property/name

P

Dictionaries are now compressed.

Let’s see how the predicate dictionary is compressed using Plain Front Coding.

1. Terms are concatenated in lexicographic order:http://example.org/property/address$http://example.org/property/age$http://example.org/property/father$ht

tp://example.org/property/location$http://example.org/property/mother$http://example.org/property/name$

2. Terms are then organized into buckets of b strings (e.g. b=3)B1 = http://example.org/property/address$http://example.org/property/age$http://example.org/property/father$

B2 = http://example.org/property/location$http://example.org/property/mother$http://example.org/property/name$

3. Each bucket is independently compressed:

The first term is preserved “as is”. Each internal string is differentially encoded to its predecessor.

DICTIONARY COMPRESSIONPAGE 33

Page 41: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

Dictionary Compression

4. The compressed result is managed into a simple byte array (Tpfc):

Prefix-numbers are encoded using VByte and suffixes are encoded byte-to-byte (ASCII).

5. An additional integer array (ptrs) is used to store the position of the first byte of each bucket.

Bucket 1 Bucket 2http://example.org/property/address$ 29 ge$ 28 ather$ http://example.org/property/location$ 28 mother$ 28 name$

B1 = http://example.org/property/address$http://example.org/property/age$http://example.org/property/father$

B2 = http://example.org/property/location$http://example.org/property/mother$http://example.org/property/name$

0 47

Page 42: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

RDF Dictionaries are used for SPARQL resolution, but also allows other interesting queries to be efficiently resolved in the Linked Data workflow.

Dictionaries in Practice

DICTIONARY COMPRESSIONPAGE 43

Page 43: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

DICTIONARY COMPRESSIONPAGE 44

Normative SPARQL

PAGE 44PAGE 44

1 6 10

2 2 12

2 6 6

5 2 11

5 6 7

5 3 3

5 5 4

3 1 5

3 6 8

3 4 1

3 3 2

4 1 5

4 6 9

4 4 1

Retrieve all people living in Springfield.

@prefix …

SELECT ?Who

WHERE { ?Who property:location location:springfield }

P.locate(http://example.org/property/location/)

SO.locate(http://example.org/location/springfield/)

Looking for (?Who 4 1)

SO.extract(3)

SO.extract(4)

4

1

3

4

http://example.org/person/homer-simpson

http://example.org/person/marge-simpson

Page 44: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

DICTIONARY COMPRESSIONPAGE 45

Domain Entity Retrieval

PAGE 45PAGE 45

1 6 10

2 2 12

2 6 6

5 2 11

5 6 7

5 3 3

5 5 4

3 1 5

3 6 8

3 4 1

3 3 2

4 1 5

4 6 9

4 4 1

Retrieve all people in our domain:http://explample.org/people/

SO.extractPrefix(http://example.org/people/)

S.extractPrefix(http://example.org/people/)

O.extractPrefix(http://example.org/people/)

http://example.org/person/abe-simpson

http://example.org/person/homer-simpson

http://example.org/person/marge-simpson

http://example.org/person/bart-simpson

-

Page 45: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

DICTIONARY COMPRESSIONPAGE 46

Full-Text Search

PAGE 46PAGE 46

1 6 10

2 2 12

2 6 6

5 2 11

5 6 7

5 3 3

5 5 4

3 1 5

3 6 8

3 4 1

3 3 2

4 1 5

4 6 9

4 4 1

Retrieve all terms which include “Simpson”:

O.extractSubstring(”Simpson”)

“Abe Simpson”

“Bart Simpson”

“Homer Simpson”

“Marge Simpson”

Page 46: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

Conclusions

Dictionary Compression

Page 47: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

RDF dictionaries are highly compressible:

URIs are very redundant and Literals also show non-negligible symbolic redundancy.

This redundancy can be detected and removed within specific data structures for dictionaries:

Structures for URIs use up to 20 times less space than the original dictionaries.

For Literals, the corresponding structures use 6 − 8 times less space than the original dictionaries.

All these structures report data retrieval performance at microsecond level:

This functionality includes both simple and advanced operations.

DICTIONARY COMPRESSIONPAGE 48

Conclusions

Page 48: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

DICTIONARY COMPRESSIONPAGE 49

Conclusions

Compressed string dictionaries are available in the libCSD C++ library

(beta):

We are working on a new release including more techniques and more search functionality (e.g. top K).

https://github.com/migumar2/libCSD

Page 49: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 50

Bibliography

1. Julian Arz and Johannes Fischer. LZ-compressed string dictionaries. In Procedings of DCC, pages 322–331, 2014.

2. Nieves Brisaboa, Rodrigo Cánovas, Francisco Claude, Miguel A. Martínez-Prieto, and Gonzalo Navarro. Compressedstring dictionaries. In Proceedings of SEA, pages 136–147, 2011.

3. Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms. MITPress and McGraw-Hill, 2nd edition, 2001.

4. Paolo Ferragina and Giovanni Manzini. Indexing compressed texts. Journal of the ACM, 52(4):552–581, 2005.

5. Paolo Ferragina, Giovanni Manzini, Veli Mäkinen, and Gonzalo Navarro. Compressed representations of sequencesand full-text indexes. ACM Transactions on Algorithms, 3(2):article 20, 2007.

6. Roberto Grossi and Giuseppe Ottaviano. Fast Compressed Tries through Path Decompositions. In Proceedings ofALENEX, pages 65–74, 2012.

7. T.C. Hu and Alan C. Tucker. Optimal Computer-Search Trees and Variable-Length Alphabetic Codes. SIAM Journalof Applied Mathematics, 21:514–532, 1971.

8. David A. Huffman. A method for the construction of minimum-redundancy codes. Proc. of the Institute of RadioEngineers, 40(9):1098–1101, 1952.

Page 50: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 51

Bibliography

9. Donald E. Knuth. The Art of Computer Programming, volume 3: Sorting and Searching. Addison Wesley, 1973.

10. N. Jesper Larsson and Alistair Moffat. Offline dictionary-based compression. Proceedings of the IEEE, 88:1722–1732, 2000.

11. Veli Mäkinen and Gonzalo Navarro. Dynamic entropy-compressed sequences and full-text indexes. ACMTransactions on Algorithms, 4(3):article 32, 2008.

12. Miguel A. Martınez-Prieto, Nieves Brisaboa, Rodrigo Cánovas, Francisco Claude, and Gonzalo Navarro. Practicalcompressed string dictionaries. Information Systems, 56: 73-108, 2016.

13. Miguel A. Mart ́ınez-Prieto, Javier D. Fernáandez, and Rodrigo Cánovas. Querying RDF Dictionaries in CompressedSpace. SIGAPP Applied Computing Review, 12(2):64–77, 2012.

14. Hugh E. Williams and Justin Zobel. Compressing integers for fast file access. The Computer Journal, 42:193–201,1999.

15. Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documentsand Images. Morgan Kaufmann, 1999.

Page 51: Dictionary Compression - Software engineering · 2017. 8. 27. · Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length)

Triples Compression

Let’s the lecture continues…

Image:ROYAL MINT & ALCÁZAR (SEGOVIA, SPAIN)