Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Dictionary Compression
Reducing Symbolic Redundancy in RDF
Antonio Fariña, Javier D. Fernández and
Miguel A. Martinez-Prieto
23TH AUGUST 2017
3rd KEYSTONE Training SchoolKeyword search in Big Linked Data
Image:ALCÁZAR (SEGOVIA, SPAIN)
Introduction
What is Dictionary Compression?
Compressed String Dictionaries
Some Experimental Numbers
RDF Dictionaries
Foundations
RDF Dictionary-based Compression
Dictionaries in Practice
Conclusions
PAGE 2
Agenda
images: zurb.com
• What is Dictionary Compression?
• Compressed String Dictionaries
Introduction
Dictionary Compression
Dictionary compression is a simple but effective technique which replaces the occurrences of terms by identifiers which are more compact to encode and easier and more efficient to handle.
What is DictionaryCompression?
DICTIONARY COMPRESSIONPAGE 4
“
Dictionary compression is a simple but effective technique which replaces the occurrences of (long, variable-length) terms by (short) identifiers which are more compact to encode and easier and more efficient to handle.
Implementing this class of compression requires an efficient data structure configuration (dictionary) which provides, at least, two basic mapping operations:
locate(t) returns i if the term t is the i-th element in the dictionary.
extract(i) returns the i-th term (t) in the dictionary.
The dictionary organizes all different terms (vocabulary) in the dataset.
Dictionary compression has been traditionally applied for natural language processing purposes (e.g. information retrieval).
Dictionary Compression
DICTIONARY COMPRESSIONPAGE 5
Dictionary Compression
DICTIONARY COMPRESSIONPAGE 6
Dictionary Compression
…
la tarara síla tarara nola tarara niñaque la he visto yo
…
ID String
1 he
2 la
3 niña
4 no
5 que
6 sí
7 tarara
8 visto
9 yo
data structure
DICTIONARY COMPRESSIONPAGE 7
Dictionary Compression
ID String
1 he
2 la
3 niña
4 no
5 que
6 sí
7 tarara
8 visto
9 yo
…
2 7 62 7 42 7 35 2 1 8 9
…
data structure
DICTIONARY COMPRESSIONPAGE 8
Dictionary Compression
…
la tarara síla tarara nola tarara niñaque la he visto yo
…
The original text takes 59 bytes59 chars * 1 byte/char
DICTIONARY COMPRESSIONPAGE 9
Dictionary Compression
…
2 7 62 7 42 7 35 2 1 8 9
…
The original text takes 59 bytes59 chars * 1 byte/char
+the cost of serializing the data structure.
The dictionary compressedtext takes 7 bytes
14 IDs * log2(9) bits/ID
Dictionary Compression is used for optimizing applications of…
Natural Language Processing (e.g. Information Retrieval or Machine Translation)
Web Graph Management.
Triplestores (e.g. RDF3X) and other semantic tools (e.g. HDT)
NoSQL databases.
Bioinformatics search engines.
Internet Routing.
Geographic Information Systems.
….
DICTIONARY COMPRESSIONPAGE 10
Dictionary Compression
Dictionaries have been traditionally implemented using well-known data structures:
Hash tables or tries for resolving locate queries.
Arrays for resolving extract queries.
These solutions are efficient, but require high amounts of memory for using them in practical scenarios.
DICTIONARY COMPRESSIONPAGE 11
Data Structures
Data sets are increasingly bigger and more varied: Vocabularies are also larger and comprise more heterogeneous terms.
The dictionary size is a bottleneck for applications running under restrictions of main memory.
The resulting dictionary data structure is very large and do not scale for efficient in-memory management: Dictionary management is becoming a scalability issue by itself and it
must be optimized for Big Data scenarios.
Preconditions: Dictionaries are static (they are rebuilt from the scratch when the vocabulary is
changed).
Dictionaries are cached in main memory.
DICTIONARY COMPRESSIONPAGE 12
The Problem…
Compressed String Dictionaries are a particular class of compacta data structure which is optimize for dealing with string vocabularies from different domains.
Compressed String Dictionaries
DICTIONARY COMPRESSIONPAGE 13
“
Innovative compressed string dictionaries are proposed for managing big vocabularies in main memory:
Traditional dictionaries are revisited for optimizing their memory footprint. Existing compact data structures are tuned to perform as dictionaries.
New compact data structures has been designed as compressed string dictionaries.
All these techniques ensure efficient in-memory query resolution:
locate and extract are resolved at microsecond level.
New interesting queries are also supported by these techniques:
Prefix-based queries retrieve IDs / terms matching a given prefix. Substring-based queries retrieve IDs / terms matching a given substring.
DICTIONARY COMPRESSIONPAGE 14
The Solutions…
locate(“tarara”)
extract(2)
locatePrefix(“n”)
extractPrefix(“n”)
locateSubstring(“a”)
extractSubstring(“a”)
DICTIONARY COMPRESSIONPAGE 15
Queries
ID String
1 he
2 la
3 niña
4 no
5 que
6 sí
7 tarara
8 visto
9 yo
= 7
= “la”
= {3,4}
= {“niña”,”no”}
= {2,3,7}
= {“la”,”niña”,”tarara”}
Compressed Hash:
The hash table is simulated using bitmaps.
Strings are stored in compressed form (Huffman/Re-Pair).
locate / extract operations are implemented using rank / select.
Differential Front-Coding Compression:
Front-Coding exploits that consecutive strings (in the vocabulary) are likely to share a common prefix.
Plain Front-Coding dictionaries use byte-oriented compression.
Compressed Front-Coding dictionaries combines HuTucker and Huffman/Re-Paircompression.
Primitive and prefix-based operations are implemented using binary search and efficient sequential decoding.
Self-Indexes:
The FM-Index is adapted to perform as dictionary and the XBW introduce a self-indexed trie.
All operations are implemented exploiting the BWT features.
DICTIONARY COMPRESSIONPAGE 16
Techniques for Compressing Dictionaries
DICTIONARY COMPRESSIONPAGE 17
More Details…
Compressed String Dictionaries answer queries at the level of microseconds, while compressing vocabularies up to 20 times.
Some Experimental Numbers
DICTIONARY COMPRESSIONPAGE 18
“
We analyze compression effectiveness and retrieval speed:
locate, extract.
Prefix-based operations (URIs)
Substring-based operations (Literals).
In practice, extract is the most important query:
It is used many times as results are retrieved from the compressed dataset.
26,948,638 URIs from Uniprot:
Averaged length: 51.04 chars per URI.
Highly-repetitive.
27,592,013 Literals from DBpedia:
Averaged length: 60.45 chars per Literal.
DICTIONARY COMPRESSIONPAGE 19
Experimental Setup
DICTIONARY COMPRESSIONPAGE 20
Locate / Extract Performance (URIs)
PFC is the faster choice for locate/extract…
locate ≈ 1.6 μs/string.
extract ≈ 0.3-0.6 μs/ID.
..but requires more space:
≈ 9 − 19 % of the original space.
HTFC (compressed Front-Coding) reports the most balanced space/time tradeoffs:
locate ≈ 2.2-3 μs/string .
extract ≈ 0.7-1.6 μs/ID.
≈ 5 − 13 % of the original space.
DICTIONARY COMPRESSIONPAGE 21
Locate / Extract Performance (Literals)
HTFC reports the best compression ratios, but its performance is less competitive:
locate ≈ 2-2.5 μs/string .
extract > 2.5 μs/ID.
≈ 12 % of the original space.
HashDAC-rp (compressed Hashing) reports the best tradeoffs:
locate ≈ 1.5 μs/string .
extract ≈ 1μs/ID.
≈ 15 % of the original space.
DICTIONARY COMPRESSIONPAGE 22
Domain Entity Retrieval (URIs)
PFC is the best choice for prefix-based operations:
Although it uses more space than the other approaches.
DICTIONARY COMPRESSIONPAGE 23
Full-Text Search (Literals)
Self-index based dictionaries are the only ones providing fullt-text search:
FMI is the fastest solution (≈ 1μs/result) when uses more space than the original vocabulary.
XBW is the better choice for this scenario:
≈ 5-6 μs/result.
≈ 40% of the original space.
• Foundations
• RDF Dictionary-based Compression
• Dictionaries in Practice
RDF Dictionaries
Dictionary Compression
RDF Dictionaries are a core component of any compression or indexing approach desginedfor semantic datasets.
Foundations
DICTIONARY COMPRESSIONPAGE 25
“
An RDF dictionary comprises all different terms used in the dataset:
Terms are drawn from 3 disjoint vocabularies: URIs, Literals, and blank nodes.
URIs are medium-size strings which share long prefixes:http://example.org/property/age
http://example.org/property/location
http://example.org/person/abe-simpson
http://example.org/person/bart-simpson
Literals tends to be large-size strings (with no predictable features), or numbers, or dates…:
“742 Evergreen Terrace”
“Bart Simpson”
“Homer Simpson”
10
Blank node serialization is not standardized:
“Auto-incremental” strings are usually used → similar features than URIs.
DICTIONARY COMPRESSIONPAGE 26
Basics
Primitive Operations are exhaustively used:
locate operations are common when the dictionary is used for lookup
purposes (e.g. RDF stores, semantic search engines, etc.).
extract operations are common when the dictionary is used for data access
purposes (e.g. decompression, result retrieval, etc.).
Prefix-based operations are most relevant for URIs:
Finding all URIs in a given domain: e.g. retrieve all URIs from http://example.org/person/.
Substring-based operations are an open challenge for Literals:
REGEX SPARQL queries: e.g. look for all literals containing the substring “Simpson”.
DICTIONARY COMPRESSIONPAGE 28
Dictionary Queries
URIs and Literals should be compressed and managed independently…
Their structure is very different and they are queried in a different way.
…but they should be also organized to according to their role in the dataset:
Literals always play an object role.
URIs can be used as subject, predicate, and/or object.
DICTIONARY COMPRESSIONPAGE 29
Decisions
RDF Dictionary-based compression handles some dictionaries to optimize URIs and Literals compression.
RDF Dictionary-based Compression
DICTIONARY COMPRESSIONPAGE 30
“
A role-based partition is first performed:
Subjects are encoded in the range [1,|S|].
Predicates are encoded in the range [1,|P|].
Objects are encoded in the range [1,|O|].
URIs playing as subject and object are encoded once:
IDs in [1,|SO|] encode terms playing as subjects and objects.
Subjects are encoded in [|SO+1|,|S|].
Objects are encoded using two dictionaries:
[|SO+1|,|Ox|] encode URIs which only performs as objects.
[|Ox +1|,|O|] encode Literals.
Predicates are encoded in [1,|P|].DICTIONARY COMPRESSIONPAGE 31
Dictionary Organization
DICTIONARY COMPRESSIONPAGE 32
RDF Dictionaries in Practice
person:homer-simpson
person:abe-simpson
"Homer Simpson"property:name
"742 Evergreen Terrace"
property:address
property:father
person:marge-simpson
property:address
"Marge Simpson"
property:namelocation:springfield
property:location
property:location
person:bart-simpson "Springfield"
property:mother
property:father
property:name"Bart Simpson"
10
property:name
property:age
83 "Bart Simpson"property:age property:name
<http://example.org/location/springfield> <http://example.org/property/name> "Springfield" .
<http://example.org/person/abe-simpson> <http://example.org/property/age> 83 .
<http://example.org/person/abe-simpson> <http://example.org/property/name> "Abe Simpson" .
<http://example.org/person/bart-simpson> <http://example.org/property/age> 10 .
<http://example.org/person/bart-simpson> <http://example.org/property/name> "Bart Simpson" .
<http://example.org/person/bart-simpson> <http://example.org/property/father> <http://example.org/person/homer-simpson> .
<http://example.org/person/bart-simpson> <http://example.org/property/mother> <http://example.org/person/marge-simpson> .
<http://example.org/person/homer-simpson> <http://example.org/property/address> "742 Evergreen Terrace" .
<http://example.org/person/homer-simpson> <http://example.org/property/name> "Homer Simpson" .
<http://example.org/person/homer-simpson> <http://example.org/property/location> <http://example.org/location/springfield> .
<http://example.org/person/homer-simpson> <http://example.org/property/father> <http://example.org/person/abe-simpson> .
<http://example.org/person/marge-simpson> <http://example.org/property/address> "742 Evergreen Terrace" .
<http://example.org/person/marge-simpson> <http://example.org/property/name> "Marge Simpson" .
<http://example.org/person/marge-simpson> <http://example.org/property/location> <http://example.org/location/springfield> .
DICTIONARY COMPRESSIONPAGE 33
Looking for Subject-Object (SO) terms…
1 <http://example.org/property/name> "Springfield" .
2 <http://example.org/property/age> 83 .
2 <http://example.org/property/name> "Abe Simpson" .
<http://example.org/person/bart-simpson> <http://example.org/property/age> 10 .
<http://example.org/person/bart-simpson> <http://example.org/property/name> "Bart Simpson" .
<http://example.org/person/bart-simpson> <http://example.org/property/father> 3 .
<http://example.org/person/bart-simpson> <http://example.org/property/mother> 4 .
3 <http://example.org/property/address> "742 Evergreen Terrace" .
3 <http://example.org/property/name> "Homer Simpson" .
3 <http://example.org/property/location> 1 .
3 <http://example.org/property/father> 2 .
4 <http://example.org/property/address> "742 Evergreen Terrace" .
4 <http://example.org/property/name> "Marge Simpson" .
4 <http://example.org/property/location> 1 .
<http://example.org/location/springfield> <http://example.org/property/name> "Springfield" .
<http://example.org/person/abe-simpson> <http://example.org/property/age> 83 .
<http://example.org/person/abe-simpson> <http://example.org/property/name> "Abe Simpson" .
<http://example.org/person/bart-simpson> <http://example.org/property/age> 10 .
<http://example.org/person/bart-simpson> <http://example.org/property/name> "Bart Simpson" .
<http://example.org/person/bart-simpson> <http://example.org/property/father> <http://example.org/person/homer-simpson> .
<http://example.org/person/bart-simpson> <http://example.org/property/mother> <http://example.org/person/marge-simpson> .
<http://example.org/person/homer-simpson> <http://example.org/property/address> "742 Evergreen Terrace" .
<http://example.org/person/homer-simpson> <http://example.org/property/name> "Homer Simpson" .
<http://example.org/person/homer-simpson> <http://example.org/property/location> <http://example.org/location/springfield> .
<http://example.org/person/homer-simpson> <http://example.org/property/father> <http://example.org/person/abe-simpson> .
<http://example.org/person/marge-simpson> <http://example.org/property/address> "742 Evergreen Terrace" .
<http://example.org/person/marge-simpson> <http://example.org/property/name> "Marge Simpson" .
<http://example.org/person/marge-simpson> <http://example.org/property/location> <http://example.org/location/springfield> .
DICTIONARY COMPRESSIONPAGE 34
Building SO Dictionary & Compressing terms
DICTIONARY COMPRESSION
ID RDF Term
1 http://example.org/location/springfield
2 http://example.org/person/abe-simpson
3 http://example.org/person/homer-simpson
4 http://example.org/person/marge-simpson
SO
1 <http://example.org/property/name> "Springfield" .
2 <http://example.org/property/age> 83 .
2 <http://example.org/property/name> "Abe Simpson" .
<http://example.org/person/bart-simpson> <http://example.org/property/age> 10 .
<http://example.org/person/bart-simpson> <http://example.org/property/name> "Bart Simpson" .
<http://example.org/person/bart-simpson> <http://example.org/property/father> 3 .
<http://example.org/person/bart-simpson> <http://example.org/property/mother> 4 .
3 <http://example.org/property/address> "742 Evergreen Terrace" .
3 <http://example.org/property/name> "Homer Simpson" .
3 <http://example.org/property/location> 1 .
3 <http://example.org/property/father> 2 .
4 <http://example.org/property/address> "742 Evergreen Terrace" .
4 <http://example.org/property/name> "Marge Simpson" .
4 <http://example.org/property/location> 1 .
DICTIONARY COMPRESSION
Looking for Subject (S) terms…
ID RDF Term
1 http://example.org/location/springfield
2 http://example.org/person/abe-simpson
3 http://example.org/person/homer-simpson
4 http://example.org/person/marge-simpson
SO
1 <http://example.org/property/name> "Springfield" .
2 <http://example.org/property/age> 83 .
2 <http://example.org/property/name> "Abe Simpson" .
5 <http://example.org/property/age> 10 .
5 <http://example.org/property/name> "Bart Simpson" .
5 <http://example.org/property/father> 3 .
5 <http://example.org/property/mother> 4 .
3 <http://example.org/property/address> "742 Evergreen Terrace" .
3 <http://example.org/property/name> "Homer Simpson" .
3 <http://example.org/property/location> 1 .
3 <http://example.org/property/father> 2 .
4 <http://example.org/property/address> "742 Evergreen Terrace" .
4 <http://example.org/property/name> "Marge Simpson" .
4 <http://example.org/property/location> 1 .
1 <http://example.org/property/name> "Springfield" .
2 <http://example.org/property/age> 83 .
2 <http://example.org/property/name> "Abe Simpson" .
<http://example.org/person/bart-simpson> <http://example.org/property/age> 10 .
<http://example.org/person/bart-simpson> <http://example.org/property/name> "Bart Simpson" .
<http://example.org/person/bart-simpson> <http://example.org/property/father> 3 .
<http://example.org/person/bart-simpson> <http://example.org/property/mother> 4 .
3 <http://example.org/property/address> "742 Evergreen Terrace" .
3 <http://example.org/property/name> "Homer Simpson" .
3 <http://example.org/property/location> 1 .
3 <http://example.org/property/father> 2 .
4 <http://example.org/property/address> "742 Evergreen Terrace" .
4 <http://example.org/property/name> "Marge Simpson" .
4 <http://example.org/property/location> 1 .
DICTIONARY COMPRESSIONPAGE 36
Building S Dictionary & Compressing terms
PAGE 36PAGE 36
ID RDF Term
5 http://example.org/person/bart-simpsonS
ID RDF Term
1 http://example.org/location/springfield
2 http://example.org/person/abe-simpson
3 http://example.org/person/homer-simpson
4 http://example.org/person/marge-simpson
SO
DICTIONARY COMPRESSIONPAGE 37
Looking for Object (O) terms…
PAGE 37PAGE 37
1 <http://example.org/property/name> "Springfield" .
2 <http://example.org/property/age> 83 .
2 <http://example.org/property/name> "Abe Simpson" .
5 <http://example.org/property/age> 10 .
5 <http://example.org/property/name> "Bart Simpson" .
5 <http://example.org/property/father> 3 .
5 <http://example.org/property/mother> 4 .
3 <http://example.org/property/address> "742 Evergreen Terrace" .
3 <http://example.org/property/name> "Homer Simpson" .
3 <http://example.org/property/location> 1 .
3 <http://example.org/property/father> 2 .
4 <http://example.org/property/address> "742 Evergreen Terrace" .
4 <http://example.org/property/name> "Marge Simpson" .
4 <http://example.org/property/location> 1 .
ID RDF Term
5 http://example.org/person/bart-simpsonS
ID RDF Term
1 http://example.org/location/springfield
2 http://example.org/person/abe-simpson
3 http://example.org/person/homer-simpson
4 http://example.org/person/marge-simpson
SO
1 <http://example.org/property/name> 10 .
2 <http://example.org/property/age> 12 .
2 <http://example.org/property/name> 6 .
5 <http://example.org/property/age> 11 .
5 <http://example.org/property/name> 7 .
5 <http://example.org/property/father> 3 .
5 <http://example.org/property/mother> 4 .
3 <http://example.org/property/address> 5 .
3 <http://example.org/property/name> 8 .
3 <http://example.org/property/location> 1 .
3 <http://example.org/property/father> 2 .
4 <http://example.org/property/address> 5 .
4 <http://example.org/property/name> 9 .
4 <http://example.org/property/location> 1 .
DICTIONARY COMPRESSIONPAGE 38
Building O Dictionary & Compressing Terms
ID RDF Term
5 http://example.org/person/bart-simpsonSPAGE 38PAGE 38
ID RDF Term
1 http://example.org/location/springfield
2 http://example.org/person/abe-simpson
3 http://example.org/person/homer-simpson
4 http://example.org/person/marge-simpson
SO
1 <http://example.org/property/name> "Springfield" .
2 <http://example.org/property/age> 83 .
2 <http://example.org/property/name> "Abe Simpson" .
5 <http://example.org/property/age> 10 .
5 <http://example.org/property/name> "Bart Simpson" .
5 <http://example.org/property/father> 3 .
5 <http://example.org/property/mother> 4 .
3 <http://example.org/property/address> "742 Evergreen Terrace" .
3 <http://example.org/property/name> "Homer Simpson" .
3 <http://example.org/property/location> 1 .
3 <http://example.org/property/father> 2 .
4 <http://example.org/property/address> "742 Evergreen Terrace" .
4 <http://example.org/property/name> "Marge Simpson" .
4 <http://example.org/property/location> 1 .
ID RDF Term
5 "742 Evergreen Terrace"
6 "Abe Simpson"
7 "Bart Simpson"
8 "Homer Simpson"
9 "Marge Simpson"
10 "Springfield"
11 10
12 83
O
1 <http://example.org/property/name> 10 .
2 <http://example.org/property/age> 12 .
2 <http://example.org/property/name> 6 .
5 <http://example.org/property/age> 11 .
5 <http://example.org/property/name> 7 .
5 <http://example.org/property/father> 3 .
5 <http://example.org/property/mother> 4 .
3 <http://example.org/property/address> 5 .
3 <http://example.org/property/name> 8 .
3 <http://example.org/property/location> 1 .
3 <http://example.org/property/father> 2 .
4 <http://example.org/property/address> 5 .
4 <http://example.org/property/name> 9 .
4 <http://example.org/property/location> 1 .
DICTIONARY COMPRESSIONPAGE 39
Looking for Predicate (P) terms…
ID RDF Term
5 http://example.org/person/bart-simpsonSPAGE 39PAGE 39
ID RDF Term
1 http://example.org/location/springfield
2 http://example.org/person/abe-simpson
3 http://example.org/person/homer-simpson
4 http://example.org/person/marge-simpson
SO
ID RDF Term
5 "742 Evergreen Terrace"
6 "Abe Simpson"
7 "Bart Simpson"
8 "Homer Simpson"
9 "Marge Simpson"
10 "Springfield"
11 10
12 83
O
1 <http://example.org/property/name> 10 .
2 <http://example.org/property/age> 12 .
2 <http://example.org/property/name> 6 .
5 <http://example.org/property/age> 11 .
5 <http://example.org/property/name> 7 .
5 <http://example.org/property/father> 3 .
5 <http://example.org/property/mother> 4 .
3 <http://example.org/property/address> 5 .
3 <http://example.org/property/name> 8 .
3 <http://example.org/property/location> 1 .
3 <http://example.org/property/father> 2 .
4 <http://example.org/property/address> 5 .
4 <http://example.org/property/name> 9 .
4 <http://example.org/property/location> 1 .
DICTIONARY COMPRESSIONPAGE 40
Building P Dictionary & Compressing terms
ID RDF Term
5 http://example.org/person/bart-simpsonSPAGE 40PAGE 40
ID RDF Term
1 http://example.org/location/springfield
2 http://example.org/person/abe-simpson
3 http://example.org/person/homer-simpson
4 http://example.org/person/marge-simpson
SO
ID RDF Term
5 "742 Evergreen Terrace"
6 "Abe Simpson"
7 "Bart Simpson"
8 "Homer Simpson"
9 "Marge Simpson"
10 "Springfield"
11 10
12 83
O
ID RDF Term
1 http://example.org/property/address
2 http://example.org/property/age
3 http://example.org/property/father
4 http://example.org/property/location
5 http://example.org/property/mother
6 http://example.org/property/name
P
1 6 10 .
2 2 12 .
2 6 6 .
5 2 11 .
5 6 7 .
5 3 3 .
5 5 4 .
3 1 5 .
3 6 8 .
3 4 1 .
3 3 2 .
4 1 5 .
4 6 9 .
4 4 1 .
Dictionary Compression
ID RDF Term
1 http://example.org/property/address
2 http://example.org/property/age
3 http://example.org/property/father
4 http://example.org/property/location
5 http://example.org/property/mother
6 http://example.org/property/name
P
Dictionaries are now compressed.
Let’s see how the predicate dictionary is compressed using Plain Front Coding.
1. Terms are concatenated in lexicographic order:http://example.org/property/address$http://example.org/property/age$http://example.org/property/father$ht
tp://example.org/property/location$http://example.org/property/mother$http://example.org/property/name$
2. Terms are then organized into buckets of b strings (e.g. b=3)B1 = http://example.org/property/address$http://example.org/property/age$http://example.org/property/father$
B2 = http://example.org/property/location$http://example.org/property/mother$http://example.org/property/name$
3. Each bucket is independently compressed:
The first term is preserved “as is”. Each internal string is differentially encoded to its predecessor.
DICTIONARY COMPRESSIONPAGE 33
Dictionary Compression
4. The compressed result is managed into a simple byte array (Tpfc):
Prefix-numbers are encoded using VByte and suffixes are encoded byte-to-byte (ASCII).
5. An additional integer array (ptrs) is used to store the position of the first byte of each bucket.
Bucket 1 Bucket 2http://example.org/property/address$ 29 ge$ 28 ather$ http://example.org/property/location$ 28 mother$ 28 name$
B1 = http://example.org/property/address$http://example.org/property/age$http://example.org/property/father$
B2 = http://example.org/property/location$http://example.org/property/mother$http://example.org/property/name$
0 47
RDF Dictionaries are used for SPARQL resolution, but also allows other interesting queries to be efficiently resolved in the Linked Data workflow.
Dictionaries in Practice
DICTIONARY COMPRESSIONPAGE 43
“
DICTIONARY COMPRESSIONPAGE 44
Normative SPARQL
PAGE 44PAGE 44
1 6 10
2 2 12
2 6 6
5 2 11
5 6 7
5 3 3
5 5 4
3 1 5
3 6 8
3 4 1
3 3 2
4 1 5
4 6 9
4 4 1
Retrieve all people living in Springfield.
@prefix …
SELECT ?Who
WHERE { ?Who property:location location:springfield }
P.locate(http://example.org/property/location/)
SO.locate(http://example.org/location/springfield/)
Looking for (?Who 4 1)
SO.extract(3)
SO.extract(4)
4
1
3
4
http://example.org/person/homer-simpson
http://example.org/person/marge-simpson
DICTIONARY COMPRESSIONPAGE 45
Domain Entity Retrieval
PAGE 45PAGE 45
1 6 10
2 2 12
2 6 6
5 2 11
5 6 7
5 3 3
5 5 4
3 1 5
3 6 8
3 4 1
3 3 2
4 1 5
4 6 9
4 4 1
Retrieve all people in our domain:http://explample.org/people/
SO.extractPrefix(http://example.org/people/)
S.extractPrefix(http://example.org/people/)
O.extractPrefix(http://example.org/people/)
http://example.org/person/abe-simpson
http://example.org/person/homer-simpson
http://example.org/person/marge-simpson
http://example.org/person/bart-simpson
-
DICTIONARY COMPRESSIONPAGE 46
Full-Text Search
PAGE 46PAGE 46
1 6 10
2 2 12
2 6 6
5 2 11
5 6 7
5 3 3
5 5 4
3 1 5
3 6 8
3 4 1
3 3 2
4 1 5
4 6 9
4 4 1
Retrieve all terms which include “Simpson”:
O.extractSubstring(”Simpson”)
“Abe Simpson”
“Bart Simpson”
“Homer Simpson”
“Marge Simpson”
Conclusions
Dictionary Compression
RDF dictionaries are highly compressible:
URIs are very redundant and Literals also show non-negligible symbolic redundancy.
This redundancy can be detected and removed within specific data structures for dictionaries:
Structures for URIs use up to 20 times less space than the original dictionaries.
For Literals, the corresponding structures use 6 − 8 times less space than the original dictionaries.
All these structures report data retrieval performance at microsecond level:
This functionality includes both simple and advanced operations.
DICTIONARY COMPRESSIONPAGE 48
Conclusions
DICTIONARY COMPRESSIONPAGE 49
Conclusions
Compressed string dictionaries are available in the libCSD C++ library
(beta):
We are working on a new release including more techniques and more search functionality (e.g. top K).
https://github.com/migumar2/libCSD
BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 50
Bibliography
1. Julian Arz and Johannes Fischer. LZ-compressed string dictionaries. In Procedings of DCC, pages 322–331, 2014.
2. Nieves Brisaboa, Rodrigo Cánovas, Francisco Claude, Miguel A. Martínez-Prieto, and Gonzalo Navarro. Compressedstring dictionaries. In Proceedings of SEA, pages 136–147, 2011.
3. Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms. MITPress and McGraw-Hill, 2nd edition, 2001.
4. Paolo Ferragina and Giovanni Manzini. Indexing compressed texts. Journal of the ACM, 52(4):552–581, 2005.
5. Paolo Ferragina, Giovanni Manzini, Veli Mäkinen, and Gonzalo Navarro. Compressed representations of sequencesand full-text indexes. ACM Transactions on Algorithms, 3(2):article 20, 2007.
6. Roberto Grossi and Giuseppe Ottaviano. Fast Compressed Tries through Path Decompositions. In Proceedings ofALENEX, pages 65–74, 2012.
7. T.C. Hu and Alan C. Tucker. Optimal Computer-Search Trees and Variable-Length Alphabetic Codes. SIAM Journalof Applied Mathematics, 21:514–532, 1971.
8. David A. Huffman. A method for the construction of minimum-redundancy codes. Proc. of the Institute of RadioEngineers, 40(9):1098–1101, 1952.
BIG (LINKED) SEMANTIC DATA COMPRESSIONPAGE 51
Bibliography
9. Donald E. Knuth. The Art of Computer Programming, volume 3: Sorting and Searching. Addison Wesley, 1973.
10. N. Jesper Larsson and Alistair Moffat. Offline dictionary-based compression. Proceedings of the IEEE, 88:1722–1732, 2000.
11. Veli Mäkinen and Gonzalo Navarro. Dynamic entropy-compressed sequences and full-text indexes. ACMTransactions on Algorithms, 4(3):article 32, 2008.
12. Miguel A. Martınez-Prieto, Nieves Brisaboa, Rodrigo Cánovas, Francisco Claude, and Gonzalo Navarro. Practicalcompressed string dictionaries. Information Systems, 56: 73-108, 2016.
13. Miguel A. Mart ́ınez-Prieto, Javier D. Fernáandez, and Rodrigo Cánovas. Querying RDF Dictionaries in CompressedSpace. SIGAPP Applied Computing Review, 12(2):64–77, 2012.
14. Hugh E. Williams and Justin Zobel. Compressing integers for fast file access. The Computer Journal, 42:193–201,1999.
15. Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documentsand Images. Morgan Kaufmann, 1999.
Triples Compression
Let’s the lecture continues…
Image:ROYAL MINT & ALCÁZAR (SEGOVIA, SPAIN)