Upload
jennifer-lee
View
219
Download
0
Embed Size (px)
Citation preview
Hexastore:Hexastore:Sextuple Indexing for Semantic Web Data Sextuple Indexing for Semantic Web Data
ManagementManagement
Presented by Cathrin Weiss, Panagiotis Karras, Abraham Bernstein
Department of Informatics, University of Zurich
Session: Indexing and Query Processing, VLDB 2008
2010-01-22
Summarized by Jaeseok Myung
Intelligent Database Systems LabSchool of Computer Science & EngineeringSeoul National University, Seoul, Korea
Copyright 2010 by CEBT
OverviewOverview
Hexastore – Sextuple Indexing
A Triple (S, P, O) can be represented in six ways (3! = 6)
– SPO, SOP, PSO, POS, OSP, OPS
Every possible indexing scheme can be materialized
– Allows quick and scalable query processing
– Up to five times bigger index space is needed
In this presentation,
Review conventional RDF storage structures
Introduction to Hexastore
Discussion
Center for E-Business Technology IDS Lab. Seminar – 2/20
Copyright 2010 by CEBT
Physical Designs for RDF Storage Physical Designs for RDF Storage (1/4)(1/4)
Giant Triples Table
Center for E-Business Technology
SELECT ?titleWHERE {
?book <title> ?title.?book <author> <Fox, Joe>.?book <copyright> <2001>
}
Join! Join!
Entire Table Scan!
Redundancy!
IDS Lab. Seminar – 3/20
Copyright 2010 by CEBT
Physical Designs for RDF Storage Physical Designs for RDF Storage (2/4)(2/4)
Clustered Property Table
Contains clusters of properties that tend to be defined together
Center for E-Business Technology IDS Lab. Seminar – 4/20
Copyright 2010 by CEBT
Physical Designs for RDF Storage Physical Designs for RDF Storage (3/4)(3/4)
Property-Class Table
Exploits the type property of subjects to cluster similar sets of subjects together in the same table
Unlike clustered property table, a property may exist in multiple property-class tables
Center for E-Business Technology
Values of the type propertyValues of the type property
IDS Lab. Seminar – 5/20
Copyright 2010 by CEBT
Physical Designs for RDF Storage Physical Designs for RDF Storage (4/4)(4/4)
Vertically Partitioned Table
The giant table is rewritten into n two column tables where n is the number of unique properties in the data
We don’t have to
– Maintain null values
– Have a certain clustering algorithm
Center for E-Business Technology
subjectsubject
propertyproperty
objectobject
IDS Lab. Seminar – 6/20
Copyright 2010 by CEBT
The problem of having non-property-bound queries
MotivationMotivation
Center for E-Business Technology IDS Lab. Seminar – 7/20
Copyright 2010 by CEBT
Hexastore: Sextuple IndexingHexastore: Sextuple Indexing
Center for E-Business Technology
OOPP
PP
OO SSSS
OO
PP
PP
SS
SS
PPOO
SS
SS
OO
PP
OOOOPPSS
IDS Lab. Seminar – 8/20
Copyright 2010 by CEBT
Hexastore: Sextuple IndexingHexastore: Sextuple Indexing
Center for E-Business Technology IDS Lab. Seminar – 9/20
Copyright 2010 by CEBT
Five-fold Increase in Index SpaceFive-fold Increase in Index Space
Sharing The Same Terminal Lists
SPO-PSO, SOP-OSP, POS-OPS
The key of each of the three resources in a triple appears in two headers and two vectors, but only in one list
Center for E-Business Technology IDS Lab. Seminar – 10/20
Copyright 2010 by CEBT
Mapping DictionaryMapping Dictionary
Replacing all literals by unique IDs using a mapping dictionary
Mapping dictionary compresses the triple store
– Reduced redundancy, Saving a lot of physical space
We can concentrate on a logical index structure rather than the physical storage design
Center for E-Business Technology
S P O
object214 hasColor blue
object214 belongsTo
object352
… … …
S P O
0 1 2
0 3 4
… … …
ID Value
0 object214
1 hasColor
… …
IDS Lab. Seminar – 11/20
Copyright 2010 by CEBT
Clustered BClustered B++-Tree (RDF-3X, VLDB -Tree (RDF-3X, VLDB 2008)2008)
Store everything in a clustered B+-Tree
Triples are sorted in lexicographical order
– Allowing the conversion of SPARQL patterns into range scan
We don’t have to do entire table scan
Center for E-Business Technology
002 …
000 001 002 003
S P O
0 1 2
0 3 4
… … …
Actually, we don’t need this table!Actually, we don’t need this table!
ID Value
0 object214
1 hasColor
… …
<Mapping Dictionary>
IDS Lab. Seminar – 12/20
Copyright 2010 by CEBT
ArgumentationArgumentation
Concise and Efficient Handling of Multi-valued Resources
Index can contain multiple items
cf. Multi-valued Property Table
Avoidance of NULLs
Only those RDF elements that are relevant to a particular other element need to be stored in a particular index
No ad-hoc Choices Needed
Most other RDF data storage schemes require several ad-hoc decisions about their data representation architecture
– ex. Clustered Property Table (which properties to be stored together)
Center for E-Business Technology IDS Lab. Seminar – 13/20
Copyright 2010 by CEBT
ArgumentationArgumentation
Reduced I/O cost
Other RDF storage schemes may need to access multiple tables which are irrelevant to a query
– Queries that are not bounded by property
All First-step Pairwise Joins are Fast Merge-Joins
The key of resources in all vectors and lists used in a Hexastore are sorted
Reduction of Unions and Joins
ex. a list of subjects related to two particular objects through any property
– Hexastore can use osp index
Center for E-Business Technology IDS Lab. Seminar – 14/20
Copyright 2010 by CEBT
Treating the Path Expression ProblemTreating the Path Expression Problem
Select B.subjFROM triples AS A, triples AS BWHERE A.prop = wasBornAND A.obj = ‘1860’AND A.subj = B.objAND B.prop = ‘Author’
A path expression requires (n-1) subject-object self-joins where n is the length of the path
Vertical Partitioning
– Materialized Path Expressions (A.author:wasBorn = ‘1860’)
– n-1C2 = O(n2) possible additional properties
Hexastore
– (n-1) merge-join using pso and pos indices
Center for E-Business Technology IDS Lab. Seminar – 15/20
Copyright 2010 by CEBT
Experimental EvaluationExperimental Evaluation
Setup
2.8GHz dual core, 16GB RAM
Competitors
Column-oriented Vertical Partitioning Approaches– COVP1 – PSO Index
– COVP2 – PSO Index + POS Index (second copy)
Hexastore– SPO, SOP, PSO, POS, OSP, OPS
Datasets
Barton, MIT library data, 61 mil. triples, 258 properties
LUBM, A synthetic benchmark data set(10 univ.), 6.8 mil. triples, 18 predicates
Center for E-Business Technology IDS Lab. Seminar – 16/20
Copyright 2010 by CEBT
Performance (Barton Data)Performance (Barton Data)
Center for E-Business Technology IDS Lab. Seminar – 17/20
Copyright 2010 by CEBT
Performance (LUBM, 10)Performance (LUBM, 10)
Center for E-Business Technology IDS Lab. Seminar – 18/20
Copyright 2010 by CEBT
Memory UsageMemory Usage
In practice, Hexastore requires a four-fold increase in memory in comparison to COVP1, which is an affordable cost for the derived advantages
Center for E-Business Technology IDS Lab. Seminar – 19/20
Copyright 2010 by CEBT
ConclusionConclusion
Hexastore: Sextuple-Indexing Scheme
Worst-case five-fold storage increase in comparison to a conventional triples table
Quick and scalable general-purpose query processing
– All pairwise joins in a Hexastore can be rendered as merge joins
My Question
Main-memory Indexing (Is it possible?)
– 7GB RAM for 6 mil. triples
Other Options?
Center for E-Business Technology IDS Lab. Seminar – 20/20