Upload
conan
View
83
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Graph Analytics on Massive Collections of Small Graphs. Dritan Bleco Yannis Kotidis Department of Informatics Athens University Of Economics and Business. [email protected]. [email protected] . EDBT 2014 - Athens. Outline. Motivation Graph Records & Queries - PowerPoint PPT Presentation
Citation preview
Graph Analytics on Massive Collections of Small Graphs
Dritan Bleco Yannis Kotidis
Department of InformaticsAthens University Of Economics and Business
EDBT 2014 - Athens
Outline
• Motivation• Graph Records & Queries• Storage of Graph Records and Indexing using a
Column Store• Graph View Materialization• Selection of Graph Views• Extensions• Experiments• Conclusions
Dritan Bleco
Motivational Example
• Focus on small graphs that are generated continuously– Examples: data from CRM , WMS and SCM applications
• Difference between our targeted applications and other applications of graphs (e.g. social web, biology) – Not a single massive graph but a massive collection of
smaller graphs– Nodes/ Edges are mapped to real world entities
• Thus, no need for isomorphism discovery
Dritan Bleco
Framework Overview• Our framework puts together three different techniques
– A column-oriented relational backend to permit a flat description of the graph records. • Alleviates recursion and costly joins for path calculations (required in a straightforward
relational implementation)
– A very efficient indexing mechanism using bitmap columns • Analogous to bitmap indexes frequently used in DWs• This model is generic and can accommodate specialized graph indexes (for example the
gIndex)
– A framework that permits the creation and reuse of materialized graph views of different types• These views improve query times especially for aggregation queries
Dritan Bleco
AF
EGD I
KProduction Lines
Hubs
Customer Locations
Dritan Bleco
BC H
J
Own RouteLeased Route
QUERIES
• Delivery Time for products shipped via [A, D, E, G, I] path
• Delivery Cost for products shipped using Leased Routes
• The longest delay for products shipped from Region 1 to Location I via Hubs of Region2
Region1
Region2
Primitive Query Types• Graph Queries
– Find records that contain a given query graph Gq
– The result is the record id with the respective measures of each matching record
– For example return delivery times along all hops in [A, D, E, G, I]
• Aggregate Graph Queries– A Graph Query Gq with the addition of a user-defined aggregate
function f– The result is the aggregation of the measures along all maximal paths
(paths connecting sink and terminal nodes in Gq) – E.g. total delivery time for all shipments via [A, D, E, G, I]
Dritan Bleco
A
Graph Queries
Dritan Bleco
Record 1BDC
E
A FD
CE G
A FD
E G
Record 2
Record 3
1:3 2:4 3:2
4:1 5:2
2:1 3:26:4 7:1
4:2 5:3
4:55:4
6:3 7:1
Edge Edge IdAB 1AC 2CE 3AD 4DE 5EF 6FG 7
Find records that follow path [ACEF]
Result : r2 , AC:1, CE:2, EF:4 (record id , related measures)
A
Graph Aggregate Queries
Dritan Bleco
Record 1BDC
E
A FD
CE G
A FD
E G
Record 2
Record 3
1:3 2:4 3:2
4:1 5:2
2:1 3:26:4 7:1
4:2 5:3
4:55:4
6:3 7:1
Edge Edge IdAB 1AC 2CE 3AD 4DE 5EF 6FG 7
Find records and the total (sum) cost for path [ADEF]
Result : r2 , ADEF:9 (record id, aggregated measures) r3, ADEF:12
A
Storage Model
Dritan Bleco
Record 1BDC
E
A FD
CE G
A FD
E G
Record 2
Record 3
1:3 2:4 3:2
4:1 5:2
2:1 3:26:4 7:1
4:2 5:3
4:55:4
6:3 7:1
Edge Edge IdAB 1AC 2CE 3AD 4DE 5EF 6FG 7
rec Id m1 m2 m3 m4 m5 m6 m7
1 3 4 2 1 2 Null Null2 Null 1 2 2 3 4 13 Null Null Null 5 4 3 1
A
Bitmap Columns – a simple index
Dritan Bleco
Record 1BDC
E
A FD
CE G
A FD
E G
Record 2
Record 3
1:3 2:4 3:2
4:1 5:2
2:1 3:26:4 7:1
4:2 5:3
4:55:4
6:3 7:1
Edge Edge IdAB 1AC 2CE 3AD 4DE 5EF 6FG 7
rec Id m1 m2 m3 m4 m5 m6 m7 b1 b2 b3 b4 b5 b6 b7
1 3 4 2 1 2 Null Null 1 1 1 1 1 0 0
2 Null 1 2 2 3 4 1 0 1 1 1 1 1 1
3 Null Null Null 5 4 3 1 0 0 0 1 1 1 1
A
Queries using Bitmap Columns
Dritan Bleco
B
D
CE F G
Edge Edge IdAB 1AC 2CE 3AD 4DE 5EF 6FG 7
rec Id m1 m2 m3 m4 m5 m6 m7 b1 b2 b3 b4 b5 b6 b7
1 3 4 2 1 2 Null Null 1 1 1 1 1 0 0
2 Null 1 2 2 3 4 1 0 1 1 1 1 1 1
3 Null Null Null 5 4 3 1 0 0 0 1 1 1 1
Graph Query
Get the costs delay of [ACEF] path
Select recid, m2, m3, m6 where b2=1 AND b3=1 AND b6=1
Graph Aggregate Query
Get the total cost delay of [ACEF] pathSelect recid, m2 + m3 + m6 where b2=1 AND b3=1 AND b6=1
Graph View Materialization• Materialized Graph Views
– Used for Graph Queries / Aggregate Graph Queries– Implemented as bitmaps resulting from ANDing the edges of a
subgraph derived (by our techniques) from a set of graph queries– These bitmaps are added as a new columns in the database
• Materialized Aggregate Graph Views – Used for Graph Queries / Graph Aggregate Queries– A Bitmap (as in a Graph View) and pre-computed aggregates
• Bitmap is the corresponding materialized Graph View• Aggregates are derived from the measures stored in graph records
Dritan Bleco
A
Materialized Graph Views
Dritan Bleco
B
D
CE F G
Edge Edge IdAB 1AC 2CE 3AD 4DE 5EF 6FG 7
rec Id m1 m2 m3 m4 m5 m6 m7 b1 b2 b3 b4 b5 b6 b7 bq1
1 3 4 2 1 2 Null Null 1 1 1 1 1 0 0 0
2 Null 1 2 2 3 4 1 0 1 1 1 1 1 1 1
3 Null Null Null 5 4 3 1 0 0 0 1 1 1 1 0
Query
Q1 = Get the cost delay of [ACEF] path
Select recid, m2 ,m3 ,m6 where bq1=1 (b2=1 AND b3=1 AND b6=1)
Materialized View for Q1 : bq1 = b2 AND b3 AND b6
A
Materialized Aggregate Views
Dritan Bleco
B
D
CE F G
Edge Edge IdAB 1AC 2CE 3AD 4DE 5EF 6FG 7
rec Id m1 m2 m3 m4 m5 m6 m7 mq1 b1 b2 b3 b4 b5 b6 b7 bq1
1 3 4 2 1 2 Null Null Null 1 1 1 1 1 0 0 0
2 Null 1 2 2 3 4 1 7 0 1 1 1 1 1 1 1
3 Null Null Null 5 4 3 1 Null 0 0 0 1 1 1 1 0
Query
Q1 = Get the total cost of [ACEF] path
Select recid, mq1 (m2 + m3 + m6 ) where bq1=1 (b2=1 AND b3=1 AND b6=1)Path Aggregated Q1 : bq1 = b2 AND b3 AND b6
mq1 = m2 + m3 + m6
A
Dritan Bleco
B
D
CE F G
Edge Edge IdAB 1AC 2CE 3AD 4DE 5EF 6FG 7
rec Id m1 m2 m3 m4 m5 m6 m7 mq1 b1 b2 b3 b4 b5 b6 b7 bq1
1 3 4 2 1 2 Null Null Null 1 1 1 1 1 0 0 0
2 Null 1 2 2 3 4 1 7 0 1 1 1 1 1 1 1
3 Null Null Null 5 4 3 1 Null 0 0 0 1 1 1 1 0
Another Query can use the materialization of Q1 Q2 = Get the total cost delay of [ACEFG] path
Select recid, mq1 + m7 (m2 + m3 + m6 +m7 ) where bq1=1 AND b7=1 (b2=1 AND b3=1 AND b6=1 AND b7=1 )Aggregated Q1 : bq1 = b2 AND b3 AND b6
mq1 = m2 + m3 + m6
Re-use of materialized graph views• See our past work "Business Intelligence on Complex Graph Data", BEWEB, Berlin,
Germany, March 2012,
– How to formulate complex graph expressions using a set of intuitive operators we define
• How to best answer a user query using materialized (Aggregate or not) Graph Views?
– A simple cost model based on the number of bitmaps required for answering a query
– Mapped to a set cover problem–
– Solved via a greedy algorithm
– Details are in the paper.Dritan Bleco
What to materialize?• Aggressive materialization: Materialize whole queries– Often not possible due to space limitations
• Our approach: Query Driven Graph View Selection
• First need to derive a set of candidate views – Naïve approach : Consider all subsets of the edges in the
Union of all Query Graphs• Exponential number of candidates (thus not feasible)• Many redundant Views
– Intuition: Prune candidates based on a monotonicity property
Dritan Bleco
Dritan Bleco
Candidate Generation
Based on this property we only consider the following candidates :1. Each query graph +{[ACEFGHJ], [ADEFGHJ]}
2. All the subgraphs that are intersection between 2 query graphs +{[EFGHJ]}
3. All the subgraphs that are intersection between 2 graphs of the previous step until no more new views are created
AB
D
CE F G H J
Frequent Query Set {[ACEFGHJ], [ADEFGHJ]}
Monotonicity Property : Graph View Gv ’ supersedes Graph View Gv iff Gv Gv ’ and Gq : Gv Gq ⇒ Gv ’
The view selection from candidate set mapped as set a cover problem
Dritan Bleco
ExtensionsAll data are be stored in a single relation
rec Id m1 m2 m3 m4 m5 m6 m7 b1 b2 b3 b4 b5 b6 b7
1 3 4 2 1 2 Null Null 1 1 1 1 1 0 0
2 Null 1 2 2 3 4 1 0 1 1 1 1 1 1
3 Null Null Null 5 4 3 1 0 0 0 1 1 1 1
But obviously can be partitioning in more than one relation
rec Id m1 m2 m3 b1 b2 b3
1 3 4 2 1 1 1
2 Null 1 2 0 1 1
3 Null Null Null 0 0 0
rec Id m4 m5 m6 m7 b4 b5 b6 b7
1 1 2 Null Null 1 1 0 0
2 2 3 4 1 1 1 1 1
3 5 4 3 1 1 1 1 1
Can easily incorporate Specialized Graph Indexes (for example the gIndex)
Experiments• Graph records from two datasets
1. * NY: Depicts New York roads and
2. **Gnutella: Describes connections among Gnutella hosts from August 2002.
• Experimental evaluation among 4 systems– Commercial Row Store Relational DB– Column Store Relational DB– Neo4j– Commercial Native RDF DB
• * http://www.dis.uniroma1.it/~challenge9/download.shtml• ** http://snap.stanford.edu/data/p2p-Gnutella05.html
Dritan Bleco
Comparison to alternative Systems (no views)
Dritan Bleco
• Our System provides almost constant query times with increasing graph query size as fewer records are retrieved (even though more bitmaps are being used)
• Column store not affected from increasing density (% edges in a record)
Benefit of Using Graph Views
• Graph views provide savings of up to 32% in query times– there is a mandatory cost for fetching the records that is not affected by materialization
• Thus, more savings are seen in aggregate queries– using 100 aggregate graph views reduce the execution time by 89%
• Larger gains when queries exhibit skew (graphs in the paper)
Dritan Bleco
Runtime for 100 uniform Graph Queries Runtime for 100 uniform Aggregate Graph Queries
Using Additional Indexes
• gIndex (record driven): trained the index using records that are part of the query result set
– It took about 24 hours to process about 100.000 records
• Graph views (query driven) result in up to 6 times faster query processing times– It ran in less than one second Dritan Bleco
gIndex in 100 uniform Graph Queries gIndex 100 uniform Aggregate Graph Queries
Conclusions• Presented a framework where both data and queries are modeled as
abstract graph structures– Abstracted two primitive query graphs– Introduced two types of Graph Views for expediting queries– Discussed an efficient mechanism for selecting a set of non-redundant views– Answering queries using Graph Views by solving an instance of a set cover problem
• Argued for a simple yet effective representation of graph records using a flat relational model implemented in a column store– Introduced bitmap indexes for efficient query processing– Graph Views are stored within the same relational schema
• Presented experimental results using datasets consisting of hundreds of millions of graph records– Experimental results show that our platform is orders of magnitude faster than
• A straightforward relational implementation • Alternative systems that natively handle graph data.
Dritan Bleco
Thank you,
Dritan Bleco
Questions?