Download pptx - Optimization of Incremental Queries CloudMDE2015

Budapest University of Technology and EconomicsDepartment of Measurement and Information Systems

Optimization of Incremental Queries in the Cloud

József Makai, Gábor Szárnyas, Ákos Horváth, István Ráth, Dániel Varró

Budapest University of Technology and EconomicsFault Tolerant Systems Research Group

INCQUERY-D: DISTRIBUTED INCREMENTAL MODEL QUERIES

Incremental Query Evaluation by RETE AUTOSAR well-formedness validation rule

Communication channel

Logical signal Mapping Physical signal

Invalid model fragment

Instance model

Valid model fragment

Fill the input nodesFill the worker nodesRead the result setModify the modelPropagate the changesRead the changes in the result set (deltas)

Incremental Query Evaluation by RETE

join

join

antijoin

Result set

input nodes

Communication channel

Logical signal Mapping Physical signal

worker nodes

Goals of IncQuery-D Objectives

o Distributed incremental pattern matchingo Adaptation of IncQuery tooling to graph DBso Executed over cloud infrastructure (COTS hardware)

Achieve scalability by avoiding memory bottlenecko Sharding separately• Data• Indexers• Query network

o In memory: • Index + Query

Assumptions• All Rete nodes fit on a server node• Indexers can be filled efficiently• Modification size model size≪• The application requires the complete result

set of the query (opposed to just one match)

Database shard 0

INCQUERY-D Architecture

Server 1

Database shard 1

Server 2

Database shard 2

Server 3

Database shard 3

Transaction

Server 0

Rete net

Indexer layer

INCQUERY-D

Distributed query evaluation network

Distributed indexer Model access adapter

Distributed indexing, notification

Distributed persistent storage

Distributed production network• Each intermediate node can be allocated

to a different host• Remote internode communication

INCQUERY-D Architecture

Server 1

Database shard 1

Server 2

Database shard 2

Server 3

Database shard 3

Transaction

In-memory EMF modelDatabase shard 0

Server 0

Indexer layer

INCQUERY-D

Indexer Indexer Indexer Indexer

JoinJoin

AntijoinAkka

Triple store (4store),Document DB (Mongo),RDF over Column family

(Cumulus)

RETE Deployment ProcessQuery

Language

Query Predicates

RETE Structure

Platform Description

Allocation / Mapping

Deployment Descriptor

pattern routeSensor(sensor: Sensor) = { TrackElement.sensor(switch,sensor); Switch(switch); SwitchPosition. switch(sp, switch); SwitchPosition(sp); Route.switchPosition(route, sp); Route(route); neg find head(route, sensor); }pattern head(R, Sen) = { Route.routeDefinition(R, Sen);}

route: Route sp: SwitchPosition

Switch: Switchsensor: Sensor

switchPosition

switchsensor

routeDefinition

RETE Deployment Process Construct language-

independent constraints Resolution of

o syntactic sugar o type information

Query Language

Query Predicates

RETE Structure




Variables route sp switchParameter sensor

Constraints

Edge: SwitchPosition.switch Edge: TrackElement.sensor Edge: Route.switchPosition Negation: head

RETE Deployment Process Construct RETE structure

(platform independently) Optimizations:

o Model statisticso Expected usage profile

Query Language

Query Predicates

RETE Structure




join

join

join

RETE Deployment Process Architecture model

(Cloud infrastructure)o Virtual Machines

• Memory limits• CPU speed• Storage capacity

o Communication Channels• Bandwidth

Specified by a textual DSL (Xtext)

Query Language

Query Predicates

RETE Structure




1 2

3 4

RETE Deployment ProcessMachine Allocated Nodes

1 In1, In2, Join2

2 In3

3 In4

4 Join1, Join3

Query Language

Query Predicates

RETE Structure




1 2

3 4

Join1

Join3

Join2

In1 In2 In3 In4

Allocation can be optimized for query performance and other

beneficial system characteristics!

RETE Deployment Process Configuration scripts for

o Deploymento Communication

middleware Derived by automated

code generationo Using Eclipse technology:

EMF-IncQuery + Xtend

Query Language

Query Predicates

RETE Structure




ALLOCATION OPTIMIZATION IN INCQUERY-D

Motivation for Allocation Optimization Considering data-intensive

systemso Over usage of resourceso Cost of the systemo Overhead of network

communication

Job Job

tLocal job

execution time

t’Data transmission time is significant component in

global execution time

~

Job

Job

Job

Network links can have different capacities

4000 MBProcess2000 MB

Process500 MB

Process2400 MB

$$$

Poor utilization leads to expensive system

The Allocation Problem Inputs Allocation constraints Output: Valid allocation Optimization targets

500 MB

3200 MB

2400 MB600 MB

Worker node

Input nodeInput node

Production node

1 2

3

4

5000 MB6000 MB

1 2

• Rete network for the query organized to processes

• Resource consumption

Available infrastructure with important resource parameters

Opt. Target: Communication Minimization

1 × 1,000,000

3 × 200,000 3 × 200,000

� Communication = 2,200,000

6000 MB

5000 MB

1

2500 MB

3200 MB

2400 MB600 MB

Worker node


Production node

1,000,000200,000

200,000

1 2

3

4

3 × 1,000,000

1 × 200,000

1 × 200,000

� Communication = 3,400,000

5000 MB

6000 MB

1

2

Largest volume of data is sent through faster local link

Opt. Target: Cost Minimization

500 MB

3200 MB

2400 MB600 MB

Worker node


Production node

1 2

3

4

4000 MB$5

4000 MB$5

6500 MB$7

1

2

3

� Cost = 10

4000 MB$5

4000 MB$5

6500 MB$7

1

2

3

� Cost = 12

Heuristics in Optimization

Worker node

Production node

Input node

Worker node


Worker node

Production node

Production node

Worker node

Model database

Number of model elements

?? MBInput node

Memory consumption of Rete nodes and processes

1 1 11 1 1

1

Memory usage of Input nodes can be estimated

Communication intensity of network

communication channels2 2

2

2

2 2

3 3

3

3 3

4 4

Performance Impact of Optimization

61K 213K 867K 3M 13MModel size (number of elements)

Tim

e (s

ec)

First evaluation time of a complex query

28

45

72

114

182

290

463

739

Max. memory

Naiveoptimization

Communicationoptimization

739

616

194

144

2 minutes gain!

This approach doesn’t work for larger models!

Network Traffic Statistics

vm0 vm1 vm2 total vm0 vm1 vm2 total0

200

400

600

800

1000

1200

300 349 371

1020

248 280347

875

142

74

90

24 20

190

234

Network Traffic in Megabytes

Remote Local

Unoptimized Optimized

Unoptimized:o Remote Traffic:

1020o Local Traffic: 90o Total Traffic: 1110

Optimized:o Remote Traffic:

875o Local Traffic: 234o Total Traffic: 1109

Conclusion and Future Work Results

o Novel approach for application-specific resource allocation optimization for distributed Rete

o CPLEX-based implementation for IncQuery-Do Preliminary evaluation results

• Significant improvements for local resource management• Performance gains especially over slow / inhomogeneous networks• Efficient optimization execution (supported by runtime cutoff in CPLEX)

Future worko Hadoop / YARN support (new IncQuery-D developments)

• Support configuration optimization for other Hadoop-based cloud apps

o Static allocation Dynamic reallocation• Take existing configuration as a starting constraint set• Optimize for changed workload conditions

New INCQUERY-D Architecture

Docker container 1

Database shard 1

Docker container 2

Database shard 2

Docker container 3

Database shard 3

Transaction

In-memory EMF modelDatabase shard 0

Docker container 0

Indexer layer

New INCQUERY-D: “Hadoop over Docker”

Indexer Indexer Indexer Indexer

JoinJoin

Antijoin

• YARN resource management

• ZooKeeper monitoring

Akka actors embedded into long-running Hadoop jobs