38
Parallelizing Multiple Group-by queries using MapReduce: optimization and cost estimation Jie Pan · Frédéric Magoulès · Yann Le Biannic · Christophe Favart B99705024 林劭軒 B99705021 李奕德 R00725051 郗昀彥 § Ecole Centrale Paris · † SAP Research § § Telecommunication Systems 2013

for "Parallelizing Multiple Group-by Queries using MapReduce"

Embed Size (px)

Citation preview

Page 1: for "Parallelizing Multiple Group-by Queries using MapReduce"

Parallelizing Multiple Group-by queries using MapReduce:

optimization and cost estimationJie Pan · Frédéric Magoulès ·

Yann Le Biannic · Christophe Favart

B99705024 林劭軒B99705021 李奕德R00725051 郗昀彥

§ Ecole Centrale Paris · † SAP Research

§ §

† †

Telecommunication Systems 2013

Page 2: for "Parallelizing Multiple Group-by Queries using MapReduce"

Outline

• MapReduce and Optimized MapReduce

• Cost Estimation

• Experiments and Evaluation

Page 3: for "Parallelizing Multiple Group-by Queries using MapReduce"

MapReduce

Data

MapDi MapDi MapDi MapDi

Master Node

Worker Nodes

Page 4: for "Parallelizing Multiple Group-by Queries using MapReduce"

MapReduce

Data

MapDi MapDi MapDi MapDi

Di Map

Master Node

Worker Nodes

Page 5: for "Parallelizing Multiple Group-by Queries using MapReduce"

MapReduce

Data

MapDi MapDi MapDi MapDi

Di Map

Master Node

Worker Nodes

serialize :: structured objects → byte stream

de-serialize :: byte stream → structured objects

Page 6: for "Parallelizing Multiple Group-by Queries using MapReduce"

MapReduce

Data

MapDi MapDi MapDi MapDi

Di IiMap

Master Node

Worker Nodes

Page 7: for "Parallelizing Multiple Group-by Queries using MapReduce"

MapReduce

Data

MapDi MapDi MapDi MapDi

Di IiMapDiDiDiIi

Master Node

Worker Nodes

Page 8: for "Parallelizing Multiple Group-by Queries using MapReduce"

Data

MapDi MapDi MapDi MapDi

Di IiMapReducer

Result

DiDiDiIi

Master Node

Worker Nodes

MapReduce

Page 9: for "Parallelizing Multiple Group-by Queries using MapReduce"

Motivation

• Data Analysis (Business Intelligence)

• Task with Predicates

• High Selectivity => High Communication Cost

• Goal: Reduce the Volume of Intermediate Data

DiDiDiIi Master NodeWorker Nodes

Selectivity = #Data

#Data Satisfying Predicates

Page 10: for "Parallelizing Multiple Group-by Queries using MapReduce"

Data

MapDi MapDi MapDi MapDi

Di IiMapsignal

Master Node

Worker Nodes

MapCombineReduce (1/2)

Page 11: for "Parallelizing Multiple Group-by Queries using MapReduce"

MapCombineReduce (2/2)

Data

MapDi MapDi MapDi MapDi

IiCombiner

Master Node

Worker Nodes

CombinerCombinerCombinerCombiner

Page 12: for "Parallelizing Multiple Group-by Queries using MapReduce"

MapCombineReduce (2/2)

Data

MapDi MapDi MapDi MapDi

Ai IiCombiner

Master Node

Worker Nodes

CombinerCombinerCombinerCombiner

Page 13: for "Parallelizing Multiple Group-by Queries using MapReduce"

Data

Reducer

Result

MapDi MapDi MapDi MapDi

DiDiDiAi

Ai IiCombiner

Master Node

Worker Nodes

CombinerCombinerCombinerCombiner

MapCombineReduce (2/2)

Page 14: for "Parallelizing Multiple Group-by Queries using MapReduce"

Cost Estimation

Page 15: for "Parallelizing Multiple Group-by Queries using MapReduce"

Notations – general

Page 16: for "Parallelizing Multiple Group-by Queries using MapReduce"

Cost

min ∑ Cst + Cw + Ccl + Ccmm

Data

MapDi MapDi MapDi MapDi

Di IiMapReducer

Result

DiDiDiIi

Master Node

Worker Nodes

Page 17: for "Parallelizing Multiple Group-by Queries using MapReduce"

Initial Build (1/4)

Creating a mappingSerialize Data

Forall mappers

Network Factor

Mapper’s Data Transfer Cost

Result Transfer Cost

Page 18: for "Parallelizing Multiple Group-by Queries using MapReduce"

Initial Build (2/4)

De-serialize Data Serialize Result

Fragment

Load to Memory

Filter Cost

Page 19: for "Parallelizing Multiple Group-by Queries using MapReduce"

Initial Build (3/4)

De-serialize All Result

Selected DataAggregation Cost

Page 20: for "Parallelizing Multiple Group-by Queries using MapReduce"

Initial Build (4/4)

• sizem = 0

• Cmpg * nbm is constant

Page 21: for "Parallelizing Multiple Group-by Queries using MapReduce"

Optimized Build (1/6)

Nodes to be Combined

Size of Combiner’s Object

Does Not Change

Page 22: for "Parallelizing Multiple Group-by Queries using MapReduce"

Optimized Build (2/6)

Does Not Change

Does Not Serialize Result

Page 23: for "Parallelizing Multiple Group-by Queries using MapReduce"

Optimized Build (3/6)

Serialize Intermediate Result

Page 24: for "Parallelizing Multiple Group-by Queries using MapReduce"

Optimized Build (4/6)

De-serialize Intermediate Result

Page 25: for "Parallelizing Multiple Group-by Queries using MapReduce"

Optimized Build (5/6)

• Network Factor * (Start to Map + Worker to Combiner + Reduce Phare)

Page 26: for "Parallelizing Multiple Group-by Queries using MapReduce"

Optimized Build (6/6)

• sizem = 0• sizec = 0• Cmpg * nbm is constant

Page 27: for "Parallelizing Multiple Group-by Queries using MapReduce"

Compare

The factors has changed!!

Page 28: for "Parallelizing Multiple Group-by Queries using MapReduce"

Experiments and Evaluation

Page 29: for "Parallelizing Multiple Group-by Queries using MapReduce"

Experiments Environment (1/2)

• Running the experience over

• 9 sites geographically distributed in France

• featuring 5000 processors

• 1 cluster situated in the Sophia site

• IBM eServer 325

• Total number of nodes in this cluster: 49

[1] https://www.grid5000.fr/

[1]

Page 30: for "Parallelizing Multiple Group-by Queries using MapReduce"

Experiments Environment (2/2)

• Each node is composed of

• 2 CPUs of AMD Opteron 246

• 1 MB of cache, 2 GB of memory

• network: 2xGigabit Ethernet

• Java 1.6, GridGain 2.1.1

Page 31: for "Parallelizing Multiple Group-by Queries using MapReduce"

Dataset

• Dataset: 640000 records

• Each record contains 15 columns

• partition with 5 different fragment sizes

• 1000, 2000, 4000, 8000 and 16000

• with selectivity = 0.0106, 0.099 and 0.185

Page 32: for "Parallelizing Multiple Group-by Queries using MapReduce"

Experiments

• Run a sequential test on

• 1 machine

• Launch the parallel tests in GridGain on

• 5, 10, 15 and 20 machines

Page 33: for "Parallelizing Multiple Group-by Queries using MapReduce"

Results - Query Selectivity 0.0106

Page 34: for "Parallelizing Multiple Group-by Queries using MapReduce"

Results - Query Selectivity 0.099

Page 35: for "Parallelizing Multiple Group-by Queries using MapReduce"

Results - Query Selectivity 0.185

Page 36: for "Parallelizing Multiple Group-by Queries using MapReduce"

Result

• When the selectivity is bigger, the optimized version’s speeds-up better than the initial version.

• When the query’s selectivity is small, only a small amount of data need to be transferred over network.

• When the query’s selectivity is big, then the communication cost becomes dominant.

Page 37: for "Parallelizing Multiple Group-by Queries using MapReduce"

Scalability

• use several datasets having the same columns• composed of 640000, 1280000, 1920000 and 2560000 records

• Fragment: 16000• Run the queries with the same selectivity

Page 38: for "Parallelizing Multiple Group-by Queries using MapReduce"

Conclusion• MapReduce Model

• MapCombineReduce Model

• The combiner: pre-aggregator which aggregates over worker node

• Reduce the amount of intermediate data transferred over network

• Cost estimation

• Experimental results

• Better speed-up and scalability for a reasonable selectivity