for "Parallelizing Multiple Group-by Queries using MapReduce"

Parallelizing Multiple Group-by queries using MapReduce:

optimization and cost estimationJie Pan · Frédéric Magoulès ·

Yann Le Biannic · Christophe Favart

B99705024 林劭軒B99705021 李奕德R00725051 郗昀彥

§ Ecole Centrale Paris · † SAP Research

† †

Telecommunication Systems 2013

Outline

• MapReduce and Optimized MapReduce

• Cost Estimation

• Experiments and Evaluation

MapReduce

MapDi MapDi MapDi MapDi

Master Node

Worker Nodes

MapReduce

Di Map

Master Node

Worker Nodes

MapReduce

Di Map

Master Node

Worker Nodes

serialize :: structured objects → byte stream

de-serialize :: byte stream → structured objects

MapReduce

Di IiMap

Master Node

Worker Nodes

MapReduce

Di IiMapDiDiDiIi

Master Node

Worker Nodes

Di IiMapReducer

Result

DiDiDiIi

Master Node

Worker Nodes

MapReduce

Motivation

• Data Analysis (Business Intelligence)

• Task with Predicates

• High Selectivity => High Communication Cost

• Goal: Reduce the Volume of Intermediate Data

DiDiDiIi Master NodeWorker Nodes

Selectivity = #Data

#Data Satisfying Predicates

Di IiMapsignal

Master Node

Worker Nodes

MapCombineReduce (1/2)

IiCombiner

Master Node

Worker Nodes

CombinerCombinerCombinerCombiner

Ai IiCombiner

Master Node

Worker Nodes

Reducer

Result

DiDiDiAi

Ai IiCombiner

Master Node

Worker Nodes

Cost Estimation

Notations – general

min ∑ Cst + Cw + Ccl + Ccmm

Di IiMapReducer

Result

DiDiDiIi

Master Node

Worker Nodes

Initial Build (1/4)

Creating a mappingSerialize Data

Forall mappers

Network Factor

Mapper’s Data Transfer Cost

Result Transfer Cost

Initial Build (2/4)

De-serialize Data Serialize Result

Fragment

Load to Memory

Filter Cost

Initial Build (3/4)

De-serialize All Result

Selected DataAggregation Cost

Initial Build (4/4)

• sizem = 0

• Cmpg * nbm is constant

Optimized Build (1/6)

Nodes to be Combined

Size of Combiner’s Object

Does Not Change

Does Not Serialize Result

Serialize Intermediate Result

De-serialize Intermediate Result

• Network Factor * (Start to Map + Worker to Combiner + Reduce Phare)

• sizem = 0• sizec = 0• Cmpg * nbm is constant

Compare

The factors has changed!!

Experiments and Evaluation

Experiments Environment (1/2)

• Running the experience over

• 9 sites geographically distributed in France

• featuring 5000 processors

• 1 cluster situated in the Sophia site

• IBM eServer 325

• Total number of nodes in this cluster: 49

[1] https://www.grid5000.fr/

Experiments Environment (2/2)

• Each node is composed of

• 2 CPUs of AMD Opteron 246

• 1 MB of cache, 2 GB of memory

• network: 2xGigabit Ethernet

• Java 1.6, GridGain 2.1.1

Dataset

• Dataset: 640000 records

• Each record contains 15 columns

• partition with 5 different fragment sizes

• 1000, 2000, 4000, 8000 and 16000

• with selectivity = 0.0106, 0.099 and 0.185

Experiments

• Run a sequential test on

• 1 machine

• Launch the parallel tests in GridGain on

• 5, 10, 15 and 20 machines

Results - Query Selectivity 0.0106

Result

• When the selectivity is bigger, the optimized version’s speeds-up better than the initial version.

• When the query’s selectivity is small, only a small amount of data need to be transferred over network.

• When the query’s selectivity is big, then the communication cost becomes dominant.

Scalability

• use several datasets having the same columns• composed of 640000, 1280000, 1920000 and 2560000 records

• Fragment: 16000• Run the queries with the same selectivity

Conclusion• MapReduce Model

• MapCombineReduce Model

• The combiner: pre-aggregator which aggregates over worker node

• Reduce the amount of intermediate data transferred over network

• Cost estimation

• Experimental results

• Better speed-up and scalability for a reasonable selectivity

for "Parallelizing Multiple Group-by Queries using MapReduce"

Technology

Auto-Parallelizing Option

Parallelizing Multiple Group by Query in Shared-nothing ... · Introduction Related works Multiple Group by Query Data Preparation MapReduce jobs Experimental Results Cost Estimation

Parallelizing and De-parallelizing Elimination Orders

PARALLELIZING PATH EXPLORATION AND OPTIMIZING …

Parallelizing Query Optimization - VLDB · Parallelizing Query Optimization ... (e.g., many Siebel queries) ... used, thereby amortizing the optimization cost over multi-

Optimization for iterative queries on Mapreduce

Parallelizing METIS

OSCAR Parallelizing CompilerOSCAR Parallelizing …...OSCAR Parallelizing CompilerOSCAR Parallelizing Compiler and API for Low Powerand API for Low Power High Performance Multicores

MIPSpro Auto-Parallelizing Option Programmer’s Guide Auto-Parallelizing Option.pdf · MIPSpro™ Auto-Parallelizing Option Programmer’s Guide ... MIPSpro™ Auto-Parallelizing

Columnar Access with HBase - VI4IO · Real-time queries Compression, in-memory execution Bloom ﬁlters and block cache to speed up queries Use HDFS and supports MapReduce Uses ZooKeeper

IG DATA SYSTEMSpages.cs.wisc.edu/~paris/cs564-f16/lectures/lecture-18.pdf– OLAP (decision support queries) • MapReduce – first developed by Google, published in 2004 – only

Orca: A Modular Query Optimizer Architecture for Big Data...queries are compiled into MapReduce jobs and executed by Hadoop. HiveQL accelerated the coding of complex queries but also

Scalable Push-Based Real-Time Queries · Scalable Push-Based Real-Time Queries on Top of Pull-Based Databases Wolfram Wingerath May 8, 2019, Disputation, Hamburg ... MapReduce Bigtable

A Many-core Parallelizing Processor

Parallelizing Simplex within SMT solvers

Parallelizing LINQ program for GPGPU

Parallel Computation of Skyline and Reverse Skyline Queries … · 2019-07-12 · Parallel Computation of Skyline and Reverse Skyline Queries Using MapReduce Yoonjae Park Seoul National

Parallelizing Big Data Machine Learning Applications with ...dsc.soic.indiana.edu/publications/Parallelizing Big Data Machine... · Parallelizing Big Data Machine Learning Applications

MRShare: Sharing Across Multiple Queries in MapReduce

Parallelizing Compilers Presented by Yiwei Zhang