31
Distributed Query Processing Based on “The state of the art in distributed query processing” Donald Kossman (ACM Computing Surveys, 2000)

Distributed Query Processing Based on “The state of the art in distributed query processing” Donald Kossman (ACM Computing Surveys, 2000)

  • View
    229

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Distributed Query Processing Based on “The state of the art in distributed query processing” Donald Kossman (ACM Computing Surveys, 2000)

Distributed Query Processing

Based on “The state of the art in distributed query processing” Donald Kossman (ACM Computing Surveys, 2000)

Page 2: Distributed Query Processing Based on “The state of the art in distributed query processing” Donald Kossman (ACM Computing Surveys, 2000)

Motivation

• Cost and scalability: network of off-shelf machines

• Integration of different software vendors (with own DBMS)

• Integration of legacy systems• Applications inherently distributed, such as

workflow or collaborative-design• State-of-the-art distributed information

technologies (e-businesses)

Page 3: Distributed Query Processing Based on “The state of the art in distributed query processing” Donald Kossman (ACM Computing Surveys, 2000)

Part 1 : Basics

• Query Processing Basics– centralized query processing– distributed query processing

Page 4: Distributed Query Processing Based on “The state of the art in distributed query processing” Donald Kossman (ACM Computing Surveys, 2000)

Problem Statement

• Input: Query such as „Biological objects in study A referenced in a literature in journal Y“.

• Output: Answer

• Objectives:– response time, throughput, first answers, little IO, ...

• Centralized vs. Distributed Query Processing– same basic problem – but, more and different parameters, such(data sites

or available machine power) and objectives

Page 5: Distributed Query Processing Based on “The state of the art in distributed query processing” Donald Kossman (ACM Computing Surveys, 2000)

Steps in Query Processing• Input: Declarative Query

– SQL, XQuery, ...

• Step 1: Translate Query into Algebra– Tree of operators (query plan generation)

• Step 2: Optimize Query – Tree of operators (logical) - also select partitions of table

– Tree of operators (physical) – also site annotations

– (Compilation)

• Step 3: Execution– Interpretation; Query result generation

Page 6: Distributed Query Processing Based on “The state of the art in distributed query processing” Donald Kossman (ACM Computing Surveys, 2000)

Algebra

– relational algebra for SQL very well understood– algebra for XQuery mostly understood

SELECT A.dFROM A, BWHERE A.a = B.b AND A.c = 35

A.d

A.a = B.b,A.c = 35

X

A B

Page 7: Distributed Query Processing Based on “The state of the art in distributed query processing” Donald Kossman (ACM Computing Surveys, 2000)

Query Optimization

– logical, e.g., push down cheap predicates– enumerate alternative plans, apply cost model– use search heuristics to find cheapest plan

A.d

A.a = B.b,A.c = 35

X

A B

A.d

hashjoin

B.b

index A.c B

Page 8: Distributed Query Processing Based on “The state of the art in distributed query processing” Donald Kossman (ACM Computing Surveys, 2000)

Basic Query Optimization

• Classical Dynamic Programming algorithm– Performs join order optimization– Input : Join query on n relations– Output : Best join order

Page 9: Distributed Query Processing Based on “The state of the art in distributed query processing” Donald Kossman (ACM Computing Surveys, 2000)

The Dynamic Prog. Algorithm

for i = 1 to n do { optPlan({Ri}) = accessPlans(Ri) prunePlans(optPlan({Ri}))}for i = 2 to n do for all S { R1, R2 … Rn } such that |S| = i do { optPlan(S) = for all O S do {

optPlan(S) = optPlan(S) joinPlans(optPlan(O), optPlan(S –

O))prunePlans(optPlan(S))

} }return optPlan({R1, R2, … Rn})

Page 10: Distributed Query Processing Based on “The state of the art in distributed query processing” Donald Kossman (ACM Computing Surveys, 2000)

Query Execution

– library of operators (hash join, merge join, ...)– exploit indexes and clustering in database– pipelining (iterator model)

A.d

hashjoin

B.b

index A.c B

(John, 35, CS)(Mary, 35, EE) (Edinburgh, CS,5.0)

(Edinburgh, AS, 6.0)

(CS)(AS)

(John, 35, CS)

John

Page 11: Distributed Query Processing Based on “The state of the art in distributed query processing” Donald Kossman (ACM Computing Surveys, 2000)

Summary : Centralized Queries

• Basic SQL (SPJG, nesting) well understood

• Very good extensibility– spatial joins, time series, UDF, xquery, etc.

• Current problems– Better statistics : cost model for optimization– Physical database design expensive & complex

• Some Trends– interactiveness during execution – approximate answers, top-k– self-tuning capabilities (adaptive; robust; etc.)

Page 12: Distributed Query Processing Based on “The state of the art in distributed query processing” Donald Kossman (ACM Computing Surveys, 2000)

Distributed Query Processing: Basics

• Idea: Extension of centralized query processing. (System

R* et al. in 80s)

• What is different?– extend physical algebra: send&receive operators– other metrics : optimize for response time– resource vectors, network interconnect matrix– caching and replication– less predictability in cost model (adaptive algos)– heterogeneity in data formats and data models

Page 13: Distributed Query Processing Based on “The state of the art in distributed query processing” Donald Kossman (ACM Computing Surveys, 2000)

Issues in Distributed Databases

• Plan enumeration– The time and space complexity of traditional dynamic

programming algorithm is very large

– Iterative Dynamic Programming (heuristic for large queries)

• Cost Models– Classic Cost Model

– Response Time Model

– Economic Models

Page 14: Distributed Query Processing Based on “The state of the art in distributed query processing” Donald Kossman (ACM Computing Surveys, 2000)

Distributed Query Plan

A.d

hashjoin

B.b

index A.c B

receive receive

send send

FormsOf

Parallelism?

Page 15: Distributed Query Processing Based on “The state of the art in distributed query processing” Donald Kossman (ACM Computing Surveys, 2000)

Cost : Resource Utilization

1

8

2

5 10

1 6

1 6

Total Cost =Sum of Cost of Ops

Cost = 40

Page 16: Distributed Query Processing Based on “The state of the art in distributed query processing” Donald Kossman (ACM Computing Surveys, 2000)

Another Metric : Response Time

25, 33

24, 32

0, 12

0, 5 0, 10

0, 7 0, 24

0, 6 0, 18

Total Cost = 40first tuple = 25last tuple = 33

first tuple = 0last tuple = 10

Pipelinedparallelism

Independentparallelism

Page 17: Distributed Query Processing Based on “The state of the art in distributed query processing” Donald Kossman (ACM Computing Surveys, 2000)

Query Execution Techniques for Distributed Databases

• Row Blocking

• Multi-cast optimization

• Multi-threaded execution

• Joins with horizontal partitioning

• Semi joins

• Top n queries

Page 18: Distributed Query Processing Based on “The state of the art in distributed query processing” Donald Kossman (ACM Computing Surveys, 2000)

Query Execution Techniques for DD

• Row Blocking –– SEND and RECEIVE operators in query plan

to model communication– Implemented by TCP/IP, UDP, etc.– Ship tuples in block-wise fashion (batch);

smooth burstiness

Page 19: Distributed Query Processing Based on “The state of the art in distributed query processing” Donald Kossman (ACM Computing Surveys, 2000)

Query Execution Techniques for DD

• Multi-cast Optimization– Location of sending/receiving may affect

communication costs; forwarding versus multi-casting

• Multi-threaded execution– Several threads for operators at the same site (intra-

query parallelism)– May be useful to enable concurrent reads for diverse

machines (while continuing query processing)– Must consider if resources warrant concurrent operator

execution (say two sorts each needing all memory)

Page 20: Distributed Query Processing Based on “The state of the art in distributed query processing” Donald Kossman (ACM Computing Surveys, 2000)

Query Execution Techniques for DD

• Joins with Data (horizontal) partitioning:– Hash-based partitioning to conduct joins on independent partitions

• Semi Joins :– Reduce communication costs; Send only “join keys” instead of

complete tuples to the site to extract relevant join partners

• Double-pipelined hash joins :– Non-blocking join operators to deliver first results quickly; fully

exploit pipelined parallelism, and reduce overall response time

• Top n queries :– Isloate top n tuples quickly and only perform other expensive

operations (like sort, join, etc) on those few (use “stop” operators)

Page 21: Distributed Query Processing Based on “The state of the art in distributed query processing” Donald Kossman (ACM Computing Surveys, 2000)

Adaptive Algorithms

• Deal with unpredictable events at run time– delays in arrival of data, burstiness of network– autonomity of nodes, changes in policies

• Example: double pipelined hash joins– build hash table for both input streams– read inputs in separate threads– good for bursty arrival of data

• Re-optimization at run time (LEO, etc.)– monitor execution of query– adjust estimates of cost model– re-optimize if delta is too large

Page 22: Distributed Query Processing Based on “The state of the art in distributed query processing” Donald Kossman (ACM Computing Surveys, 2000)

Special Techniques for Client-Server Architectures

• Shipping techniques– Query shipping– Data shipping– Hybrid shipping

• Query Optimization– Site Selection– Where to optimize– Two Phase Optimization

Page 23: Distributed Query Processing Based on “The state of the art in distributed query processing” Donald Kossman (ACM Computing Surveys, 2000)

Special Techniques for Federated Database Systems

• Wrapper architecture

• Query optimization– Query capabilities– Cost estimation

• Calibration Approach

• Wrapper Cost Model

• Parameter Binding

Page 24: Distributed Query Processing Based on “The state of the art in distributed query processing” Donald Kossman (ACM Computing Surveys, 2000)

Heterogeneity

• Use Wrappers to “hide“ heterogeneity• Wrappers take care of data format, packaging• Wrappers map from local to global schema• Wrappers carry out caching

– connections, cursors, data, ...

• Wrappers map queries into local dialect• Wrappers participate in query planning!!!

– define the subset of queries that can be handled– give cost information, statistics– “capability-based rewriting“

Page 25: Distributed Query Processing Based on “The state of the art in distributed query processing” Donald Kossman (ACM Computing Surveys, 2000)

Middleware• Two kinds of middleware

– data warehouses– virtual integration

• Data Warehouses– good: query response times– good: materializes results of data cleaning– bad: high resource requirements in middleware– bad: staleness of data

• Virtual Integration – the opposite– caching possible to improve response times

Page 26: Distributed Query Processing Based on “The state of the art in distributed query processing” Donald Kossman (ACM Computing Surveys, 2000)

Virtual Integration

Query

Middleware(query decomposition, result composition)

DB1 DB2

wrapper

subquery

wrapper

subquery

Page 27: Distributed Query Processing Based on “The state of the art in distributed query processing” Donald Kossman (ACM Computing Surveys, 2000)

IBM Data Joiner

SQL Query

Data Joiner

SQL DB1 SQL DB2

wrapper

subquery

wrapper

subquery

Page 28: Distributed Query Processing Based on “The state of the art in distributed query processing” Donald Kossman (ACM Computing Surveys, 2000)

Adding XML

Query

Middleware (SQL)

DB1 DB2

wrapper

subquery

wrapper

subquery

XML Publishing

Page 29: Distributed Query Processing Based on “The state of the art in distributed query processing” Donald Kossman (ACM Computing Surveys, 2000)

XML Data Integration

XML Query

Middleware (XML)

DB1 DB2

wrapper

XMLquery

wrapper

XMLquery

Page 30: Distributed Query Processing Based on “The state of the art in distributed query processing” Donald Kossman (ACM Computing Surveys, 2000)

XML Data Integration

• Example: BEA Liquid Data

• Advantage– Availability of XML wrappers for all major databases

• Problems– XML - SQL mapping is very difficult– XML is not always the right language

(e.g., decision support style queries)

Page 31: Distributed Query Processing Based on “The state of the art in distributed query processing” Donald Kossman (ACM Computing Surveys, 2000)

Summary

• Middleware looks like a homogenous centralized database– location transparency– data model transparency

• Middleware provides global schema– data sources map local schemas to global schema

• Various kinds of middleware (SQL, XML)

• “Stacks“ of middleware possible

• Data cleaning requires special attention