24
1 Query Optimization Query Optimization In Compressed Database In Compressed Database Systems Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs

1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs

  • View
    218

  • Download
    1

Embed Size (px)

Citation preview

Page 1: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs

1

Query Optimization Query Optimization In Compressed Database SystemsIn Compressed Database Systems

Zhiyuan Chen and Johannes Gehrke

Cornell University

Flip Korn

AT&T Labs

Page 2: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs

2

Why Compression?Why Compression?

CPU speed outpaces Disk speed exponentially!– x10 / decade (bandwidth), x100 / decade (latency)

Trade CPU for I/O: improve query performance+ Save bandwidth for sequential I/O+ Improve buffer pool hit ratio- Pay decompression cost

Environment– Decision support queries– Lossless compression

Page 3: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs

3

IssuesIssues

Database compression methods

Efficient query processing

Page 4: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs

4

Database Compression MethodsDatabase Compression Methods

General-purpose compression

Only compression ratio matters

Large decompression unit

(whole file)

Database compression

Both compression ratio and decompression cost matter

Small decompression unit (attribute or tuple)

Our setting: allow to decompress a single attribute

Page 5: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs

5

Efficient Query ProcessingEfficient Query Processing

Compared to uncompressed DB– When to decompress– Assumption: no compression in query processing

Our story– Different strategies of when to decompress– None of them is always optimal– Combined optimization problem:

Query plan + decompression placement– Solutions– Experiments

Page 6: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs

6

Different Decompression StrategiesDifferent Decompression Strategies

R S

R.A = S.B

Eager

D(R) D(S)

All uncompressed

D(R.A) D(S.B)

AB uncompressed

R S

R.A = S.B

Lazy

R S

d(R.A) = d(S.B)

All compressed

Transient

Mem

Disk

Page 7: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs

7

Which Strategy Is Optimal?Which Strategy Is Optimal?

Lazy vs. eager– Lazy is always better

Transient vs. Lazy – Transient: more I/O savings– Lazy: lower decompression cost

In practice– Numerical attributes: transient is always better– String attributes: no clear winner

• Expensive to decompress• High I/O savings if compressed

Page 8: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs

8

An Example With TPCH DataAn Example With TPCH Data

Select S_NAME, S_ADDRESS, C_NAME, C_PHONEFrom Supplier, CustomerWhere S_ADDRESS = C_ADDRESSOrder by S_NAME, C_NAME

Supplier Customer

S_A = C_A

Sort(S_N, C_N)

Page 9: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs

9

Lazy BNL (2s)

Lazy sort (7s)

Transient vs. LazyTransient vs. Lazy

1 attribute compressed

Lazy BNL (2s)

Transient sort (3s)

3 attributes compressed

Transient BNL (42s)

Transient sort (0.5s)

All attributescompressed

An optimization problem!

Page 10: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs

10

Lazy BNL (2s)

Transient sort (3s)

Interactions With Traditional OptimizationInteractions With Traditional Optimization

Optimal plan returned by System R is no longer optimal!

Pruned by System R

Algorithm: run System R, then decide when to decompress.

3 attributes compressed

Transient SM (2.5s)

Transient sort (0.5s)

All attributes compressed

Page 11: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs

11

Compression Aware OptimizationCompression Aware Optimization

Given a query and a compressed DB: Find the optimal query plan

New operators– Explicit decompression operators– Transient versions of existing relational operators

Search space: O (nm) factor over old search space– n is the depth of the plan – m is the number of attributes– Each attribute explicitly decompressed at most once– For each attribute, n places to decompress explicitly

Page 12: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs

12

Dynamic Programming - OPTDynamic Programming - OPT

Extend system R optimizer – Bottom up, one minimal plan per interesting property– What attributes remain compressed as a new property

Blowup reduced from nm to 2m

Lazy BNL (2s)Property: S_A, C_A uncompressed

Customer Supplier

Transient SM join (2.5s)Property: all compressed

Customer Supplier

Page 13: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs

13

Min-K Heuristic AlgorithmMin-K Heuristic Algorithm

Store plans for k rather than 2m properties– The k properties whose plans are cheapest

Storage blowup reduced from 2m to k Time: still exponential blowup in the worst case

Join on S_A, C_A

Stored plans: Lazy: S_A, C_ATransient: S_A, C_ALazy: S_A, transient: C_ATransient: S_A, Lazy: C_A

S_A,… C_A,…

Page 14: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs

14

Min-K Heuristics (2)Min-K Heuristics (2)

If transient decompression is bad for one join attribute, often so for the other– BNL join: both S_A and C_A decompressed N2 times

Time blowup is 2k

Join on S_A, C_A

Stored plans: Lazy: S_A, C_A

Transient: S_A, C_AS_A,… C_A,…

Only consider two cases

Page 15: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs

15

ExperimentsExperiments

Setup– Modify Predator query engine & optimizer– Algorithms

• Uncompressed, Eager, Lazy, Transient-Only,Two-Step, OPT, Min-1, Min-2

– 100 MB TPCH data– 50% compression ratio– Pentium III 550 Mhz, vary buffer pool size

Page 16: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs

16

Experimental Setup (2)Experimental Setup (2)

Randomly add join conditions on string attributes Divide queries into workloads

– Number of string join conditions, number of join tables

Metrics: for algorithm X– Average relative-cost:

Average(cost of plan returned by X / cost of opt plan)– Average blowup factor:

Average(# plans searched by X / # plans by System R)

Page 17: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs

17

Average Relative CostAverage Relative Cost

Queries with 3-4 join tables, buffer pool 10% of compressed DB

0

2

4

6

8

10

12

14

0 1 2 3Number of join conditions on string

attributes

Rel

-co

st

OPT

Min-2

Min-1

Two-Step

Eager

Lazy

Transient-Only

Uncompressed

Page 18: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs

18

Distribution of Query PerformanceDistribution of Query Performance

Percentage of Good plans (cost within twice of OPT) for all queries

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Per

cen

tag

e o

f g

oo

d p

lan

s Min-2

Min-1 Two-Step

Eager

LazyTransient-

Only

NotCompressed

Page 19: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs

19

Optimization CostOptimization Cost

Queries with 3-4 join tables

0

10

20

30

40

50

60

0 1 2 3

Number of join conditions on string attributes

Blo

wu

p F

acto

r OPT

Min-2

4

Page 20: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs

20

Related WorkRelated Work

How to compress– Roth&Horn93, Iyer&Wilhite94, Goldstein98

How to query– Graefe&Shapiro91, Westmann00, Greer99

Query optimization– Compressed MOLAP aggregates: Li99– Compressed Bitmap indices:Amer-Yahia&Johnson00– Expensive predicates:

• Chaudhuri&Shim99, Hellerstein93

Page 21: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs

21

Conclusions & Future WorkConclusions & Future Work

Novel optimization problem– Search for regular query plan + when to decompress– Separate search sub-optimal– OPT and Min-K heuristic– Up to an order improvement in experiments

Future work– Caching decompressed values– Updates

Page 22: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs

22

Search SpaceSearch Space

S_A, …

S_A = C_A

Sort(S_A)

3 extended plans (3 is depth)

nm blow up over old space-n: depth of plan-m: number of attributes

D(S_A)

3 places to place D(S_A)

Transient join

Before: convert to transient

Regular sort

After: as it is

Page 23: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs

23

Relative-CostRelative-Cost- Varying Buffer Pool Size- Varying Buffer Pool Size

Queries with 3- 4 join tables, 2 additional string joins

0

2

4

6

8

10

12

14

10% 40% 200%

Buffer Pool Size (% of compressed DB)

Rel

-cost

OPT

Min-2

Min-1

Two-Step

Eager

Lazy

Transient-Only

Uncompressed

Page 24: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs

24

Relative Performance (2)Relative Performance (2)

Queries with more than 5 join tables

0

2

4

6

8

10

12

0 1 2 3Number of join conditions on string

attributes

Rel

-Cos

t

OPTMin-2

Min-1Two-Step

EagerLazyTransient-Only

Uncompressed