View
218
Download
1
Embed Size (px)
Citation preview
1
Query Optimization Query Optimization In Compressed Database SystemsIn Compressed Database Systems
Zhiyuan Chen and Johannes Gehrke
Cornell University
Flip Korn
AT&T Labs
2
Why Compression?Why Compression?
CPU speed outpaces Disk speed exponentially!– x10 / decade (bandwidth), x100 / decade (latency)
Trade CPU for I/O: improve query performance+ Save bandwidth for sequential I/O+ Improve buffer pool hit ratio- Pay decompression cost
Environment– Decision support queries– Lossless compression
3
IssuesIssues
Database compression methods
Efficient query processing
4
Database Compression MethodsDatabase Compression Methods
General-purpose compression
Only compression ratio matters
Large decompression unit
(whole file)
Database compression
Both compression ratio and decompression cost matter
Small decompression unit (attribute or tuple)
Our setting: allow to decompress a single attribute
5
Efficient Query ProcessingEfficient Query Processing
Compared to uncompressed DB– When to decompress– Assumption: no compression in query processing
Our story– Different strategies of when to decompress– None of them is always optimal– Combined optimization problem:
Query plan + decompression placement– Solutions– Experiments
6
Different Decompression StrategiesDifferent Decompression Strategies
R S
R.A = S.B
Eager
D(R) D(S)
All uncompressed
D(R.A) D(S.B)
AB uncompressed
R S
R.A = S.B
Lazy
R S
d(R.A) = d(S.B)
All compressed
Transient
Mem
Disk
7
Which Strategy Is Optimal?Which Strategy Is Optimal?
Lazy vs. eager– Lazy is always better
Transient vs. Lazy – Transient: more I/O savings– Lazy: lower decompression cost
In practice– Numerical attributes: transient is always better– String attributes: no clear winner
• Expensive to decompress• High I/O savings if compressed
8
An Example With TPCH DataAn Example With TPCH Data
Select S_NAME, S_ADDRESS, C_NAME, C_PHONEFrom Supplier, CustomerWhere S_ADDRESS = C_ADDRESSOrder by S_NAME, C_NAME
Supplier Customer
S_A = C_A
Sort(S_N, C_N)
9
Lazy BNL (2s)
Lazy sort (7s)
Transient vs. LazyTransient vs. Lazy
1 attribute compressed
Lazy BNL (2s)
Transient sort (3s)
3 attributes compressed
Transient BNL (42s)
Transient sort (0.5s)
All attributescompressed
An optimization problem!
10
Lazy BNL (2s)
Transient sort (3s)
Interactions With Traditional OptimizationInteractions With Traditional Optimization
Optimal plan returned by System R is no longer optimal!
Pruned by System R
Algorithm: run System R, then decide when to decompress.
3 attributes compressed
Transient SM (2.5s)
Transient sort (0.5s)
All attributes compressed
11
Compression Aware OptimizationCompression Aware Optimization
Given a query and a compressed DB: Find the optimal query plan
New operators– Explicit decompression operators– Transient versions of existing relational operators
Search space: O (nm) factor over old search space– n is the depth of the plan – m is the number of attributes– Each attribute explicitly decompressed at most once– For each attribute, n places to decompress explicitly
12
Dynamic Programming - OPTDynamic Programming - OPT
Extend system R optimizer – Bottom up, one minimal plan per interesting property– What attributes remain compressed as a new property
Blowup reduced from nm to 2m
Lazy BNL (2s)Property: S_A, C_A uncompressed
Customer Supplier
Transient SM join (2.5s)Property: all compressed
Customer Supplier
13
Min-K Heuristic AlgorithmMin-K Heuristic Algorithm
Store plans for k rather than 2m properties– The k properties whose plans are cheapest
Storage blowup reduced from 2m to k Time: still exponential blowup in the worst case
Join on S_A, C_A
Stored plans: Lazy: S_A, C_ATransient: S_A, C_ALazy: S_A, transient: C_ATransient: S_A, Lazy: C_A
S_A,… C_A,…
14
Min-K Heuristics (2)Min-K Heuristics (2)
If transient decompression is bad for one join attribute, often so for the other– BNL join: both S_A and C_A decompressed N2 times
Time blowup is 2k
Join on S_A, C_A
Stored plans: Lazy: S_A, C_A
Transient: S_A, C_AS_A,… C_A,…
Only consider two cases
15
ExperimentsExperiments
Setup– Modify Predator query engine & optimizer– Algorithms
• Uncompressed, Eager, Lazy, Transient-Only,Two-Step, OPT, Min-1, Min-2
– 100 MB TPCH data– 50% compression ratio– Pentium III 550 Mhz, vary buffer pool size
16
Experimental Setup (2)Experimental Setup (2)
Randomly add join conditions on string attributes Divide queries into workloads
– Number of string join conditions, number of join tables
Metrics: for algorithm X– Average relative-cost:
Average(cost of plan returned by X / cost of opt plan)– Average blowup factor:
Average(# plans searched by X / # plans by System R)
17
Average Relative CostAverage Relative Cost
Queries with 3-4 join tables, buffer pool 10% of compressed DB
0
2
4
6
8
10
12
14
0 1 2 3Number of join conditions on string
attributes
Rel
-co
st
OPT
Min-2
Min-1
Two-Step
Eager
Lazy
Transient-Only
Uncompressed
18
Distribution of Query PerformanceDistribution of Query Performance
Percentage of Good plans (cost within twice of OPT) for all queries
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Per
cen
tag
e o
f g
oo
d p
lan
s Min-2
Min-1 Two-Step
Eager
LazyTransient-
Only
NotCompressed
19
Optimization CostOptimization Cost
Queries with 3-4 join tables
0
10
20
30
40
50
60
0 1 2 3
Number of join conditions on string attributes
Blo
wu
p F
acto
r OPT
Min-2
4
20
Related WorkRelated Work
How to compress– Roth&Horn93, Iyer&Wilhite94, Goldstein98
How to query– Graefe&Shapiro91, Westmann00, Greer99
Query optimization– Compressed MOLAP aggregates: Li99– Compressed Bitmap indices:Amer-Yahia&Johnson00– Expensive predicates:
• Chaudhuri&Shim99, Hellerstein93
21
Conclusions & Future WorkConclusions & Future Work
Novel optimization problem– Search for regular query plan + when to decompress– Separate search sub-optimal– OPT and Min-K heuristic– Up to an order improvement in experiments
Future work– Caching decompressed values– Updates
22
Search SpaceSearch Space
S_A, …
S_A = C_A
Sort(S_A)
3 extended plans (3 is depth)
nm blow up over old space-n: depth of plan-m: number of attributes
D(S_A)
3 places to place D(S_A)
Transient join
Before: convert to transient
Regular sort
After: as it is
23
Relative-CostRelative-Cost- Varying Buffer Pool Size- Varying Buffer Pool Size
Queries with 3- 4 join tables, 2 additional string joins
0
2
4
6
8
10
12
14
10% 40% 200%
Buffer Pool Size (% of compressed DB)
Rel
-cost
OPT
Min-2
Min-1
Two-Step
Eager
Lazy
Transient-Only
Uncompressed
24
Relative Performance (2)Relative Performance (2)
Queries with more than 5 join tables
0
2
4
6
8
10
12
0 1 2 3Number of join conditions on string
attributes
Rel
-Cos
t
OPTMin-2
Min-1Two-Step
EagerLazyTransient-Only
Uncompressed