Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15

Query Optimization 3Cost Estimation

R&G, Chapters 12, 13, 14Lecture 15

Administrivia

• Homework 2 Due Tonight– Remember you have 4 slip days for the course

• Homeworks 3 & 4 available later this week– 3 is written assignment, deals with

optimization, due before midterm 2– 4 is programming assignment, implementing

query processing, due after Spring Break

• Midterm 2 is 3/22, 2 weeks from Thursday

Review: Query Processing

• Queries start out as SQL

• Database translates SQL to one or more Relational Algebra plans

• Plan is a tree of operations, with access path for each

• Access path is how each operator gets tuples– If working directly on table, can use scan, index– Some operators, like sort-merge join, or group-by,

need tuples sorted– Often, operators pipelined, getting tuples that are

output from earlier operators in the tree

• Database estimates cost for various plans, chooses least expensive

Today

• Review costs for:– Sorting– Selection– Projection– Joins

• Re-examine Hashing for:– Projection– Joins

General External Merge Sort

• To sort a file with N pages using B buffer pages:– Pass 0: use B buffer pages. Produce sorted

runs of B pages each. – Pass 1, 2, …, etc.: merge B-1 runs.

N B/

B Main memory buffers

INPUT 1

INPUT B-1

OUTPUT

DiskDisk

INPUT 2

. . . . . .

. . .

More than 3 buffer pages. How can we utilize them?

Cost of External Merge Sort

• Minimum amount of memory: 3 pages– Initial runs of 3 pages– Then 2-way merge of sorted runs

(2 pages for inputs, one for outputs)

– #of passes: 1 + log2(N/3)

• With more memory, fewer passes

– With B pages, #of passes: 1 + log(B-1)(N/B)

• I/O Cost = 2N * (# of passes)

External Sort Example

• E.g., with 5 buffer pages, to sort 108 page file:– Pass 0:

• 22 sorted runs of 5 pages each (last run is only 3 pages)

• Now, do four-way (B-1) merges– Pass 1:

• 6 sorted runs of 20 pages each (last run is only 8 pages)

– Pass 2: • 2 sorted runs, 80 pages and 28 pages

– Pass 3: • Sorted file of 108 pages

Using B+ Trees for Sorting

• Scenario: – Table to be sorted has B+ tree index on sorting column(s).

• Idea: – Retrieve records in order by traversing leaf pages.

• Is this a good idea? Cases to consider:– B+ tree is clustered Good idea!– B+ tree is not clustered Could be a very bad idea!

• I/O Cost– Clustered tree: ~ 1.5N– Unclustered tree: 1 I/O per tuple, worst case!

Selections: “age < 20”, “fname = Bob”, etc

• No index– Do sequential scan over all tuples– Cost: N I/Os

• Sorted data– Do binary search– Cost: log2(N) I/Os

• Clustered B-Tree– Cost: 2 or 3 to find first record +

1 I/O for each #qualifying pages• Unclustered B-Tree

– Cost: 2 or 3 to find first RID + ~1 I/O for each qualifying tuple

• Clustered Hash Index– Cost: ~1.2 I/Os to find bucket, all tuples inside

• Unclustered Hash Index– Cost: ~1.2 I/Os to find bucket, +

~1 I/O for each matching tuple

Selection Exercise

• Sal > 100– Sequential Scan: 100 I/Os– Btree on sal? Unclustered, 503 I/Os– BTree on <age, sal>? Can’t use it

• Age = 25– Sequential Scan: 100 I/Os– Hash on age: 1.2 I/O to get to bucket

• 20 matching tuples, 1 I/O for each– BTree <age,sal>: ~3 I/Os to find leaf + #matching

pages• 20 matching tuples, clustered, ~ 2 I/Os

• Age > 20– Sequential Scan: 100 I/Os– Hash on age? Not with range query.– BTree <age, sal>: ~3 I/Os to find leaf + #matching

pages• (Age > 20) is 90% of pages, or ~90*1.5 = 135 I/Os

Selection Exercise (cont)

• Eid = 1000– Sequential Scan: ~50 I/Os (avg)– Hash on eid: ~1.2 I/Os to find bucket, 1 I/O

to get record

• Sal > 100 and age < 30– Sequential Scan: 100 I/Os– Btree <age, sal>: ~ 3 I/Os to find leaf, 30%

of pages match, so 30*1.5 = 45 I/Os

Projection

• Expensive when eliminating duplicates

• Can do this via:

– Sorting: cost no more than external sort• Cheaper if you project columns in initial pass,

since more projected tuples fit in each page.

– Hashing: build a hash table, duplicates will end up in the same bucket

An Alternative to Sorting: Remove duplicates with Hashing!

• Idea:– Many of the things we use sort for don’t exploit

the order of the sorted data– e.g.: removing duplicates in DISTINCT– e.g.: finding matches in JOIN

• Often good enough to match all tuples with equal values

• Hashing does this!– And may be cheaper than sorting! (Hmmm…!)– But how to do it for data sets bigger than

memory??

General Idea

• Two phases:– Partition: use a hash function h to split

tuples into partitions on disk.• Key property: all matches live in the same

partition.

– ReHash: for each partition on disk, build a main-memory hash table using a hash function h2

Two Phases• Partition:

• Rehash:Partitions

Hash table for partitionRi (<= B pages)

B main memory buffersDisk

Result

hashfnh2

B main memory buffers DiskDisk

Original Relation OUTPUT

2INPUT

1

hashfunction

h B-1

Partitions

1

2

B-1

. . .

Duplicate Elimination using Hashing

• read one bucket at a time

• for each group of identical tuples, output one

PartitionsHash table for partition

Ri (<= B pages)


Result

hashfnh2



2INPUT

1

hashfunction

h B-1

Partitions

1

2

B-1

. . .

Hashing in Detail

• Two phases, two hash functions• First pass: partition into (B-1) buckets• E.g., B = 5 pages, h(x) is two low order bit

1

MemoryInput File Output Files

65149126

Memory, I/O costs Requirement

• If we can hash in two passes -> cost is 4N

• How big of a table can we hash in two passes?– B-1 “partitions” result from Phase 0– Each should be no more than B pages in size– Answer: B(B-1).

Said differently: We can hash a table of size N pages in about space

– Note: assumes hash function distributes records evenly!

• Have a bigger table? Recursive partitioning!

N

How does this compare with external sorting?

Memory Requirement for External Sorting

• How big of a table can we sort in two passes?– Each “sorted run” after Phase 0 is of size B– Can merge up to B-1 sorted runs in Phase 1– Answer: B(B-1).

Said differently: We can sort a table of size N pages in about space

• Have a bigger table? Additional merge passes!

N

So which is better ??

• Based on our simple analysis:– Same memory requirement for 2 passes– Same IO cost

• Digging deeper …

• Sorting pros:– Great if input already sorted (or almost sorted)– Great if need output to be sorted anyway– Not sensitive to “data skew” or “bad” hash functions

• Hashing pros:– Highly parallelizable (will discuss later in semester)– Can exploit extra memory to reduce # IOs (stay

tuned…)

Nested Loops Joins

• R, with M pages, joins S, with N Pages• Nested Loops

– Simple nested loops• Insanely inefficient M + PR*M*n

– Paged nested loops – only 3 pages of memory• M + M*N

– Blocked nested loops – B pages of memory• M + M/(B-2) * N• If M fits in memory (B-2), cost only M + N

– Index nested loops• M + PR*M* index cost

• Only good in M very small

Sort-Merge Join

• Simple case: – sort both tables on join column– Merge– Cost: external sort cost + merge cost

• 2M*(1 + log(B-1)(M/B)) + 2N*(1 + log(B-1)(N/B)) + M + N

• Optimized Case:– If we have enough memory, do final merge and join in

same pass. This avoids final write pass from sort, and read pass from merge

– Can we merge on 2nd pass? Only in #runs from 1st pass < B

– #runs for R is M/B. #runs for S is N/B. • Total #runs ~~ (M+N)/B

– Can merge on 2nd pass if M+N/B < B, or M+N < B2

– Cost: 3(M+N)

Hash Join

Partitionsof R & S

Input bufferfor Si

Hash table for partitionRi (B-2 pages)


Output buffer

Disk

Join Result

hashfnh2

h2



2INPUT

1

hashfunction

h B-1

Partitions

1

2

B-1

. . .

Cost of Hash Join

• Partitioning phase: read+write both relations 2(|R|+|S|) I/Os

• Matching phase: read+write both relations |R|+|S| I/Os

• Total cost of 2-pass hash join = 3(|R|+|S|)

Q: what is cost of 2-pass merge-sort join?

Q: how much memory needed for 2-pass sort join?

Q: how much memory needed for 2-pass hash join?

• Have B memory buffers • Want to hash relation of size N

An important optimization to hashing

N

# passes

B B2

1

2

If B < N < B2, will have unused memory …

cost

N

3N

1

Hybrid Hashing

• Idea: keep one of the hash buckets in memory!



3

INPUT

2

h B-k

Partitions

2

3

B-k

. . .h3

k-buffer hashtable

Q: how do we choose the value of k?

Cost reduction due to hybrid hashing

• Now:

N

# passes

B B2

1

2

cost

N

3N

Summary: Hashing vs. Sorting

• Sorting pros:– Good if input already sorted, or need output

sorted– Not sensitive to data skew or bad hash functions

• Hashing pros:– Often cheaper due to hybrid hashing– For join: # passes depends on size of smaller

relation– Highly parallelizable

Summary

• Several alternative evaluation algorithms for each operator.

• Query evaluated by converting to a tree of operators and evaluating the operators in the tree.

• Must understand query optimization in order to fully understand the performance impact of a given database design (relations, indexes) on a workload (set of queries).

• Two parts to optimizing a query:– Consider a set of alternative plans.

• Must prune search space; typically, left-deep plans only.– Must estimate cost of each plan that is considered.

• Must estimate size of result and cost for each plan node.• Key issues: Statistics, indexes, operator implementations.

Documents

Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15