CS 345: Topics in Data Warehousing

CS 345:Topics in Data Warehousing

Tuesday, October 19, 2004

Review of Thursday’s Class• Bridge tables

– Hierarchies– Multi-Valued Dimensions

• Extraction-Transformation-Load– Data staging area vs. data warehouse– Assigning surrogate keys– Detecting changed rows– Detecting duplicate dimension rows

Outline of Today’s Class

• Database System Architecture– Memory management– Secondary storage (disk) – Query planning process

• Joins– Nested Loop Join– Merge Join– Hash Join

• Grouping– Sort vs. Hash

Memory Management

• Modern operating systems use virtual memory– Address space larger than physical main memory– Layer of indirection between virtual addresses and

physical address– Physical memory serves as a cache for most recently

accessed memory locations– Infrequently accessed memory locations written to

disk when cache fills up• Memory transferred in large blocks called pages

– Disk access: latency is large relative to throughput– Benefit from locality of reference– Transfer overheads amortized over many requests

Memory Management• Database management systems (DBMSs) use a similar

model to virtual memory– Data is stored persistently on disk– Main memory serves as cache for frequently accessed data

• This memory cache is called the buffer pool– Data organized into blocks called pages

• DBMSs override built-in virtual memory– Database operations are very data intensive– Careful management of disk access essential for good

performance– DBMS has specialized knowledge of its data access patterns– Specialized memory management includes:

• Prefetching pages likely to be needed in the future• Essential / frequently accessed pages pinned in memory• Multiple buffer pools with different parameters for different needs

– Parameters: Page size, page replacement policy– Needs: Relations, indexes, temporary results, sort space

Disk Access Time

• Seek cost + rotational latency + transfer cost• Seek cost

– Move the disk head to the right place– Physical machinery

• Rotational latency– Wait for the disk to spin– Average wait = ½ rotation

• Transfer cost– Read the data while the disk spins

Seek cost

Rotational latency

Disk drive

Random vs. Sequential I/O

• Seek cost and rotational latency are overhead– Independent of the transfer size– ~ 20 ms on commodity disks– One large transfer is much cheaper than many small

transfers• Random vs. Sequential I/O

– I/O = Input/Output– Random I/O: Read/write an arbitrary page– Sequential I/O: Read/write the next page– Sequential I/O is 10-20x faster than random I/O

• Avoiding random I/O will be a theme– Cost of sequential I/O is less– Prefetching easy when access patterns predictable

Query Planning Process• Many different query plans are possible

– Logically equivalent ways of answering the same query• Job of the query optimizer: choose the most efficient plan for a

given query– Most efficient = minimize resource consumption– Resources = CPU, disk I/O– Consume less resources → faster query responses

• Design of query optimizer– Rule-based

• Apply heuristic transformations (e.g. push selections down)– Cost-based:

• Estimate cost (resource usage) of each possible query plan• Choose the plan with lowest estimated cost• Cost estimation uses statistics collected by the optimizer

– In practice, optimizers use a combination of the two• Search space of possible plans is too large to exhaustively explore• Heuristics guide which types of plans are considered for cost-based

optimization

parse

Query rewriting

Physical plan generation

execute

result

SQL query

parse tree

Best logical query planstatistics

Best physical query plan

Join Processing

• OLAP queries involve joins– Fact table joins to dimension tables– Primary-key / Foreign-key joins

• Choice of join algorithm has large impact on query cost• Naïve join processing algorithm

To join relations A and B:– Generate “Cartesian product” of A and B

• Each A tuple is combined with each B tuple– Apply selection conditions to each tuple in the Cartesian product

• Problem: Intermediate result can be huge!– Relation A has 10 million rows– Relation B has 100,000 rows– A x B has 1 trillion rows (many terabytes of data)

• Need more efficient, logically equivalent procedure

• NLJ (conceptually)

for each r R1 do

for each s R2 do

if r.C = s.C then output r,s pair

Nested Loop Join (NLJ)

B C

a 10

a 20

b 10

d 30

C D

10 cat

40 dog

15 bat

20 rat

R1 R2

Nested Loop Join

• In theory, multi-way nested loop joins are possible– Joining n relations → n nested “for” loops

• In practice, only binary (2-way) joins operators are implemented– Joining n relations → Cascade of n-1 binary joins

• Nested-loop join terminology:– Outer input: scanned in outer loop of join– Inner input: scanned in inner loop of join

A B

C

Analysis of Nested Loop Join

• Outer has m tuples, inner has n tuples• CPU complexity = O(m*n)

– For each of m outer tuples, compare with each of n inner tuples

– Quadratic running time• What about I/O complexity?

– Depends on amount of memory available– We’ll assume B tuples per page– Consider four scenarios

• Use 2 pages of memory• Use (m/B + n/B) pages of memory• Use (n/B + 1) pages of memory• Use (k + 1) pages of memory

– For k < m/B

Nested Loop Join (2 pages)

• One page for outer, one page for inner• NLJ algorithm:

– Load 1st outer block into memory• Load all n/B blocks of inner, one after another

– Load 2nd outer block into memory• Load all n/B blocks of inner, one after another

– …– Load (m/B)th outer block into memory

• Load all n/B blocks of inner, one after another

• Total I/O cost = m/B * (n/B + 1)

Nested Loop Join (m/B+n/B pages)

• Load both inputs entirely into memory

• All computation takes place in memory

• No further paging necessary

• Total I/O cost = m/B + n/B

• Requires a lot of memory– Practical if both inputs are small

Nested Loop Join (n/B + 1 pages)

• Allocate 1 page for outer• Load entire inner into memory (n/B pages)• NLJ algorithm:

– Load entire inner into memory– Load 1st outer block into memory

• Inner is already in memory → no additional I/O needed– Load 2nd outer block into memory

• Inner is already in memory → no additional I/O needed– …– Load (m/B)th outer block into memory

• Inner is already in memory → no additional I/O needed

• Total I/O cost = m/B + n/B– Same I/O cost as previous slide– Practical if inner input is small

Nested Loop Join (k+1 pages)

• Allocate 1 page for inner• Load k pages of outer into memory at a time• NLJ algorithm:

– Load 1st group of k outer blocks into memory• Load all n/B blocks of inner, one after another

– Load 2nd group of k outer block into memory• Load all n/B blocks of inner, one after another

– …– Load (m/kB)th group of k outer blocks into memory

• Load all n/B blocks of inner, one after another

• Total I/O cost = m/kB * (n/B + k)– Reduction from 2-page case by a factor of k

Summary of NLJ Analysis

• CPU usage is O(m*n)– Quadratic in size of inputs

• I/O usage depends on memory allocated– If one input is small:

• Make the small input the inner• I/O complexity is O(m/B + n/B)• Linear in size of inputs• The minimum possible

– Otherwise:• Make the small input the outer

– Slightly better than the other way around

• I/O complexity is O(m/kB * (n/B + k))– When k+1 pages of memory are available– Approximately mn/kB2

Improving on Nested Loops Join

• For each outer tuple, we scan the entire inner relation to find matching tuples

• Maybe we can do better by using data structures

• Two main ideas:– Sorting– Hashing

Sorting

• Using sorting to improve NLJ– First we sort the inner relation– Then, for each outer tuple, find matching inner tuple(s) using

binary search over sorted inner

• CPU usage improves – O(mn) → O(m log n) + sort cost– Sub-quadratic

• This strategy is indexed nested-loop join– To be covered in more detail on Thursday

• Implementation details– Use B-Tree instead of binary tree as search structure– Larger branching factor → better I/O efficiency

Merge Join

• What if we sort both inputs instead of one?– No need to perform binary search for each outer tuple– Since outer tuples are also in sorted order, access to inner

relation is structured• Binary search accesses pages in arbitrary order → random I/O• Merge join uses sequential I/O

• CPU usage improves– O(mn) → O(m + n) + sort costs

1 2 3 4 5 6 7

5 2 7 3 5 1 4

1 23

4 5

1 2 3 4 5 6 7

1 2 3 4 4 5 5 5 7

scan scan

scan

Indexed NLJ Merge JoinRandom I/O

for eachouter tuple

External-Memory Sorting

• How to efficiently sort large data sets?– Consider I/O complexity

• Sort a relation with n tuples, B tuples per block– Case 1: Use n/B pages of memory– Case 2: Use k+1 pages of memory

• Case 1: Use n/B pages of memory– Read the relation into memory

• n/B sequential disk reads– Sort in main memory using quicksort

• No disk access required– Write the sorted output back to disk

• n/B sequential disk writes– Total: 2n/B disk I/Os (all sequential)

• Requires lots of memory → only useful for small inputs

External-Memory Sorting• Case 2: Use k pages of memory

– k ≥ sqrt(n/B)• Two phases (run generation + merge)• Run generation

– Repeat for each group of k blocks:• Read group of k blocks into memory• Sort using quicksort• Write sorted “run” back out to disk

• Merge– Read 1st page from each run into memory– Merge to produce globally sorted output– Write output pages to disk as they are filled– As a page is exhausted, replace by next

page in run• Each block read + written once per phase

– Total disk I/Os: 4n/B

Sorted Runs

Final output

Memory Disk

Hashing

• Store inner input in a hash table– Tuples grouped together into buckets using a hash function

• Apply hash function to join attributes (called the key) to determine bucket

– Idea: all records with the same key must be in the same bucket• For each outer tuple

– Hash key to determine which bucket it goes in– Scan that bucket for matching inner tuples

• CPU usage improves– O(mn) → O(m + n)

• Assuming a good hash function with few collisions• Collision = tuples with different keys hash to same bucket

• Only useful for equality joins– Not for join conditions like WHERE A.foo > B.bar– 99% of all joins are equality joins

Hash Join Analysis

• Hash join terminology– Inner input is called build input– Outer input is called probe input– Build a hash table based on one input, probe into that hash table

using the other• Probe has m tuples, build has n tuples• Using (n/B + 1) pages of memory:

– Allocate 1 page for probe input– Read build input into memory + create hash table

• Uses n/B pages– Read 1st page of probe input

• Hash each tuple and search bucked for matching build tuples• Requires no additional I/O

– Repeat for 2nd and subsequent pages of probe input– I/O complexity = O(n/B + m/B)– Suitable when one input is small

Partitioned Hash Join• Two-phase algorithm (Partition + Join)

– Uses k blocks of memory• Phase 1: Partition

– Performed separately for build and probe inputs– Read 1st page into memory– Assign each tuple to a partition 1..k using hash function– Each partition is allocated 1 block of memory

• Tuples assigned to same partition are placed together• When the block fills up, write it to disk• Each partition stored in a contiguous region on disk

– Repeat for 2nd and subsequent pages• Phase 2: Join phase

– Join Partition 1 of probe input with Partition 1 of build input using ordinary hash join

• Use a secondary hash function– Repeat with Partition 2 of each input, etc.

• A form of recursion– Partitioning creates smaller problem instances

Partition PhaseMemory Disk

Partitions

Unsorted Input

(Similar to merge phase of merge join, only backwards)

Summary of Join Methods

• When 1 relation fits entirely in memory, all join methods have linear I/O complexity

• When both relations are large:– NLJ has quadratic I/O complexity– Merge join and hash join have linear I/O complexity

• CPU complexity– Quadratic for NLJ– N log N for merge join– Linear for hash join

• Rules of Thumb:– Use nested loop join for very small inputs– Use merge join for large inputs that are already sorted– Use hash join for large, unsorted inputs

Query Plans with Multiple Joins• Two kinds of execution:

– Pipelined• No need to store intermediate results on disk• Available memory must be shared between joins

– Materialized• No memory contention between joins• Spend I/Os writing intermediate results to disk

• Blocking operators– Sort, partition, build hash table– Pipelined execution is disallowed

• Interesting orders– Sorted input → sorted output

• NLJ: output is sorted by same key as outer input• Merge join: output is sorted by same key as inputs

– Partitioned input → partitioned output• Hash join: output is partitioned by same key as inputs

– Take advantage of interesting orders• Eliminate sorting / partitioning phases of later joins

A B

C

Grouping

• First partition tuples on grouping attributes– Tuples in same group placed together– Tuples in different groups separated

• Then scan tuples in each partition, compute aggregate expressions

• Two techniques for partitioning– Sorting

• Sort by the grouping attributes• All tuples with same grouping attributes will appear together in

sorted list– Hashing

• Hash by the grouping attributes• All tuples with same grouping attributes will hash to same bucket• Sort / re-hash within each bucket to resolve collisions

Documents

CS 345: Topics in Data Warehousing