30
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004

CS 345: Topics in Data Warehousing

  • Upload
    janus

  • View
    26

  • Download
    0

Embed Size (px)

DESCRIPTION

CS 345: Topics in Data Warehousing. Tuesday, October 19, 2004. Review of Thursday’s Class. Bridge tables Hierarchies Multi-Valued Dimensions Extraction-Transformation-Load Data staging area vs. data warehouse Assigning surrogate keys Detecting changed rows - PowerPoint PPT Presentation

Citation preview

Page 1: CS 345: Topics in Data Warehousing

CS 345:Topics in Data Warehousing

Tuesday, October 19, 2004

Page 2: CS 345: Topics in Data Warehousing

Review of Thursday’s Class• Bridge tables

– Hierarchies– Multi-Valued Dimensions

• Extraction-Transformation-Load– Data staging area vs. data warehouse– Assigning surrogate keys– Detecting changed rows– Detecting duplicate dimension rows

Page 3: CS 345: Topics in Data Warehousing

Outline of Today’s Class

• Database System Architecture– Memory management– Secondary storage (disk) – Query planning process

• Joins– Nested Loop Join– Merge Join– Hash Join

• Grouping– Sort vs. Hash

Page 4: CS 345: Topics in Data Warehousing

Memory Management

• Modern operating systems use virtual memory– Address space larger than physical main memory– Layer of indirection between virtual addresses and

physical address– Physical memory serves as a cache for most recently

accessed memory locations– Infrequently accessed memory locations written to

disk when cache fills up• Memory transferred in large blocks called pages

– Disk access: latency is large relative to throughput– Benefit from locality of reference– Transfer overheads amortized over many requests

Page 5: CS 345: Topics in Data Warehousing

Memory Management• Database management systems (DBMSs) use a similar

model to virtual memory– Data is stored persistently on disk– Main memory serves as cache for frequently accessed data

• This memory cache is called the buffer pool– Data organized into blocks called pages

• DBMSs override built-in virtual memory– Database operations are very data intensive– Careful management of disk access essential for good

performance– DBMS has specialized knowledge of its data access patterns– Specialized memory management includes:

• Prefetching pages likely to be needed in the future• Essential / frequently accessed pages pinned in memory• Multiple buffer pools with different parameters for different needs

– Parameters: Page size, page replacement policy– Needs: Relations, indexes, temporary results, sort space

Page 6: CS 345: Topics in Data Warehousing

Disk Access Time

• Seek cost + rotational latency + transfer cost• Seek cost

– Move the disk head to the right place– Physical machinery

• Rotational latency– Wait for the disk to spin– Average wait = ½ rotation

• Transfer cost– Read the data while the disk spins

Seek cost

Rotational latency

Disk drive

Page 7: CS 345: Topics in Data Warehousing

Random vs. Sequential I/O

• Seek cost and rotational latency are overhead– Independent of the transfer size– ~ 20 ms on commodity disks– One large transfer is much cheaper than many small

transfers• Random vs. Sequential I/O

– I/O = Input/Output– Random I/O: Read/write an arbitrary page– Sequential I/O: Read/write the next page– Sequential I/O is 10-20x faster than random I/O

• Avoiding random I/O will be a theme– Cost of sequential I/O is less– Prefetching easy when access patterns predictable

Page 8: CS 345: Topics in Data Warehousing

Query Planning Process• Many different query plans are possible

– Logically equivalent ways of answering the same query• Job of the query optimizer: choose the most efficient plan for a

given query– Most efficient = minimize resource consumption– Resources = CPU, disk I/O– Consume less resources → faster query responses

• Design of query optimizer– Rule-based

• Apply heuristic transformations (e.g. push selections down)– Cost-based:

• Estimate cost (resource usage) of each possible query plan• Choose the plan with lowest estimated cost• Cost estimation uses statistics collected by the optimizer

– In practice, optimizers use a combination of the two• Search space of possible plans is too large to exhaustively explore• Heuristics guide which types of plans are considered for cost-based

optimization

Page 9: CS 345: Topics in Data Warehousing

parse

Query rewriting

Physical plan generation

execute

result

SQL query

parse tree

Best logical query planstatistics

Best physical query plan

Page 10: CS 345: Topics in Data Warehousing

Join Processing

• OLAP queries involve joins– Fact table joins to dimension tables– Primary-key / Foreign-key joins

• Choice of join algorithm has large impact on query cost• Naïve join processing algorithm

To join relations A and B:– Generate “Cartesian product” of A and B

• Each A tuple is combined with each B tuple– Apply selection conditions to each tuple in the Cartesian product

• Problem: Intermediate result can be huge!– Relation A has 10 million rows– Relation B has 100,000 rows– A x B has 1 trillion rows (many terabytes of data)

• Need more efficient, logically equivalent procedure

Page 11: CS 345: Topics in Data Warehousing

• NLJ (conceptually)

for each r R1 do

for each s R2 do

if r.C = s.C then output r,s pair

Nested Loop Join (NLJ)

B C

a 10

a 20

b 10

d 30

C D

10 cat

40 dog

15 bat

20 rat

R1 R2

Page 12: CS 345: Topics in Data Warehousing

Nested Loop Join

• In theory, multi-way nested loop joins are possible– Joining n relations → n nested “for” loops

• In practice, only binary (2-way) joins operators are implemented– Joining n relations → Cascade of n-1 binary joins

• Nested-loop join terminology:– Outer input: scanned in outer loop of join– Inner input: scanned in inner loop of join

A B

C

Page 13: CS 345: Topics in Data Warehousing

Analysis of Nested Loop Join

• Outer has m tuples, inner has n tuples• CPU complexity = O(m*n)

– For each of m outer tuples, compare with each of n inner tuples

– Quadratic running time• What about I/O complexity?

– Depends on amount of memory available– We’ll assume B tuples per page– Consider four scenarios

• Use 2 pages of memory• Use (m/B + n/B) pages of memory• Use (n/B + 1) pages of memory• Use (k + 1) pages of memory

– For k < m/B

Page 14: CS 345: Topics in Data Warehousing

Nested Loop Join (2 pages)

• One page for outer, one page for inner• NLJ algorithm:

– Load 1st outer block into memory• Load all n/B blocks of inner, one after another

– Load 2nd outer block into memory• Load all n/B blocks of inner, one after another

– …– Load (m/B)th outer block into memory

• Load all n/B blocks of inner, one after another

• Total I/O cost = m/B * (n/B + 1)

Page 15: CS 345: Topics in Data Warehousing

Nested Loop Join (m/B+n/B pages)

• Load both inputs entirely into memory

• All computation takes place in memory

• No further paging necessary

• Total I/O cost = m/B + n/B

• Requires a lot of memory– Practical if both inputs are small

Page 16: CS 345: Topics in Data Warehousing

Nested Loop Join (n/B + 1 pages)

• Allocate 1 page for outer• Load entire inner into memory (n/B pages)• NLJ algorithm:

– Load entire inner into memory– Load 1st outer block into memory

• Inner is already in memory → no additional I/O needed– Load 2nd outer block into memory

• Inner is already in memory → no additional I/O needed– …– Load (m/B)th outer block into memory

• Inner is already in memory → no additional I/O needed

• Total I/O cost = m/B + n/B– Same I/O cost as previous slide– Practical if inner input is small

Page 17: CS 345: Topics in Data Warehousing

Nested Loop Join (k+1 pages)

• Allocate 1 page for inner• Load k pages of outer into memory at a time• NLJ algorithm:

– Load 1st group of k outer blocks into memory• Load all n/B blocks of inner, one after another

– Load 2nd group of k outer block into memory• Load all n/B blocks of inner, one after another

– …– Load (m/kB)th group of k outer blocks into memory

• Load all n/B blocks of inner, one after another

• Total I/O cost = m/kB * (n/B + k)– Reduction from 2-page case by a factor of k

Page 18: CS 345: Topics in Data Warehousing

Summary of NLJ Analysis

• CPU usage is O(m*n)– Quadratic in size of inputs

• I/O usage depends on memory allocated– If one input is small:

• Make the small input the inner• I/O complexity is O(m/B + n/B)• Linear in size of inputs• The minimum possible

– Otherwise:• Make the small input the outer

– Slightly better than the other way around

• I/O complexity is O(m/kB * (n/B + k))– When k+1 pages of memory are available– Approximately mn/kB2

Page 19: CS 345: Topics in Data Warehousing

Improving on Nested Loops Join

• For each outer tuple, we scan the entire inner relation to find matching tuples

• Maybe we can do better by using data structures

• Two main ideas:– Sorting– Hashing

Page 20: CS 345: Topics in Data Warehousing

Sorting

• Using sorting to improve NLJ– First we sort the inner relation– Then, for each outer tuple, find matching inner tuple(s) using

binary search over sorted inner

• CPU usage improves – O(mn) → O(m log n) + sort cost– Sub-quadratic

• This strategy is indexed nested-loop join– To be covered in more detail on Thursday

• Implementation details– Use B-Tree instead of binary tree as search structure– Larger branching factor → better I/O efficiency

Page 21: CS 345: Topics in Data Warehousing

Merge Join

• What if we sort both inputs instead of one?– No need to perform binary search for each outer tuple– Since outer tuples are also in sorted order, access to inner

relation is structured• Binary search accesses pages in arbitrary order → random I/O• Merge join uses sequential I/O

• CPU usage improves– O(mn) → O(m + n) + sort costs

1 2 3 4 5 6 7

5 2 7 3 5 1 4

1 23

4 5

1 2 3 4 5 6 7

1 2 3 4 4 5 5 5 7

scan scan

scan

Indexed NLJ Merge JoinRandom I/O

for eachouter tuple

Page 22: CS 345: Topics in Data Warehousing

External-Memory Sorting

• How to efficiently sort large data sets?– Consider I/O complexity

• Sort a relation with n tuples, B tuples per block– Case 1: Use n/B pages of memory– Case 2: Use k+1 pages of memory

• Case 1: Use n/B pages of memory– Read the relation into memory

• n/B sequential disk reads– Sort in main memory using quicksort

• No disk access required– Write the sorted output back to disk

• n/B sequential disk writes– Total: 2n/B disk I/Os (all sequential)

• Requires lots of memory → only useful for small inputs

Page 23: CS 345: Topics in Data Warehousing

External-Memory Sorting• Case 2: Use k pages of memory

– k ≥ sqrt(n/B)• Two phases (run generation + merge)• Run generation

– Repeat for each group of k blocks:• Read group of k blocks into memory• Sort using quicksort• Write sorted “run” back out to disk

• Merge– Read 1st page from each run into memory– Merge to produce globally sorted output– Write output pages to disk as they are filled– As a page is exhausted, replace by next

page in run• Each block read + written once per phase

– Total disk I/Os: 4n/B

Sorted Runs

Final output

Memory Disk

Page 24: CS 345: Topics in Data Warehousing

Hashing

• Store inner input in a hash table– Tuples grouped together into buckets using a hash function

• Apply hash function to join attributes (called the key) to determine bucket

– Idea: all records with the same key must be in the same bucket• For each outer tuple

– Hash key to determine which bucket it goes in– Scan that bucket for matching inner tuples

• CPU usage improves– O(mn) → O(m + n)

• Assuming a good hash function with few collisions• Collision = tuples with different keys hash to same bucket

• Only useful for equality joins– Not for join conditions like WHERE A.foo > B.bar– 99% of all joins are equality joins

Page 25: CS 345: Topics in Data Warehousing

Hash Join Analysis

• Hash join terminology– Inner input is called build input– Outer input is called probe input– Build a hash table based on one input, probe into that hash table

using the other• Probe has m tuples, build has n tuples• Using (n/B + 1) pages of memory:

– Allocate 1 page for probe input– Read build input into memory + create hash table

• Uses n/B pages– Read 1st page of probe input

• Hash each tuple and search bucked for matching build tuples• Requires no additional I/O

– Repeat for 2nd and subsequent pages of probe input– I/O complexity = O(n/B + m/B)– Suitable when one input is small

Page 26: CS 345: Topics in Data Warehousing

Partitioned Hash Join• Two-phase algorithm (Partition + Join)

– Uses k blocks of memory• Phase 1: Partition

– Performed separately for build and probe inputs– Read 1st page into memory– Assign each tuple to a partition 1..k using hash function– Each partition is allocated 1 block of memory

• Tuples assigned to same partition are placed together• When the block fills up, write it to disk• Each partition stored in a contiguous region on disk

– Repeat for 2nd and subsequent pages• Phase 2: Join phase

– Join Partition 1 of probe input with Partition 1 of build input using ordinary hash join

• Use a secondary hash function– Repeat with Partition 2 of each input, etc.

• A form of recursion– Partitioning creates smaller problem instances

Page 27: CS 345: Topics in Data Warehousing

Partition PhaseMemory Disk

Partitions

Unsorted Input

(Similar to merge phase of merge join, only backwards)

Page 28: CS 345: Topics in Data Warehousing

Summary of Join Methods

• When 1 relation fits entirely in memory, all join methods have linear I/O complexity

• When both relations are large:– NLJ has quadratic I/O complexity– Merge join and hash join have linear I/O complexity

• CPU complexity– Quadratic for NLJ– N log N for merge join– Linear for hash join

• Rules of Thumb:– Use nested loop join for very small inputs– Use merge join for large inputs that are already sorted– Use hash join for large, unsorted inputs

Page 29: CS 345: Topics in Data Warehousing

Query Plans with Multiple Joins• Two kinds of execution:

– Pipelined• No need to store intermediate results on disk• Available memory must be shared between joins

– Materialized• No memory contention between joins• Spend I/Os writing intermediate results to disk

• Blocking operators– Sort, partition, build hash table– Pipelined execution is disallowed

• Interesting orders– Sorted input → sorted output

• NLJ: output is sorted by same key as outer input• Merge join: output is sorted by same key as inputs

– Partitioned input → partitioned output• Hash join: output is partitioned by same key as inputs

– Take advantage of interesting orders• Eliminate sorting / partitioning phases of later joins

A B

C

Page 30: CS 345: Topics in Data Warehousing

Grouping

• First partition tuples on grouping attributes– Tuples in same group placed together– Tuples in different groups separated

• Then scan tuples in each partition, compute aggregate expressions

• Two techniques for partitioning– Sorting

• Sort by the grouping attributes• All tuples with same grouping attributes will appear together in

sorted list– Hashing

• Hash by the grouping attributes• All tuples with same grouping attributes will hash to same bucket• Sort / re-hash within each bucket to resolve collisions