Upload
claude-wilkins
View
214
Download
0
Embed Size (px)
Citation preview
Parallel Databases
1
Introduction Basic idea: use multiple disks, memory and/or
processors to speed up querying. Measures
– Throughput – how many tasks can be completed in some unit of time.
– Response time – how long does it take to complete one task?
Using parallelism to increase response time is called speedup.
Using parallelism to increase throughput is called scale up. 2
Problems Optimally, we would like linear scale up/speedup.
This is not usually the case. Why?
– Start Up Costs
– Interference – different processors need the same resource.
– Communication Costs
– Some parts may not be able to be parallelized.
– Skew – Not likely to be able to break problem into equal sized parts.
3
Skew Example Suppose I have 8 processors to do a query. I
should be able to do it in 1/8 the time. Now suppose data is distributed this way:
– P1: 5%
– P2: 10%
– P3: 10%
– P4: 5%
– P5: 10%
– P6: 10%
– P7: 25% -- these only allow ¼ of the time.
– P8: 25% 4
What Can Be Shared? Share Memory
– Advantages:
• dynamic partitioning (any process may be allocated all/some of memory available).
• Cheaper than each processor having its own memory.
• Lower communication cost between processors
– Disadvantages:
• Memory can become a bottleneck.
• Scalability is a problem.5
Sharing Continued
Share Disk– Advantages:
• Data need not be replicated – no synchronization
• Better scalability
• Fault tolerance may be built into the system
– Disadvantages:
• Single point of failure
• Communication cost is greater
6
Sharing III
Share Nothing -- really a type of distributed DB– Advantages:
• Complete parallel solution
• Less bottlenecks
• Multiple points of failures
• Scalability
– Disadvantages:
• Cost for the bean counters
• Communication costs are greater
• Multiple points of failures7
Sharing IV
Hierarchical– Advantages:
• Gain advantages of speed and scalability
– Disadvantages:
• How to partition?
8
Disk Partitioning
wikipedia-Standard RAID levels
9
Disk Partitioning for DB Usage
Round Robin Partitioning – like RAID 5 Range Partitioning – all tuples with a column
value within some range go to the same partition. Hash Partition – all tuples with a column value
that hash to the same value go to the same partition.
10
Usage
Which is best for– Simple selects – unique match
– Simple selects – non-unique match
– Range queries
– Print unsorted
– Print sorted
11
Skew In This Context
Attribute-Value Skew – many tuples with the same value for the partitioning column.
Partition Skew – some partitions end up with more tuples, even if they have different values.– Change the ranges – use a histogram to better predict
cut-offs.
Time-Value Skew – a good partitioning algorithm acquires skew over time.
12
Parallel Joins
R ⨝(A=B) S– Range Partition R on A and S on B. Pass same ranges
off to the same partition.
– Hash Partition – would also work
R ⨝(A<B) S– Partition R and replicate S.
13
Example
Emp(Fn, Minit, LN, SSN, Bdate, Addr, Sex, Salary, SuperSSN, Dno)
– r = 100,000 records
– bf = 5 records/block
– b = 20,000 blocks
Dept(D#, Dname, MGRSSN, MgrStartDate)– r = 1250 records
– bf = 10 records/block
– b = 125 blocks
14
Example Query
I want to perform
Emp ⨝(DNO=D#) Dept
How can I parallelize this and how much can I save?
15