Parallel Databases 77. Introduction 4 Basic idea: use multiple disks, memory and/or processors to speed up querying. 4 Measures –Throughput – how many

Parallel Databases

1

Introduction Basic idea: use multiple disks, memory and/or

processors to speed up querying. Measures

– Throughput – how many tasks can be completed in some unit of time.

– Response time – how long does it take to complete one task?

Using parallelism to increase response time is called speedup.

Using parallelism to increase throughput is called scale up. 2

Problems Optimally, we would like linear scale up/speedup.

This is not usually the case. Why?

– Start Up Costs

– Interference – different processors need the same resource.

– Communication Costs

– Some parts may not be able to be parallelized.

– Skew – Not likely to be able to break problem into equal sized parts.

3

Skew Example Suppose I have 8 processors to do a query. I

should be able to do it in 1/8 the time. Now suppose data is distributed this way:

– P1: 5%

– P2: 10%

– P3: 10%

– P4: 5%

– P5: 10%

– P6: 10%

– P7: 25% -- these only allow ¼ of the time.

– P8: 25% 4

What Can Be Shared? Share Memory

– Advantages:

• dynamic partitioning (any process may be allocated all/some of memory available).

• Cheaper than each processor having its own memory.

• Lower communication cost between processors

– Disadvantages:

• Memory can become a bottleneck.

• Scalability is a problem.5

Sharing Continued

Share Disk– Advantages:

• Data need not be replicated – no synchronization

• Better scalability

• Fault tolerance may be built into the system

– Disadvantages:

• Single point of failure

• Communication cost is greater

6

Sharing III

Share Nothing -- really a type of distributed DB– Advantages:

• Complete parallel solution

• Less bottlenecks

• Multiple points of failures

• Scalability

– Disadvantages:

• Cost for the bean counters

• Communication costs are greater

• Multiple points of failures7

Sharing IV

Hierarchical– Advantages:

• Gain advantages of speed and scalability

– Disadvantages:

• How to partition?

8

Disk Partitioning

wikipedia-Standard RAID levels

9

Disk Partitioning for DB Usage

Round Robin Partitioning – like RAID 5 Range Partitioning – all tuples with a column

value within some range go to the same partition. Hash Partition – all tuples with a column value

that hash to the same value go to the same partition.

10

Usage

Which is best for– Simple selects – unique match

– Simple selects – non-unique match

– Range queries

– Print unsorted

– Print sorted

11

Skew In This Context

Attribute-Value Skew – many tuples with the same value for the partitioning column.

Partition Skew – some partitions end up with more tuples, even if they have different values.– Change the ranges – use a histogram to better predict

cut-offs.

Time-Value Skew – a good partitioning algorithm acquires skew over time.

12

Parallel Joins

R ⨝(A=B) S– Range Partition R on A and S on B. Pass same ranges

off to the same partition.

– Hash Partition – would also work

R ⨝(A<B) S– Partition R and replicate S.

13

Example

Emp(Fn, Minit, LN, SSN, Bdate, Addr, Sex, Salary, SuperSSN, Dno)

– r = 100,000 records

– bf = 5 records/block

– b = 20,000 blocks

Dept(D#, Dname, MGRSSN, MgrStartDate)– r = 1250 records

– bf = 10 records/block

– b = 125 blocks

14

Example Query

I want to perform

Emp ⨝(DNO=D#) Dept

How can I parallelize this and how much can I save?

15

Documents

Parallel Databases 77. Introduction 4 Basic idea: use multiple disks, memory and/or processors to speed up querying. 4 Measures –Throughput – how many