15
Parallel Databases 1

Parallel Databases 77. Introduction 4 Basic idea: use multiple disks, memory and/or processors to speed up querying. 4 Measures –Throughput – how many

Embed Size (px)

Citation preview

Page 1: Parallel Databases 77. Introduction 4 Basic idea: use multiple disks, memory and/or processors to speed up querying. 4 Measures –Throughput – how many

Parallel Databases

1

Page 2: Parallel Databases 77. Introduction 4 Basic idea: use multiple disks, memory and/or processors to speed up querying. 4 Measures –Throughput – how many

Introduction Basic idea: use multiple disks, memory and/or

processors to speed up querying. Measures

– Throughput – how many tasks can be completed in some unit of time.

– Response time – how long does it take to complete one task?

Using parallelism to increase response time is called speedup.

Using parallelism to increase throughput is called scale up. 2

Page 3: Parallel Databases 77. Introduction 4 Basic idea: use multiple disks, memory and/or processors to speed up querying. 4 Measures –Throughput – how many

Problems Optimally, we would like linear scale up/speedup.

This is not usually the case. Why?

– Start Up Costs

– Interference – different processors need the same resource.

– Communication Costs

– Some parts may not be able to be parallelized.

– Skew – Not likely to be able to break problem into equal sized parts.

3

Page 4: Parallel Databases 77. Introduction 4 Basic idea: use multiple disks, memory and/or processors to speed up querying. 4 Measures –Throughput – how many

Skew Example Suppose I have 8 processors to do a query. I

should be able to do it in 1/8 the time. Now suppose data is distributed this way:

– P1: 5%

– P2: 10%

– P3: 10%

– P4: 5%

– P5: 10%

– P6: 10%

– P7: 25% -- these only allow ¼ of the time.

– P8: 25% 4

Page 5: Parallel Databases 77. Introduction 4 Basic idea: use multiple disks, memory and/or processors to speed up querying. 4 Measures –Throughput – how many

What Can Be Shared? Share Memory

– Advantages:

• dynamic partitioning (any process may be allocated all/some of memory available).

• Cheaper than each processor having its own memory.

• Lower communication cost between processors

– Disadvantages:

• Memory can become a bottleneck.

• Scalability is a problem.5

Page 6: Parallel Databases 77. Introduction 4 Basic idea: use multiple disks, memory and/or processors to speed up querying. 4 Measures –Throughput – how many

Sharing Continued

Share Disk– Advantages:

• Data need not be replicated – no synchronization

• Better scalability

• Fault tolerance may be built into the system

– Disadvantages:

• Single point of failure

• Communication cost is greater

6

Page 7: Parallel Databases 77. Introduction 4 Basic idea: use multiple disks, memory and/or processors to speed up querying. 4 Measures –Throughput – how many

Sharing III

Share Nothing -- really a type of distributed DB– Advantages:

• Complete parallel solution

• Less bottlenecks

• Multiple points of failures

• Scalability

– Disadvantages:

• Cost for the bean counters

• Communication costs are greater

• Multiple points of failures7

Page 8: Parallel Databases 77. Introduction 4 Basic idea: use multiple disks, memory and/or processors to speed up querying. 4 Measures –Throughput – how many

Sharing IV

Hierarchical– Advantages:

• Gain advantages of speed and scalability

– Disadvantages:

• How to partition?

8

Page 9: Parallel Databases 77. Introduction 4 Basic idea: use multiple disks, memory and/or processors to speed up querying. 4 Measures –Throughput – how many

Disk Partitioning

wikipedia-Standard RAID levels

9

Page 10: Parallel Databases 77. Introduction 4 Basic idea: use multiple disks, memory and/or processors to speed up querying. 4 Measures –Throughput – how many

Disk Partitioning for DB Usage

Round Robin Partitioning – like RAID 5 Range Partitioning – all tuples with a column

value within some range go to the same partition. Hash Partition – all tuples with a column value

that hash to the same value go to the same partition.

10

Page 11: Parallel Databases 77. Introduction 4 Basic idea: use multiple disks, memory and/or processors to speed up querying. 4 Measures –Throughput – how many

Usage

Which is best for– Simple selects – unique match

– Simple selects – non-unique match

– Range queries

– Print unsorted

– Print sorted

11

Page 12: Parallel Databases 77. Introduction 4 Basic idea: use multiple disks, memory and/or processors to speed up querying. 4 Measures –Throughput – how many

Skew In This Context

Attribute-Value Skew – many tuples with the same value for the partitioning column.

Partition Skew – some partitions end up with more tuples, even if they have different values.– Change the ranges – use a histogram to better predict

cut-offs.

Time-Value Skew – a good partitioning algorithm acquires skew over time.

12

Page 13: Parallel Databases 77. Introduction 4 Basic idea: use multiple disks, memory and/or processors to speed up querying. 4 Measures –Throughput – how many

Parallel Joins

R ⨝(A=B) S– Range Partition R on A and S on B. Pass same ranges

off to the same partition.

– Hash Partition – would also work

R ⨝(A<B) S– Partition R and replicate S.

13

Page 14: Parallel Databases 77. Introduction 4 Basic idea: use multiple disks, memory and/or processors to speed up querying. 4 Measures –Throughput – how many

Example

Emp(Fn, Minit, LN, SSN, Bdate, Addr, Sex, Salary, SuperSSN, Dno)

– r = 100,000 records

– bf = 5 records/block

– b = 20,000 blocks

Dept(D#, Dname, MGRSSN, MgrStartDate)– r = 1250 records

– bf = 10 records/block

– b = 125 blocks

14

Page 15: Parallel Databases 77. Introduction 4 Basic idea: use multiple disks, memory and/or processors to speed up querying. 4 Measures –Throughput – how many

Example Query

I want to perform

Emp ⨝(DNO=D#) Dept

How can I parallelize this and how much can I save?

15