44
Parallel and Distributed Databases • CS263 Lecture 16

Parallel and Distributed Databases CS263 Lecture 16

  • View
    224

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Parallel and Distributed Databases CS263 Lecture 16

Parallel and Distributed Databases

• CS263 Lecture 16

Page 2: Parallel and Distributed Databases CS263 Lecture 16

LECTURE PLAN

Parallel DBMS - What and Why?

What is a Client/Server DBMS?

Why do we need Distributed DBMSs?

Date’s rules for a Distributed DBMS

Benefits of a Distributed DBMS

Issues associated with a Distributed DBMS

Disadvantages of a Distributed DBMS

Page 3: Parallel and Distributed Databases CS263 Lecture 16

PARALLEL DATABASE SYSTEM

Page 4: Parallel and Distributed Databases CS263 Lecture 16

PARALLEL DBMSsWHY DO WE NEED THEM?

• More and More Data!

We have databases that hold a high amount of data, in the order of 1012 bytes:

10,000,000,000,000 bytes!

• Faster and Faster Access!

We have data applications that need to process data at very high speeds:

10,000s transactions per second!

SINGLE-PROCESSOR DBMS AREN’T UP TO THE JOB!

Page 5: Parallel and Distributed Databases CS263 Lecture 16

Improves Response Time.

INTERQUERY PARALLELISM

It is possible to process a number of transactions in parallel with each other.

Improves Throughput.

INTRAQUERY PARALLELISM

It is possible to process ‘sub-tasks’ of a transaction in parallel with each other.

PARALLEL DBMSsBENEFITS OF A PARALLEL DBMS

Page 6: Parallel and Distributed Databases CS263 Lecture 16

Speed-Up.

As you multiply resources by a certain factor, the time taken to execute a transaction should be reduced by the same factor:

10 seconds to scan a DB of 10,000 records using 1 CPU 1 second to scan a DB of 10,000 records using 10 CPUs

PARALLEL DBMSsHOW TO MEASURE THE BENEFITS

Scale-up.

As you multiply resources the size of a task that can be executed in a given time should be increased by the same factor.

1 second to scan a DB of 1,000 records using 1 CPU 1 second to scan a DB of 10,000 records using 10 CPUs

Page 7: Parallel and Distributed Databases CS263 Lecture 16

Sub-linear speed-up

Linear speed-up (ideal)

Number of CPUs

Nu

mb

er o

f tr

ansa

ctio

ns/

seco

nd

1000/Sec

5 CPUs

2000/Sec

10 CPUs 16 CPUs

1600/Sec

PARALLEL DBMSsSPEED-UP

Page 8: Parallel and Distributed Databases CS263 Lecture 16

10 CPUs2 GB Database

Number of CPUs, Database size

Nu

mb

er o

f tr

ansa

ctio

ns/

seco

nd

Linear scale-up (ideal)

Sub-linear scale-up

1000/Sec

5 CPUs1 GB Database

900/Sec

PARALLEL DBMSsSCALE-UP

Page 9: Parallel and Distributed Databases CS263 Lecture 16

MEMORYCPU

CPU

CPU

CPU

CPU

CPU

Shared Memory – Parallel Database Architecture

Page 10: Parallel and Distributed Databases CS263 Lecture 16

CPU

CPU

CPU

CPU

CPU

CPU

Shared Disk – Parallel Database Architecture

M

M

M

M

M

M

Page 11: Parallel and Distributed Databases CS263 Lecture 16

Shared Nothing – Parallel Database Architecture

CPUM

CPUM

CPUM

CPU M

CPU M

Page 12: Parallel and Distributed Databases CS263 Lecture 16

MAINFRAME DATABASE SYSTEM

Page 13: Parallel and Distributed Databases CS263 Lecture 16

DUMB

DUMB

DUMB

SP

EC

IAL

ISE

D N

ET

WO

RK

CO

NN

EC

TIO

NTERMINALSMAINFRAME COMPUTER

PRESENTATION LOGICBUSINESS LOGICDATA LOGIC

Page 14: Parallel and Distributed Databases CS263 Lecture 16

CLIENT/SERVER DATABASE SYSTEM

Page 15: Parallel and Distributed Databases CS263 Lecture 16

CLIENT/SERVER DBMS

Manages user interface

Accepts user data

Processes application/business logic

Generates database requests (SQL)

Transmits database requests to server

Receives results from server

Formats results according to application logic

Present results to the user

CLIENT PROCESS

Page 16: Parallel and Distributed Databases CS263 Lecture 16

CLIENT/SERVER DBMS

Accepts database requests

Processes database requests

Performs integrity checks

Handles concurrent access

Optimises queries

Performs security checks

Enacts recovery routines

Transmits result of database request to client

SERVER PROCESS

Page 17: Parallel and Distributed Databases CS263 Lecture 16

Data Request Data Response

CLIENT/SERVERCLIENT/SERVERDBMS ARCHITECTUREDBMS ARCHITECTURE

CLIENT#1

CLIENT#2

CLIENT#3

PRESENTATION LOGIC

BUSINESS LOGIC

DATA LOGIC

(FAT CLIENT)

D/BASE

SERVER

Page 18: Parallel and Distributed Databases CS263 Lecture 16

D/BASE

SERVER

Data Request Data Response

CLIENT/SERVERCLIENT/SERVERDBMS ARCHITECTUREDBMS ARCHITECTURE

CLIENT#1

CLIENT#2

CLIENT#3

PRESENTATION LOGIC

BUSINESS LOGICDATA LOGIC

(THIN CLIENT)

PL

/SQ

L

Page 19: Parallel and Distributed Databases CS263 Lecture 16

LAN

CLIENT

CLIENT

LAN

CLIENT CLIENT

CLIENT CLIENT

LAN

CLIENT

CLIENT

LAN

CLIENT

Leyton

CLIENT

CLIENT CLIENT

Stratford

DB

MS

WID

E A

RE

A N

ET

WO

RK

Barking Leytonstone

DISTRIBUTED PROCESSING ARCHITECTUREDISTRIBUTED PROCESSING ARCHITECTURE

CLIENT

CLIENT

CLIENT

CLIENT

Page 20: Parallel and Distributed Databases CS263 Lecture 16

DISTRIBUTED DATABASE SYSTEM

Page 21: Parallel and Distributed Databases CS263 Lecture 16

A distributed database system is a collection of logically related databases that co-operate in a transparent manner.

Transparent implies that each user within the system may access all of the data within all of the databases as if they were a single database

There should be ‘location independence’ i.e.- as the user is unaware of where the data is located it is possible to move the data from one physical location to another without affecting the user.

DISTRIBUTED DATABASESWHAT IS A DISTRIBUTED DATABASE?

Page 22: Parallel and Distributed Databases CS263 Lecture 16

WID

E A

RE

A N

ET

WO

RK

LAN

CLIENT CLIENT

CLIENT CLIENT

DB

MS

DISTRIBUTED DATABASE ARCHITECTUREDISTRIBUTED DATABASE ARCHITECTURE

LAN

CLIENT CLIENT

CLIENT CLIENT

DB

MS

Leytonstone

CLIENT CLIENT

CLIENT

DB

MS

Stratford

CLIENT

CLIENT CLIENT

CLIENT

DB

MS

Barking

CLIENT

CLIENT

CLIENT

Leyton

Page 23: Parallel and Distributed Databases CS263 Lecture 16

D/BASE

SERVER #1CLIENT

#1

D/BASE

SERVER #2

CLIENT#2

CLIENT#3

M:N CLIENT/SERVER DBMS ARCHITECTUREM:N CLIENT/SERVER DBMS ARCHITECTURE

NOT TRANSPARENT!NOT TRANSPARENT!

Page 24: Parallel and Distributed Databases CS263 Lecture 16

DB Computer Network

Site 2

Site 1

GSC

DDBMS

DC LDBMS

GSC

DDBMS

DC

LDBMS = Local DBMS DC = Data Communications GSC = Global Systems Catalog DDBMS = Distributed DBMS

COMPONENTS OF A DDBMS

Page 25: Parallel and Distributed Databases CS263 Lecture 16

• Reduced Communication Overhead

Most data access is local, less expensive and performs better.

• Improved Processing Power

Instead of one server handling the full database, we now have a collection of machines handling the same database.

• Removal of Reliance on a Central Site

If a server fails, then the only part of the system that is affected is the relevant local site. The rest of the system remains functional and available.

DISTRIBUTED DATABASESADVANTAGES

Page 26: Parallel and Distributed Databases CS263 Lecture 16

• Expandability

It is easier to accommodate increasing the size of the global (logical) database.

• Local autonomy

The database is brought nearer to its users. This can effect a cultural change as it allows potentially greater control over local data .

DISTRIBUTED DATABASESADVANTAGES

Page 27: Parallel and Distributed Databases CS263 Lecture 16

A distributed system looks exactly like a non-distributed system to the user!

1. Local autonomy2. No reliance on a central site3. Continuous operation4. Location independence5. Fragmentation independence6. Replication independence7. Distributed query independence8. Distributed transaction processing9. Hardware independence10. Operating system independence11. Network independence12. Database independence

DISTRIBUTED DATABASESDATE’S TWELVE RULES FOR A DDBMS

Page 28: Parallel and Distributed Databases CS263 Lecture 16

Data Allocation

Data Fragmentation

Distributed Catalogue Management

Distributed Transactions

Distributed Queries – (see chapter 20)

DISTRIBUTED DATABASESISSUES

Page 29: Parallel and Distributed Databases CS263 Lecture 16

1. Locality of reference Is the data near to the sites that need it?

2. Reliability and availability Does the strategy improve fault tolerance and accessibility?

3. Performance Does the strategy result in bottlenecks or under-utilisation of resources?

4. Storage costs How does the strategy effect the availability and cost of data storage?

5. Communication costs How much network traffic will result from the strategy?

DISTRIBUTED DATABASESDATA ALLOCATION METRICS

Page 30: Parallel and Distributed Databases CS263 Lecture 16

CENTRALISED

DISTRIBUTED DATABASESDATA ALLOCATION STRATEGIES

Locality of Reference

Reliability/Availability

Storage Costs

Performance

Communication Costs

Lowest

Lowest

Lowest

Unsatisfactory

Highest

Page 31: Parallel and Distributed Databases CS263 Lecture 16

PARTITIONED/FRAGMENTED

DISTRIBUTED DATABASESDATA ALLOCATION STRATEGIES

Locality of Reference

Reliability/Availability

Storage Costs

Performance

Communication Costs

High

Low (item) – High (system)

Lowest

Satisfactory

Low

Page 32: Parallel and Distributed Databases CS263 Lecture 16

COMPLETE REPLICATION

DISTRIBUTED DATABASESDATA ALLOCATION STRATEGIES

Locality of Reference

Reliability/Availability

Storage Costs

Performance

Communication Costs

Highest

Highest

Highest

High

High (update) – Low (read)

Page 33: Parallel and Distributed Databases CS263 Lecture 16

SELECTIVE REPLICATION

DISTRIBUTED DATABASESDATA ALLOCATION STRATEGIES

Locality of Reference

Reliability/Availability

Storage Costs

Performance

Communication Costs

High

Average

Satisfactory

Low

Low (item) – High (system)

Page 34: Parallel and Distributed Databases CS263 Lecture 16

Usage Applications are usually interested in ‘views’ not whole relations.

Efficiency It’s more efficient if data is close to where it is frequently used.

Parallelism It is possible to run several ‘sub-queries’ in tandem.

Security Data not required by local applications is not stored at the local site.

DISTRIBUTED DATABASESWHY FRAGMENT DATA?

Page 35: Parallel and Distributed Databases CS263 Lecture 16

DISTRIBUTED DATABASESHORIZONTAL DATA FRAGMENTATION

333.00STRATFORDKHAN456

500.00BARKINGONO400

340.14BARKINGGREEN350

23.17STRATFORDSMITH345

200.00BARKINGGRAY324

1000.00STRATFORDJONES200

BALANCEBRANCHCUSTOMERACCOUNT

Horizontal Fragmentation: Consists of a Restriction on a Relation.

e.g., ( branch = ‘Stratford’ Account)

Page 36: Parallel and Distributed Databases CS263 Lecture 16

DISTRIBUTED DATABASESHORIZONTAL DATA FRAGMENTATION

STRATFORD

STRATFORD

STRATFORD

333.00KHAN456

23.17SMITH345

1000.00JONES200

BALANCEBRANCHCUSTOMERACCT NO.

BARKING

BARKING

BARKING

500.00ONO400

340.14GREEN350

200.00GRAY324

BALANCEBRANCHCUSTOMERACCT NO.

STRATFORD BRANCH

BARKING BRANCH

Page 37: Parallel and Distributed Databases CS263 Lecture 16

DISTRIBUTED DATABASESVERTICAL DATA FRAGMENTATION

KJTR78KHA456T0208-500-5821STRATFORDKHAN456

ZZEE56GRA324S0208-545-7528BARKINGGRAY324

XXYY22JON200T0208-500-9000STRATFORDJONES200

PASSWORDLOGINPHONE NOSITENAMES#

Vertical Fragmentation: Consists of a Projection on a Relation.

e.g., ( S#, NAME, SITE, PHONE NO Student)

Page 38: Parallel and Distributed Databases CS263 Lecture 16

DISTRIBUTED DATABASESVERTICAL DATA FRAGMENTATION

STRATFORD

BARKING

STRATFORD

KHAN456

GRAY3240208-500-5821

0208-545-7528

0208-500-9000JONES200

PHONE NO.SITENAMES#

KJTR78

ZZEE56

XXYY22

KHA456T456

GRA324S324

JON200T200

PASSWORDLOGIN-IDS#

STUDENT ADMINISTRATION

NETWORK ADMINISTRATION

Page 39: Parallel and Distributed Databases CS263 Lecture 16

DISTRIBUTED DATABASESDISTRIBUTED CATALOG MANAGEMENT

• Centralised Global Catalog

One site maintains the full global catalog. All changes to any local system catalog have to be propagated to the site maintaining the global catalog. Bad performance, single point of failure, compromises site autonomy.

• Dispersed Catalog

There is no physical global catalog. Each time a remote data item is required, the catalogues from ALL other sites are examined for the item. This has severe performance penalties.

Page 40: Parallel and Distributed Databases CS263 Lecture 16

DISTRIBUTED DATABASESDISTRIBUTED CATALOG MANAGEMENT

• Replicated Global Catalog

Each site maintains its own global catalog. Although this greatly speeds up remote data location, it is very inefficient to maintain. A detail of every data item added, changed or deleted locally has to be propagated to ALL other sites .

• Local-Master Catalog

Each site maintains both its local system catalog as well as a catalog of all of its data items that are replicated at other sites. This avoids compromising site autonomy, is fairly efficient, and is not a single point of failure.

Page 41: Parallel and Distributed Databases CS263 Lecture 16

AT

OM

IC D

IST

RIB

UT

ED

TR

AN

SA

CT

ION

DISTRIBUTED DATABASESDISTRIBUTED TRANSACTIONS

Stratford DB

Barking DB

Leyton DB

StratfordDBMS

StratfordClient

StratfordClient

StratfordClient

BarkingDBMS

LeytonDBMS

Global Transaction

(a) Debit Stratford A/C £500(b) Credit Barking A/C £350(c) Credit Leyton A/C £150

(a)

(b)

(c)

Page 42: Parallel and Distributed Databases CS263 Lecture 16

TWO-PHASE COMMIT (2PC) - OK

Page 43: Parallel and Distributed Databases CS263 Lecture 16

TWO-PHASE COMMIT (2PC) - ABORT

‘Global Abort’

Page 44: Parallel and Distributed Databases CS263 Lecture 16

Architectural complexity.

Cost.

Security.

Integrity control more difficult.

Lack of standards.

Lack of experience.

Database design more complex.

DISTRIBUTED DATABASESDISADVANTAGES OF DDBMSs