Distributed DB 12. 1 CSE2132 Database Systems Week 12 Lecture Distributed Database

Distributed DB 12. 1

CSE2132 Database Systems

Week 12 Lecture

Distributed Database


Evolution of Distributed DBMS CENTRALISED systems - all system components located on a single

computer• data• DBMS• secondary storage (disks etc)

– Access via serially connected 'dumb terminals' - all processing occurs at central site

1980s changes:– business operations decentralised geographically, need to become

'lean-and-meaner, quick reacting, dispersed operations'– technological change low cost powerful computing platforms

led to DISTRIBUTED SYSTEMS


Distributed Data Base Management Systems system components distributed over multiple sites, interconnected via

communication system (network) managed by Distributed Data Base Management System (DDBMS) DDBMS advantages:

– data located near the greatest demand site

– faster data access - desired data subset locally available

– faster data processing - system processing load spread out over multiple cpu's

– growth facilitation - easy to add new sites to network

– less danger of single point failure DDBMS disadvantages:

– complexity of management and control

– security - weaker due to distribution (thus more people involved) and network traffic

– lack of standards - many communication protocols exist eg. tcp/ip netbios DECnet etc.


Distributed Processing database's logical processing shared among multiple cpu's actual database resides on a single computer


Distributed Database actual database stored over two or more independent cpu's - each part

know as a database fragment


Distribution Options

Two important components of a DDBMS are– Transaction Processor (TP) - receives and processes the applications

data request– Data Processor (DP) - stores and retrieves data located at the site

Data And Processing Distribution Options:

– Single site processing, single site data

• all processing done on single cpu (host computer)• all data stored on host computers local disk/s• eg. traditional mainframe / minicomputer DBMS with dumb terminals

or single user microcomputer DBMS


Distribution Options– Multi site processing, single site data

• multiple processes run on different computers

• all data stored on single computers local disk/s

• e.g. LAN file server

TP acts as redirector - routes all network data requests to file server

file server appears to end user as a hard disk eg. F:

all data selection, search and update occur at workstation entire file must be transported across network

inefficient, costly (communication)


Distribution Options (1) Client-server is a term used very loosely - difficult to clearly define

Client-server, similar to file server except database processing occurs at the server site - know as a database server. Client server is about a split in processing rather than a split in data. A 3 tier architecture is often employed. This requires communications middleware to resolve translation issues by the use of an agreed protocol(e.g. TCP/IP, IPX/SPX or NetBEUI). Client-server approaches overlap and and are used in conjunction with distributed database.

Multiple site processing, multiple site data

This describes a fully distributed DBMS. They can then be classified as homogeneous or heterogeneous.


Distribution Options (2)

• homogeneous DDBMS

The same DBMS running at each site. While there is much complexity to deal with the task of accessing data over many sites is simpler than in the heterogeneous case.

• heterogeneous DDBMS

Will support different DBMSs and even different DBMS models (relational, network, hierarchical) at each site. Such implementations of a DDBMS operate under certain restrictions. (e.g. the access to other databasesis read only or only other relational databases can be accessed.)


Dates Distributed Database RulesDISTRIBUTED DATABASE SHOULD

LOOK EXACTLY LIKE A NON DISTRIBUTED 'LOCAL' DATABASE

LOCAL ACCESS TO LOCAL DATA

Many vendors claim to be delivering distributed database applications - need criteria:

DATES 12 RULES

– 0 - TO USER A DISTRIBUTED DATABASE SHOULD LOOK NO DIFFERENT TO A NON DDB - Distribution transparency, database treated as single logical database

– 1 - LOCAL AUTONOMY, local data owned and managed locally

– 2 - NO RELIANCE ON CENTRAL SITE

– 3 - CONTINUOUS OPERATION, failure transparency - site independence, even in the event of a node failure the system continues to operate

– 4 - LOCATION INDEPENDENCE, location transparency

– 5 - FRAGMENTATION INDEPENDENCE, fragmentation transparency


Dates rules continued

– 6 - REPLICATION INDEPENDENCE

– 7 - DISTRIBUTED QUERY PROCESSING, multiple site queries

– 8 - DISTRIBUTED TRANSACTION MANAGEMENT, transaction transparency - multiple site updates

– 9 - HARDWARE INDEPENDENCE

– 10 - OPERATING SYSTEM INDEPENDENCE

– 11 - COMMUNICATION NETWORK INDEPENDENCE

– 12 - DATABASE INDEPENDENCE


Levels of Distribution Transparency Can be used as a method of

classification by determining level of transparency supported by DDBMS at highest level:

– FRAGMENTATION transparency

• No need to specify fragment names or locations

• select *

from employee

where dob < 01/01/40

– LOCATION transparency

• Specify fragment names but not locations

select *

from e1

where dob < 01/01/40;

UNION

select *

from e2

where dob < 01/01/40;

UNION

select *

from e3

where dob < 01/01/40;


Levels of Distribution Transparency– local MAPPING transparency (lowest level)

Need to specify both fragment and location (using pseudo-SQL)

select * from e1 node melbourne

where dob < 01/01/40;

UNION

select * from e2 node sydney

where dob < 01/01/40;

UNION

select * from e3 node adelaide

where dob < 01/01/40;


DDBMS Operations Join operation (most vendors supply)

– Easier to achieve

– Query optimisation critical Update operation

– eg debit / credit of two accounts at different sites

– more difficult to manage

– Need sophisticated transaction management - most popular strategy : TWO PHASE COMMIT

Two Phase Commit requires three operations:

– DO - performs operation & records before and after image in transaction log

– UNDO - undoes an operation using log entries created in DO

– REDO - redoes an operation using log entries created in DO


Two Phase Commit Site originating transaction

(coordinator) sends request to sites (subordinates), each site processes sub transaction but does not commit.

Phase 1 - Preparation:

– 1. coordinator sends prepare to commit to all subordinates

– 2. subordinates receive message, write log entries and reply to coordinator - READY to COMMIT or NOT READY

– 3. coordinator checks all nodes ready to commit - if not broadcasts an ABORT, if all ready:

Phase 2 - Final COMMIT

– 1. coordinator broadcasts a commit message to all subordinates and awaits a reply

– 2. subordinate receives commit and updates database

– 3. subordinates reply with COMMITTED or NOT COMMITTED to the coordinator

– If any subordinates did not commit, coordinator sends ABORT forcing an UNDO


DISTRIBUTION STRATEGIES

The overall data model - the company view Distribution principles

– Examine geography / frequency of access

– Guiding principle -minimise network traffic and communication costs


The sales team– Situated in Sydney– Need most access to customer/order/order-line/product

The supply branch– Situated in Melbourne– Need most access to warehouse/inventory/product

Data model partitioning

What do we do with RELATIONS at the BOUNDARY ?FRAGMENTATION vs REPLICATION


Possible starting points:

– Which site accesses it most?

• Storage at a single site

• Minimises update complexity

– Is there a case for fragmentation?

• Minimises local access time

• Minimises network traffic

– Is there a case for replication?

• Minimises local access time

• Minimises network traffic Fragmentation

– Horizontal fragmentation

• based on SELECTION, eg fragment customer table on customer_city


– Vertical fragmentation

• Based on projection, eg fragment product table on attributes needed by each group

SALES : P_CODE, DESC, UNIT_PRICE

SUPPLY : P_CODE, PACK_SIZE

– Hybrid Fragmentation

• Assume 3 warehouses

Footscray - p_code < 100

Collingwood - p_code = 100

Dandenong - p_code > 100

• further subdivide vertically supply fragment

SELECT P_CODE < 100 --> FOOTSCRAY

SELECT P_CODE = 100 --> COLLINGWOOD

SELECT P_CODE > 100 --> DANDENONG


Replication Efficient retrievals vs

Multi site updating

– Costly, accident prone

Updating techniques

– Conservative

– Don't commit until all sites accept

– Primary node, one site accept updates and broadcast

– Majority voting

– Snapshots, etc


Query Optimization Query - list the supplier numbers for cleveland suppliers of red parts ?

– SUPPLIER(SUPPLIER#, CITY) 10,000 DETROIT

– PART(PART#, COLOUR) 100,000 CHICAGO

– SHIPMENT(SUPPLIER#, PART#) 1,000,000 DETROIT

SELECT

S.SUPPLIER#

FROM

SUPPLIER S, PART P, SHIPMENT H

WHERE

S.SUPPLIER# = H.SUPPLIER# AND

H.PART# = P.PART# AND

P.COLOUR = 'RED'; Time for query varies from 1 second to 2.3 days depending on the Query Plan

selected

Documents

Distributed DB 12. 1 CSE2132 Database Systems Week 12 Lecture Distributed Database