Distributed Database Concept

Embed Size (px)

Citation preview

  • 7/21/2019 Distributed Database Concept

    1/18

    COMP 302 Valentina Tamma

    Distributed Databases

    Connolly & Begg. Chapter 22. Third edition

    COMP 302 Valentina Tamma

    Distributed Databases Basic Concepts

    Concepts.

    Advantages and disadvantages of distributeddatabases.

    Functions and architecture for a DDBMS.

    Distributed database design.

    Levels of transparency.

    Comparison criteria for Distributed DBMSs.

    COMP 302 Valentina Tamma

    Why distributed databases?

    Some initial motivations:

    The development of computer networks promotesdecentralization.

    In a company, the database organization might reflect theorganizational structure, which is distributed into units.Each unit maintains its own database.

    Sharing of data can be achieved by developing adistributed database system which:

    makes data accessible by all units

    stores data close to where it is most frequently used.

    COMP 302 Valentina Tamma3

    Concepts

    Distributed Database

    A logically interrelated collection of shared data (and adescription of this data), physically distributed over acomputer network.

    Distributed DBMS (DDBMS)

    Software system that permits the management of the

    distributed database and makes the distributiontransparent to users.

  • 7/21/2019 Distributed Database Concept

    2/18

    COMP 302 Valentina Tamma

    An example of DDBMS

    COMP 302 Valentina Tamma4

    DDBMS - characteristics

    Collection of logically-related shared data.

    Data split into fragments.

    Fragments may be replicated.

    Fragments/replicas allocated to sites.

    Sites linked by a communications network.

    Data at each site is under control of a DBMS.

    DBMSs handle local applications autonomously.

    Each DBMS participates in at least one globalapplication.

    COMP 302 Valentina Tamma6

    These are not DDBMSs

    Distributed Processing A centralized database that canbe accessed over a computer network.

    COMP 302 Valentina Tamma7

    These are not DDBMSs

    Parallel DBMS

    A DBMS running across multiple processors anddisks designed to execute operations in parallel,whenever possible, to improve performance.

    Based on premise that single processor systems can nolonger meet requirements for cost-effective scalability,reliability, and performance.

    Parallel DBMSs link multiple, smaller machines to achievesame throughput as single, larger machine, with greaterscalability and reliability.

  • 7/21/2019 Distributed Database Concept

    3/18

    COMP 302 Valentina Tamma

    Parallel DBMS

    Main architectures for parallel DBMSs are:

    Shared memory,

    Shared disk,

    Shared nothing.

    COMP 302 Valentina Tamma

    Parallel DBMS

    (a) sharedmemory

    (b) shared disk

    (c) sharednothing

    COMP 302 Valentina Tamma10

    Advantages of DDBMSs

    Reflects Organizational Structure

    Improved Sharing and Local Autonomy

    Improved AvailabilityA failure does not make the entire system inoperable

    Improved ReliabilityData may be replicated

    Improved PerformanceData are local to the site of greatest demand

    Economics

    Many small computers cost less than a big one! Modular Growth

    easy to add new modules

    COMP 302 Valentina Tamma11

    Disadvantages of DDBMSs

    Complexity

    CostEspecially in system management

    Securitynetwork must be made secure

    Integrity Control More Difficult

    Lack of Standards

    Lack of Experience Database Design More Complex

    due to fragmentation, allocation of fragments to a specificsite, ..

  • 7/21/2019 Distributed Database Concept

    4/18

    COMP 302 Valentina Tamma12

    Types of DDBMS

    Homogeneous DDBMS

    All sites use same DBMS product (eg.Oracle)

    Fairly easy to design and manage.

    Heterogeneous DDBMS

    Sites may run different DBMS products (eg. Oracle andIngress)

    Possibly different underlying data models (eg. relationalDB and OO database)

    Occurs when sites have implemented their own databasesand integration is considered later.

    We wont consider heterogeneous DDBMSs here.

    COMP 302 Valentina Tamma

    Multidatabase System (MDBS)

    DDBMS in which each site maintains completeautonomy.

    DBMS that resides transparently on top of existingdatabase and file systems and presents a singledatabase to its users.

    Allows users to access and share data withoutrequiring physical database integration.

    Unfederated MDBS (no local users) and federatedMDBS.

    COMP 302 Valentina Tamma

    Overview of Networking

    Network - Interconnected collection of autonomouscomputers, capable of exchanging information.

    Local Area Network (LAN) intended for connectingcomputers at same site.

    Wide Area Network (WAN) used when computers or LANsneed to be connected over long distances.

    WAN relatively slow and less reliable than LANs. DDBMS

    using LAN provides much faster response time than oneusing WAN.

    COMP 302 Valentina Tamma

    Overview of Networking

  • 7/21/2019 Distributed Database Concept

    5/18

    COMP 302 Valentina Tamma

    Functions of a DDBMS

    Expect DDBMS to have at least the functionality ofa DBMS (see Connolly & Begg. Chapter 2. Third edition)

    Also to have following functionality: Extended communication services.

    Extended Data Dictionary.

    Distributed query processing.

    Extended concurrency control.

    Extended recovery services.

    Extended security control.

    COMP 302 Valentina Tamma

    Reference Architecture for DDBMS

    Due to diversity, no accepted architecture equivalent toANSI/SPARC 3-level architecture for DBMSs.

    A possible reference architecture consists of: Set of global external schemas.

    Global conceptual schema (GCS).

    Fragmentation schema and allocation schema.

    Set of schemas for each local DBMS conforming to 3-levelANSI/SPARC .

    Some levels may be missing, depending on levels oftransparency supported.

    COMP 302 Valentina Tamma

    Reference Architecture for DDBMS

    COMP 302 Valentina Tamma

    Reference Architecture for DDBMS

    Global Conceptual Schema is the logicaldescription of the DB as if it were not distributed. Itcontains definitions of entities, relationships,constraints, security, and integrity information.

    Fragmentation and Allocation Schemas describehow data are logically partitioned, and where theyare located, taking replication into account.

    Local Schemas are the logical descriptions of thelocal DBs.

  • 7/21/2019 Distributed Database Concept

    6/18

    COMP 302 Valentina Tamma

    Components of a DDBMS

    COMP 302 Valentina Tamma

    Distributed Databases

    Issues in Distributed Database Design

    COMP 302 Valentina Tamma27

    Issues in Distributed Database Design

    Three key issues we have to consider:

    Data Allocation: where are data placed? Data should bestored at site with "optimal" distribution.

    Fragmentation: relation may be divided into a number ofsub-relations (called fragments) , which are stored indifferent sites.

    Replication: copy of fragment may be maintained atseveral sites.

    COMP 302 Valentina Tamma

    Issues in Distributed Database Design

    Definition and allocation of fragments carried outstrategically to achieve:

    Locality of Reference

    Improved Reliability and Availability

    Improved Performance

    Balanced Storage Capacities and Costs

    Minimal Communication Costs.

    Involves analysing most important transactions,based on quantitative/qualitative information.

  • 7/21/2019 Distributed Database Concept

    7/18

    COMP 302 Valentina Tamma

    Fragmentation

    Quantitative information may include:

    frequency with which a transaction is run;

    site from which a transaction is run; performance criteria for transactions.

    Qualitative information may include transactions that areexecuted such as:

    type of access (read or write);

    predicates of read operations.

    COMP 302 Valentina Tamma30

    Data Allocation

    Four strategies regarding placement of data:

    Centralized

    Partitioned (or Fragmented)

    Complete Replication

    Selective Replication

    COMP 302 Valentina Tamma31

    Data Allocation

    Centralized: Consists of single database stored at one sitewith users distributed across the network.(This is not a DDB but distributed processing!!)

    Partitioned: Database partitioned into disjoint fragments,each fragment assigned to one site.

    Complete Replication: Consists of maintaining completecopy of database at each site.

    Selective Replication:Combination of partitioning,replication, and centralization.

    COMP 302 Valentina Tamma28

    Fragmentation

    A relation R is divided into fragments r1, r2, rn,which contain enough information to allowreconstruction of R

    Example:We have a relation Sells(pub, address,price,type)

    Type is bitter or lager.

    We can split Sells into twp dfferent fragments: SellsBitter= type = bitter(Sells)

    SellsLager= type = lager(Sells)

  • 7/21/2019 Distributed Database Concept

    8/18

  • 7/21/2019 Distributed Database Concept

    9/18

  • 7/21/2019 Distributed Database Concept

    10/18

    COMP 302 Valentina Tamma45

    Vertical Fragmentation

    Each fragment consists of a subset of attributes of a relationR.

    Defined using projection operation of relational algebra:

    a1,an(R)

    Determined by establishing affinityof one attribute to another.

    Example:

    Relation: Bars(name,address,licence,employees,owner)

    Fragments:

    name,address,licence (Bars)

    name,address,employees,owner(Bars)

    COMP 302 Valentina Tamma46

    Mixed Fragmentation

    We can also mix horizontal and vertical fragmentation.

    We obtain a fragment that consist of an horizontalfragment that is vertically fragmented, or a verticalfragment that is horizontally fragmented.

    Defined using Selection and Projection operations ofrelational algebra.

    p(a1,an(R))

    a1,an(p(R))

    COMP 302 Valentina Tamma

    Example - Mixed Fragmentation

    S1 = staffNo, position, sex, DOB, salary(Staff)

    S2 = staffNo, fName, lName, branchNo(Staff)

    S21 = branchNo=B003(S2)

    S22 = branchNo=B005(S2)

    S23 = branchNo=B007(S2)

    COMP 302 Valentina Tamma

    Derived Horizontal Fragmentation

    A horizontal fragment that is based onhorizontal fragmentation of a parent relation.

    Ensures that fragments that are frequentlyjoined together are at same site.

    Defined using Semijoin operation of relationalalgebra:

    Ri = R>F Si, 1 i w

  • 7/21/2019 Distributed Database Concept

    11/18

  • 7/21/2019 Distributed Database Concept

    12/18

    COMP 302 Valentina Tamma

    Correctness of Fragmentation

    Recostruction: we must be able to reconstruct the entire Rfrom fragments.

    For horizontal fragmentation is union operation.

    R = r1 r2 rn,

    For vertical fragmentation is natural join operation.

    R = r1>< r2>< r2

    Disjointness: The two fragments are disjoint, except for the primarykey, name, which is necessary for reconstruction

  • 7/21/2019 Distributed Database Concept

    13/18

    COMP 302 Valentina Tamma

    Distributed Databases

    Transparency in Distributed databases

    COMP 302 Valentina Tamma

    Transparencies in a DDBMS

    Distribution Transparency

    Transaction Transparency

    Performance Transparency

    DBMS Transparency

    COMP 302 Valentina Tamma51

    Distribution Transparency

    The user has to perceive the DDB as a single,logical entity

    Fragmentation Transparency: the user does not need toknow that data is fragmented

    Location Transparency: the user does not need to knowthe location of data items

    Replication Transparency: the user is unaware ofrelication of data.

    Naming transparency: items in a database must have aunique name, but users dont need to worry about it.

    COMP 302 Valentina Tamma54

    Naming Transparency

    Each item in a DDB must have a unique name.

    DDBMS must ensure that no two sites create adatabase object with same name.

    Solution 1: create central name server.

    Disadvantages:

    loss of some local autonomy;

    central site may become a bottleneck;

    low availability; if the central site fails, remaining sitescannot create any new objects.

  • 7/21/2019 Distributed Database Concept

    14/18

    COMP 302 Valentina Tamma55

    Naming Transparency

    Solution 2: prefix object with identifier of site thatcreated it.

    Example: Beer created at site S1 might be namedS1.Beer.

    Disadvantage: loss of distribution transparency.

    COMP 302 Valentina Tamma56

    Naming Transparency

    Solution 3: use aliases for each database object.

    Example: S1.Beer might be known as local_Beerby user at site S1.

    The DDBMS has task of mapping an alias toappropriate database object.

    COMP 302 Valentina Tamma57

    Transaction Transparency

    Ensures that all distributed transactions maintaindistributed databases integrity and consistency.

    Distributed transaction accesses data stored at more thanone location.

    Each transaction is divided into number of sub-transactions, one for each site that has to be accessed.

    DDBMS must ensure the indivisibility of both the globaltransaction and each sub-transactions.

    Must ensure both concurrency transparency, and failuretransparency

    COMP 302 Valentina Tamma58

    Example - Distributed Transaction

    Relation: Sells(pub, beer,price,type)

    Fragments:

    SellsBitter= type = bitter(Sells)

    SellsLager= type = lager(Sells)

    The two fragments are at two different sites.

    Transaction T prints out the names of all pubs in the relation

    sells. This transaction is split into two sub-transactions,one for each fragment.

  • 7/21/2019 Distributed Database Concept

    15/18

    COMP 302 Valentina Tamma

    Example - Distributed Transaction

    T prints out names of all staff, using schema

    defined above as S1, S2, S21, S22, and S23.Define three subtransactions TS3, TS5, and TS7to represent agents at sites 3, 5, and 7.

    COMP 302 Valentina Tamma59

    Concurrency Transparency

    All transactions must execute independently and belogically consistent with results obtained if transactionsexecuted one at a time, in some arbitrary serial order.

    Same fundamental principles as for centralised DBMS.

    DDBMS must ensure both global and local transactions donot interfere with each other.

    Similarly, DDBMS must ensure consistency of all sub-transactions of global transaction.

    Techniques for concurrency control. Usually different fromthe ones for DBMS.

    COMP 302 Valentina Tamma

    Concurrency Transparency

    Replication makes concurrency more complex.

    If a copy of a replicated data item is updated,update must be propagated to all copies.

    Could propagate changes as part of originaltransaction, making it an atomic operation.

    However, if one site holding copy is not reachable,then transaction is delayed until site is reachable.

    COMP 302 Valentina Tamma

    Concurrency Transparency

    Could limit update propagation to only those sitescurrently available. Remaining sites updated whenthey become available again.

    Could allow updates to copies to happenasynchronously, sometime after the originalupdate. Delay in regaining consistency may rangefrom a few seconds to several hours.

  • 7/21/2019 Distributed Database Concept

    16/18

    COMP 302 Valentina Tamma

    Failure Transparency

    DDBMS must ensure atomicity and durability ofglobal transaction.

    Means ensuring that sub-transactions of globaltransaction either all commit or all abort.

    Thus, DDBMS must synchronize globaltransaction to ensure that all sub-transactionshave completed successfully before recording afinal COMMIT for global transaction.

    Must do this in presence of site and networkfailures.

    COMP 302 Valentina Tamma63

    Performance Transparency

    DDBMS must perform as if it were acentralized DBMS:

    DDBMS should not suffer any performancedegradation due to distributed architecture.

    DDBMS should determine most cost-effectivestrategy to execute a request.

    COMP 302 Valentina Tamma64

    Performance Transparency

    Distributed Query Processor (DQP) maps datarequest into ordered sequence of operations onlocal databases.

    It must consider fragmentation, replication, andallocation schemas.

    DQP has to decide:

    which fragment to access;

    which copy of a fragment to use; which location to use.

    COMP 302 Valentina Tamma65

    Performance Transparency

    DQP produces execution strategy optimisedwith respect to some cost function.

    Typically, costs associated with a distributedrequest include:

    I/O cost;

    CPU cost;

    communication cost.

  • 7/21/2019 Distributed Database Concept

    17/18

    COMP 302 Valentina Tamma66

    Performance Transparency - Example

    Property(Pno, City) 10000 records in London

    Renter(Rno,Max_Price) 100000 records in Glasgow

    Viewing(Pno, Rno) 1000000 records in London

    SELECT p.pno

    FROM property p INNER JOIN

    (renter r INNER JOIN viewing v ON r.rno = v.rno)

    ON p.pno = v.pno

    WHERE p.city=Aberdeen AND r.max_price > 200000;

    COMP 302 Valentina Tamma67

    Performance Transparency - Example

    Assume:

    Each tuple in each relation is 100 characters long.

    10 renters with maximum price greater than200,000.

    100 000 viewings for properties in Aberdeen.

    Computation time negligible compared tocommunication time.

    COMP 302 Valentina Tamma68

    Performance Transparency - Example

    COMP 302 Valentina Tamma69

    Dates 12 Rules for a DDBMS

    0. Fundamental Principle

    To the user, a distributed system should look exactly likea non-distributed system.

    1. Local Autonomy

    2. No Rel iance on a Centra l Site

    3 . Con tinuous Operat ion

    4 . Loca tion Independence

    5. Fragmentation Independence

    6. Replication Independence

  • 7/21/2019 Distributed Database Concept

    18/18

    COMP 302 Valentina Tamma69

    Dates 12 Rules for a DDBMS

    0. Fundamental Principle

    To the user, a distributed system should look exactly like a non-distributed system.

    7. Dis tr ibuted Query Process ing

    8. Distributed Transaction Processing

    9. Hardware Independence

    10. Operating System Independence

    11. Network Independence

    12. Database Independence

    Note: last four rules are ideal!

    COMP 302 Valentina Tamma4

    Distributed Transaction Management

    DDBMS must ensure:

    synchronization of sub-transactions with other local

    transactions executing concurrently at a site; synchronization of sub-transactions with global

    transactions running simultaneously at same or differentsites.

    Global transaction manager (transactioncoordinator) at each site, to coordinate global andlocal transactions initiated at that site.

    COMP 302 Valentina Tamma

    Distributed Concurrency Control

    Techniques for Distributed Concurrency Controlmust ensure distributed serializability.

    Locking protocols (extensions of 2PL protocol)Distributed Deadlock management.

    Timestamping methods (extend the definition oftimestamp so that it includes a site identifier)