20
CLIENT/SERVER DATABASE AND DISTRIBUTED DATABASE CHAPTER 13 291 Photo Courtesy of Hasbro CHAPTER OBJECTIVES After learning the material in this chapter, you will be able to: Describe the concepts and advantages of the client/server database approach. Describe the concepts and advantages of the distributed database approach. Explain how data can be distributed and replicated in a distributed database. Describe the problem of concurrency control in distributed database. Describe the distributed join process. Describe data partitioning in a distributed database. Describe distributed directory management. HASBRO Hasbro is a worldwide leader in children’s and family leisure time entertainment prod- ucts and services, including the design, manufacture, and marketing of games and toys ranging from traditional to high tech. Headquartered in Pawtucket, RI, Hasbro

CLIENT/SERVER DATABASE AND DISTRIBUTED DATABASE

  • Upload
    others

  • View
    29

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CLIENT/SERVER DATABASE AND DISTRIBUTED DATABASE

CLIENT/SERVER DATABASEAND DISTRIBUTED

DATABASE

C H A P T E R 1 3

291

Photo Courtesy of Hasbro

CHAPTER OBJECTIVES

After learning the material in this chapter, you will be able to:✔ Describe the concepts and advantages of the client/server database approach.✔ Describe the concepts and advantages of the distributed database approach.✔ Explain how data can be distributed and replicated in a distributed database.✔ Describe the problem of concurrency control in distributed database.✔ Describe the distributed join process.✔ Describe data partitioning in a distributed database.✔ Describe distributed directory management.

HASBROHasbro is a worldwide leader in children’s and family leisure time entertainment prod-ucts and services, including the design, manufacture, and marketing of games andtoys ranging from traditional to high tech. Headquartered in Pawtucket, RI, Hasbro

Page 2: CLIENT/SERVER DATABASE AND DISTRIBUTED DATABASE

292 Chapter 13 C l i en t/Ser ve r Da tabase and D i s t r i bu ted Database

was founded in 1923 by the Hassenfeld Brothers, hence the eventual company name.Over the years, the Hasbro family has expanded through internal growth plus acquisi-tions that include Milton Bradley (founded in 1860), Parker Brothers (founded in1883), Tonka, Kenner, and Playskool. Included among its famous toys are MR.POTATO HEAD®, G.I. Joe®, Tonka Trucks®, Play Doh®, Easy Bake Oven®,Transformers®, Furby®, Tinkertoy®, and the games Monopoly® (the world’s all-time,best-selling game), Scrabble®, Chutes and Ladders , Candy Land®, The Game ofLife®, Risk®, Clue®, Sorry®, and Yahtzee®.

Hasbro keeps track of this wide variety of toys and games with a database appli-cation called PRIDE (Product Rights Information Database), which was implemented in2001. PRIDE’s function is to track the complete life cycle of Hasbro’s contract to pro-duce or market each of its products. This includes the payment of royalties to the prod-uct’s inventor or owner, Hasbro’s territorial rights to sell the product by country or areaof the world, distribution rights by marketing channel, various payment guaranteesand advances, and contract expiration and renewal criteria. A variety of Hasbrodepartments use PRIDE, including accounting for royalty payments, marketing forworldwide marketing plans, merchandising, product development, and legal depart-ments throughout the world.

PRIDE utilizes the Sybase DBMS and runs on an IBM RS-6000 Unix platform.Actual scanned images of the contracts are stored in the database. The system isdesigned to store amendments to the contracts, including keeping track of whichamendments were in effect at any point in time. It is also designed to incorporate datacorrections and to search the scanned contracts for particular text. The main databasetable is the Contract Master table, which has 7000 records and a variety of sub-tablescontaining detailed data about royalties, territories, marketing channels, agents, andlicensors. These tables produce a variety of customizable reports and queries. Thedata can also be exported to MS Excel for further processing in spreadsheets.

Printed by permission of Hasbro

Simply put, the question in this chapter is, “Where is the database located?” Inmany situations, the obvious answer is, “It’s in the computer itself!” That is, it islocated on one of the computer’s disk drives. If the computer in question is astand-alone personal computer, of course the database is stored on the PC’s harddrive or on one of the PC’s removable disks. (Where else could it be?!) The samecan be and often is true of much larger computer systems. A company can cer-tainly choose to have its databases stored in its mainframe computer, while pro-viding access to the computer and its databases on a broad, even worldwide scale.

Over the years, two arrangements for locating data other than “in the computeritself ” have been developed. Both arrangements involve computers connected toeach other on networks. One, known as client/server database, is for personalcomputers connected together on a local area network. The other, known as dis-tributed database, is for larger, geographically dispersed computers located on awide area network. The development of these networked data schemes has beendriven by a variety of technical and managerial advantages, although, as is so oftenthe case, there are some disadvantages to be considered, as well.

Page 3: CLIENT/SERVER DATABASE AND DISTRIBUTED DATABASE

C l i en t/Ser ve r Da tabase 293

CLIENT/SERVER DATABASEA local area network (LAN) is an arrangement of personal computers connectedtogether by communications lines (Figure 13.1). It is local in the sense that the PCsmust be located fairly close to each other, say within a building or within severalnearby buildings. Additional components of the LAN that can be utilized or sharedby the PCs can be other, often more powerful server computers and peripheraldevices such as printers. The PCs on a LAN can certainly operate independently,but they can also communicate with each other. If, as is often the case, a LAN isset up to support a department in a company, the members of the department cancommunicate with each other, send data to each other, and share such devices as ahigh-speed printer. Finally, a gateway computer on the LAN can link the LAN andits PCs to other LANs, to one or more mainframe computers, or to the Internet.

If one of the LAN’s main advantages is the ability to share resources, then cer-tainly one type of resource to share is data contained in databases. For example, thepersonnel specialists in a company’s personnel department might all have to haveaccess to the company’s personnel database. But then, what are the options forlocating and processing shared databases on a LAN? In terms of location, the basicconcept is to store a shared database on a LAN server so that all of the PCs (alsoknown as clients) on the LAN can access it. In terms of processing, there are a fewpossibilities in this two-tiered client/server arrangement.

The simplest tactic is known as the file server approach. When a client com-puter on the LAN needs to query, update, or otherwise use a file on the server, theentire file (yes, that’s right, the entire file) must be sent from the server to that client.All of the querying, updating, or other processing is then performed in the client

Server

PC

PC

PC Printer

PC PC

PC

➤ Figure 13.1Local area network (LAN)

Page 4: CLIENT/SERVER DATABASE AND DISTRIBUTED DATABASE

294 Chapter 13 C l i en t/Ser ve r Da tabase and D i s t r i bu ted Database

computer. If changes were made to the file, the entire file is then shipped back tothe server. Clearly, for files of even moderate size, shipping entire files back andforth across the LAN with any frequency will be very costly. In terms of concur-rency control, obviously the entire file must be locked while one of the clients isupdating even one record in it. Other than providing a rudimentary file-sharingcapability, this arrangement’s drawbacks render it not very practical or useful.

A much better arrangement is variously known as the database server orDBMS server approach. Again, the database is located at the server, but this time,the processing is split between the client and the server, and there is much less datatraffic on the network. Say that someone at a client computer wants to query thedatabase at the server. The query is entered at the client, and the client computerperforms the initial keyboard and screen interaction processing, as well as initialsyntax checking of the query. The system then ships the query over the LAN to theserver where the query is actually run against the database. Only the results areshipped back to the client. Certainly, this is a much better arrangement than the fileserver approach! The network data traffic is reduced to a tolerable level, even forfrequently queried databases. Also, security and concurrency control can be han-dled at the server in a much more contained way. The only real drawback to thisapproach is that the company must invest in a sufficiently powerful server to keepup with all of the activity concentrated there.

Another issue involving the data on a LAN is the fact that some databases canbe stored on a client PC’s own hard drive while other databases that the client mightaccess are stored on the LAN’s server. This is also known as a two-tier approach,(Figure 13.2). Software has been developed that makes the location of the datatransparent to the user at the client. In this mode of operation, the user issues aquery at the client, and the software first checks to see if the required data is on the

Server

PC

PC

PCQueryIssuedHere

Printer

PC PC

PC

Dat

abas

e

Dat

abas

e

➤ Figure 13.2Two-t ier c l ient/server database

Page 5: CLIENT/SERVER DATABASE AND DISTRIBUTED DATABASE

C l i en t/Ser ve r Da tabase 295

PC’s own hard drive. If it is, the data is retrieved from it, and that is the end of thestory. If it is not there, then the software automatically looks for it on the server. Inan even more sophisticated three-tier approach (Figure 13.3), if the software does-n’t find the data on the client PC’s hard drive or on the LAN server, it can leave theLAN through a gateway computer and look for the data on, for example, a large,mainframe computer that may be reachable from many LANs.

In another use of the term three-tier approach, the three tiers are the client PCs,servers known as application servers, and other servers known as database servers,(Figure 13.4). In this arrangement, local screen and keyboard interaction is still han-dled by the clients, but they can now request a variety of applications to be performedat and by the application servers. The application servers, in turn, rely on the databaseservers and their databases to supply the data needed by the applications. Though cer-tainly well beyond the scope of LANs, an example of this kind of arrangement is theWorld Wide Web on the Internet. The local processing on the clients is limited to thedata input and data display capabilities of browsers such as Netscape’s Communicatorand Microsoft’s Internet Explorer. The application servers are the computers at com-pany Web sites that conduct the companies’ business with the “visitors” workingthrough their browsers. The company application servers in turn rely on the compa-nies’ database servers to provide the necessary data to complete the transactions. Forexample, when a bank’s customer visits his bank’s Web site, he can initiate lots of dif-ferent transactions, ranging from checking his account balances to transferring moneybetween accounts to paying his credit card bills. The bank’s Web application serverhandles all of these transactions. It, in turn, sends requests to the bank’s databaseserver and databases to retrieve the current account balances, add money to oneaccount while deducting money from another in a funds transfer, and so forth.

Server/Gateway

PC

PC

PCQueryIssuedHere

Printer

PC PC

PC

Dat

abas

e MainframeComputer

Dat

abas

e

Dat

abas

e

➤ Figure 13.3 Three-t ier c l ient/server database

Page 6: CLIENT/SERVER DATABASE AND DISTRIBUTED DATABASE

296 Chapter 13 C l i en t/Ser ve r Da tabase and D i s t r i bu ted Database

DISTRIBUTED DATABASE

The Distributed Database Concept

In today’s world of universal dependence on information systems, all sorts of peopleneed access to companies’ databases. In addition to a company’s own employees, theseinclude the company’s customers, potential customers, suppliers, and vendors of alltypes. It is possible for a company to have all of its databases concentrated at one main-frame computer site with worldwide access to this site provided by telecommunicationsnetworks, including the Internet. Although the management of such a centralized sys-tem and its databases can be controlled in a well-contained manner and this can beadvantageous, it poses some problems as well. For example, if the single site goesdown, then everyone is blocked from accessing the databases until the site comes backup again. Also the communications costs from the many far-flung PCs and terminalsto the central site can be expensive. One solution to such problems, and an alternativedesign to the centralized database concept, is known as distributed database.

The idea is that instead of having one, centralized database, we are going tospread the data out among the cities on the distributed network, each of which hasits own computer and data storage facilities. All of this distributed data is still con-sidered to be a single logical database. When a person or process anywhere on thedistributed network queries the database, it is not necessary to know where on thenetwork the data being sought is located. The user just issues the query, and theresult is returned. This feature is known as location transparency. This can becomerather complex very quickly, and it must be managed by sophisticated softwareknown as a distributed database management system or distributed DBMS.

PC

PC

PC Printer

PC PC

PC

ApplicationServer

DatabaseServer

Dat

abas

e

➤ Figure 13.4 Another type of three-t ier c l ient/server approach

Page 7: CLIENT/SERVER DATABASE AND DISTRIBUTED DATABASE

Dis t r i bu ted Database 297

Distributing the Data Consider a large multinational company with majorsites in Los Angeles, Memphis, New York (which is corporate headquarters), Paris,and Tokyo. Let’s say that the company has a very important transactional relationaldatabase that is actively used at all five sites. The database consists of six largetables, named A, B, C, D, E, and F, and response time regarding queries made tothe database is an important factor. If the database was centralized, the arrangementwould look like Figure 13.5, with all six tables located in New York.

The first and simplest idea in distributing the data would be to disperse the sixtables among the five sites. If particular tables are used at some sites more fre-quently than at other sites, it would make sense to locate the tables at the sites atwhich they are most frequently used. Figure 13.6 shows that we have kept TablesA and B at New York, while moving Table C to Memphis, Tables D and E to Tokyo,and Table F to Paris. Say that we moved Table F to Paris because it is used mostfrequently there. With Table F in Paris, the people there can use it as much as theywant to without running up any telecommunications costs, as opposed to when thetable used to be in New York. Furthermore, the Paris employees can exercise localautonomy over the data, taking responsibility for its security, backup and recovery,and concurrency control.

Unfortunately, distributing the database in this way has not relieved some ofthe problems that we had with the centralized database, and it has introduced a cou-ple of new ones. The main problem that is carried over from the centralizedapproach is availability. In the centralized approach of Figure 13.5, if the New York

UNITED STATES

Memphis

Paris

Tokyo

Los AngelesNew York

NORTH AMERICA

SOUTH AMERICA

ASIA

EUROPE

AFRICACANADA

U.K.GREENLAND(DEN.)

MEXICO

PACIFICOCEAN

ARCTICOCEAN

ATLANTICOCEAN

Gulf ofMexico

CaribbeanSea

D E F

A B C

➤ Figure 13.5Central ized database

Page 8: CLIENT/SERVER DATABASE AND DISTRIBUTED DATABASE

298 Chapter 13 C l i en t/Ser ve r Da tabase and D i s t r i bu ted Database

site went down, no other site on the network could access Table F (or any of theother tables). In the dispersed approach of Figure 13.6, if the Paris site goes down,Table F is equally unavailable to the other sites. A new problem that crops up inFigure 13.6 has to do with joins. When the database was centralized at New York,a query issued at any of the sites that required a join of two or more of the tablescould be handled in the standard way by the computer at New York. The resultwould then be sent to the site that issued the query. In the dispersed approach, ajoin might require tables located at different sites! Though not an insurmountableproblem, this would obviously add some major complexity (and we will discussthis further later in this chapter). Furthermore, although we could (and did) makethe argument that local autonomy is good for issues like security control, an argu-ment can also be made that security for the overall database can better be handledat a single, central location. Clearly, the simple dispersal of database tables asshown in Figure 13.6 is of limited benefit.

Let’s introduce a new option into the mix. Suppose that we allow databasetables to be duplicated—the term used with distributed database is replicated—at two or more sites on the network. This idea has both advantages and, unfortu-nately, disadvantages. On the plus side, the first advantage is availability. If atable is replicated at two or more sites and one of those sites goes down, every-one everywhere else on the network can still access the table at the other site(s).Also, if more than one site requires frequent access to a particular table, the tablecan be replicated at each of those sites, again minimizing telecommunications

UNITED STATES

Memphis

Paris

Tokyo

Los AngelesNew York

NORTH AMERICA

SOUTH AMERICA

ASIA

EUROPE

AFRICACANADA

U.K.GREENLAND(DEN.)

MEXICO

PACIFICOCEAN

ARCTICOCEAN

ATLANTICOCEAN

Gulf ofMexico

CaribbeanSea

A B

D EF

C

➤ Figure 13.6Distr ibuted database withno data repl icat ion

Page 9: CLIENT/SERVER DATABASE AND DISTRIBUTED DATABASE

Dis t r i bu ted Database 299

costs during data access. And copies of a table can be located at sites that havetables with which it may have to be joined, allowing the joins to take place atthose sites without having the complexity of having to join tables across multi-ple sites. On the down side, if a table is replicated at several sites, it becomesmore of a security risk. But the biggest problem that data replication introducesis that of concurrency control. As we have already seen, concurrency control isan issue even without replicated tables; with replicated tables, it becomes evenmore complex. How do you keep data consistent when it is replicated in tableson three continents? More about this issue later.

Assuming, then, that data replication has some advantages and that we arewilling to deal with the disadvantages, what are the options for where to place thereplicated tables? Figure 13.7 shows the maximum approach of replicating everytable at every site. It’s great for availability and for joins, but it’s the absolute worstarrangement regarding concurrency control. Every change to every table has to bereflected at every site. It’s also a security nightmare, and, by the way, it takes up alot of disk space.

The concept in Figure 13.8 is to have a copy of the entire database at head-quarters in New York and to have each table replicated exactly once at one of theother sites. Again, this improves availability, at least to the extent that each table isnow at two sites. Because each table is at only two sites, the security and concur-rency exposures are limited. Any join that has to be executed can be handled at

UNITED STATES

Memphis

Paris

Tokyo

Los AngelesNew York

NORTH AMERICA

SOUTH AMERICA

ASIA

EUROPE

AFRICACANADA

U.K.GREENLAND(DEN.)

MEXICO

PACIFICOCEAN

ARCTICOCEAN

ATLANTICOCEAN

CaribbeanSea

D E F

A B C

D E F

A B C

D E F

A B C

D E F

A B C

D E F

A B C

➤ Figure 13.7Distr ibuted database withmaximum data repl icat ion

Page 10: CLIENT/SERVER DATABASE AND DISTRIBUTED DATABASE

300 Chapter 13 C l i en t/Ser ve r Da tabase and D i s t r i bu ted Database

New York. So, this arrangement sounds pretty good, but it is limiting. What if a par-ticular table is used heavily at both Tokyo and Los Angeles? We would like to placecopies of it at both of those sites, but we can’t because the premise is to have onecopy in New York and only one other copy elsewhere. Also, New York would tendto become a bottleneck, with all of the joins and many of the other accesses beingsent there. Still, the design of Figure 13.8 appears to be an improvement over thedesign of Figure 13.7. Can we do better still?

The principle behind making this concept work is flexibility in placing repli-cated tables where they will do the most good. We want to:

• Place copies of tables at the sites that use them most heavily in order to mini-mize telecommunications costs.

• Ensure that there are at least two copies of important or frequently used tablesto realize the gains in availability.

• Limit the number of copies of any one table to control the security and con-currency issues.

• Avoid any one site becoming a bottleneck.

Figure 13.9 shows an arrangement of replicated tables based on these princi-ples. There are two copies each of Tables A, B, E, and F, and three copies of TableD. Apparently, Table C is relatively unimportant or infrequently used, and it islocated solely at Los Angeles.

UNITED STATES

Memphis

Paris

Tokyo

Los Angeles

New York

NORTH AMERICA

SOUTH AMERICA

ASIA

EUROPE

AFRICACANADA

U.K.GREENLAND(DEN.)

MEXICO

PACIFICOCEAN

ARCTICOCEAN

ATLANTICOCEAN

Gulf ofMexico

CaribbeanSea

BC

F

A E

D

D E F

A B C

➤ Figure 13.8Distr ibuted database with one complete copy inone c i ty

Page 11: CLIENT/SERVER DATABASE AND DISTRIBUTED DATABASE

Dis t r i bu ted Database 301

Concurrency Control in Distributed Database

Earlier in the book we discussed concurrency control in terms of the problemsinvolved in multiple people or processes trying to update a record at the sametime. When we allow replicated tables to be dispersed all over the country or theworld in a distributed database, the problems of concurrent update expand, too.The original possibility of the “lost update” is still there. If two people attemptto update a particular record of Table B in New York at the same time, everythingthat we said about the problem of concurrent update earlier in the book remainstrue. But now, in addition, look at what happens when geographically dispersed,replicated files are involved. In Figure 13.9, if one person updates a particularvalue in a record of Table B in New York at the same time that someone elseupdates the very same value in the very same record of Table B in Paris, theresults are going to be wrong. Or if one person updates a particular record ofTable B in New York and then right after that a second person reads the samerecord of Table B in Paris, that second person is not going to get the latest, mostup-to-date data. The protections that we discussed earlier that can be put intoplace to handle the problem of concurrent update in a single table are not ade-quate to handle the new, expanded problem.

If the nature of the data and of the applications that use it can tolerate retrieveddata not necessarily being up-to-the-minute accurate, then several “asynchronous”approaches to updating replicated data can be used. For example, the site at which

UNITED STATES

Memphis

Paris

Tokyo

Los Angeles

New York

NORTH AMERICA

SOUTH AMERICA

ASIA

EUROPE

AFRICACANADA

U.K.GREENLAND(DEN.)

MEXICO

PACIFICOCEAN

ARCTICOCEAN

ATLANTICOCEAN

Gulf ofMexico

CaribbeanSea

B ED

C D

A F

ED

A B F

➤ Figure 13.9Distr ibuted database with targeted data repl icat ion

Page 12: CLIENT/SERVER DATABASE AND DISTRIBUTED DATABASE

302 Chapter 13 C l i en t/Ser ve r Da tabase and D i s t r i bu ted Database

the data was updated, New York in the above example involving Table B, can sim-ply send a message to the other sites that contain a copy of the same table, in thiscase Paris which also has a copy of Table B, in the hopes that the update will reachParis reasonably quickly and that the computer in Paris will update that record inTable B right away. In another asynchronous scheme, one of the sites can be cho-sen to accumulate all of the updates to all of the tables. That site can then transmitthe changes to all of the other sites on a regularly scheduled basis. Or each tablecan have one of the sites be declared the “dominant” site for that table. All of theupdates for a particular table can be sent to the copy of the table at its dominantsite, which can then transmit the updates to the other copies of the table on sometimed or other basis.

But if the nature of the data and of the applications that use it require that allof the data in the replicated tables worldwide always be consistent, accurate, andup-to-date, then a more complex synchronous procedure must be put into place.Although there are variations on this theme, the basic process for accomplishingthis is known as the two-phase commit. This process works as follows. Each com-puter on the network has a special log file in addition to its database tables. So, inFigure 13.9, each of the five cities has one of these special log files. Now, when anupdate is to be made at one site, the distributed DBMS has to do several things. Ithas to freeze all of the replicated copies of the table involved, send the update outto all the sites with the table copies, and then be sure that all the copies wereupdated. After all of that happens, all of the replicated copies of the table will havebeen updated and processing can resume. Remember that for this to work properly,either all of the replicated files have to be updated or none of them must beupdated. What we don’t want is for the update to take place at some of the sites andnot at the others, which would obviously leave inconsistent results. Let’s look at anexample using Table D in Figure 13.9. Copies of Table D are located in LosAngeles, Memphis, and Paris.

Say that someone issues an update request to a record in Table D inMemphis. In the first or “prepare” phase of the two-phase commit, the computerin Memphis sends the updated data to Los Angeles and Paris. The computers inall three cities write the update to their logs (but not to their actual copies ofTable D, at this point). The computers in Los Angeles and Paris attempt to locktheir copies of Table D to get ready for the update. If another process is usingtheir copy of Table D, then they will not be able to do this. Los Angeles and Paristhen report back to Memphis whether or not they are in good operating shape andwhether or not they were able to lock Table D. The computer in Memphis takesin all of this information and then makes a decision of whether to go ahead withthe update or to abort it. If Los Angeles and Paris report back that they are upand running and were able to lock Table D, then the computer in Memphis willdecide to go ahead with the update. If the news from Los Angeles and Paris wasbad, Memphis will decide not to go ahead with the update. So, in the second or“commit” phase of the two-phase commit, Memphis sends its decision to LosAngeles and Paris. If it decides to complete the update, then all three cities trans-fer the updated data from their logs to their copy of Table D. If Memphis decidesto abort the update then none of the sites transfers the updated data from theirlogs to their copy of Table D. All three copies of Table D remain as they were,and Memphis can start the process all over again.

Page 13: CLIENT/SERVER DATABASE AND DISTRIBUTED DATABASE

Dis t r i bu ted Database 303

The two-phase commit is certainly a complex, costly, and time-consumingprocess. It should be clear that the more volatile the data in the database is, the lessattractive is this type of synchronous procedure for updating replicated tables in thedistributed database.

Distributed Joins

Let’s take a look at the issue of distributed joins, which came up earlier. In a dis-tributed database in which no single computer (no single city) in the network con-tains the entire database, there is the possibility that a query will be run from oneof the computers that requires a join of two or more tables that are not all at thesame computer. Consider the distributed database design in Figure 13.9. Let’s saythat a query issued at Los Angeles requires the join of Tables E and F. First of all,neither of the two tables is located at Los Angeles, the site that issued the query.Then, notice that none of the other four cities contains a copy of both Tables E andF. That means that there is no one city to which the query can be sent for completeprocessing, including the join.

In order to handle this type of distributed join situation, the distributedDBMS must have a sophisticated capability to move data from one city toanother to accomplish the join. Earlier in the book, we described the idea of therelational DBMS’s relational query optimizer as an expert system that figures outan efficient way to respond to and satisfy a relational query. Similarly, the dis-tributed DBMS must have its own built-in expert system that is capable of fig-uring out an efficient way to handle a request for a distributed join. Thisdistributed DBMS expert system will work hand-in-hand with the relationalquery optimizer, which will still be needed to determine which records of a par-ticular table are needed to satisfy the join, among other things. In the example ofthe query issued from Los Angeles that requires a join of Tables E and F, thereare several options:

• Figure out which records of Table E are involved in the join and send copiesof them from either Memphis or Paris (each of which has a copy of Table E)to either New York or Tokyo (each of which has a copy of the other tableinvolved in the join, Table F). Then, execute the join in whichever of NewYork or Tokyo was chosen to receive the records from Table E and send theresult back to Los Angeles.

• Figure out which records of Table F are involved in the join and send copiesof them from either New York or Tokyo (each of which has a copy of Table F)to either Memphis or Paris (each of which has a copy of the other tableinvolved in the join, Table E). Then, execute the join in whichever of Memphisor Paris was chosen to receive the records from Table F and send the resultback to Los Angeles.

• Figure out which records of Table E are involved in the join and send copies ofthem from either Memphis or Paris (each of which has a copy of Table E) toLos Angeles, the city that initiated the join request. Figure out which recordsof Table F are involved in the join and send copies of them from either NewYork or Tokyo (each of which has a copy of Table F) to Los Angeles. Then, exe-cute the join in Los Angeles, the site that issued the query.

Page 14: CLIENT/SERVER DATABASE AND DISTRIBUTED DATABASE

304 Chapter 13 C l i en t/Ser ve r Da tabase and D i s t r i bu ted Database

How does the distributed DBMS decide among these options? It must consider:

• The number and size of the records from each table involved in the join.

• The distances and costs of transmitting the records from one city to another toexecute the join.

• The distance and cost of shipping the result of the join back to the city thatissued the query in the first place.

For example, if only 20 records of Table E are involved in the join while all ofTable F is needed, then it would make sense to send copies of the 20 Table E recordsto a city that has a copy of Table F. The join can then be executed at the Table F cityand the result sent back to Los Angeles. The arrangement of tables in Figure 13.9suggests that, one solution would be to send the 20 records from Table E in Memphisto New York, one of the cities with Table F. The query could then be executed in NewYork and the result sent to Los Angeles, which issued the query. Why Memphis andNew York rather than Paris and Tokyo, the other cities that have copies of Tables Eand F, respectively? Because the distance (and probably the cost) between Memphisand New York is much less than the distances involving Paris and Tokyo. Finally,what about the option of shipping the data needed from both tables to Los Angeles,the city that issued the query, for execution? Remember, the entirety of Table F isneeded for the join in this example. Shipping all of Table F to Los Angeles to executethe join there would probably be much more expensive than the New York option.

Partitioning or Fragmentation

Another option in the distributed database bag of tricks is known as partitioningor fragmentation. This is actually a variation on the theme of file partitioning thatwe discussed in the context of physical database design,earlier in the book.

In horizontal partitioning, a relational table can be split up so that some recordsare located at one site, other records are located at another site, and so on. Figure13.10 shows the same five-city network that we have been using as an example, withanother table, Table G, added. The figure shows that subset G1 of the records of TableG is located in Memphis, subset G2 is located in Los Angeles, and so on. A simpleexample of this would be the company’s employee table: the records of the employ-ees who work in a given city are stored in that city’s computer. Thus G1 is the subsetof records of Table G consisting of the records of the employees who work inMemphis, G2 is the subset consisting of the employees who work in Los Angeles,and so forth. This makes sense when one considers that most of the query and accessactivity on a particular employee’s record will take place at his work location. Thedrawback is that when one of the sites, say the New York headquarters location, occa-sionally needs to run an application that requires accessing the employee records ofeveryone in the company, it must collect them from every one of the five sites.

In vertical partitioning, the columns of a table are divided up among severalcities on the network. Each such partition must include the primary key attribute(s)of the table. This arrangement can make sense when different sites are responsiblefor processing different functions involving an entity. For example, the salaryattributes of a personnel table might be stored in one city while the skills attributesof the table might be stored in another city. Both partitions would include theemployee number, the primary key of the full table. Note that to bring the different

Page 15: CLIENT/SERVER DATABASE AND DISTRIBUTED DATABASE

Dis t r i bu ted Database 305

pieces of data about a particular employee back together again in a query wouldrequire a multi-site join of the two fragments of that employee’s record.

Can a table be partitioned both horizontally and vertically? Yes, in principle!Can horizontal and vertical partitions be replicated? Yes, again, in principle! Butbear in mind that the more exotic such arrangements become, the more complex-ity there is for the software and the IT personnel to deal with.

Distributed Directory Management

In the discussion on distributed database up to this point we’ve been taking the notionof location transparency for granted. That is, we’ve been assuming that when a queryis issued at any city on the network, the system simply “knows” where to find thedata it needs to satisfy the query. But that knowledge has to come from somewhere,and that place is in the form of a directory. A distributed DBMS must include a direc-tory that keeps track of where the database tables, the replicated copies of databasetables (if any), and the table partitions (if any) are located. Then, when a query is pre-sented at any city on the network, the distributed DBMS can automatically use thedirectory to find out where the required data is located and maintain location trans-parency. That is, the person or process that initiated the query does not have to knowwhere the data is, whether or not it is replicated, or whether or not it is partitioned.

Which brings up an interesting question. Where should the directory itself bestored? As with the matter of how to distribute the database tables themselves,

UNITED STATES

Memphis

Paris

Tokyo

Los Angeles

New York

NORTH AMERICA

SOUTH AMERICA

ASIA

EUROPE

AFRICACANADA

U.K.GREENLAND(DEN.)

MEXICO

PACIFICOCEAN

ARCTICOCEAN

ATLANTICOCEAN

CaribbeanSea

B ED

C D

A

G5

F

ED

A B F

G1

G4

G3G2

➤ Figure 13.10Distr ibuted database with data par t i t ioning/ fragmentat ion

Page 16: CLIENT/SERVER DATABASE AND DISTRIBUTED DATABASE

306 Chapter 13 C l i en t/Ser ve r Da tabase and D i s t r i bu ted Database

there are a number of possibilities, some relatively simple and others more com-plex, with many of the same kinds of advantages and disadvantages that we’vealready discussed. The entire directory could be stored at only one site, copies ofthe directory could be stored at several of the sites, or a copy of the directory couldbe stored at every site. Actually, since the directory must be referenced for everyquery issued at every site and since the directory data will only change when newdatabase tables are added to the database, database tables are moved, or new repli-cated copies or partitions are set up (all of which would be fairly rare occurrences),the best solution generally is to have a copy of the directory at every site.

Distributed DBMS Advantages and Disadvantages

At this point, it will be helpful to review and summarize the advantages and dis-advantages of the distributed database concept and its various options. Figure13.11 provides this summary, which includes the advantages and disadvantages ofa centralized database, for comparison purposes.

Centralized Database—Like Figure 13.5Advantages:

• Single site provides high degree of security, concurrency, and backup andrecovery control.

• No need for a distributed directory since all of the data is in one place.

• No need for distributed joins since all of the data is in one place.

Disadvantages:

• All data accesses from other than the site with the database incur commu-nications costs.

• The site with the database can become a bottleneck.

• Possible availability problem: if the site with the database goes down, therecan be no data access.

Dispersing Tables on the Network (without replication or partitioning)—Like Figure 13.6Advantages:

• Local autonomy.

• Reduced communications costs because each table can be located at the sitethat most heavily uses it.

• Improved availability because portions of the database are available even ifone or some of the sites are down.

Disadvantages:

• Several sites have to be concerned with security, concurrency, and backupand recovery.

• Requires a distributed directory and the software to support location trans-parency.

• Requires distributed joins.

(Continued)

➤ Figure 13.11Advantages and disadvan-tages of central ized and dist r ibuted databaseapproaches

Page 17: CLIENT/SERVER DATABASE AND DISTRIBUTED DATABASE

Sta te o f Tennessee—Depar tment o f Sa fe ty 307

STATE OF TENNESSEE—DEPARTMENT OF SAFETYTennessee, with 5.7 million people and an area of over 42,000 square miles, is the16th largest U.S. state in population and the 36th largest in size. Tennessee became the16th state of the U.S. in 1796. Its principal cities are Memphis, Nashville, the capital,Knoxville, and Chattanooga. Tennessee’s leading industries include printing, publishing,chemicals, fabricated metals, and automobile manufacturing. Almost one-half of thestate’s land is dedicated to 80,000 farms with the major products being cattle, hard-

Targeted Data Replication—Like Figure 13.9Advantages in addition to the advantages of dispersed tables:

• Greatly reduced communications costs for read-only data access becausecopies of tables can be located at multiple sites that most heavily use them.

• Greatly improved availability because if a site with a database table goesdown, there may be another site with a copy of that table.

Disadvantages in addition to the disadvantages of dispersed tables:

• Multi-site concurrency control when data in replicated tables is updated.

Partitioned Tables—Like Figure 13.10Advantages:

• Greatest local autonomy because data at the record or column level can bestored at the site(s) that most heavily use it.

• Greatly reduced communications costs because data at the record or columnlevel can be stored at the site(s) that most heavily use it.

Disadvantages:

• Retrieving all or a large portion of a table may require multi-site accesses.

➤ Figure 13.11 (Continued)Advantages and disadvan-tages of central ized and dist r ibuted databaseapproaches

Page 18: CLIENT/SERVER DATABASE AND DISTRIBUTED DATABASE

308 Chapter 13 C l i en t/Ser ve r Da tabase and D i s t r i bu ted Database

wood lumber, dairy products, and cotton. Centrally located in the U.S., the state isalso known as a major distribution center. As with all states, the Tennessee state gov-ernment is responsible for a wide variety of public services, including the collectionand management of state taxes, the management and maintenance of state parks,and the management of various social services for its citizens. The state’s Departmentof Safety is responsible for services such as the licensing of motor vehicles and dri-vers, and the enforcement of laws covering the operation of motor vehicles.

The Department of Safety maintains a Driver’s License System database applica-tion that tracks the state’s driver’s licenses. Implemented in 1978, the database storesbasic name and address data as well as data that specifies the type of license andany restrictions such as required corrective lenses. In 1996, an extension to the appli-cation was implemented that captures and stores both a photograph of the driver andthe driver’s signature in a digital format or “image.” All of this data, including thephoto and signature, is incorporated into the actual physical driver’s license. Theimages are captured at each driver’s licensing location and transmitted online to thedatabase for storage. All of the data, including the images, can be queried andretrieved online using canned queries.

Running on an IBM OS/390 mainframe computer located in the capital,Nashville, the database application is an interesting hybrid of two different types ofdatabases and DBMSs. The original application that stores the name and address andlicense type data, and which dates from 1978, is implemented in IBM’s IMS DBMS.The 1996 extension that stores the photos and signatures is implemented in IBM’s DB2relational DBMS. The relational database currently holds approximately 7 million photoand signature images, including driver photos taken for previous license renewals.

Printed by permission of State of Tennessee—Department of Safety

KEY TERMSApplication serverClientClient/server databaseDatabase serverDatabase server approachDistributed dataDistributed databaseDistributed database management

Distributed joinDistributed directory managementFile server approachFragmentationGateway computerLocal area network (LAN)Local autonomyLocation transparency

PartitioningReplicated dataServerThree-tiered client/server approachTwo-phase commitTwo-tiered client/server approach

QUESTIONS1. What is a client/server database system?

2. Explain the database server approach to client/serverdatabase.

3. What are the advantages of the database serverapproach to client/server database compared to thefile server approach?

4. What is data transparency in client/server database?Why is it important?

5. Compare the two-tier arrangement of client/serverdatabase to the three-tier arrangement.

6. What is a distributed database? What is a distributeddatabase management system?

7. Why would a company be interested in moving fromthe centralized to the distributed database approach?

8. What are the advantages of locating a portion of a data-base in the city in which it is most frequently used?

Page 19: CLIENT/SERVER DATABASE AND DISTRIBUTED DATABASE

Min i cases 309

EXERCISES1. Australian Boomerang, Ltd., wants to design a dis-

tributed relational database. The company is head-quartered in Perth and has major operations inSydney, Melbourne, and Darwin. The databaseinvolved consists of five tables, labeled A, B, C, D,and E, with the following characteristics:

Table A consists of 500,000 records and is heavilyused in Perth and Sydney.

Table B consists of 100,000 records and is frequentlyrequired in all four cities.

Table C consists of 800 records and is frequentlyrequired in all four cities.

Table D consists of 75,000 records. Records1–30,000 are most frequently used in Sydney.Records 30,001–75,000 are most frequently used inMelbourne.

Table E consists of 20,000 records and is used almostexclusively in Perth.

Design a distributed relational database for AustralianBoomerang. Justify your placement, replication, andpartitioning of the tables.

2. Canadian Maple Trees, Inc., has a distributed rela-tional database with tables in computers in Halifax,Montreal, Ottawa, Toronto, and Vancouver. The data-base consists of twelve tables, some of which arereplicated in multiple cities. Among them are tablesA, B, and C, with the following characteristics.

Table A consists of 800,000 records and is located inHalifax, Montreal, and Vancouver.

Table B consists of 100,000 records and is located inboth Halifax and Toronto.

Table C consists of 20,000 records and is located inOttawa and Vancouver.

Telecommunications costs among Montreal, Ottawa,and Toronto are relatively low, while telecommunica-tions costs between those three cities and Halifax andVancouver are relatively high.

A query is issued from Montreal that requires a joinof tables A, B, and C. The query involves a singlerecord from table A, 20 records from table B, and anundetermined number of records from table C.

Develop and justify a plan for solving this query.

MINICASES1. Consider the Happy Cruise Lines relational database

of Minicase 1 in Chapter 5. The company has decidedto reconfigure this database as a distributed databaseamong its major locations: New York, which is itsheadquarters, and its other major U.S. ports: Miami,Los Angeles, and Houston. Distributed and replicatedamong these four locations, the tables have the fol-lowing characteristics:

SHIP consists of 20 records and is used in all four cities.

CRUISE consists of 4000 records. CRUISE recordsare used most heavily in the cities from which thecruise described in the record began.

PORT consists of 42 records. The records thatdescribe Atlantic Ocean ports are used most heavily inNew York and Miami. The records that describeCaribbean Sea ports are used most heavily in Houstonand Miami. The records that describe Pacific Oceanports are used most heavily in Los Angeles.

9. What are the advantages and disadvantages of datareplication in a distributed database?

10. Describe the concept of asynchronous updating ofreplicated data. For what kinds of applications wouldit work or not work?

11. Describe the two-phase commit approach to updatingreplicated data.

12. Describe the factors used in deciding how to accom-plish a particular distributed join.

13. Describe horizontal and vertical partitioning in a dis-tributed database.

14. What are the advantages and disadvantages of hori-zontal partitioning in a distributed database?

15.What are the advantages and disadvantages of verticalpartitioning in a distributed database?

16. What is the purpose of a directory in a distributeddatabase? Where should the directory be located?

17. Discuss the problem of directory management for dis-tributed database. Do you think that as an issue, it ismore critical, less critical, or about the same as thedistribution of the data itself? Explain.

Page 20: CLIENT/SERVER DATABASE AND DISTRIBUTED DATABASE

310 Chapter 13 C l i en t/Ser ve r Da tabase and D i s t r i bu ted Database

VISIT consists of 15,000 records and is used primar-ily in New York and Los Angeles.

PASSENGER consists of 230,000 records and is usedprimarily in New York and Los Angeles.

VOYAGE consists of 720,000 records and is used inall four cities.

Design a distributed relational database for HappyCruise Lines. Justify your placement, replication, andpartitioning of the tables.

2. Consider the Super Baseball League relational data-base of Minicase 2 in Chapter 5. The league hasdecided to organize its database as a distributed data-base with replicated tables. The nodes on the distrib-uted database will be Chicago (the league’sheadquarters), Atlanta, San Francisco (where theleague personnel office is located), and Dallas. Thetables have the following characteristics:

TEAM consists of 20 records and is located inChicago and Atlanta.

COACH consists of 85 records and is located in SanFrancisco and Dallas.

WORKEXP consists of 20,000 records and is locatedin San Francisco and Dallas.

BATS consists of 800,000 records and is located inChicago and Atlanta.

PLAYER consists of 100,000 records and is located inSan Francisco and Atlanta.

AFFILIATION consists of 20,000 records and islocated in Chicago and San Francisco.

STADIUM consists of 20 records and is located onlyin Chicago.

Assume that telecommunications costs among thecities are all about the same.

Develop and justify a plan for solving the followingqueries:

a. A query is issued from Chicago to get a list of allthe work experience of all the coaches on theDodgers.

b. A query is issued from Atlanta to get a list of thenames of the coaches who work for the team basedat Smith Memorial Stadium.

c. A query is issued from Dallas to find the names ofall the players who have compiled a batting averageof at least .300 while playing on the Dodgers.