View
26
Download
0
Category
Preview:
Citation preview
8
Chapter 2 - Background
2.1 Introduction
This and the next chapter provide an overview of parallel relational database
systems. The concepts are introduced in this chapter and current research areas are
identified in the next chapter.
The exploitation of the inherent parallelism in database applications by running
them on suitable parallel platforms to enhance their performance has become of
significant commercial interest. Initial parallel systems were based on purpose-built
database software and hardware, parallel database machines, and were appropriate for
carrying out relational-type operations in parallel [Dewitt, 90]. There have been several
different parallel machines built that support standard commercial database systems
adapted to a parallel environment.
The first section in this chapter provides a brief review of parallel database
architectures and discusses their advantages and disadvantages. Section 2.3 describes the
GoldRush MegaServer parallel platform from ICL. This is the architecture which has
been used throughout this thesis. Data placement is introduced in section 2.4 by
describing some of the main methods and describing the differences between them.
Section 2.5 describes the two database performance benchmarks that have been used in
this thesis. Some of the terminology concerning parallel database systems that will be
used throughout the rest of the thesis is described in section 2.6. The chapter is rounded
off with some summarising comments.
9
2.2 Review of parallel database architectures
This section presents an overview of parallel relational database systems.
Familiarity with conventional Relational DBMS (RDBMS) [Codd, 70, 94], [Elmasri, 99]
is assumed.
Traditional parallel relational database systems were usually classified as either
shared-everything, shared-disk or shared-nothing, although the distinctions between these
have become blurred. As with non-parallel architectures, the majority of parallel
architectures which host database systems comprise processors, main memory modules,
and secondary storage (disks) as their major types of components. The hardware vendors
make design choices about the way the various components are connected and this serves
as the basis for a classification of parallel database systems into the categories: shared
memory, shared disk, shared nothing and hierarchical.
2.2.1 Shared memory systems
In this type of system the processors and disks have access to a common memory,
typically via a bus or through an interconnection network. This means that it has
extremely efficient communication between the processors. The data in shared memory
can be accessed by any processor without having to move it using software. The
architecture may not be scalable beyond 32 or 64 processors since the bus or the
interconnection network becomes too large a bottleneck. They are widely used for lower
degrees of parallelism (4 to 8).
Examples of research systems, developed for the shared memory model include
DBS3 [Bergsten, 91], [Zait, 96], and XPRS [Hong, 93], while examples of commercial
10
systems include Sybase [Sybase, 04], Oracle [Oracle, 96], [Oracle, 96B], [Oracle, 04],
DB2 [IBM, 04], Informix [Informix, 94], [IBM, 03] and SQL Server [Microsoft, 04].
In a shared memory system there is a single copy of the operating system running
and a single instance of the database engine. As every processor has access to all of the
available memory, communication is straightforward. Message exchange and data
sharing are through the shared memory. It is also relatively easy to implement
synchronisation by using low-level mechanisms (such as semaphores). Shared memory
systems also lend themselves to load balancing, usually taken care of by the operating
system. These systems perform best when running a large number of small tasks such as
multiple threads.
The shared bus allows a memory module to be accessed by any processor, thus
increasing the memory-access latency. To counteract this, processors usually have fast
private cache memories for speeding up the access to shared memory [Lu, 94]. Although
this does mean that mechanisms are required to ensure cache coherency. An example of a
cache coherency algorithm is the “snooping bus write invalidate” method [Shatdal, 96],
where all of the cache controllers listen on the shared bus and the cached value is
invalidated when a memory write on a cached address occurs. The value is read into the
cache from memory, when a read occurs. Cache coherency introduces additional
overhead, which increases as the number of processors or memory modules increases.
2.2.2 Shared disk systems
In this type of system, all of the processors can directly access all of the disks via
an interconnection network, but each processor has its own main memory. This
architecture provides a degree of fault-tolerance, in that, if a processor fails, the other
11
processors can take over its tasks since the database is resident on the disks and these are
accessible from all processors. The downside is that the bottleneck now occurs at the
interconnection to the disk subsystem. Shared disk systems can scale to a large number of
processors, but communication between processors is slower.
Examples of commercial shared disk products include Oracle Parallel Server [Oracle,
96B], [Oracle, 04], DB2 [IBM, 04] and DB2 for IBM System/390 [White, 99].
Figure 2.1 Parallel database architectures
Shared disk systems can have their disk controllers and input/output controllers
connected to the interconnecting network rather than directly to nodes via their bus. On
the other hand some vendors implement a single global file store for their systems, which
provides a virtual shared disk architecture. Although certain disks in the system may be
directly connected to certain nodes, the DBMS has no concept of a node ‘owning’ a
12
certain set of disks. Rather, all disks and data are accessible by all nodes, an example
being Oracle [Oracle, 96B], [Oracle, 04].
Global locking and protocols are required to avoid conflicting accesses to the
same pages and to ensure cache coherency [Valduriez, 93]. To ensure the consistency of
data, a distributed lock manager is employed. This can lead to the creation of heavy
traffic on the interconnection network consisting of lock protocol messages.
2.2.3 Shared nothing systems
In this type of system, a node consists of a processor, memory, and a disk system.
A processor at one node communicates with another processor at another node using
some form of interconnection network. A node functions as the server for the data on the
disk or disks the node owns. Data accessed from local disks (and local memory accesses)
do not pass through interconnection network, thereby minimizing the interference of
resource sharing. Shared nothing multiprocessors can be scaled up to thousands of
processors without interference. The main drawback is the cost of communication and
non-local disk access, as sending data involves software interaction at both ends.
Examples of research systems, developed for the shared nothing model are Bubba [Boral,
88], [Boral, 90] and EDS [Watson, 91], while commercial products include Informix XPS
[Informix, 98], [IBM, 03], Sybase MPP [Knoop, 99], [Sybase, 04C], DB2 Parallel
Edition [Baru, 95], [Infinio, 03] and NonStop SQL/MP [HP, 04].
The database is partitioned and stored on the disks of the different nodes. Each
node is granted exclusive control over a part of the database. A node has its own set of
local disks where its data is stored. It is not possible for a processor to directly access the
13
disks and memory of another. Communication between processors is achieved by sending
messages through the interconnection network.
In more recent architectures, each node itself may be a separate system. Such
architectures are made up of a combination of shared memory at the node level and
distributed memory at the global system level. In addition, the disk controllers of a
particular node may also be connected to the bus of a second node in order to allow fault-
tolerance. Thus the term ‘shared nothing’, originally referring to architectures where a
disk connects to a single node with a single processor, is sometimes also used to describe
hybrid systems with some limited form of memory and disk sharing [Delis, 98].
When more nodes are added to the system, the aggregate bandwidth requirement
scales up in proportion to the number of processors added. Although most current
interconnection networks are designed to satisfy this requirement. Thus the major
advantage of these systems is that they can scale up to handle hundreds of nodes.
It is difficult to co-ordinate large numbers of nodes. This difficulty manifests
itself in load balancing. Ideally the workload should be spread equally among the
participating nodes. However, when data skew [Hua, 91] is present, a parallelised
operation such as join may not bring good returns unless load-balancing issues are
considered explicitly as part of the design of the join algorithm.
2.2.4 Hierarchical
A hierarchical or hybrid architecture combines characteristics of shared-memory,
shared-disk, and shared-nothing architectures. The top level is a shared-nothing
architecture, where the nodes are connected by an interconnection network, and do not
share disks or memory with each other.
14
Each node of the system could be a shared-memory system with a few processors
or each node could be a shared-disk system, and each of the systems sharing a set of disks
could be a shared-memory system. Commercial hybrid NUMA (non-uniform memory
access) systems have been developed by vendors such as Data General, IBM, ICL, NCR,
among others [Garth, 96], [Rudin, 96]. Another example of a commercial hybrid system
is the Compaq TruCluster running Oracle [Gartner, 02].
2.2.5 Comparative analysis
A study of the benefits of the different architectures [Valduriez, 93] points to
shared disk architectures having similar advantages and disadvantages to shared nothing
systems. However, shared disk architectures are seen to provide a better opportunity for
load balancing and ease of migration. Moving from a centralised to a shared disk system
is relatively straightforward as the data on disk does not require to be reorganised. The
access to the shared disk subsystem is highlighted as a potential bottleneck but is not an
issue with shared nothing systems as the nodes do not have to co-ordinate data sharing
since each one is responsible for its own data. Shared disk systems are considered more
flexible in the sense that any node may be selected to access any piece of data, while in
shared nothing systems data can be accessed only through the node that owns it. The
shared disk ensures that a failed node will not affect the availability of the system,
although part of the data may not be accessible.
Shared memory’s main advantages are reported as the simplicity and ease of load
balancing, while limited extensibility and low availability are its disadvantages. Shared
memory architectures may suffer from poor availability, in the sense that a memory fault
may affect most processors and bring the system down.
15
Extensibility and availability are the main virtues of shared nothing systems. On
the other hand, high complexity and poor load balancing are shared nothing’s main
disadvantages. The difficulty with load balancing is due to the amount and location of
processing being determined by the database partitioning. This can especially be a
problem when a new processor or disk is to be added: a decision must be made as to how
to redistribute the data among the processors and disks, which is a costly operation. The
high complexity of a shared nothing system requires the need for distributed database
techniques, such as 2-phase commit, to be implemented to ensure correct functioning of
the system.
There is no single database architecture that is better than others in all respects.
The choice of an optimal platform ultimately depends on the specific user application and
workload, although the current trend is for some form of hybrid system, to try and
capture the best parts of the pure architectures.
The architectural differences between the types of systems are becoming more
and more blurred. Hybrid systems called non-uniform memory access (NUMA) systems
were developed by vendors such as Data General, IBM, ICL, NCR, among others [Garth,
96], [Rudin, 96]. In these systems a node has access to the memory of a different node,
thus a common global memory is established. Each processor accesses the memory over
a two level interconnecting mechanism. Access to local memory (memory residing on the
same board as the processors of the node) is through a high-speed local bus and has a low
access time. Access to remote memory is through a different system bus and has higher
access time. A NUMA system runs a single instance of the operating system, thus
(potentially) providing a shared memory programming model. The effectiveness of this
16
architecture depends on maintaining a high level of data locality in each node. Cray/SGI
Origin2000, the HP-Convex Exemplar and the Sequent NUMA-Q Series are examples of
this hybrid architecture.
Another architecture that was used for handling large database systems was
Symmetric Multiprocessing (SMP). SMP machines are characterised by tightly-coupled
series of identical processors, usually operating on a single shared bank of memory. The
essence is of a timesharing, multi-tasking operating system that has more than one
processor to choose from when scheduling programs to run. The architectural design
consists of separate caches or a shared cache as well as shared main memory. A
commercial example of a SMP machine is the Sun Enterprise 10000 Server [Sun, 99].
There is also an architecture combining both the SMP and NUMA architectures where
another level of cache is introduced. A commercial example of this hybrid architecture is
the Unisys ES7000 [Unisys, 04].
A currently popular architecture being utilised for large database systems
currently is a cluster of clusters. A cluster is a group of separate computers connected
together and used as a single computing entity to provide a service or run an application
for purposes of scalability, load balancing, and task distribution. Operations between
clusters exchange information via messages or memory addressing. Clusters are usually
made up of SMP machines and may also be a single SMP machine. A commercial
example of a cluster machine is the IBM pSeries 690 server [IBM, 04B].
Another current trend in computer manufacture is toward slim, hot swappable
blade servers [Vasudeva, 04] which fit in a single chassis like books in a bookshelf, each
is an independent server, with its own processors, memory, storage, network controllers,
17
operating system and applications. The blade server simply slides into a bay in the
chassis and plugs into a mid- or backplane, sharing power, fans, floppy drives, switches,
and ports with other blade servers. The benefits of the blade approach include reduced
cabling. With switches and power units shared, precious space is freed up. Initial
performance figures for these blade servers [Horwitz, 04] suggest that they may well
dominate the market place of the future.
Parallel technology grew out of distributed technology and the current trend
appears to be moving back to distributed technology mainly in the area of Grid
technology [Foster, 99]. Many of the theories and tools developed for parallel database
systems can be utilised in the area of distributed systems.
2.3 Parallel platform
This section describes the GoldRush MegaServer parallel platform from ICL. At
the beginning of 1996 GoldRush/Oracle and GoldRush/Informix were the state-of-the-art
shared disk and shared nothing high performance database solutions offered by ICL. It is
the architecture that was used in the Mercury Project [Mercury, 95].
2.3.1 ICL GoldRush MegaServer
The ICL GoldRush MegaServer uses parallel processing technology to support
very large databases in an open client/server environment. The architecture of the system
is presented in detail in [Watson, 95], [Watson, 95B], [Watson, 95C].
The basic GoldRush hardware architecture, shown in Figure 2-2, consists of a set
of up to 64 Processing Elements (PEs) and Communication Elements (CEs), and a single
18
Management Element (ME), connected together by a high performance network
(DeltaNet).
DeltaNet
… …PE PE CE CE ME… … ……… …FDDI FDDI
system
management
interface
Figure 2.2 The GoldRush architecture
The PE is the basic building block of the hardware. It runs a UNIX operating
system and database back-end servers. The unit comprises a main PCB with a subsidiary
PCB and a twin HyperSPARC module mounted on it. The internal communication bus of
the PE is a 40 MHz Mbus. The main internal unit is a dual processing unit which consists
of a Processing Unit (PU) and a System Support Unit (SSU). Both of these units include
a 90MHz HyperSPARC chip set. The PU is the main processing engine and the SSU
provides support for message passing between elements. Up to 256 Mbytes of RAM store
is provided in the PE, primarily for use as database cache. Each PE also has two fast and
wide SCSI-2 connections, allowing up to 30 disks to be connected.
The CE is identical to a PE except that it provides external links for GoldRush to
clients via LANs. To do this it has one SCSI-2 interface and two Sbus connectors, into
which two standard FDDI Sbus modules are plugged. The CEs do not run database
servers, but are dedicated to relaying messages transparently between the PEs and FDDI
couplers. It is typical to have 1-2 CEs per 16 elements. The ME is a conventional mid-
range UNIX processor (an ICL DRS6000) which runs GoldRush management software.
19
The DeltaNet high speed network is the medium whereby all the elements of a
GoldRush system communicate with each other. The basic component of the DeltaNet is
a network switching element. This is a packet switching 8×8-crossbar, which
dynamically provides unidirectional channels from each of its 8 inputs to any of its 8
output ports. The network switching elements are connected in stages allowing up to 64
connections. Packets are 128 bytes long. If two or more packets arrive at the inputs of a
switch simultaneously, and each needs to be routed through the same output, contention
results. The switch has 4 buffers per input, so one of the packets is retained until the
output is free. Each channel data interface is capable of 20 Mbytes per second. The total
DeltaNet throughput is up to 1.2 Gbytes per second.
GoldRush relies entirely on message passing for inter-PE communication. This is
done efficiently through a lightweight communications protocol. It is available through
UNIX interfaces so that software can exploit it. For example, the Informix and Oracle
software use it for communicating when exploiting intra-query parallelism.
For database systems supporting a shared disk architecture, GoldRush provides a
Global Coherent Filestore (GCF). Even though physically each disk is attached to one
element only, each element creates its own portion of the GCF on its own local disks and
then cross-mounts the global file store from all other elements, thus making it accessible
to the database server running on it. The GCF is implemented in the OS kernel and uses
the lightweight communications protocol.
A Distributed Lock Manager (DLM) is implemented to provide
coherency/concurrency control for database servers running within a shared disk system.
It is distributed across all PEs, with an instance running on each one. Each global lock is
20
managed by one instance of the DLM and all requests for it are sent to that instance by
the PE generating the request. All communication with the DLM is by the lightweight
communications protocol.
GoldRush can be configured as either a shared disk or a shared nothing system. In
a shared disk configuration each PE runs its own instance of the database. The tables are
placed across disks attached to a number of PEs, with the GCF allowing shared access.
Each PE maintains a local database cache in memory and the DLM is used to ensure
coherency. In a shared nothing configuration, again, each PE runs a database server and
the tables are partitioned across the PEs. Each PE only accesses the table fragments
stored locally. Queries are decomposed into ‘fragments’ and directed to PEs which own
relevant data. There is no need for the services of the DLM, as no global coherency needs
to be maintained.
2.4 Data placement
The way in which data is distributed across the processing elements of a parallel
shared-nothing architecture can have a significant effect on the performance of a parallel
DBMS. Data placement strategies provide a mechanical approach to determining a data
distribution which will provide good performance. However, there is considerable
variation in the results produced by different strategies and no simple way of determining
which strategy will provide the best results for any particular database application. A
poor data distribution can lead to higher loads and/or load imbalance and hence higher
cost when processing transactions. This section considers some of the main types of data
placement strategies and describes some of the problems associated with the placement of
21
data, with particular attention to placing data on shared nothing hardware, such as the
GoldRush MegaServer.
Deciding how to distribute the data is a complex issue. A given data distribution
may be ideal for one type of query but may produce a load imbalance with consequential
reduction in performance for another. The process of deciding how to distribute data
among the different nodes in the system is known as data placement. This involves
breaking up each relation into a number of fragments and allocating each fragment to a
node of the system (or to a particular disk of the node). Various strategies for data
placement have been developed which attempt to achieve improved performance in
different ways. Some focus on the complexity of operations such as joins, others are
based on the accesses made to each fragment of a relation. However, there is no obvious
choice among the different approaches, which would provide the best results in all cases.
The purpose of this section is to provide an understanding of some of the different data
placement strategies and their relative merits.
Various data placement strategies have been developed by researchers to provide
a mechanical approach to the process which will produce data distributions capable of
realising much of the performance potential of shared-nothing database systems. Since
the complexity of the problem is NP complete [Sacca, 83], there is no guarantee that even
a feasible solution, let alone an optimal one, can be found in a reasonable amount of time.
Generally, a static data placement strategy can be divided into three phases. They
are:
• Declustering phase in which relations are partitioned into a number of fragments
which will be allocated to the disks of a parallel machine in the placement phase;
22
• Placement phase in which the fragments obtained from the declustering phase are
allocated to the disks of a parallel machine;
• Re-distribution phase in which data is re-distributed to restore good system
performance after the load has become unbalanced and overall system
performance degraded.
Dynamic data placement is concerned with dynamically re-organising relations
during execution in order to increase the degree of parallelism, reduce response time and
improve throughput. It usually takes the form of generating temporary relations (i.e.
intermediate relations). The placement of base relations resulting from the initial data
placement scheme is not changed.
There is no simple way of determining which data placement strategy would
provide the best results for any particular database.
2.4.1 Overview of data placement strategies
As previously stated, a static data placement strategy can generally be divided
into one of the three phases mentioned. These three phases are not completely
independent as the approach chosen for placement sometimes limits the choice of the
declustering and re-distribution methods.
2.4.1.1 Declustering and Re-distribution
The declustering phase of data placement is concerned with the problem of
partitioning an entire data space into subspaces, which may be overlapping or non-
overlapping. In the case of relational databases, this partitioning can be either horizontal
or vertical. A fragment in a horizontal partitioning consists of a subset of tuples from a
23
relation, while a fragment in a vertical partitioning consists of a projection of a relation
over a set of attributes. These two types of partitioning can also be combined, which
results in a mixed approach. A discussion on the complexity of operations over a mixed-
partitioned relation is given in [Thanos, 91].
For horizontal partitioning, there are three fundamental types: range, hash and the
self explanatory round-robin (card dealing) approach.
The range declustering method partitions a relation into fragments by dividing the
domains of some attributes into sections and then allots the tuples to the fragment whose
range value fits the attribute values of the tuple. The hash declustering method partitions
a relation into fragments according to the hash value obtained by applying a hash
function to the chosen attributes. These can be performed on both single and multiple
attributes. Other partitioning methods include using bit-wise operators [Kim, 88] and
space filling curves [Faloutsos, 93]. However, since hash-join algorithms have been
widely adopted in parallel database systems, the hash method is the most popular as it fits
nicely with the requirements of a hash-join algorithm.
As insertions and deletions are performed on the local data associated with each
PE, the size of the local partition of data may change so that gradually a non-uniform
distribution of data will appear. This causes an unbalanced load on PEs which in turn
degrades overall system performance. In this situation, a re-distribution of data is
necessary to resume good system performance. Although the re-distribution of data can
always be performed with the original declustering and placement strategies, it may
require the movement of large amounts of data in complex ways resulting in a cost which
is even higher than that of processing data under the skewed circumstances. While this
24
data movement could be done in the background, throughput would be severely degraded
in a heavily-loaded system. In such cases re-distribution should be infrequent and should
involve as little data movement as possible.
Various placement strategies have been developed for parallel database systems.
These can be classified into three categories according to the criteria used in reducing
costs incurred on resources such as network bandwidth, CPUs and disks, namely:
• size based [Hua, 90], [Hua, 92]
• access frequency based (although some methods can be considered as combined
strategies [Copeland, 88])
• network traffic based [Apers, 88]
The main idea behind these approaches is to achieve the minimal load (e.g.
network traffic) or a balance of load (e.g. size, I/O access) across the system.
2.4.1.2 Size based strategies
Size based strategies were developed for systems utilising intra-operator
parallelism, where the CPU load of each PE depends on the amount of data being
processed. Since the declustering methods are built around the ideas of hashing and
sorting of relations, relation operations applied to the partitioning attribute can be
performed very efficiently. However, queries that involve non-partitioning attributes are
naturally inefficient.
Hua et al [Hua, 90] proposed a size based data placement scheme which balances
the loads of nodes during query processing. Each relation is partitioned into a large
number of cells using a grid file structure. These cells are then assigned to the PEs using
a greedy algorithm so as to balance the data load for each element. This algorithm is
25
executed by putting all cells into a list, ordered by their size, with the largest cell first,
and then allocating the first cell in the list to the PE which currently has the largest disk
space available. The cell is then removed from the list. This process is repeated until the
cell list becomes empty. This initial placement scheme was also used in [Hua, 92], where
the concept of a cell was generalised to any logical unit of data partitioning and
movement, i.e., all records belonging to the same cell are stored on the same PE.
2.4.1.3 Access frequency based strategies
Access frequency based strategies aim to balance the load of disks across PEs
since disks are most likely to be the bottlenecks or hotspots of the system. Sometimes, the
relation access frequency is taken into account together with the relation size due to the
fact that some parts of the relations may be RAM resident and do not require disk access.
This has resulted in some size and access frequency based strategies.
The Bubba [Copeland, 88] method defines frequency of access to a relation as the
heat of the relation and the quotient of a relation's heat divided by its size as the
temperature of the relation. Heat is used in determining the number of nodes across
which to spread the relations. Temperature is used in determining whether a relation
should be RAM resident. The strategy then assigns fragments of relations to nodes taking
account of each node's temperature and heat. It puts relations in a list ordered by their
temperatures and assigns the fragments of the relations in the list to PEs using a greedy
algorithm to ensure that the heat of each PE is roughly equivalent to the others. The heat
of each relation can be estimated for initial placement and tracked for re-distribution.
There are several disk allocation algorithms developed for multi-disk systems
[Scheuermann, 94], which can be classified as access frequency based. The disk cooling
26
method [Scheuermann, 94] uses a greedy procedure which involves both placement and
re-distribution phases. It tries to achieve an I/O load balance while minimizing the
amount of data that is moved across disks. As the disk cooling procedure was
implemented as a background demon which is invoked at fixed intervals in time to
restore the balance of I/O load across different disks, the heat of each data block can be
tracked dynamically, based on a moving average of the inter-arrival time of requests to
the same block. Alternatively, it can also be tracked by keeping a count of the number of
requests to a block within a fixed timeframe. The problem with these heat tracking
methods is that for application sizing, it would be very difficult to track block heat
dynamically for initial data placement, as there is no ready system running for the
tracking exercise.
2.4.1.4 Network traffic based strategies
Network traffic based strategies aim to minimise the delay incurred by network
communication in transferring data between two nodes. They were originally developed
in a distributed database environment where the PEs are remote from each other and
running independently but connected together by a network.
Apers [Apers, 88] presented an algorithm based on a heuristic approach coupled
with a greedy algorithm. In this algorithm, a network is constructed with the fragments of
a relation as its nodes. Each edge represents the cost of data transmission between two
nodes. Each pair of nodes is then evaluated on the basis of the transmission cost of the
allocation. The pair of nodes with the highest cost is selected and united. This process is
repeated until the network is unified with the actual network of machines. This algorithm
was later extended by Sacca and Wiederhold [Sacca, 83] for shared-nothing parallel
27
database systems, in which a first-fit bin-packing approach was used to allocate selected
fragments.
Ibiza-Espiga and Williams [Ibiza, 92] developed a variant of Apers' placement
method. It pre-assigns the clusters of relations (each cluster contains relations which need
to be co-located due to hash join operations) to groups of PEs, taking into account the
available space of the PEs, so that each cluster is mapped to a set of PEs. The idea is to
find an appropriate balance among the work load of the PEs and an equitable distribution
of the space. Then the units within each cluster are assigned to the set of PEs associated
with the cluster using Apers' method.
In [Williams, 98] a study of a number of placement strategies was carried out on
the original version of STEADY. To keep the study manageable it was only carried out at
the PE level. The distribution of relation fragments amongst multiple disks attached to
each PE was assumed to be round-robin. One of the unsurprising outcomes of the study
was that the way in which data is distributed across the PEs of a parallel shared-nothing
architecture can have a significant effect on the performance of a parallel DBMS. Data
placement strategies provide a mechanical approach to distributing data amongst the PEs
although there is considerable variation in the results produced by different strategies. In
the study, five data placement strategies were considered in detail. These were
representative of the three general categories, size based, access frequency based and
network traffic based. The five different strategies were applied to the transaction
processing benchmark TPC-C [Gray, 93] where the number of processing elements was
varied, and the database size was varied. One conclusion of the study was that the access
frequency based and size based strategies used, provided comparable performance for
28
TPC-C. The traditional network traffic based strategy provided almost identical
performance to that of the size based strategies, which was not a surprise as it utilises size
as a measure for network traffic. In middle to heavy weight read-write oriented
applications, the disks generally became the system bottlenecks. The cost difference
between a remote disk access and a local disk access was mainly related to network
communication. Hence improvements made to other factors such as reduction of network
traffic and CPU cost did not enhance the performance significantly. Another network
traffic based strategy, the distributed-database strategy, scaled up well when the number
of units in relation clusters was larger than the number of processing elements
participating in database activities. The degree of the scalability of this strategy was
sensitive to the comparison between the fragmentation degree of relations and the number
of processing elements. As the number of PEs was increased or the database size
decreased, the performance obtained from data distributions generated using the size
based and access frequency based strategies generally scaled up. However, there were
peaks and drops at particular numbers of PEs, depending on the maximum number of
most heavyweight fragments assigned to a PE.
Although this study was carried out on placement at the PE level, data placement
at the disk level is also important to system performance, as an individual disk on a disk
stack attached to a PE could become a bottleneck due to poor placement of data at this
level. The design of multi-disk structures provides room for parallel I/O accesses. This
complicates still further the process of determining the best data distribution for a
particular database application, and reinforces the need for tools to experiment with data
placement strategies.
29
2.5 Database performance benchmarks
Database performance benchmarks [Gray, 93] are usually classified into two
types. Transaction processing benchmarks are designed to test OLTP (on-line transaction
processing) operations, typical of applications whose database operations are pre-defined
and grouped in transactions. Decision support benchmarks are designed to test ad hoc
querying on specially constructed databases. The two main benchmarks used in the
calibration and validation of STEADY were AS3AP and TPC-C.
AS3AP is a scalable, portable ANSI SQL relational database benchmark. This
benchmark provides a comprehensive set of tests for database processing power, it has a
built-in scalability and portability that tests a broad range of systems, it minimizes human
effort in implementing and running benchmark tests; and it provides a uniform metric
straight-forward interpretation of benchmark results.
TPC-C is an online transaction processing (OLTP) benchmark. This benchmark
involves a mix of five concurrent transactions of different types. The database is
comprised of nine types of tables having a wide range of record and population sizes.
This benchmark is measured in transactions per minute.
2.5.1 AS3AP benchmark
The AS3AP benchmark was designed to:
• provide a comprehensive but tractable set of tests for database processing
power.
• have built in scalability and portability, so that it can be used to test a
broad range of systems.
30
• minimize human effort in implementing and running the benchmark tests.
• provide a uniform metric, the equivalent database ratio, for a
straightforward and non-ambiguous interpretation of the benchmark
results.
For a particular DBMS, the AS3AP benchmark determines an equivalent database
size, which is the maximum size of the AS3AP database for which the system is able to
perform the designated AS3AP set of tests in under 12 hours. The equivalent database
size is an absolute performance metric by itself. It can also provide a basis for comparing
cost and performance of systems, as follows: the cost per megabyte of a DBMS is the
total cost of the DBMS divided by the equivalent database size. The equivalent database
ratio for two systems is the ratio of their equivalent database sizes. Both the cost per
megabyte and the equivalent database ratio provide global comparison metrics.
The AS3AP database contains five relations. One, the tiny relation, is a one tuple,
one column relation, used only to measure overhead. The other four relations are named
as follows:
• uniques. A relation where all attributes have unique values.
• hundred. A relation where most of the attributes have exactly 100 unique
values, and are correlated. This relation provides absolute selectivities of
100, and projections producing exactly 100 multi-attribute tuples.
• tenpct. A relation where most of the attributes have 10% unique values.
This relation provides relative selectivities of 10%.
• updates. A relation customized for updates. Different distributions are
used and three types of indices are built on this relation.
31
The four relations have the same ten attributes (columns) with the same names
and types. The attributes cover the range of commonly used data types: signed and
unsigned integer, floating point, exact decimal, alphanumeric, fixed and variable length
strings, and an 8 character date type. As an example, the uniques relation consists of:
key INTEGER (4)
int INTEGER (4)
signed INTEGER (4)
float REAL (4)
double DOUBLE (8)
decim NUMERIC(18,2)
date DATETIME (8)
code CHAR (10)
name CHAR(20)
address VARCHAR(20)
2.5.2 TPC-C benchmark
Approved in July of 1992, the TPC Benchmark C is an on-line transaction
processing (OLTP) benchmark. TPC-C is more complex than previous OLTP
benchmarks such as TPC-A because of its multiple transaction types, more complex
database and overall execution structure.
TPC-C simulates a complete computing environment where a population of users
executes transactions against a database. The benchmark is centred on the principal
activities (transactions) of an order-entry environment. These transactions include
entering and delivering orders, recording payments, checking the status of orders, and
32
monitoring the level of stock at the warehouses. While the benchmark portrays the
activity of a wholesale supplier, TPC-C is not limited to the activity of any particular
business segment, but represents any industry that must manage, sell, or distribute a
product or service.
In the business model, a wholesale parts supplier operates out of a number of
warehouses and their associated sales districts. The TPC benchmark is designed to scale
just as the Company expands and new warehouses are created. However, certain
consistent requirements must be maintained as the benchmark is scaled. Each warehouse
in the TPC- C model must supply ten sales districts, and each district serves three
thousand customers. At any time an operator from a sales district can select any one of
the five operations or transactions offered by the Company's order-entry system.
The most frequent transaction consists of entering a new order which, on average,
consists of ten different items. Each warehouse tries to maintain stock for the 100,000
items in the Company's catalogue and fill orders from that stock. However, in reality, one
warehouse will probably not have all the parts required to fill every order. Therefore,
TPC-C requires that close to ten percent of all orders must be supplied by another
warehouse of the Company. Another frequent transaction consists in recording a payment
received from a customer. Less frequently, operators will request the status of a
previously placed order, process a batch of ten orders for delivery, or query the system
for potential supply shortages by examining the level of stock at the local warehouse. A
total of five types of transactions are used to model this business activity. The
performance metric reported by TPC-C measures the number of orders that can be fully
processed per minute and is expressed in tpm-C.
33
Each warehouse has ten terminals and all five transactions are available at each
terminal. A remote terminal emulator (RTE) is used to maintain the required mix of
transactions over the performance measurement period. This mix represents the complete
business processing of an order as it is entered, paid for, checked, and delivered. More
specifically, the required mix is defined to produce an equal number of New-Order and
Payment transactions and to produce one Delivery transaction, one Order-Status
transaction, and one Stock-Level transaction for every ten New-Order transactions. The
tpm-C metric is the number of New-Order transactions executed per minute.
The ten relations involved are:
• Warehouse
• District
• Stock
• Items
• Parts
• Customer
• Order
• New-Order
• Orderline
• History
The tables cover a range of cardinalities, from a few rows up to millions of rows.
Relationships between table sizes must be maintained according to the scaleability
requirements of the benchmark, which is governed by the number of warehouses.
34
2.6 Parallel database system terminology
This section introduces some issues and terminology pertaining to parallel
database systems that will be used throughout the rest of the thesis.
In 1970, Codd [Codd, 70] introduced the relational database model, which has
since seen widespread use in database machines and software systems. A relation can be
thought of as a table of data with rows called tuples and columns called attributes.
Attribute values are members of sets of values called domains. The set of attributes over
which all tuples of a relation are uniquely defined is called the primary key.
One of the industry standard interfaces for a relational database management
system (RDBMS) is the Structured Query Language (SQL), a non-procedural language
designed specifically for relational databases.
2.6.1 Query operations
The selection operation is the most basic relational database query operation:
given a relation, it retrieves those tuples with attributes that match certain conditions. For
example, a selection could retrieve all of those tuples from a relation where one of its
attributes is equal to a value. e.g., in SQL:
SELECT ALL FROM TABLE WHERE ATTRIBUTE = VALUE
Selection operations are critical to database machine performance, as most
relational database operations involve one or more selections. Selection implementation
methods include full relation search (usually in parallel when in hardware), indexing
methods, and hashing. It is important to note that the method a database machine uses to
35
implement selection operations dramatically affects the implementation and performance
of other operations (such as join).
The projection operation is similar to a selection but only retrieves a subset of
attributes from matching tuples. Because the subset of attributes may not be unique over
all tuples, it is sometimes necessary to remove duplicate results.
The join operation combines related tuples from two relations into single tuples,
on a specific set of attributes. The join operation represents a large part of the usefulness
of the relational model, and is consequently very important.
2.6.2 Complex query operations
Relational queries are often complicated and may involve multiple selections,
joins and projections. The series of operations involved is known as a query plan, and is
usually expressed in tree form and optimized by a relational database management
system prior to execution.
2.6.3 Parallel query execution
Within a relational database engine, a query is represented as an operator tree,
which is derived from the query execution plan. Examples of operators are scan (part of
select operation), build and probe (parts of a join operation implemented using a hash-
join algorithm), and merge (part of sort-merge-based join operation). Each operator
works on one or more streams of tuples and produces a new stream. So, operators can be
arranged in parallel dataflow graphs.
With dataflow graphs of this type, there are two main types of parallelism, namely
inter-query parallelism and intra-query parallelism. In inter-query parallelism, multiple
36
server processes are used simultaneously to work on the operator trees of different
queries. Intra-query parallelism has three types:
• Independent parallelism - if neither of two operators use data produced by the
other, both can run in parallel on separate processors.
• Inter-operator or pipeline parallelism - refers to the running of a number of
operators in a pipeline, with the output stream of tuples from one operator
forming the input of the next one of the pipeline. Operators such as aggregate
(sum, count, etc.) and sort can not be pipelined as these operators do not emit
their first output until they have consumed the whole of their input.
• Intra-operator or partitioned parallelism - several processes are assigned to
work together on a single operator of the tree. This partitioned execution can
be achieved either by designing and implementing parallel algorithms for each
operator, or by parallelising the data instead [Dewitt, 90], which is simpler,
since the use of existing routines for executing operators is preserved, and is
achieved by inserting merge and split operators in suitable places within the
dataflow graph.
The merge operator combines several data streams into a single sequential stream
to be fed into the next operator. In this way, an operator working on a stream of tuples
can be executed as a number of sub-processes executing on independent processors, each
working on a subset of the tuples from the stream. The output of each operator is fed into
a merge operator, where the streams are combined and passed on.
The split operator does the opposite: it partitions or replicates a single data stream
into several independent streams, so that multiple processes may operate on the stream in
37
parallel. The partition of the data can be round robin or hash, or can be performed by an
arbitrary program [Graefe, 90], [Graefe, 93].
2.6.4 Query optimisation and scheduling
Query optimisation refers to the process of selecting an optimal execution plan for
a query from a set of feasible plans, based on some objective such as minimising the
query response time. One part of this process is to determine a suitable order in which to
perform operations. For example, in a query consisting of a number of join operations, it
may be beneficial to perform highly selective joins first, thus eliminating a large number
of tuples early and reducing the size of intermediate relations. Another part of the
optimisation process is concerned with choosing what algorithms to employ for the
relational operations within the plan. A join may be performed using a hash-based,
nested-loop based, or merge-sort based method amongst others.
2.7 Summary
This chapter has reviewed the field of parallel relational database systems and
introduced some of the concepts and terminology which are used throughout the rest of
the thesis. The common hardware architectures underlying modern parallel database
systems were described and compared. In this thesis the parallel machine that is modelled
is a GoldRush Megaserver which has a shared nothing architecture, although no single
database architecture is better than the others in all respects, the shared nothing
architecture is a reasonable choice for hosting parallel database management systems,
particularly for evenly distributed, ‘delightful’, systems, as Stonebraker [Stonebraker, 86]
states that ‘In scalable, tunable, nearly delightful data bases, shared nothing systems will
38
have no apparent disadvantages compared to the other alternatives. Hence the shared
nothing architecture adequately addresses the common case.’
The notions of data partitioning and placement were reviewed. Two standard
benchmarks for database systems that were used throughout the work in this thesis were
also described.
Recommended