Univac 1108 multiprocessor system - IEEE Computer … ® 1108 multiprocessor system by D. C. STANGA Univac Division, Sperry Rand Corp. Roseville, Minnesota INTRODUCTION Two …

Univac ® 1108 multiprocessor system

by D. C. STANGA

Univac Division, Sperry Rand Corp. Roseville, Minnesota

INTRODUCTION

Two prime objectives existed during initial conception of the 1108 Multiprocessor System. These were:

• Increased system performance, and • Increased system availability. Increased performance in a "central system" was

achieved via multiple processing units, multiple input/ output units, and multiple access paths to critical peripheral devices. To accommodate this increase in computer capability the main memory was increased in size to provide storage for a large number of resident programs.

Increased availability was achieved by providing multiple access paths to all system components, and by providing for removal of system components for offline testing and maintenance.

In order to accomplish the prime objectives, several self-imposed restraints were placed on the design considerations, these are: modifications to the 1108 Processor would be minimal, the two processors (1108 & 1108 II) would be functionally identical, all peripheral subsystems and existing memory modules must operate in the new configuration. The self imposed restraints were intended to minimize development time and ensure that the existing upward compatibility between 1107 and 1108 Processors would be continued through the 1108 II Processor.

System configuration (hardware) Figure 1 shows a maximum 1108 Multiprocessor

System configuration, with the exception of the Peripheral Subsystems which are essentially unlimited with respect to the rest of the system. A minimum configuration (unit processor) of the same system would consist of a processor and two memory modules (65K).

The 1108 Executive system views the minimum configuration as a subset of the more general multisystem configuration. To better understand the multiprocessor system, it is helpful to think of it as com-

67

posed of many distinct modules or system components which are functionally, logically, and electrically independent. Any system component can be removed from the system, resulting only in a degradation of system performance, and not total system destruction. The removal of the system component is accomplished by a combination of software and hardware actions. (This will be discussed further under System Availability.)

The following system components can be interconnected into a variety of pre-specified system configurations.

1108 II processor This processor is functionally identical to the 1108

Processor except that an additional mode of main memory addressing was introduced, and another instruction was added to the Unprivileged Set.

Additional mode of main memory addressing.To the programmed overlapped addressing of the 1108 Processor was added the interleaved capability with an extension of this addressing form to include 262K, 36-bit main memory words. This interleaving is accomplished after the basing and indexing operation by decoding· the upper two bits and the lower bit of the absolute memory address.

Logical Bank 0-3

Address Field

Even/ Odd

The upper two bits specify the logical bank and the least significant bit of the address field specifies the even or odd module within the logical bank. Each logical bank contains 65K of 36-bit words. The interleaving feature permits optimum simultaneous execution of a common area of procedures by two

From the collection of the Computer History Museum (www.computerhistory.org)

68 Spring Joint Compnter Conf., 1967

(or more) processors. The degree of interleave is a trade off' between simultaneous access capability and hardware functional capability during times of main memory hardware malfunction.

Addition of the test and set instruction to the unprivileged set. - The function of this instruction is to permit queued access to common procedure areas that cannot be executed simultaneously by two or more processors. This function is accomplished by testing the specified main memory cell and setting it during the same extended memory cycle.

Memory Cell 230 = 1 Yes No Interrupt to Loc. 164 Take NI

Consequently, this common area of procedures is protected from simultaneous entrance by virtue of the fact that all references to it must test the cell associated with that area. If simultaneous access is attempted, an internal interrupt to the Executive System is effected. These restricted entrances into pro-cedure areas will occur most frequently in the Executive area during assignment of, or changes in, priority of system tasks. It is beyond the scope of this paper to delve into discussions on re-entrant or pure procedure, and conv.ersational mode compilers. They are, however, planned. as standard software packages for 1108 Multiprocessor Systems.

Main memory Main menmry is provided in 65K word increments,

expandable from a 65K minimum to a 262K maximum configuration. The main memory cycle time is 750 nanoseconds. Eight distinct paths are provided for each Processor and IOC, thus enabling each of these components to address the maximum memory configuration. Main Memory is composed of eight separate 32K modules, thus eight SImultaneous references could occur in a system configuration containing the maximum memory, processor, and IOC components.

Multiple module access unit (MMA) This system component provides multiple access

to individual memory modules. A pair of MMA's provides access to one logical bank of 65K words. A maximum of five paths to each module exists. Priority resolution between simultaneous memory requests is accomplished in this unit. The ProcessorI OC-Memory interface is request -acknowledge logic. Therefore, at a time when queues develop at a module interface a Processor or IOC can be kept waiting while memory requests are serviced on a predetermined priority basis. This priority is so arranged

that IOC's are serviced first due to the inability of most peripheral devices to "wait" for any extended period. Requests are serviced in the following manner. Priority is sequenced from left to right in each class with Class 1 having preemptive priority over Class 2.

lOCo lOCI Class 1

Input output controller The main function of this unit in the system is to

provide the capability for simultaneous compute and input/out data transfer operations. It also enhances the real-time capabilities of the system by providing high speed Externally Specified Index operations. Other primary functions include system channel expansion capability and data buffer chaining. The IOC is functionally similar to the input/output section of the 1108 II Processor. It does, however, exhibit the following advantages over the processor as a memory to peripheral subsystem data transfer device.

(1) Requires no interruption of processor computational capability to execute data transfers.

(2) Possesses data buffer chaining capability. (3) Performs ESI data transfers to main memory

utilizing only one main memory reference versus three main memory references for a processor executed ESI transfer.

The operations of the IOC are initiated via control paths from each of the three processors in the system. A channel from each processor is required as a control path for the issued commands and the reSUlting termination interrupts. The commands specify where function and data buffer control words are located in main memory; it is then up to the IOC to procure and issue functions as well as initiate, control, and monitor the reSUlting data transfers. Upon completion of an operation, a pre-specified processor is interrupted and notified that the channel has been terminated. The status associated with this termination interrupt is automatically stored at a pre-specified location in main memory by the IOC.

The control memory of the IOC is 256 words (expandable to 512 words). Its cycle time is 250 nanoseconds.

This high speed internal memory enables the IOC to hold up to 448 ESI buffer control words internally. During an ESI transfer the IOC need only reference this memory for its control words and therefore requires only one main memory reference at the time that data is to be transferred. In contrast the processor must reference main memory three times for the ESI buffer control word (as shown below):


Univac ® 1108 Multiprocessor System 69

roc: \Read BI\. Wri te BCW /\ Store Character in Main Memory 750 ns 250 ns ',250 ns

,\verlapped with laSj! Transfer

Processor \Read BCW~ Write BCW~ Store Charac ter in / 750 TIS 750 ns Main Memory 750 ns ,

\NO overlap possible /

The maximum aggregate data transfer thruput of an IOC is 1.33 million words per second.

Multiple processor adapter This functional unit provides mUltiple access

capability to a peripheral subsystem. It honors requests for access on a first come-first served basis. The unit has two basic modes of operation, one for Internally Specified Index (lSI) operation and the other for Externally Specified Index (ESI) operation. Up to four input/ output channels from any combination of Processors and IOC's can be connected via the MPA to a peripheral subsystem.

In the lSI mode the MP A will "lock" onto the first channel that issues a function and will continue servicing that channel until one of two contingencies occur. They are:

(1) An External Interrupt is issued by the Peripheral Subsystem, or

(2) A Release Control function is issued by a Processor or IOC.

In the ESI mode the MP A will remain locked onto a Communications Subsystem until another channel issues an External Function. In the normal ESI mode the MP A remains continuously dedicated to the initiating input/output channel. In the case of a channel malfunction, control of the Communications Subsystem can be switched to another channel connected to that MP A. The switching is accomplished by a function being issued by the channel wishing to take control.

Dual access devices In addition to providing multiple access to indi

vidual subsystem control units, the capability also exists for dual access to certain devices by two control units. The devices provided this capability are Tapes, FH Drums, and F ASTRAND Drums. These devices have the capability for resolving simultaneous requests for service from two independent control units. If one of these devices is busy, a busy status is presented to the other control unit. Upon com-

pletion of the current service request, the "waiting" control unit is granted access. Further discussion of simultaneous service requests to a device is included under the System Control section of this paper.

System-control (software) It is obvious from the preceding hardware discus

sion that many hardware configurations exist and also that many alternate paths between system components are available. However, software and system performance considerations makes it advantageous to pre-specify legal system configurations, and the primary data flow paths as shown in Figure 2.

System control is maintained by the Executive (EXEC) software system. The EXEC can be executed by any processor in the system. Each, in turn, acting as the EXEC processor will inspect the list of current activities and select a task to be done. As previously stated in the hardware discussion, the EXEC processor may interlock an area of critical common procedures or data during task assignment or facilities assignment operations. Input/output operations are normally initiated using the primary data flow path shown in Figure 2. To increase system performance the EXEC will also assign another path to a string of dual access devices such as the FH Drums. This secondary path may also serve as the primary path in the case of malfunction. Secondary paths to single access device subsystems (such as a card subsystem) are also assigned during system generation so that they may function as primary paths during IOC or Processor malfunction. The EXEC routines interfacing with the resident worker programs are reentrant in design. For minor tasks requested of the EXEC, many of the routines are totally re-entrant.

Others, when in the multiprocessing environment, will queue the worker program requests where serial processing in a particular area of the EXEC is required. At the lowest level, the EXEC must queue interrupts and issue input/output service requests on the pre-designated processor. Above this


70 Spring Joint Computer Conf., 1967

basic level, any of the available processors can perform a given task with selection based on the priority of the task currently being executed by the processors.

System performance Several methods have been used in an attempt to

quantitatively express the additional increase in computational performance afforded by the addition of one or more processors to the basic system.

Most of these methods consist of complex mathematical expressions and simulation models or a combination of the two in order to depict all of the factors of the configuration and their comolex inter-relation-ships. ~

The simple expression presented below is only intended to show the primary effect of the factors.

PXI06 • • N =. InstructIons per second (1 )

C+Q+D+E '

Where, P = number of processors C = cycle time of memory (the memory itself) Q = delay due to queues at memories D = delays due to hardware (MMA, etc.) E = time added due to extended sequence in

structions and C, Q, D, and E are in microseconds

C and D are straightforward factors; E is best gotten from the Gibson Mix or the like, and Q is the most difficult number to arrive at.

As an extreme case, consider the I-processor configuration with a 750 nanosecond memory, no queues at memory and no hardware delays. The maximum rate of instruction execution with no extended sequence instructions would be,

lXl~ .. N = ~ = 1.33xI06 InstructIons per second

(theoretical maximum-one processor)

The maximum rate of instruction execution with extended sequence instructions would be,

lXI06 ., N = .75+.300 = 0.95xI06 InstructIons per second

(for one processor)

As an example of a 2-processor case; set C= .750 and D = .125. An estimate for E would be about .300.

Through the medium of various simulations of the 2-processor case, a value of Q = .050 may be arrived at.

Hence,

"'IX 1nS

N = . 7 5+.05~. ~ ~~+.300 = 1.63x 1 06

instructions

per second (for 2 processor-multiprocessor system)

Therefore,

Gain for 2nd processor is 1.6g~.95 = 0.71

The value of E may be adjusted through a small range to reflect the type of problem mix. Q is influenced most strongly by the configuration, the I/O load, and the intelligence of the assignment portion of the Executive. Q is also somewhat interdependent with E.

System availability Central System availability is assumed to be re

quired on a continuing basis. To achieve this goal modular independency is stressed throughout the system. Each system component is electrically and logically independent. The same provision is made regarding cooling and power supplies. Provision is also made for disconnecting any malfunctioning components without disrupting system operation.

This disconnection is initiated either by the Executive System or the operator. The component in either case is first "program disconnected" by the Executive System and then the operator is informed via the system console. System operation will then degrade to a level which is dependent upon the malfunctioning units normal contribution to system performance.

The Availability Control Unit (ACU) and the Availability Control Panel (ACP) shown in Figure 3 provide the capability for isolating components from the remainder of the operational system for purposes of test and maintenance. Within this framework two other important features are included. They are: Partitioned System capability, and an Automatic Recovery System (ARS).

The ARS is basically a system timer; it is dependent upon the Executive System to reset it periodically (millisecond to several second intervals). If it is not reset, a catastrophic malfunction of the Executive System is assumed, and an initial load operation will be initiated by the hardware using a pre specified portion of the total system. An example of a pre-specified split in the system configuration is shown in Figure 4.

For the system shown in Figure 4 the left-hand portion would try to initial-load the Executive Sys-


Univac ® 1108 Muitiprocessor System 71

SUMMARY tern first. If the malfunction still exists (the system timer would trigger again) then the right hand portion of the configuration would attempt the initial load operation. Assuming one malfunction, one of the two configurations will be operational.

This then is the Partitioned System recovery concept implementation. The system can also be partitioned manually by selection. The partitioning is pre-specified, and the initiation only serves to activate the pre-selected partitions.

The 1108 Multiprocessor System is a large scale central data processing system which functions equally as well in a real-time, demand processing application as in a scientific batch processing environment. The symmetrical system configuration permits a high degree of parallel processing of tasks by all system processors. Modularity of system components provides for total system availability as well as ease in expansion capability for the future.

-MEMORY BANK MEMORY BANK MEMORY' BANK MEMORY BANK

(O-65K) (65K-13IK) ( 131K -198K ) ( 198K - 262 K )

32K E I 32K 0 32K d 32K 0 - .. 32K

El 32K 0 32K d 32K 0

MMA ~ MMA MMA : MMA MMA : MMA MMA I MMA

i • I i ' : · .·.n I ••• T· TT T ••• I • . . ~ : I ••• : I ~ • 4 I I I I I I I I I I I I I I i I

I I

I I I I I I I I I I I ; I T r T I ~ : I ! I I I ! : L ___ ..l... ___ ~ __ -,-l ____ 1 ____ 1 ___ _.1.. ____ ..:

<f I

~ Cf I I

II08A I II08A II08A

PROCESSOR I PROCESSOR PROCESSOR

~--8ri2T1i 8 I ~-8:12Ti6 8 ~--erl2;i6 I

j ~t L __ .$-_ ----+- ~~l=~---4_-----~-- .~Lf _J 1 • L_ CONSOLE I ~)¢ • ~ 3 CONSOLE 1 CONSOLE

I r ACP 1 1 T

~ r -t-t$- -,~ I !! 2 I , INPUT IOUTPUT AVAILABILITY INPUT IOUTPUT

CONTROLLER CONTROL UNIT CONTROLLER 3

~4rerI2Ti6 e ~4I8rI2rli I II ---l T

I -

r ----.... ,---------- ,l ---- --------, ..L ....L

MPA MPA

CU/S CU/S

---t DEViCE ....

-.r DEVICEl--

DUAL CONTROL

SIMULTANEOUS ACCESS

RIR, R/W, W/R, W/W

CONSOLE - CHANNEL 15

! I

I MPA I MPA

CU/S

, COMMUNI- DEVICE CATIONS SUBSYSTEM

RECOMMENDED CONF IGURA TlONS

PROCESSORS

IOC'S

MEM(MIN)

Figure 1- Univac 1108 multiprocessor system~ maximum configuration

I

I

131K

2 3

I I

196K 262K

I MPA

I CU/S

1 DEVICE

I 2

2 2

131K 262K

3

2

262K



15K LOGICAL BANK

PRIORITY' lOCO IIOCI. a (PO,PI,PZ)

PROC ell

LOCK - ON FUNCTION RELEASE ON EI

INPUT OUTPUT CONTROLLER

CONTROL

UNIT

Figure 2 - Primary data flow

J VI

CONTROL UNIT


Univac ® 1108 Multiprocessor System 73

IOC MMA r- PAIR

o y ~u c:~ cp -t-------+-----+---i----"j ~ I ::,~ I

i 1 i I I I I I I I I CPU INTERFACE IOC INTERFACE I

I I ~ I MEMORY:

I INTERFACE I

I ACP (MMA'S) i ACP INTER- T (ON CPU FACE I CONSOLE)

I I

I I I I I I

6 MPA

INTERFACE

( FEATURE)

6 MPA

INTERFACE

(FEATURE)

AVAILABILITY CONTROL UNIT

i r

I

I

~I I I

L---------------------U

Figure 3 - Availability control unit

MMA

I PAIR

1~2

MMA PAIR

3

o '--- MPA

I

MPA

MPA

22

MPA

23



MMA I MMA

I I

-

~3-.=--PROCESSOR

o

o

I t I

I

MMA

J

I ...L

65K

i I I I I

I MMA i

l r..--~..L ,...---

I

I

N

Figure 4 - Partitioned system

651( 65K

I

I I I I I I I I

: MMA ~

MMA MMA I MMA I

I I I : I I I .L t-- ..L f"- ---

.L_ _J

I !

PROCESSOR

I


Considerations in block-oriented systems design by D. H. GIBSON Systems Development Division International Business Machines Corporation Poughkeepsie, New York

SUMMARY The feasibility of transmitting blocks of words between memory and CPU is the subject of this study. The question is pertinent to the design of very fast computing systems where the nanoseconds to traverse a few feet become significant. There is intuitive advantage to transmitting blocks of words, rather than a word at a time. The initial access time due to physical distance, effective address mapping, and priority is a few hundred nanoseconds. If this time could be prorated against several words, then the effective access time could be reduced to a few tens of nanoseconds. The question is, of course, can the extra words be useful to the CPU?

This question was explored in a simulation model driven by customer-based IBM 7000 series data. The simulation results indicate that blocks of 4, 8, or 16 words, transmitted to a local storage of 2K to 4K words, will adequately prorate memory access time. With this configuration, block transfer is seen to be an efficient memory access method which can provide high performance, superior to single-word access.

Study goals

The classical CPU reference to memory is for one word, usually for one instruction or one piece of data. Block transfer provides more words than a CPU asks for. A CPU designed for block transfer will save the extra words locally and if it should refer to one of these words, it will not be necessary to go back to memory. On the other hand, if the next CPU reference is for a word in any other block, and the CPU can only save one block, then it will be necessary to go back to memory and waste the transmission time needed to bring over the extra words.

The purpose of this study is to investigate the usefulness of these extra words to the CPU. Specific questions to be asked are:

(1) How many extra words in a block are useful?

75

(2) How many blocks should the CPU save locally? (3) When the CPU must replace one of the blocks

it has saved locally, how should that block be chosen?

(4) How does the type of program running in the CPU affect the usefulness of block transfer?

The block transfer system is shown in Figure 1.

VERY LARGE MEMORY

(BACKING STORE)

HOW WIDE?

r- --- -~HOW BIG?

T LOCAL STORAGE

CPU

WHAT IS THE EFFECT OF THE PROGRAM ?

--+.-®-WHICH BLOCK TO REPLACE?

Figure 1 - Block transfer system

Methodology The pattern of CPU references to memory will

determine the usefulness to the CPU of extra words in a block. Conceptually, this pattern could range from random to sequential. In this study, we will first consider the ends of the range - the pure patterns of randomness and of sequential references. Penciland-paper calculations suffice to answer the questions posed. The study will then use customer-based IBM 7000 series data to examine addressing patterns on the scale between random and sequential. A computer program aids in the analysis of these other patterns.



Analysis of a particular addressing pattern consists of simulating a block size, a local storage size, and a particular algorithm for choosing the block to be replaced when the local store is full (See Figure 2).

I I I I

I I

SEQUENTIAL

I VERY LARGE MEMORY

(BACKING STORE)

I----CHOOSE A WIDTH

r- -1 ~CHOOSE A SIZE

T LOCAL STORAGE r "'" • ~CHOOSE A

I I REPLACEMENT

I CPU ALGORITHM

~ I

CHOOSE AN ADDRESSING PATTERN "-,,-

"-" "-

""-

"-L CUSTOMER-BASED I"' 7000 SERIES PROGRAMS 1

ADDRESSING PATTERNS

Figure 2 - Analysis of address pattern

RANDOM

The basic number resulting from an analysis is the number of references not found in the local storage; i.e., the number of times a block is transferred to the local store.

Scope of the study

Approximately 600 analyses were run during the course of the study. The block sizes were varied in steps of powers of two, from 4 words per block to 4096 words per block, the iocal storage sizes were varied from 32 words to 8192 words, and the replacement algorithm was varied over about 15 basic algorithms with several subvariations.

The available addressing patterns were derived from twenty 7000 series customer programs obtained from the field. Each program ran approximateiy three million address references, which were sliced into 200,000 reference sequences. Thus some 300 address patterns were available for study (See Figure 3).

Address patterns

A random addressing pattern is a sequence of CPU references to memory in which any address is equaHy likely to occur at any point in the sequence. For such a pattern, the probability of not finding the desired word in local storage exactly equals one minus the ratio of "size of local store" to "size of backing store." Since any word is equ£\.Hy likely to be referenced by

the CPU, variations of block size and replacement algorithm have no effect on the number of word references not found in the local store.

A sequential addressing pattern is a sequence of CPU references to memory in which any address is exactly one word away from the preceding address at any point in the sequence. For such a pattern, the probability of not finding the desired word in local storage is exactly the inverse of the block size. Variations of local store size and replacement algorithms have no effect on the number of interest here.

VERY LARGE MEMORY

(BACKING STORE)

II BLOCK 1 WIDTHS VARY FROM I I 4 WORDS TO 4096

.1 .t WORDS

r - --- -~SIZE VARIES FROM 32

1 LOCAL STORAGE 1 ~ WORDS TO 8192 WORDS

T ~EPLACEMENT ALGORITHM

CPU

ADDRESSING PATTERNS VARY FROM SEQUENTIAL TO RANDOM

VARIES FROM CONGRUENT MAPPING TO REFERENCE ACTiVITY

APPROXIMATEL Y 20 DIFFERENT CUSTOMER JOBS OF ALL SIZES

Figure 3 - Scope of study

A set of customer-based 7000 series addressing patterns is neither random nor sequential. The range of results shows two distinct types of patterns. The probability of not finding the desired word in local storage is most often about 0.015, although for a given local store size and block size there are a significant number of addressing patterns for which the probability is approximately 0.075. Variations of block size, local store size, and replacement algorithm affect this probability, as illustrated by the following discussion.

From Figure 4, 76% of the addressing patterns examined have a probability of 0.0275 or less that the desired word is not in a local store of 2048 words, for a block size of 16 words.

Extra lvords in a block

The usefulness to the CPU of extra words in a block decreases as the block size increases. All other things being held constant, the smaller the block the better. Obviously, however, all other things are not equai; since, for a given technology, there exist a


fixed time per access and a fixed time per word transferred.

24

22

20

!8

r.Il 16 w

r.Il >-...J <l 14 z <[

u. 0 12 a:: w m 10 ~ :::l Z

8

6

4

2

• THE RESULT OF ONE ANALYSIS IS THE NUMBER OF WORD REFERENCES NOT FOUND IN THE LOCAL STORAGE

• THIS FREQUENCY DISTRIBUTION DEPICTS A SET OF ANALYSES DIFFERING ONLY IN THE ADDRESS PATTERN

PERCENT OF WORD REFERENCES NOT FOUND IN LOCAL STORAGE

Figure 4 - Range of results

9

As may be seen in Figure 5, the customer-based 7000 series data indicates that doubling the block size typically doubles the number of words transferred. On the other hand, for a given replacement algorithm and a given local store size, changing the block size Qoes not significantly change the number of accesses outside the local store. Indeed, in the case shown, where the local store is significantly less than what is required to contain the total program, a large block size will produce more traffic between the local store and the backing store than between the CPU and the local store.

No data are available for block sizes smaller than four words. It would be expected that the "number of accesses outside local store" would increase for the smaller block size, while the "number of words transferred to local store" would continue to decrease. In the special case of a one word block, these numbers would be equal to one another, and would equal the number of unique words required by the CPU (about 6000 for this addressing pattern).

Blocks to be saved locally

The typical program has a minimum of four centers of activity; viz., the instruction area, two operand areas for source data, and one operand area for sink

Considerations in Block-Oriented Systems Design 77

data. Theoretically, then, it is desirable to save at least four blocks locally to the CPU.

The customer-based IBM 7000 series data indicates that many more than four blocks should be saved for efficient operation. The different addressing patterns seem to have a commonness in that 128 or more blocks of 16 words each should be saved in local store to assure finding most of the referenced words already available.

a w 0:: 0:: W U.

All z <t 0:: ~

en a 0:: 0 ~

u. 0

0:: w al ::E :::l z 0 en ..J <t

vi w en en w u u <t

U. 0

0:: w al ::E :::l Z

!6,384,000 L

8,192,000

4,096,000

2,048,000

1,024,000

512,000

256,000

128,000

64,000

32,000

16,000

8,000

4,000

2,000

1,000

NUMBER OF WORDS TRANSFERRED TO '-........ LOCAL STORAGE ~

/ --~-------

NUMBER OF WORD REFERENCES REQUIRED BY CPU

NUMBER OF ACCESSES OUTSIDE LOCAL

STO\

4 8 16 32 64 128 256 512 1,0242,048

BLOCK SIZE

Figure 5 - Block sizes

As the block size increases, the local store size should increase, although the number of blocks that must be saved decreases. The product of block size and number of blocks equals the size of the local store in words. This product is smallest when the block size is smallest. The four addressing patterns shown in Figure 6 are unequal in numbers of blocks referenced, yet converge to a low percentage of outside references for 128 or more blocks. This would indicate that a local storage of 2K words would be suitable for a block transfer system design. Choice of block to be replaced

The local store contains blocks of information which must be replaced when a desired reference is not found therein. Suppose the processor, upon making reference to the local store for a word, finds that the word is not there. The processor then chooses a space in the local store to hold the block containing the desired word. That space is then examined to determine whether or not it is occupied and appropriate



action is initiated. Finally, the block containing the desired word is moved from backing store to the chosen space locally. The algorithm used to choose the space in local store is called the replacement algorithm.

The data used in this study indicates that the replacement algorithm is of second-order importance to the addressing pattern of the running program in determining the number of references not found in local store. Further, there seems to be no one "best" algorithm for all addressing patterns. Rather, a given algorithm is best for one pattern, and not best for some other pattern.

w ::r ~ 50L l(/)

...J <X u 0 ...J

~ 0 z ::> 0 u. I-0 z (/) III u Z III a: III u. w a: I-z III U a: III CL

45

40

35

30

25

20

15

10

5

___ PROGRAM FITS COMPLETELY ..-- WITHIN 541 BLOCKS

PROGRAM FITS COMPLETELY WITHIN 812 BLOCKS

2 4 8 16 32 64 128 256 512

NUMBER OF !i6 WORD) BLOCKS IN LOCAL STORAGE

Figure 6 - Local store sizes

The replacement algorithms divide into three classes. The congruent mapping algorithm is probably cheapest to implement but makes least efficient use of local store. A modified version of this algorithm will cost more but can improve efficiency. This class of algorithm uses a binary decode of the address to choose the space in local store. The other two classes use an associative search to choose a space. The "first in, first out" algorithm searches the available space to find the space first filled (i.e., longest ago filled). This class of algorithm is based upon only the blocks in local store and does not deal with block history in the backing store. On the other hand, the third class of algorithm deals with reference activity information stored in extra cells in the backing store; e.g., the total number of times a block was referenced for all stays in local store might be stored in backing store, and used whenever the block is in local store on thp replacement algorithm.

Figure 7 indicates that the choice of replacement algorithm for a given system should be dictated at least as much by available technology as by theoretical properties of the contending algorithms.

r /1

w (!) ex a: g

6

en 5 ...J « u o ...J

~ o Z ::> 4 fr ~ o z en w u z IE 3 IL. W a: ~ z w ~ ~ 2

LEGEND CONGRUENT MAPPING 0 --0

MODIFIED CONGRUENCY 0-------0 FIRST IN-FIRST OUT 0----0

RANDOM CHOICE t:;.---t:;.

PUSHDOWN x-·- x REFERENCE ACTIVITY u-··-u

A B C D E

DIFFERENT ADDRESSING PATTERNS

Figure 7 - Replacement algorithm

Program size vs. block transfer effectiveness

F G

The usefulness to the CPU of block transfer is clearly a function of the program running in the CPU, if by "program" we mean "address pattern." No other meaningful description of "program" can be correlated to the swapping activity in a block transfer system. The trace of a single FORTRAN compilation illustrates the inadequacy of program size as a determining factor of swapping activity.

Figure 8 emphasizes the changing address pattern within a single "job." Time has been sliced into units of 200,000 references. The upper chart indicates the changing size of storage used by the compiler, varying from a low of 6.4 K to a maximum of 32K in the tenth time slice. This chart is detailed to indicate relative storage required by data and by instructions. The


lower chart indicates the percentage of references not found in local store during the various slices. Note particularly that the greatest incidence of going outside the local store occurs when the used storage size is smallest (during the fifth time slice, when unique words used is nearly minimum).

From this and many similar programs, it is concluded that the program size is of second-order importance to the addressing pattern in determining the effectivness of b10ck transfeL

(32K) 100

90

o ~ 80 ~

~ 70 <I a: o :n 60

...J

~ 50 o I-

15 40

IZ

~ 30 a: w 0.. 20

DYNAMIC PROGRAM SIZE

TOTAL ----_ ..... 1

DATA :----~ ----I r----L - - __ t_s--------..I ~ __ _

10 - --- INSTRJCTlONS ~----

~ 0 <I a: ~ (f)

~ 6 ~

~ CD 5 ~ ~ 4 t.) z w :!j 3 u. w a: u. o IZ W t.) a: ~ 0

rt----__ _ -----~ i

INSTRUCTIONS

t 4 6 8 10

I

I I I I L __ ..,

DYNAMIC SWAPPING ACTIVITY TOTAL

NO OF ADDRESS REFERENCES (X 100,000)

Figure 8 - FORTRAN compiler compiling 600-card program.

Traffic in a block transfer system

One intuitively feels that the number of bits trans ferred per second is highest between the CPU and the local store, with a significant lower bit rate between local store and backing store, and a lower rate still between backing store and auxiliary storage. This type of bit traffic is certainly desirable when one considers the bit rates of the various devices which constitute the typical storage hierarchy.

The 7000 series data shows that to achieve this desirable traffic rate, block size and local store size must be carefully chosen. While Figure 9 illustrates a local store/backing store hierarchy, the numbers have equally valid interpretation for the classical

Considerations in Block-Oriented Systems Design 79

core/drum hierarchy. Thus, if a drum is accessed for 1024 words per block, and if the program has only 4096 words allotted for use in core, then 1.9 words will be transferred from drum to core for each word transferred from core to CPU.

CPU LOCAL STORAGE

MEMORY OR

BACK ING STORAGE

NUMBER BITS PROCESSED: Np NUMBER BITS TRANSFERRED: NT

RATiO NT/ Np

~ STORAGE BLOCK SIZE

SIZE 32 64 128 256 512 1024 2048 4096 8192 16 6.5 1.9 1.4 1.0 .70 17 .10 .045 033 32 12.1 3.0 2.0 1.2 .37 .16 .072 .039 64 23.2 4.5 2.5 1.2 .27 .14 .043

128 44.6 7.0 3.1 .45 .22 .073 256 84.1 11.3 2.5 .35 .09 512 157. 6.9 .73 20

1024 24.6 1.9 .42 2048 7.9 1.2 4096 3.7

Figure 9 - Transfer channels.

Traffic density is most often favorable for small block sizes. If a large block must be used, then a corresponding very large local store must be chosen.

Block transfer system perj'ormanc;e

The "number of references not found in local store" can be translated to approximate system performance for given hardware implementations. This is most readily done by assuming that the CPU generates a word request for every potential local store cycle. The-limit to the system performance is then set by the local store cycle time, and would be reached if the "number of references not found in local store" were zero.

As an example, suppose we compute the system performance for "number of references not found in local store" equal 275 out of 10,000. We will assume a local store cycle of 50 nanoseconds, and we will assume that the distance DA = Dc (as shown in Figure 10) corresponds to 150 nanoseconds. (Note that this figure includes time for priority determination at the backing store, and for dynamic relocation.) We will assume a backing store cycle of 800 nanoseconds with a block readout of 16 words. We compute as follows:



Time for one block reference 150+

Time for 275 block references 800= 950 nanoseconds

275 x 950 = 261,250 nanoseconds Time for 10,000 word references

50 x 10,000 = 500,000 nanoseconds

Total time for references = 761,250 nanoseconds A verage time per reference

VERY LARGE

MEMORY

= 761.250 -+- 10,000 = 76.125 nanoseconds . r---------~ I I I MORE FAST MEMORY I I I BACKING

I I I I STORAGE

I I

}

BLOCK L - - J I REFERENCES l

r--L-'OC'-AL---' 8 J Dc STORAGE

r---±-------I Os

CPU CPU

}.

ACCESS TIME OF

?

FOR NANOSECOND SPEEDS WHERE PHYSICAL DISTANCE IS SIGNIFICANT, IF TIME FOR BLOCK REFERENCES IS SMALL, THEN SYSTEM B CAN BE THE FASTEST

Figure 10- Block !ransfer system performance. This is for System B as shown. For System A, the

average time per reference is 150 + 800 = 950 nanoseconds. For System C the average time per reference is 150 + 50 = 200 nanoseconds.

The above calculation ignores many important parameters, and may not be used to precisely compare system options. Such comparison is, in fact, the subject of an entirely new study. The author has participated in preliminary comparative studies which show that systems of Type B and Type Care more closely equal in performance than shown here. Nevertheless, the calculation indicates the potential performance advantage offered by block transfer

when the percentage of references not found in local store is quite small.

CONCLUSION

For local storeibacking store (A) Small blocks preferred, with 4 word, 8 word,

and 16 word blocks proving quite suitable. (B) Size of local store suitable at 2K to 4K words. (C) Replacement algorithms can sometimes im

prove speed. (0) The swapping activity cannot be correlated

to the size of the program. (E) The backing storeiiocal store;transfer channei

traffic and bit rates must relate properly. (F) The address patterns of the various programs

produce the most significant variation in swapping activity.

REFERENCES

LA BELADY A study of replacement algorithms for a virtual storage computer IBM system journal Vol 5 No 2 1966

2 T KILBURN Data transfer control device U S Patent No 3,218,611 November 16 1965

3 T KILBURN Electronic digital computing machines U S Patent No 3,248,702 April 26 1965

4 F FLEE Look aside memory implementation Project MAC-M-99 MIT Cambridge Mass August 19 1963

5 G G SCARROTT The efficient use of multilevel storage Proc IFIPS congress Spartan Books 1965

6 M V WILKES Slave memories and dynamic storage allocation Project MAC-M-I64 MIT Cambridge Mass June 22 ] 964


Documents

Univac 1108 multiprocessor system - IEEE Computer … ® 1108 multiprocessor system by D. C. STANGA Univac Division, Sperry Rand Corp. Roseville, Minnesota INTRODUCTION Two …