of a Modern Supercomputer

IEEE TRANSACTIONS ON COMPUTERS, VOL. C-31, NO. 5, MAY 1982

Technology and Design Tradeoffs in the Creationof a Modern Supercomputer

NEIL R. LINCOLN

Abstract-Supercomputers, which derive their title from the relativepower they possess in any given period of computer life cycle, possesscertain qualities which make their creation unique in the general milieuof computational engines. The interaction of technology, architecture,manufacturing, and user demands gives rise to compromises and designdecisions which challenge the supercomputer developer. The natureof some of these challenges and their resolution, in the case of theproduction of the Control Data CYBER 205, is discussed in this paperin an attempt to elevate supercomputer development from the mystiqueof being an art to the level of a science of synergistic combination ofprogramming, technology, structure, and packaging.

Index Terms-Architecture, parallelism, pipeline, supercomputer,technology, vector processor.

INTRODUCTION

ACLASS VI computer has been informally defined as aprocessor possessing scalar processing speeds in excess

of twenty million floating point operations per second (20Mflops) and vector (or parallel) execution in excess of 100Mflops. The number of potential customers for a Class VIcomputer was conservatively estimated to be between 50 and100 separate installations in 1981. At that time, two machineswere in existence in that category, the CRAY-1 and theControl Data CYBER 205. Even as those estimates were an-nounced, computer users in the aerospace and petroleum in-dustries were claiming that performance at those levels is in-adequate to meet the needs of key applications in their par-ticular disciplines. The term "supercomputer," as applied tothe most powerful computing machines in existence, thencebecame defined as that computer which is only one generationbehind the computational needs of certain key industries.Obviously, there are many "key industries" whose demandsfor computational power are escalating faster than the per-formance capabilities of our computing systems.The rationale for higher and higher performance in one class

of computer is essentially as follows.1) The cost/performance for a supercomputer applied to

certain classes of problems is superior to that of more modestsystems. This is particularly true when the applications requirea great deal of on-line storage, I/O bandwidth, central mem-ory, arithmetically intensive computations, or some combi-nation of these.

2) The need for machines to solve three-dimensional models

Manuscript received July 5, 1981; revised December 17, 1981.The author is with the Control Data Corporation, Arden Hills, MN

55112.

of oil fields, reactors, fluid flows, and structures in productionenvironments implies central memory systems on the order of8 to 32 million words of storage and auxiliary storage (otherthan disk) ranging from 32 to 256 million words, where a wordis 64 bits in length. Algorithms have now been refined andinformational needs have progressed to the point where im-provement of three-dimensional modeling is becoming a ne-cessity in many disciplines, rather than a luxury or researchcuriosity. The scaling of memory requirements when onemoves from two- to three-dimensional solutions is a dramaticexample of the separation of supercomputer systems fromstandard or medium scale systems. A two-dimensional systemhaving dimensions of 100 points by 100 points, with each pointrepresented by six independent variables, requires only 60 000words of memory to hold the initial database. When this sys-tem is extended into the third dimension, memory require-ments become 6 000 000 words for just the initial database!

3) Three-dimensional computations also require a massiveincrease in the amount of arithmetic which must be performedto arrive at a solution. If a two-dimensional model has beenyielding adequate production results after one hour of CPUtime, the three-dimensional version could require 100 h toachieve comparable results. The computational capability mustthen be increased in a supersystem to match the memory ca-pacities for three-dimensional models.

4) Many applications on supercomputers possess hugeamounts of input and result data which, despite massive centraland auxiliary storage systems, must be moved through themachine at high rates in order to exploit the computationalpower of the machine. Thus, a supercomputer must provideextreme I/O bandwidths and connectivity to match the at-tached peripherals and the demands of the central pro-cessor.The motivation for a user to procure a supercomputer is

-rarely item 1) above, but rather the need to fit an increasedmodel size or computing speed requirement. The motivationfor a manufacturer to produce a supercomputer is to realizesufficient sales and profit to justify launching an expensive andrisky development effort.Once the decision has been made to commence a large-scale

supercomputer project, the architecture, technology, designapproach, and production methodology choices must be made.These elements of the project are interrelated in many subtleways with the final outcome being a necessary compromisebetween the constraints of each discipline and the opportunitiesoffered by advances in the state of the art of each arena. The

0018-9340/82/0500-0349$00.75 C) 1982 IEEE

349

350

purpose of this paper is to discuss major decisions made in theControl Data CYBER 200 project in an attempt to illuminatethe processes involved for future developers.

THE GENERAL CYBER 200 ARCHITECTURE

In 1975, when the CYBER 200 project commenced, theoverall concept of this supercomputer was taken from theSTAR-100 architecture which had its origin in 1964. In thatyear, at the urging of the Atomic Energy Commission, variousmanufacturers began research into new and somewhat radicalmachine structures leading to the Burroughs ILLIAC IV arrayprocessor and the Texas Instruments Advanced ScientificComputer (ASC) and the Control Data STAR-100 vectorprocessors. The major features of the. STAR-100 which areconsidered its basic architecture are as follows.

1) Massive, High-Speed Central Memory: In particular,any generation of the machine was expected to possess themaximum amount of memory permitted by extant technologyand engineering judgment. For the first STAR-100, a high-density core memory with a maximum of one million words-,was considered the limit of that technology. The speed of thismemory was focused on the overall system bandwidth estab-lished at 12.8 billion bits/s.

2) High-Speed Memory-to-Memory Vector Operations:In current supercomputer terminology, a vector is a contig-uously stored string of data, unrelated to the mathematicalsignificance of such an order. Given a Fortran declaration

DIMENSION A(4,4), B(4,4,4)

and the normal Fortran memory allocation system, the entirearray A could be considered a vector of length 16, and B couldbe a vector of length 64 if all the data were to be processed asa unit with all elements independent of one another. If everyother plane of the B array were to be processed, one could viewthe data as being in four vectors, each of length 16 elements(see Fig. 1). The requirement for contiguous storage of datain memory arises from the engineering solution to achievinghigh bandwidth (Fig. 2) but the conceptual notion of dealingwith problem solutions in vector form has value in its own right,and is being exploited by mathematicians in the developmentof new algorithms [1]-[7].

In the case of the STAR-100, the data in vector form areread from central memory and sent to the arithmetic units withthe results returned to memory upon completion of the oper-ation; hence the term "memory-to-memory" when applied tothis architecture. This system also appeared in the TI ASCwhich, while quite similar in functionality to the STAR-100,could create vectors out of noncontiguously stored data at aconsequent degradation in machine performance. The re-maining vector machine of note, the CRAY-1, providesarithmetic functions similar to those of the STAR-100 andASC, but its vectors are read from a set of high-speed registers,which themselves must be loaded from and unloaded to centralmemory at different points in the processing.

In all three instances the arithmetic units were constructedas pipelines (see Fig. 3). Since the data arriving at these unitsare usually contiguous, every segment in a pipeline will beperforming useful work except at the very beginning and end

IEEE TRANSACTIONS ON COMPUTERS, VOL. c-31, NO. 5, MAY 1982

ARRAY A IAS VECTOFLENGTH 16

K

ARRAYFORTRA

01 05

02 06

03 07

VIEWEDR OF

ARRAY B INFORTRAN ORDER/

A INkN ORDER

09 113 01 05

_10 14 02 06

_11 15 03 07

_12 16 04 08

01 0102 02

03 03

04 04

05 05

06 06

07 07

08 08

09 0910 1011 11

12 1213 13

14 14

15 1516 16

17

/ 18

ARRAY B VIEWED _-AS SINGLE VECTOR 63OF LENGTH 64 CA

Fig. 1. Fortran arrays as

40 41 42 43 44 45 46 47

32 33 34 35 36 37 38 39

24 25 26 27 28 29 30 31

16 17 18 119 20 21 22 23

08 09 10 111 12 13 14 15

00 01 02 03 04I 05 06 07

01 17 33 49

02 1-8 34z 50

03 19 35 5104 20 36 5205 21 37 53

06 22 35 54

07 23 39 55

09 24 40 5609 25 41 5710 26 42 58

11 27 43 59

12 28 44 6013 29 45 61

14 30 46 62

iS 31 47 63

1S 32 48 64

ARRAY B VIEWED ASFOUR VECTORS EACHOF LENGTH 16

vectors.

0

BANK 5

BANK 4

BANK 3

BANK 2

ONE MEMORYREQUEST

Fig. 2. CYBER 200 memory organization.

of a vector operation (as the pipeline fills and empties). Froman engineering point of view this makes very efficient use ofthe circuitry employed, at a cost of some time to "get the op-eration rolling" (first result returned to memory or registers)due to the length (in number of clock cycles) of the pipe-line.

II r

351LINCOLN: CREATION OF MODERN SUPERCOMPUTER

INPUTBUFFERS

[OJ,12 ,43j.4 OUTPUT BUFFER

I C500 -jMEMORY

Fig. 3. Vector pipeline concept.

3) Vast Virtual Memory Space: Experience has shown thatfew supercomputers totally operate on a single massive prob-lem in a dedicated manner 24 hours a day, seven days a week.The normal installation finds the waking hours being con-

sumed by algorithm development, program debugging, andinteractive execution of research programs and even largeproduction programs. The evening and midnight hours are

usually more structured, with one or more major programs

monopolizing the machine's resources and with, perhaps, a

modest amount of interactive debugging being pursued in a

background mode. These different forms of operation presentproblems to operating system designers, operations personnel,and even applications programmers.

In a fixed-memory allocation system the user must be awareof the resources being consumed by a particular program atall times. For example, a programmer, preparing a set of runparameters for a job to be run at night, may need to verifyseveral passes of that process before committing the entireensemble to the hands of the midnight crew. If this job trulyconsumes the total resource at midnight with that set of pa-

rameters, then it will demand that same resource during theprime time daylight hours even though the effort is primarilydebugging. In a fixed memory scheme this means that all otherusers are denied access while the major job is being tested, or

at the very least, while jobs are swapped to and from the CPUwith consequent loss of turnaround efficiency.A virtual memory system has a place in a supercomputer

as a tool to do the following.a) Assist in Memory Management: The operating sys-

tem can commit and decommit arbitrary blocks of real mem-ory without having to ensure the physical contiguity of a user'sworkspace. This reduces overhead for actions such as accu-

mulation of unused space, which could be quite costly in largememory systems (on the order of eight million words).

b) Provide Identically Appearing Execution ofAll Jobs(Ignoring the Turnaround Time, Of Course) Regardless ofWhat Resources are Required or Available: This means thata program's DIMENSION statements and input parameters can

remain unchanged whether or not a four hour run is beingcontemplated or a simple debugging run of a particular phaseis intended.

c) Eliminate Working Space Constraints from Algo-rithm Development: Mathematicians/programmers can begindeveloping an algorithm as if they had available an-infiniteworkspace in which to put data and temporary results. Oncethe algorithm is developed, the introduction of "ADVICE" fromthe programmer on how to handle paging of the informationcan be introduced into the code to optimize the performanceof the system, and to eliminate the oft-feared thrashing thatcan occur in virtual memory machines moving data to andfrom real central memory.The STAR-O00 was given a monolithic virtual space of 32

trillion bytes, which could be considered essentially infinite formost applications. This scheme of virtual storage was a

"workspace" scheme rather than a "segmented memory"scheme. The latter was employed on the IBM series computerswherein the virtual memory was used to link and structure datafiles. Fig. 4 illustrates the STAR-100 virtual memory

scheme.4) Concurrent Input-Output and Arithmetic Processing:

If a supercomputer must stop its high performance processingwhile input-output (I/O) operations are being performed, thenfor many applications, the effective speed of the machine can

be reduced to less than a quarter of its theoretical capacity.This fact is particularly evident in areas.such as seismic pro-

cessing where few calculations are performed for every dataelement, but in which the amount of data flowing through themachine is enormous. The STAR- 100 was architected to en-

sure that the peak pipeline rates could be sustained regardlessof the amount of I/O activity. This activity could reach peaksof 600 Mbits/s total for 12 channels.

5) Employment of "Iverson" or APL Operations as an

Adjunct to the Standard Arithmetic Operations of ADD,

SUBTRACT, MULTIPLY, DIVIDE, and SQUARE ROOT: It

becomes quite clear that the ability to effectively vectorize a

problem and make optimal use of a vector structure, such as

A500

81

B2

8500

C1

C2

Al

A2

IEEE TRANSACTIONS ON COMPUTERS, VOL. C-3 1, NO. 5, MAY 1982

Fig. 4. STAR-100 virtual memory scheme.

the STAR-100, is dependent on the availability of means fordata "mapping" or restructuring to produce vectors fromotherwise nonvector constructs. Iverson, in A ProgrammingLanguage [8], provided a conceptual basis and notationalscheme for these mapping functions. The functions of COM-PRESS, MASK, MERGE, SCATTER, and GATHER were de-rived from this book and incorporated into the STAR- 1(00; inaddition, many of the reduction operations SUM, PRODUCT,and INNER PRODUCT were implemented in STAR hardware.A most significant contribution of the APL concept was the

notion that strings of binary bits (called bit vectors) could beused to carry information about vectors and applied to thosevectors to perform some key functions (see Fig. 5). Since thebit strings became key to the vector restructuring concept, a

means had to be provided to manipulate bit vectors as well as

numeric vectors; thus was added the "string" functional unitto the hardware ensemble, and the name String Array (orSTAR) processor was born.

It is interesting to note that all other vector and array pro-

cessors of the past few years (including the Burroughs Scien-tific Processor) adopted forms of the Iverson functions.

SPECIFIC ARCHITECTURE OF THE CYBER 200/205

Architects learn from the construction experience of eachof their creations. The same is true of supercomputer engineersfor whom experience during the construction and usage of a

current generation computer acts as guidance for the nextgeneration. The lessons learned from the STAR-100 were not,in the main, architectural in nature but were of an engineeringcharacter. For example, the emphasis placed on the perfor-mance of the various vector instructions for the STAR-100 wasquite literally based on instinctive estimates of the value of suchfunctions. The processing rates for various instructions are

given in Table I. Note that the throughput rates for the Iversonfunctions, including the SCATTER/GATHER operation, are

from 6 to 50 times slower than the arithmetic functions of ADD,SUBTRACT, and MULTIPLY. These ratios were the accidentalconsequences of decisions related to economics and schedulesduring the design of the STAR- 100, and were not inherent inthe architecture of the machine itself. As production programswere fully developed for the STAR- 100, it became obvious thatthe nonarithmetic functions had to proceed at arithmeticspeeds to prevent them from becoming a bottleneck to machineperformance. The CYBER 205 column of Table I illustratesthe design decisions made for this new machine as a result ofthe STAR experience.The design of a new supercomputer is influenced by the

experiential base provided by its predecessor. This base is firsttranslated into a set of "design objectives" which serve as goalsfor the designer. In a sense, these represent architectural de-cisions because they guide the overall direction of the CPUengineering; however, they do not affect the structure of themachine in ways that are visible to the user and which mustbe dealt with by the operating system and compilers. Threemajor changes were made in the architecture of the STAR-100to yield the CYBER 205.

1) In the STAR- 100, the operation of the scalar unit wascoupled with the vector unit such that only one type of opera-tion could be performed at a time. This feature became builtinto the architecture in that the register file, which could house256 64-bit operands for processing by the scalar unit, was alsomade a part of the virtual memory over which vector opera-tions could function (see Fig. 6). This meant that the registerfile could behave as the source and destination for vectorarithmetic operations, and at other times could provide oper-ands for scalar arithmetic and vector setup. The difficultiesinherent in checking for potential conflicts between scalarusage and vector usage of the register file made it necessaryto interlock these two units, allowing only one function at atime.

352

LINCOLN: CREATION OF MODERN SUPERCOMPUTER

BIT VECTOR 1| 0 1 1 0 1 1 1 1 ?) 1 1A

SOURCE VECTOR AIO) All) A(2) A(3) A)4) A(S) A(

RESULT VECTOR I R(O) I R) I R)2)I R3) R n

COMPRESS SOURCE WORD VECTOR ABY BIT VECTOR B

GIVING RESULT WORD VECTOR R

COMPRESS ON "O"s IN B(SELECT ON "1"s IN B)

(a)

SOURCE VECTOR AM AM)E A(2| A43) A(4| A(5) A(n)

RESULT VECTOR

SOURCE VECTOR

BIT VECTOR

RIO) RM) R(2) RM3| R(4| R(5) R (n)

BB[O) BM1) B(2) B)3) B(4W B(5) B(n)

a1 0o |

1 1 0 1 |1 |

MASK WORD VECTOR AAND WORD VECTOR B

UNDER CONTROL OF BIT VECTOR CTO GIVE RESULT VECTOR R

SELECT A STREAM ON "'I"s OF BIT VECTOR(SELECT B STREAM ON "0"s OF BIT VECTOR)

(b)

ASOUJRCE VECTOR

RESULT VECTOR

SOURCE VECTOR

BIT VECTOR

LB(0)| B)1)| B)2)| B(3) B| 4)1|B5) | (6) B B7)1 ! 3jc I 1 1 1 1 1 1 o 1 1 0 1 jI

MERGE SOURCE WORD VECTOR AAND SOURCE WORD VECTOR B

UNDER CONTROL OF BIT VECTOR CTO GIVE RESULT WORD VECTOR R

SELECT A STREAM ON "l"s IN CSELECT B STREAM ON "O"s IN C

(c)

Fig. 5. Bit vectors for Iverson functions. (a) Example of COMPRESS. (b)Example of MASK. (C) Example of MERGE.

The ability to run both vector and scalar units in parallelbecame highly desirable as a vast array of user programs were

being run through the STAR machine during its development.The result was a decision, made very late in the STAR-100design effort, to eliminate the register file from the virtualmemory space(so that future models could, in fact, provide

TABLE IVECTOR PROCESSING RATES (KEY INSTRUCTIONS)

STAR-100 Operand CYBER 200Operation Peak Rate Size Peak Rate

ADD/SUB 50 MFLOPS 64-bit 200 MFLOPS100 MFLOPS 32-bit 400 MFLOPS

MULTIPLY 25 MFLOPS 64-bit 200 MFLOPS100 MFLOPS 32-bit 400 MFLOPS

DIVIDE/ 12.5 MFLOPS 64-bit 64 MFLOPSSQUARE ROOT 25 MFLOPS 32-bit 122.4 MFLOPS

COMPRESS/ 4.2 MOPS 64-bit 200 MOPSMERGE 4.2 MOPS 32-bit 400 MOPS

GATHER/ .78 MOPS 64-bit 33 MOPSSCATTER .78 MOPS 32-bit 33 MOPS

REGISTERNUMBER

4016

4016

8016

VIRTUALADDR ESS

1 000016

100016

200016

FF16 3F(

A

RIMEN

L\,- \

0016

EALMORY

VECTORS PERMITTED TOBEGIN IN THE REGISTERFILE AND CONTINUETHROUGH MEMORY

I l

VECTOR OFLENGTH L

Fig. 6. STAR register file as virtual memory.

parallelism of the various functional units. Although this de-cision came too late for the STAR- 100 to utilize the inherentparallelism, it was made with the intention that a later butcompatible machine could benefit. The impact of this decisionwas rather momentous. First, many memory-to-memory stringoperations were originally used to provide register logical op-erations such as AND, OR, EXCLUSIVE-OR, and SHIFr. Oncethe register file was removed from virtual memory, a completeset of register-to-register logical-and shift operations had tobe added to the command repertoire. Second, the compiler andoperating system had to undergo substantial modifications toincorporate the newly added instructions and to eliminate anumber of philosophical constructs related to the placementof the registers in virtual memory.

353

IEEE TRANSACTIONS ON COMPUTERS, VOL. C-3 1, NO. 5, MAY 1982

With the roadblock of the register file allocation eliminated,the designers of the CYBER 205 could concentrate on thefunctional parallelism of the scalar and vector units. Thetechnology of the STAR-100, which could be termed smallscale integration (SSI), made it necessary to share a good dealof logic between the vector and scalar functional elements. Thearithmetic portions of the machine were completely shared inthis manner. Parallel operation of these units became an en-gineering possibility with the advent of high-speed large scaleintegration (LSI). Thus, the architecture of the CYBER 205became inextricably linked with the technology to be used forits implementation. The increased density and reduced partscount provided by this LSI made possible a substantial en-largement of functional components, among which were dualarithmetic systems, one scalar and one vector.

2) The virtual memory system provided two units ofphysical memory allocation (called pages). The smaller unitcontains 4096 bytes and is called small page. The size of thispage was chosen because it appeared to be a convenient sizefor interactive users. The larger unit, called large page, con-tains 524 288 bytes. This large page was intended for majorproduction jobs that would consume most of the machine re-sources. Since 1974, 40 Mbit/s disk storage units and datatrunks have been available for paging of virtual memory to andfrom the central memory. The time required to access a singlepage on rotating mass storage includes system software anddisk "seek" time. The movement of 4096 byte pages in singleunits thus becom-es quite inefficient, since the transfer time of800 ,us becomes swamped by the 16 000 us positioning timeof the disk. In addition to the above factors, it became evidentearly in the life of the STAR- 100 that basic program modulesand data blocks would be larger than 4096 bytes; hence, thelarge page was included.To fit a variety of operating environments, the CYBER 200

family was provided with a range of small page sizes beginningwith 4096 bytes and ending with 65 536 bytes. The large pagesize of the STAR-100 was retained since it appeared to beoptimum for large production programs. The reason for of-fering more than one small page was to maintain compatibilityat the outset with the STAR- 100 and its 4096 byte page, whileproviding more efficient transfer units, depending upon theneeds of a particular environment. As in the decision to providedual arithmetic units in the CYBER 205, the availability ofLSI as a circuit element made providing several different pagesizes in the associative register system an engineering re-ality.

3) The input-output system employed in the STAR-100was of the "star network" type with node-to-node communi-cation between the CPU and attached peripherals. Manage-ment of the peripherals was accomplished by a high-speed, 16bit miniprocessor called a buffer controller. The STAR com-puter possessed 12 40 Mbit/s channels, which for large ap--plications was barely sufficient. Fig. 7 illustrates the starnetwork connection of the STAR-100 and attached periph-erals, and contrasts this with the CYBER 205. Note that theconnectivity of the CYBER 205 is potentially much greaterthan that of its predecessor. In addition, data transfers betweenelements of the CYBER 205 system can bypass the front end

elements; in the STAR-100, data rates between permanent filestorage and the CPU are limited by the capacity of the frontend processor, which typically is on the order of 1-4 Mbits/s.The change from a point-to-point interconnection to a net-

work form of I/O, which is called the Loosely Coupled Net-work (LCN), was a major switch for hardware and softwarealike on the CYBER 205. Transmission of data is accom-plished on a high-speed, bit-serial trunk to which 2 to 16 systemelements can be coupled. The method of establishing a link isbased on "addresses" in the serial message which can be rec-ognized by the hardware trunk coupler, called the NetworkAccess Device (NAD) (see Fig. 8). The most significant aspectof this decision has been the philosophical departure fromdedicated peripherals to shared peripherals which are accessedon a "party line" basis. Once this change has been incorporatedin the system software, the actual transmission media andhardware form ofNAD is invisible to the user, thus permittingthe employment of light pipe and microwave transmission asthe cost and practicality permit, without change to systemsoftware or user programs.

The availability of a technological base was again respon-sible for the choice of the LCN as the I/O subsystem of theCYBER 205. Key elements here were: 1) the existence ofhighly reliable 50 Mbit MODEM's, 2) high-speed TTL logiccircuits permitting cycle times in the NAD of 100 ns, and 3)memory capacities of 65 536 16-bit words with a bandwidthof 150 Mbits/s. System software, an often neglected tech-nology, has had to mature to the state where a vast networkand its operations, maintenance, and performance could be-come a reality. In effect, network schemes have been underhardware development since 1970 but the software develop-ment, to the point of a mature, stable, and deliverable com-modity, has required even greater time and manpower.

TECHNOLOGY, THE FUNDAMENTAL ELEMENT INSUPERCOMPUTERS

The STAR-100 was constructed with the most advancedsemiconductor technology available in 1966-1970, calledtransistor current switch (TCS). This family preceded theintroduction of the successful Motorola-MECL 10K familyof high-performance, moderate power, small scale integrationcircuits (containing from three to ten simple gates), and wasessential to meeting the performance levels required of theSTAR. The beginning point for a supercomputer developmentis the acquisition of the fastest, densest technology availablein a given generation (about six to ten years) of circuitry. Theclock cycle of a fast computer is then determined by the speedof the basic gate and the time delay of all the interconnectionsbetween circuits (see Fig. 9) constituting two consecutive ranksof data latches (flip-flops).The designer's desire is obviously highest speed and capacity;

these are limited by a factor called the "speed-power product"of a single gate. In order to achieve faster switching times,thereby permitting use of a higher clock rate for a given silicontechnology gate, greater power is required, resulting in higherheat dissipation. As these gates are aggregated into large

354


(b)Fig. 7. Input-output configurations. (a) STAR-100 network. (b)

CYBER 200 network.

DATA CONTRO PONROC:3 DSETT ~{ TCU JI

OPTIONAL --~ TCU

r-l--, MEMORY

I_ D5ATTA __ 4 TU > 1 106 NS

RS232 LINTERFACE_

DEVICE PARALLEL INTERFACE

CHANNEL INTERFACE

Fig. 8. Network access device (NAD) block diagram.

MULTI BOARD MODULE

INTERCONNECT 40%

CIRCUITRY 60%

SINGLE MODULE

INTERCONNECT 25%

CIRCUITRY 75%

Fig. 9. Distribution of propagation time.

ensembles on a piece of silicon about 0.1 inch square, the heatbuildup can cause the device to become inoperable or to havea seriously shortened lifetime. Each new generation of tech-nology incorporates improvements in lithography and siliconprocessing which reduce the speed-power product in dramaticstages. Unfortunately, this reduction is matched by equallydramatic increases in demands for density to meet larger gatecounts for supercomputers. The result is a necessary com-

355

IEEE TRANSACTIONS ON COMPUTERS, VOL. c-3 1, NO. 5, MAY 1982

Fig. 10. CYBER 205 register file.

promise between the designer's desires and the practicalitiesof powering and cooling the requisite high-performance cir-cuits.

MEMORIES, ONE SEGMENT OF TECHNOLOGY

The CYBER 205 possesses a memory hierarchy which wasnecessitated by the range of cost, performance, power, anddensity characteristics of available technology, and the needfor speed on one hand and large capacity on the other. Themost crucial memory, in terms of dictating the clock rate ofthe CPU, is the register file; until all input operands are fetchedand a result is stored away in the file, no new instruction canbe issued to the CYBER 205 functional units. This issue ratewas targeted at one instruction per clock cycle, thus imposinghigh demands on the register file technology. The first keytechnology search for the CYBER 205 project was for a "chip"containing as many bits of data as possible, with the ability toread two operands and write one operand in a clock cycle(whose target was 13.3 ns). No such circuit existed in the in-ventory or laboratories of any semiconductor vendor in 1975.The chip nearest to possessing these qualities was a 4 bit wideby 16 bit deep chip with separate read and write address reg-isters; this chip could read one operand and write one operand*in about 8-10 ns.

Since this compromise chip was not capable of supportingtwo operand reads and one operand write in a clock cycle, itwas decided to build a register file subsystem with the 4 X 16bit chip, but duplicating the entire file to provide for the twosimultaneous operand reads needed to meet CYBER 205 goals(see Fig. 10). A special development could have been initiatedto create a totally new chip to meet the project goals, but the18 month minimum lead time and the risk of not meetingperformance goals dictated the use of standard products fromthe semiconductor catalog. It should be noted that in 1975 the4 X 16 bit chip was an entry in a catalog, which means almosttwo years had already transpired for it to achieve productionstatus; in addition, in 1981, it still possessed some undesirabledesign attributes.The second most crucial memory technology was that

needed for a high-speed, dense random access memory (RAM)device which could be used for the microcode control that wasto be distributed throughout the CYBER 205. Since the clockcycle of the CYBER 205 was targeted at 13.3 ns, such a mi-crocode memory had to have an access time on the order of8-10 ns so that the resulting control signals could-be trans-mitted to key functional elements in the CPU. A 1 bit by 256bit chip was chosen for this function.A number of other functional elements in the CYBER 205,

like I/O buffers and vector unit buffers, employ the same 256

356


bit chip and associated auxiliary board package. In additionto the required memory chips, these auxiliary boards containa number of other SSI circuits for memory support circuitry,such as address decode, fanout, and data registers. Aside froma specific use in controlling the I X 256 bit chip as an asso-ciative instruction stack and the support logic role, no othersmall scale integration logic was used in the mainframe of theCYBER 205.The most important but least risky memory logic was that

needed for the mainframe memory consisting of I bit by 1024bit, high-speed RAM chips, with a maximum access time of37 ns. These chips dissipate between 0.5 and 0.75 W; this ne-cessitated a radical cooling system approach since large as-semblages of memory, on the order of one or two million 64 bitwords, were to be housed in a compact chassis close to thehigh-speed scalar unit.

LSI, THE KEY TECHNOLOGY

The electronic element which is the keystone of performancein the CYBER 205 is a newly developed, high-powered, largescale integration (LSI) circuit. This device provides 168emitter coupled logic (ECL) gates which have an averageswitching time of 600-800 ps. It resulted from a research effortbegun by Control Data's Advanced Design Laboratory in 1972in conjunction with several key semiconductor vendors. Thetechnology choices which were made in creating this circuitwere as follows.

1) LSI was chosen over the available SSI because of theneed to minimize parts counts (for reliability reasons) whileincreasing the number of gates in the CYBER 200 to almostdouble that of the STAR- 100. A secondary consideration wasthe improved performance possible by reducing interconnectdistances between gates when on the same chip. Table 11 givesthe relative transmission speeds of various interconnect media,and highlights the desirability for a high degree of integration.Finally, LSI was chosen as the technology basis for theCYBER 200 because such is the direction of the semicon-ductor industry. Although high-powered LSI was in its de-velopmental infancy in 1974-1976, it seemed clear that designand manufacturing techniques would mature quickly andsubstantial flow of reliable parts would be available fromseveral vendors by the time the CYBER 200 was to reach itspeak manufacturing volume.

2) ECL technology was chosen since it offered the highestperformance available in 1974-1976, and exhibited extremelystable switching characteristics. These qualities are achievedat the cost of relatively large power dissipation per gate. Thislimits the total number of gates that can be integrated on asingle silicon die and then be effectively cooled when packaged.A lower powered and consequently much slower technologysuch as metal oxide semiconductors (MOS) can achieve den-sities of tens of thousands of gates, as in microprocessor chipssuch as the Z80, 8080, and 6800. The term LSI applied to ECLfamily circuits, when evaluated in light of these factors, shouldinclude 168 ECL gates as being LSI in the 1980's. Power re-quirements of even this level of integration make extraordinarydemands on creative packaging and cooling of the parts, in-dividually and collectively.

TABLE 11PROPAGATION DELAY VERSUS MEDIUM

Delay Typical Length TypicalMedium (ps/micron) (microns) Delay(ps)

Interconnect 2.4 x 10-2 12.5 x 102 30(On Chip)

PC Board 1.2 x 10-2 1500 x 102 1800(Teflon Glass)

Coax 0.4 x 10 2 9000 x 102 3600(Foam)

Light 0.33 x 10-2 - -

Fig. 11. Gate array concept.

3) The gate array concept (Fig. 11) became the techno-logical base for the circuitry. The alternative methodologywould have been to create totally different custom arrays (seeFig. 12), where each diffusion layer would differ between parts.The advantages of the gate array technique are as follows.

a) Lower Start-Up Cost on a Per-Part Basis: The mostimportant feature of the gate array approach is that all layersimposed upon the silicon, except for two layers of metal anda layer of "via," are all common. This means that a large

357


Fig. 12. Fully custom LSI.

number of fundamental type chips can be mass produced andstored, or held, until a unique metalization is laid down at a

later date.b) Shorter Turnaround Time in Producing New or

Modified Parts: The ability to store the basic part until a laterdate, and then introduce changed metal masks, reduces thedevelopment and processing time for a newly developed typeor a type modification.

c) Vendor Willingness to Participate in the Project:Another property of the gate array scheme is the utilizationof standard wafer processing techniques at the semiconductormanufacturer's plant. The disruption of a commercial siliconfacility's operation must be drastically minimized; otherwise,semiconductor business decisions will weigh in favor of thestandard (guaranteed profit making) product lines at the ex-

pense of more challenging and economically risky efforts.The disadvantages of this approach are as follows.

i) Performance and Density are Not Optimal for theSame Silicon Area and Technology: The use of the gate-matrix structure means that the densest gate structure per

square millimeter of silicon cannot be attained. The generalityof the matrix structure ensures that all of the silicon surfacecannot be utilized productively.

ii) The Long Term Unit Cost Per Circuit for CustomDesigned Devices is Lower. Once the development cost andtime have been amortized, a custom device,-because it utilizesless silicon for the same job, will cost less than a gate array

device, if the volume of production is high enough. In the case

of supercomputers, where the final volumes of parts are pre-

dictably small, this disadvantage is minimal.iii) The Ability of Competitors to Duplicate Gate Array

Parts is Enhanced, Since the "Personality" of a Given Partis Determined by So Little Metalization.

In any event, the decision to utilize an LSI family designedspecifically for the CYBER 205 created some additionalproblems, and gave rise to some engineering challenges. TheSTAR- 100 was built from SSI parts mounted 60 to a two-sidedcircuit board. Each of these boards could exist as a unique type,with its personality determined by the types, quantities, andplacement of chips on the board, as well as the routing of thefoil paths between chips and to the board's connector pins. The

engineers on the STAR-100 were reasonably free to design asmany types of boards (modules) as they needed, each one re-quiring only a different layout, documentation package, andseparate exposure plates for the metal paths. The utilizationof this scheme resulted in over 300 different types of boardsbeing created in about three years design time.The amount of logic that could be organized onto a CYBER

200 gate array could, in many cases, be roughly equivalent tothat included on a STAR module. When faced with the pos-sibility of producing 300-400 different chip types (of lowvolume) for the CYBER 205, semiconductor vendors reactedwith horror and outright resistance to such a project. Thereasons for this are manifold but include the following.

1) A major bottleneck in a semiconductor vendor's opera-tion is the mask shop, wherein the precision masks are gener-ated for each type of component being produced by the vendor.The constant development of new parts and the improvementof existing parts for the "bread and butter" standard productsof a vendor create a large demand for the resources of a maskshop. The introduction of "special request" masks fromprojects such as the CYBER 205 can cause great perturbationsin product schedules and possibly jeopardize critical productdevelopments. Any new part types demanded by the CYBER205 project would have to be processed within the excess ca-pacities of a typical mask shop. This might limit engineers toperhaps 8-20 types.

2) The process of testing arrays at the conclusion of themanufacturing cycle requires fixtures, tester firmware, anddocumentation for each unique entity. Management of thisoperation for a vast number of types becomes quite arduousand costly.

3) If a large number of part types is permitted, most willbe highly specialized to a particular machine design and, thus,unavailable as general purpose devices that could be made partof a standard product line. This means that no part type willever be manufactured in great volume, and even thoughmanufacturing costs are passed on to the consumer, the fi-nancial return to the semiconductor vendor for low volumeelements may not justify the resource commitment needed tomanage the development and production facilities.From the consuming project point of view, to permit a great

variety of special parts is not necessarily desirable either. If acomplete kit of parts, so to speak, to produce a CYBER 205were to contain 300 types of chips, then a failure in any one ofthe 300 production lines could blockade the manufacturingand checkout of machines on the assembly line.As a consequence of these considerations, the CYBER 205

engineering group had to limit the number of different typesof LSI chips created to 14 for the scalar unit; and no more thanan additional 11 types could be used for the vector and I/Ounits of the machine. This meant that additional engineeringtime was consumed in the aggregation of certain designs intoa single part type. The design activity then demanded con-siderable interaction within the design team to achieve com-mon part types which could be used in different ways and indifferent areas of the machine. Only three chip types becameso unique that their use was impossible in some universal ca-pacity: the multiply chip, the associative register chip, and the

358


result address register chip (which is used to control instructionissue in the scalar unit).A significant result of the CYBER 205 design project was

that despite the compromises required by the chip type con-straints listed, the overall performance of the machine was notdegraded from what might have been possible had a largernumber of types been available. This conclusion was reachedafter the machine was completed and design alternatives werereviewed in detail.

PACKAGING AND COOLING, THE UNHERALDEDTECHNOLOGY

The circuit family chosen for the CYBER 205 possessedmany pleasant attributes for design engineers but exhibitedone challenging characteristic: high power dissipation. Theaverage power per chip was about 5 W. To maintain a desirablejunction temperature of 550 C at this dissipation level de-manded some creativity from the packaging and cooling en-gineers. The solution finally obtained was to bring a liquidcoolant (Freon) into contact with each and every chip. Aircooling was discarded early as an alternative since each printedcircuit board was designed to hold 150 arrays and thus wouldbe emitting 750 W in less than a square foot of chassis area.Other conductive techniques, using metal heat transfer paths,also yielded thermal resistances too high to manage. Thereason for this is that any liquid coolant system cannot createexposed surfaces cooler than a nominal dew point in today'sstandard machine room. If this were not so, then a very lowtemperature medium could be used as a cooling agent, over-coming deficiencies in the thermal pathing.

Fig. 13 shows the basic circuit package of the CYBER 205with the Freon tube running underneath the chip which hasa heat spreader mounted on its base. The basic LSI packageshown here is the culmination of the tradeoffs between densityand cooling capability, gate count and interconnectability. Theconnector interfaces 52 gold-plated contacts to each LSI array,with the bottom contact pressure being provided by the springclip which covers the assemblage. The spring clip also providesforce to ensure good mating between the heat spreader and theliquid coolant tube. While more contact pins are desired bydesigners, board technology and tester complexity act asconstraints. While it is desirable to pack more chips on a board,printed circuit technology (hole and foil densities) and powerand cooling considerations limit this area of the technology.

THE CYBER 205 MAINFRAME

The functional and performance objectives of a machine likethe CYBER 205 are arrived at through a process of matchingdemands with technological and manufacturing reality. A firststep in this process consists of estimating the probable mini-mum clock period possible for the-architecture and technologycombination. The CYBER 205 was designed to be imple-mented in three stages: 1) an enhancement to the memory andscalar units of the STAR- 100, 2) a replacement of theSTAR-I/O system, and 3) an evolution to a totally LSI-basedmainframe. This approach dictated that the LSI scalar unitwould have to operate at an integral subdivision of the

STAR-100 clock cycle of 40 ns. The choices are limited. Atone-half the clock cycle (20 ns) the overall performance of aCYBER 200 machine would be a marginal improvement forthe investment made, if the STAR- 100 structure were retainedintact. At one-third the clock cycle of the STAR-100 (13.33

ns) the speed of the machine would be easily comparableto the nearest competitor (the CRAY-1) at 12.5 ns. At one-fourth the clock rate (10 ns) the technology would strain toachieve the performance, let alone possess any reasonablemargin for manufacturing variants. The clock cycle target wasthen set at 13.33 *.. ns as a possible goal, to be proven by theviability of a final design.A set of design ground rules containing average gate delays

and the mean transmission times for circuits was then used inthe design of what were considered to be the "hard spots" (longpaths) of a STAR type architecture. The major hard spotsincluded the following.

1) The instruction issue path begins with the time totranslate an operation code, proceeds with the reading of twooperands from the register file (if available), and ends with theloading of a new instruction into the primary instruction reg-ister.

2) The memory fetch cycle consists of translating the in-struction, computing a memory address, transforming theaddress from virtual to real in the associative registers, sendingthe request to the central memory system, and returning therequested data to the register file.

3) The time required to set up memory addresses for vectorinstructions, resolve any conflicts between all memory requests,and move input and output vector data to and from centralmemory is critical to start-up and throughput potential of thehigh-speed arithmetic units. With the initial (and as yet un-proven with actual parts in 1976) ground rules, the 13.33 ...

ns clock cycle became a marginal possibility. The decision wasmade to design the major subsystems to operate at the ag-gressive 13.3 ns cycle. In particular, the register file and thecentral memory were designed, built, and tested at that speed.The engineering staff then endeavored to design all othercomponents to operate at that clock rate. Given the possibilitythat the next slower clock cycle (20 ns) would become theCYBER 200 fundamental clock, it becan-e necessary to seekstructural changes which would increase the overall machineperformance beyond the automatic factor of two yielded bythe clock improvement. The goals then set for the mainframewere established, assuming a 20 ns clock cycle, and created toprovide competitive performance characteristics at that speed.These goals by functional unit were based on relative perfor-mance targets set for the CYBER 205 computer. The designimplications of these goals were much more complicated in thefollowing areas.

1) The STAR-100 was capable of issuing instructions ata maximum rate of one every two (40 ns) clock cycles. Thelimiting factor of this rate was, originally, the ability to readonly one operand per cycle out of the register file of theSTAR-100, while two operands were required for most in-structions. As stated previously, the Advanced Design Labo-ratory technologists obviated this problem, thus challengingthe engineers to utilize the new register file. The consequence

359


CERAMICCHIP CARRIER //

Fig. 13. CYBER 200 packaging and cooling technology.

of this effort was the discovery that the final designs of theSTAR-100 scalar arithmetic elements assumed the slower rateand thus were inadequate to the new challenge. The effect ofthis relevation was an increase of engineering investment overthat planned on the basis of estimates which included trans-lation of existing designs into a new technology. A short stopor loop back path, which returns pipeline results back to theinputs of arithmetic units before they arrive at the register file,improved the issue potential of the scalar processor.

2) Start-up time of vector operations includes the instructiontranslation time, and the delay imposed by fetching andaligning the input streams and by aligning and moving output

streams to memory. Fig. 14 illustrates the impact of start-uptime on the effective performance of the CYBER 205 archi-tecture. A major improvement in the overall performance ofthis type of memory-to-memory vector architecture requirescareful attention to start-up time. In addition to the raw im-provement in clock speed, the designers were directed to othermethods for reducing the delay in vector initiation. Given theSTAR- 100 as a basis, the actions of address setup and man-agement of vector arithmetic control lines required substantialspeedup. In addition, providing separate and independentfunctional units in each of the scalar and vector processorspermits execution of those functions in parallel. Hence, the


CYBER 205LINKED TRIAD PERFORMANCE

,4.

U.

CL0

VECTOR LENGTH

(a)CYBER 205

MULTIPLY/ADD PERFORMANCE

ZK 4K 6K Ak 10K 65K

VECTOR LENGTH

(b)Fig. 14. Effect of start-up time on CYBER 205 vector performance. (a)

Multiply/add performance. (b) Linked triad performance.

"apparent" start-up time becomes smaller than what thehardware provides. This feature, in the best cases, results inparallel execution of both scalar and vector floating-pointoperations with a consequent increase in overall perfor-mance.

3) The STAR-100 guaranteed the attainment of peakvector rates by "slotting" input-output requests around thememory demands of the vector streams. From this demandarose an extremely complex design which attempted to cope

with the 32-minor-cycle (1.28 ,s) bank busy time and therandom arrivals of bulk (524 288 byte) I/O transfers. CYBER

205 designers chose to give input-output a free entry to thecentral memory, and found that it was easier (from an engi-neering point of view) to schedule the vector memory requestsaround the I/O requests rather than vice versa.

4) The CYBER 205 peak throughput rate for simple op-erations, such as VECTOR, ADD, SUBTRACT, and MULTIPLY,was specified to be 200 million floating point operations persecond (200 Mflops). A clock cycle speedup from 40 to 20 nswould only bring the peak performance to 100 Mflops in 64bit mode. This meant-that the number of pipelines would haveto be doubled in the maximum configuration. The increase inhardware components to accomplish this is offset for the mostpart by the utilization of the LSI gate array, making such alarge assemblage practical to build.

5) Once the overall design was established, the CYBER 205engineers realized the independent add, subtract, multiply, anddivide units could be produced at little additional cost, whereasin the STAR- 100 engineering economies forced a sharing ofhardware (back-end and front-end shift and alignment net-works) among arithmetic units. Not only did this permit amore straightforward partitioning of the design, but a simpleinterunit linkage could be established which would permit theexecution of two separate arithmetic operations on a singlevector during one pass through the pipelines. This design at-tribute would then, for some cases, double the arithmetic resultrate of the high-speed pipelines from 200 Mflops to 400 Mflopsfor LINKED 64 bit ADD-MULTIPLY/MULTIPLY-ADD andsome other operations such as SHIFT, LOGICAL, and VECTORCOMPARE. Again, this design approach was made possibleonly through the employment of large-scale circuit integrationwhich reduces parts counts to acceptable manufacturinglevels.

CONCLUSIONS

The process of creating the CYBER 205 has involved theblending of technological and architectural evolution withstate-of-the-art power, packaging and cooling, and manu-facturing techniques to yield a practical and reproduciblecomputer system. The requirements and desires evident inlarge-scale computational environments are tempered with thereality of engineering and manufacturing practices. If this werenot so, the world would be seeing massive processors withtrillions of words of memory and processing power on the orderof trillions of floating point operations per second. Parts counts,interchassis wiring, and power and cooling considerationsdominate the production of real, usable supercomputers, withtechnology and architecture becoming necessary, but secon-dary, considerations. A high performance memory system ofeight million 64 bit words, for ex'ample, depends on the avail-ability of circuits containing at least 16 000 bits each, withaccess times in the range of 35-45 ns. The central processorarchitecture must provide the address busing for such a largeassemblage of memory. The entire ensemble must be manuf-acturable and maintainable or the architectural and techno-logical achievements are wasted.

It is the realization that these sobering pragmatics set limitson achievable computational power which gives rise to the

361

362

cautionary "There ain't no magic" poster which hangs in theCYBER 205 development laboratory. If, in fact, there is no"magic bullet" which will yield new and extraordinary su-percomputer power, what can be expected in the future?

In the next five years (1982-1986), a new generation ofsupercomputer will emerge, driven by customer demands andother competitive pressures. If the current supercomputers(CRAY-I and CYBER 205) are termed "Class VI comput-ers," then one might call their successors "Class VII" ma-chines. Memory capacities will be two to four times greaterthan now possible, and processing speeds will be improved from2.5 to 20 times current rates. These goals will be achieved byemploying another generation of silicon technology, withemphasis on larger scales of integration rather than exotictechnologies such as Josephson junction or gallium arsenide.Larger aggregates of ECL gates on a silicon die will meanhigher power, and thus will demand innovative packaging andcooling systems with continued reliance on liquid coolanttechniques. The consequent reductions in size and parts countsin an LSI system make it possible to increase redundancy andmaintenance features, as well as the number of functionalunits. The result will be higher powered, more efficient, lowercost, and more reliable supercomputers in the coming gener-ation.The major element of this next generation will be a contin-

ued evolution of architectures involving parallelism. Vectorprocessors, multiprocessors, and similar parallel structures willbe required to keep pace with the demand for computationalpower. The exploitation of these architectures, through newalgorithms, operating systems, and compiler technology, is thekey to achieving the goals of the next generation consumer. Thepracticality of such software development is another challenge... and another story.


REFERENCES

[1] M. J. Kascic, "A direct Poisson solver on STAR processor," in Proc.LASL Workshop Vector and Parallel Processing, 1978, LA-7491-C.

[2] , "Application of vector processors to the solution of finite dif-ference equations," Soc. Petroleum Eng. AIME, SPE 7675, 1979.

[3] , "Vector processing, problem or opportunity," presented at IEEECOMPCON'80, 1979.

[4] , "Vector processing and vector numerical linear algebra," in Proc.3rd GAMM Conf Numerical Methods in Fluid Mechanics, DFVLR,Koln, Germany, Oct. 1979.

[5] , "Vector processing on the CYBER 200," in Infotech State ofthe Art Report "Supercomputer," Infotech Int. Ltd., Maidenhead,England, 1979.

[6] , "Vector processing for high energy physics," in ComputerPhysics Comm. 22. Amsterdam, The Netherlands: North-Holland,1981.

[7] , "Tridiagonal systems and OCI on the CYBER 200," SIAM, Oct.1981.

[8] K. E. Iverson, A Programming Language. New York: Wiley, 1962.[9] P. D. Jones, N. R. Lincoln and J. E. Thornton, "Whither computer ar-

chitecture," in Proc. IFIP Cong. 71. Amsterdam, The Netherlands:North-Holland, 1972.

[10] N. R. Lincoln, "It's really not as much fun building a supercomputeras it is simply inventing one," in Proc. Symp. High Speed Computer andAlgorithm Organization. New York: Academic, Apr. 1977.

[111] a,"A safari through the Control Data STAR-100 with gun andcamera," in Proc. AFIPS Nat. Comput. Conf., June 1978.

[12] R. Stokes and R. Cantarella, "The history of parallel processing atBurroughs," in Proc. Int. Conf. Parallel Processing, 198 1.

[13] G. H. Barnes et al., "The ILLIAC IV computer," IEEE Trans. Com-put., vol. C- 17, Aug. 1968.

[14] R. L. Davis, "The ILLIAC IV processing element," IEEE Trans.Comput., vol. C-18, Sept. 1969.

[15] D. H. Lawrie et al., "Glypnir-A programming language for ILLIACIV," Commun. Ass. Comput. Mach., vol. 18, no. 3, 1975.

[16] R. M. Russell, "The CRAY-1 computer system," Commun. Ass.Comput. Mach., vol. 21, no. 1, 1978.

[17] N. R. Lincoln, "Supercomputer development-The pleasure and thepain," presented at IEEE COMPCON '77, 1977.

Neil R. Lincoln, photograph and biography not available at the time of pub-lication.

Documents

of a Modern Supercomputer