1989-Single Instruction Multiple Pipeline Architecture

8/8/2019 1989-Single Instruction Multiple Pipeline Architecture

1/8

SIMP (Single Instruction stream/Multiple instruction Pipelining):A Novel High-Speed Single-Processor Architecture

Kazuaki Murakami, Naohiko Irie, Morihiro Kuga, and Shinji TomitaDepartment of Information Systems

Interdisciplinary Graduate School of Engineering SciencesKyushu University

Fukuoka, 816 JAPAN

AbstractSIMP is a novel multiple instruction-pipeline parallelarchitecture. It is targeted for enhancing the performance of SISDprocessors drastically by exploiting both temporal and spatialparallelisms, and for keeping program compatib ility as well.Degree of performance enhancement achieved by SIMP depends

on; i) how to supply multiple instructions continuously, and ii)how to resolve data and control dependencies effectively. We havedevised the outstanding techniques for instruction fetch anddependency resolution. The instruction fetch mechanism employsunique schemes of; i) prefetching multiple instructions with thehelp of branch prediction, ii) squashing inst ructions selectively,and iii) providing multipl e conditional modes as a result. Thedependency resolution mechanis m permits out-of-order executionof sequential instruction stream. Our out-of-order executionmodel is based on Tomasulos algorithm which has been used insingle instruction-pipeline processors. However, it is greatlyextended and accommodated to multiple instruction pipeliningwith; i) detecting and identifying multiple dependenciessimultaneously, ii) alleviating the effects of control dependenc ieswith both eager execution and advance execution, and iii) ensuringa precise machine state against branches and interrupts. Bytaking advantage of these techniques, SIMP is one of the mostpromising architectures toward the coming generation of high-speed single processors.1. Introduction

The demand for high-speed single-processors forces moresophisticated instruction pipelines to be implemented in SISD(Single Instruction stream/Single Data stream) processors of awide range from microprocessors to supercomputers. Theseconventional pipelined SISD processors exploit temporalparallelism in the process of instruction execution; i.e., the processis segmen ted into consecutive subprocesses (stages of a pipeline).The performance of these processors can be expressed as theprogram execution time;E=NxCxT,where N is the number of instructions that mus t be executed, C isthe average numbe r of cycles per instruct ion, and T is the cycle

Permiss ion to copy without fee all or part of this material is grantedprovided that the copies are not made or distributed for direct commer -cial advantage, the ACM copyright notice aad the title of the publicationand its date appear, and notice is given that copy ing is by permission ofthe Association for Computing Machinery. To copy otherwise, or torepublish, requires a fee and/or specific perm ission.

0 1989 ACM 0884-7495/89/0000/0078$01.50

time. N and C depend on processor architectures; e.g., CISCarchitectures decrease N but increase C by improvingfunctionality of instructions, while RISC architectures reduce C tonearly 1 but increase N by simplifying the instruction set.Although processor architec tures may influence T in some degree,semiconductor technology mostly determines T. Since T has beendecreasing constantly with advances in VLSI technology,conventional pipelined SISD processors have enjoyed speedupsregardless oftheir architectures: CISC or RISC.However, the physical lower limit of T obviously exists in anysemiconductor technology, and therefore some architecturalchanges must be considered for SISD processors.Some innovative approaches of such cha Ilenges are VLIW (VeryLong Instruction Word) architectures [Fisher83], which arederivatives of SISD and exploit spatial parallelism (low-levelparallelism). VLIW architecture decreases N by specifying two ormore independent operations in an instruction, withoutincreasing C by having each operation be a RISC-styleinstruct ion. We have already developed two VLIW processors: theQA-series (QA-1 and QA-2) [Hagiwara.BO; Tomita83,861. Toexploit spatial parallelism, the QA-series provide multip lefunctional units such as quadruple ALIJs, quadruple memoryaccess units, and a sequencer. QA-1 and QA-2 employ very longinstruction formats of 160-bits and 256-bits respectively, foreontrolling every functional unit independent ly. However, VLIWarchitectures have a serious drawback ; i.e., it is difficult to keepprogram compatib ility, because their hardware architectures areexposed to compilers .Other, somewhat old, approaches are found in singleinstruction-pipeline processors with multipl e functional-units(called MFU processors), such as the CDC 6600 and the IBM360/91. MFU processors have potential of reducing C to nearly 1by having multip le pipelined-functional units busy, and keep Ncomparable to CISC processors. Unlike V LIW, programcompatibility is preserved very easily. Nevertheless, the limitexists to performance enhancement because the instruction issuerate (= l/C) can not exceed one instruction per cycle.All these approaches attempt to exploit low-level fine-grainedparallelism by utilizing multiple functional units. Key problemsin the exploitation of low-level parallelism are; i) to detect datadependencies and control dependenc ies, ii) to resolve thesehazards, and iii) to schedule the order of instruction execution.VLIW architectures rely solely on clever compilers which solvethe problems by means of static code scheduling, such as tracescheduling [Fisher811 and software pipelining [Lam881. MFUprocessors also solve the problems at run tim e by implementingdynamic code scheduling in hardware. Static and dynamic codescheduling methods differ in their domain; i.e., static codeschedu ling is done with a broad overview of program codes, butdynamic eode scheduling is done with a peephole. However, thesecode scheduling methods are not mutually exclusive.

78


2/8

After evaluating the QA-series, we have studied the feasibilityfor enhancing the performance of SISD processors drastically bycombin ing both temporal and spatial parallelisms, and forpreserving program compatibility as well. As a result of thisstudy, we have introduced the multiple instruction-pipe lineparallel architecture: SIMP lMurakami881. Given that Pinstruction pipelines are provided, S IMP processors ideally reduceC to l/P by fetching P instructions per cycle, and keep Ncomparable to conventional single-pipelined SISD processors aswell. As the first implementation of SIMP, we are now developingthe SIMP processor prototype: r % 1 J. (in Japanese), whoseEnglish pronunciation is [jlmpu : I. It implies new streamlineprocessor.The paper is organized in 6 sections. The following two sectionspresent the rationale of SIMP architecture (section Z), and givesome background regarding instruction-pipelining techniques(section 3). In section 4, we introduce the SIMP processorprototype and discuss its pipeline flow and instruction-setarchitecture. In section 5, we describe the instruction fetchmechanism and the out-of-order execution model, both of whichare devised for the SIMP prototype. Section 6 offers a fewconcluding remarks.2. Principles of SlMP Pipelined Instruction Execution

There are various m odels of pipelined instruction execution.They apply temporal parallelism (i.e., pipelining) and/or spatialparallelism (i.e., low-level parallelism) to a single instructionstream, as schematically depicted in Figure 1. We clarify SIMParchitecture by comparing its instruction pipelining model withthose of its counterparts such as linear-pipeline, MFU, andpipelined VLIW. In the discussion here, we assume that theprocess of instruction execution is decomposed into 5 stages: IF(instruction fetch), D (instruction decode), OF (operand fetch), E(execute), and W (result write).2.1 Linear-Pipelined Processor

Most of conventional pipelined SISD processors employ singlepipeline struc ture shown in Figure l-a. Instruc tions aresequentially processed one by one from preceding stages tosucceeding stages. Instructions should enter and leave each stagein-order with respect to the compiled code sequence. This pipelinestructure is referred to as a linearpipeline.Linear-pipelined processors exploit only temporal parallelism.Pipeline interlo& logic is usually placed between critical stages todetect and resolve data and control dependencies; otherwise, theinterlock logic is imposed on a compiler.The maximum throughput of instruction execution is at mostone instruction per cycle; i.e., C (the average number of cycles perinstruction) cannot be less than 1. Thus, the ideal programexecution time of linear-pipelined processors results in;E=Nxl XT.2.2 MFU processor

Some pipelined SISD processors such as the CDC 6600, the IBM360/91 floating-point unit, and the GRAY-1 scalar unit, providemultiple functional units (MFUs) in E-stage (Figure l-b ). Eachfunctional unit may or may not be pipelined. Although not shownin Figure l-b, W-stage can be muttiplied by the number of MFUs .MFU processors can exploit both temporal and spatialparallelisms. To increase MFU utilization, most of pipelineinterlock logic is localized to the stages prior to E-stage. It isreferred to as insstruction issue logic lWeiss841. Simple instructionissue logic issues an instruction to an MFU sequentially. Complexissue logic such as Tomasulos algorithm [Tomasulo67] is capableof dynamic code scheduling by allowing instructions to beginand/or complete execution nonsequentially.Even if any instruction issue logic is employed, however, atmost one instruction can be issued per cycle. Thus, the maximumthroughput of instruction execution is the same as in linear-pipelined SISD processors; i.e., C cannot be less than 1. The idea1

IF D OF E w

(a) Linear-pipe fined processor

(b) MFU processor (case of 4 functional units)

(i+6) thvery long-instruclian

IF D OF E wI\ 1 nOPO OPO I\

OPl OPl OPl OPI i th--.- (its) th-- (i+4).th-- (i+3).th-- (i+Z).th----r- (I+ I).th--svery loOP2 OP2 OP2 DP2 instructio._ ._ _.

OF3 OP3 OP3 OP3J I v I \I I bCc) Pipelined VLIW processor (case of 4 operations/VLIW)

(i+24) thinstruction(i+25) thinstruction(i+26) thinstruction(it27) thinrtructian

D OF E wi thmtruction

(i+t) instruction(i+Z) instruction(i+3) instruction

Cd) SIMP processor (case of 4 instruction pipeline s)Figure 1. Pipelined Instruction Execution Models

program execution time of MFU processors results in;E=Nxl XT.More complex issue logic capable of issuing multipleinstructions per cycle are studied so as to reduce C less than 1[TjadenlO; Acosta86; PleszkunSB]. If F functional units areprovided and the same number of instructions can be issued eachcycle, the ideal program execution time now results in;E=Nx(lfF)xT.2.3 Pipelined VLIW Processor

VLIW processors also provide multiple functional units.Unlike MFU processors, however, VLIW processors attempt toutilize functional units by specifying multiple independen toperations in a single very long instruction . Original, non-pipelined, VLIW processors such as the QA-series have exploitedonly spatial parallelism. Since each operation can be pipelined,pipelined VLIW processors such as the Multiflow TRACE[Colwel1871 and the Cydrome Cydra 5 lRau891 can now exploitboth temporal and spatial parallelisms (Figure l-c) .A very long instruction is fetched each cycle, each operationfield of the instruction is decoded in parallel, and each decodedoperation is processed simultaneous ly and independently at thecorresponding functional unit. A clever compiler is well aware ofthe architecture (e.g., the number of functional units, the numberof pipeline stages, etc.), and is responsible for scheduling the orderof operations strictly to prevent any hazard from causing incorrectresults at run time. Therefore, no pipeline interlock logic orcomplex instruction issue logic need be implemented in hardware.

79


3/8

Given that F functional units are provided (e.g., F= 4 in FigureI-c), pipelined VLIW processors can reduce N(the number of verylong instructions to be executed) to N/F, where N is the number ofshort instructions to be executed substantially, by compacting Foperations into one very long instruction. The ideal programexecution time of pipelined VLIW processors results in ;E=(NfF)xlxT.2.4 SIMP Processor

SIMP processors employ multiple instruction-p ipeline structureshown in Figure l-d, and exploit both temporal and spatialparallelisms. All pipelines should be identical. Each pipeline maybe either the linear pipeline shown in Figure l-a, or the MFU-processor-like pipeline shown in Figure l-b.Regardless of the number of instruction pipelines provided, asingle program counter (PC) controls the flow of instructionexecution; i.e., a single instruction stream. Given that Pinstruction pipelines are provided (e.g., P=4 in Figure l-d),instructions can be processed in blocks ofP. We refer to this blockas an instruction block, which consists of successive Pinstructions in an object program. An instruction block startingwith the instruction specified by c(PC) and ending with theinstruction specified by c(PC) +P- 1 is fetched each cycle, wherec(PC) indicates the contents of PC. Each instruction of aninstruction block is decoded and processed simultaneously, butdependently, at the corresponding instruction pipeline.SIMP processors would appreciate the advantages of static codeschedu ling, but they should exploit spatial parallelism byscheduling instructions at run time. It is because, unlike VLIWs,SIMP processors should not expose their hardware architectures(e.g., the number of instruction pipelines) to such clever com pilersas used for VLIW processors. In such a case, even if an ordinarycompiler could increase the dependency distance, somedependencies may still remain in the compiled code until runtime. As linear-pipelined and MFU processors do, SIMPprocessors mus t provide a way to resolve these rema iningdependencies at run time.From a standpoin t of program compatib ility, the above is amajor difference between SIMP and VLIW. For VLIW, it is notimpossible but difficult to have compiled codes portable amongVLIW processors w ith the various number of functional units,because the instruction size reflects the number of functionalunits directly. On the other hand, in SIMP, one instructioncorresponds to one instruction pipeline, and therefore the numberof instruction pipelines installed is transparent to its 1SP(Instruct ion-Set Processor) architecture.The maximum throughput of instruction execution is at mostone instruction block of P instructions per cycle; i.e., C can bereduced to l/P. Thus, the ideal program execution time of SIMPprocessors results in;E=Nx(lIP)XT.3. Critical Issues Regarding SIMP

There are several critical issues to be resolved beforeimplementing SIMP architecture, which issues are also commonto most pipelined processors. We identify those issues by givingsome background regarding instruction-pipelining techniques.3.1 Branch Problem

The detrimental effects of branch ins tructions are severe,because fetching the next instruction is postponed until a branchdecision (i.e., taken or not-taken) is resolved and a branch targetaddress is generated. As a result, a lot of bubbles are forced in toa pipeline, and the pipeline stalls. There are some techniques toreduce the branch penalty: delayed branch, branch prediction,multiple prefetching, and so on [Lee84]. A popular technique isbranch prediction. By means of branch prediction, instructionsfetched from a predicted branch path can be executed in aconditional mode. If the predicted path is incorrect, however,some prediction-m iss handling should be done to nuilify or squashthe conditionally executed instructions.

SIMP processors also can adopt lbranch prediction withprediction-miss handling. However, branch prediction involvesother critical issues unique to SIMP;How to determine predicted paths inside an instruction blockwhich substantially consists of multiple instructions to befetched simultaneously, and how to recover if one of the paths isincorrect.3.2 Data Dependency

There are 3 types of data dependencies: flow (RA W: Read-After-Write), anti (WAR: Write-After-Read), and output (WAW : Write-After-Write ). Two popular hardware solutions exist to the datadependency problem: pipeline interlock logic in linear-pipelinedprocessors, and instruction issue logic in hlFU processors.Pipeline interlock logic is so simple and straightforward that, ifit detects data dependencies , it just interlocks pipeline until thedependencies are resolved. The detrimen tal effects of datadependencies, however, can not be alleviated.Simple instruction issue logic can issue an instruction to anMFU, only if its data dependencies are resolved; otherwise, theinstruction and subsequent instructions are blocked from issuing.Complex issue logic such as Tomasulos algorithm permits aninstruction to be issued even when its dependencies are notresolved. Flow dependencies can be resolved by waiting andmonitoring instruction results, and anti/output dependencies canbe eliminated by renaming registers with tags. In addition,Tomasulos algorithm allows subsequent instructions to bypassthe instruction issued previously, while it waits until itsdependencies have been resolved. Thus instructions can beginand/or complete their execution out-of-order with respect to thecompiled code sequence. This scheme of instruction execution isreferred to as out-of-order execution model IWeiss84; Patt851, andcan minim ize the effects of flow dependencies .To utilize multiple instruction pipelines as well as multiplefunctional units, S IMP processors also can employ the out-of-orderexecution model. In such a case, there are critical issues unique toSIMP;How to detect data dependencies among multiple instructionssimultaneously, and how to represent them.3.3 Control Dependency

Even if above-mentioned techniques isuch as conditional-modeexecution and out-of-order execution are employed, controldependencies still impede the execution of instructions. It isbecause the domain of out-of-order execution may be affected bythe size of a basic block (in which only one control dependencyoccurs at the bottom). There are two schemes of the out-of-orderexecution model: lazy (or normal) execution and eager execution.Lazy execution schem e disallows out-of-order execution toproceed beyond any branch instruction; i.e., the domain of out-of-order execution is limited to the basic block whose execution iscertainly needed. Although the scheme is simple, the effect of out-of-order execution is diminished by small basic blocks.On the other hand, eager execution scheme causes instructionsto be executed out-of-order regardless of basic blocks they belongto. The scheme applies out-of-order execulion to instructionsfetched from a predicted path, and it therefore requires moresophisticated prediction-miss handling m echanisms, such as Lhcreorder buffer of the RUU (Register Update Unit) lSohi871 and thecheckpoint repair mechanism [Hwu871, in order to avoid animprecise machine state (described in section 3.4).3.4 imprecise Machine State

When instructions may finish out-of-order, an imprecisemachine state can be introduced by branches and instruction-generated traps (e.g., an exception and a page fault). It is because,when a branch or trap occurs, out-of-order execution can modify amachine state (such as register file and memory ) inconsistentlywith the sequential architectura l specification. The problemcaused by traps is well-known as imprecise interrupt, and thereare many solutions to the problem [Smith851. Also, there are

a0


4/8

common solutions to both the imprecise interrupt problem and theproblem caused by branches lHwu87; Sohi871.4. SIMP Processor Prototype

The SIMP processor prototype, the first implementation ofSIMP, is a quadruple instruction-p ipeline processor composed of 4identical instruction pipelines. It is an ideal SIMP processor, inthe sense that it is equipped with rich hardware mechanisms toresolve the issues discussed in section 3; i.e., branch prediction,conditional-mode execution, out-of-order execution, and reorderbuffer. Note that its design never limits o ther implementations ofSIMP architecture, especially on the number of instructionpipelines.4.1 Processor Organization

Block diagram of the SIMP processor prototype is shown inFigure 2. Each instruction pipeline comprises 5 stages: IF(Instruction-block Fetch), D (Decode), I (register-read and Issue),E (Execute), and R (register-write and Retire). The prototype willbe implemented in off-the-shelf TTUCMOS chips, with a machinecycle time of 60 ns, or a pipeline cycle time of 120 ns (i.e., S-cyclepipeline). Hereafter cycles refer to pipeline cycles, The SIMPprocessor prototype consists of the following main components.(a) MBIC (Multiple-Bank Instruct ion Cache):The MBIC is an instruction cache consisting of 4 independentbanks of RAMS, where a bank width equals to an instruction size(4 bytes). A cache line contains 16 contiguous instructions from astatic instruction stream (i.e.;object program) in memory. TheMBIC also includes a BTB (Branch Target Buffer) for purpose ofbranch prediction.(b) IBSU (Instruction-Block Supply Unit):The IBSU is common to all instruction pipelines, and it plays arole of IF-stage. At each cycle, the IBSU fetches 4 successiveinstructions from the MBIC with the help of branch predic tion anddistributes them to all the IPUs.(c) IPUsUnstruction Pipeline Units):Four identical IPUs are installed. For an instruction receivedfrom the IBSU, each IPU performs a pipeline sequence from D-stage to R-stage individually. E-stage is equipped with 5pipelined functional units: IALU (Integer ALU), IMUL (IntegerMULtiplier), FALU (Floating-point ALU), FMUL (Floating-pointMULtiplier), and DCAR (Data Cache Access Requester). They areconstructed with 32-bit building blocks such as Advanced MicroDevices Am29 323132 and Weitek WTL2264/65.To allow out-of-order execution and to ensure the precisemachine state, E-stage provides a queue of 4 entries, called WRB(Waiting and Reorder Buffer), which comprises a pair of WB(Waiting Buffer) and RB (Reorder Buffer). The WB corresponds tothe reserva&n stations in the IBM 360/91 [Tomasulo671, and theRR corresponds to the reorder buffer proposed in [SmithBB;Sohi871.(d) DHRF (Dependency -Handling Register-File):The DIIRF is a register file with 8 read-ports and 4 write-ports.The DHRF is shared by all the IPUs, and is accessed three times(i.e., two source-register accesses at l-stage and one destination-register access at R-stage) per cycle by every IPU . The DHRFmaintains a WRT (Write Reseruation Table), each entry of whichis associated with a register, to detect flow dependencies. It alsomaintains a CDT (Control Dependency Table) to detect controldependencies. The DHRF provides 4 BBS (Bypass Buffers), each ofwhich contains a copy of the RB in each IPU.(e) MPDC (Multiple-Port Data Cache):The MPDC is a data cache with 4 load-ports and 4 store-ports.The MPDC is shared by all the IPUs, and is accessed twice (i.e.,one LOAD access at E-stage and one STORE access at R-stage) percycle by every IPU, if executing a LOAD/STORE instruction .(f) IPCN (IPU Chaining Network):The IPCN is an interconnection network among all the WRBs.

16 instructions/cache-linei --I 1static instruction stream t CACHE-LINE FETCH

supplying instruction-block stream INSTRUCTION-BLOCK SUPPLY(---j supplying directio; (,/ ,,...A /IPCN I I I I II I III

CPU1 PU2 [PC3

POMPDC

PI I2 I

Figure2. Block Diagram of the SIMP Processor Prototype

It consists of 4 broadcast buses, where every IPU is the bus masterof the corresponding bus. When one IPU places an executionresult on its own bus, all the IPUs (including the sender itself) cancatch the result from the bus. Thus the IPCN works like thecommon data bus used in the IBM 360/91 [Tomasulo671.4.2 Pipeline Flow

During IF-stage, the IBSU fetches an instruction block of 4successive instructions from a static instruction stream cached inthe MBIC, and then supplies it for all the IPUs, one instructionper IPU. The instruction stream being supplied by the IBSU iscalled supplying instruction. -block stream. Hereafter, theinstruction block is a unit of management until it retires from thesupplying instruction-block stream. At most 7 instruction blockscan reside in the stream; i.e., l/D-stage, l/l-stage, 4/E-stage, andl/R-stage.Once each IPU takes the corresponding instruction in aninstruction block, it can accomplishes pipelined processing for theinstruction individually. All the 4 instructions in the sameinstruction block, however, should flow through D, I, and R-stageson a lockstep basis, and in-order with respect to the instruction -block sequence. Only one slack exists in E-stage, whereinstructions can wait requisite operands and begin the execution

81


5/8

when the operands become available. Instruct ions can begin andcomplete the execution out-of-order with respect to both theinstruction sequence in the instruction block and the instruction-block sequence. The out-of-order execution mode1 is also referredto as local dataflow execution, because instructions are executedin a dataflow-execution fashion and the scope is localized in E-stage.A general pipeline flow following IF-stage is summ arized below.A brief description of out-of-order execution, which involves I, E,and R-stages, follows in section 6 .(a) D (Decode) stage:In each IPU, the decoder accepts an instruction from the IBSUand decodes an operation field of the instruction. It then forwardsidentifiers of requisite registers (up to 3: 2 sources and 1destination) and an operation type (i.e., BRANCH, or not) of thedecoded instruction to the DHRF.(b) I (register-read and Issue) stage:The DHRF receives the register identifiers and the operationtype from every IPU . The DHRF updates the entries in the WRT,each of which is indexed by a destination-registe r identifier. Italso updates the CDT, according to the operation types.The DHRF then transmits the current contents of sourceregisters, which are fetched from the register file or bypassed fromBBS, to the IPU requesting the source operands. It appends somecontrol information representing flow and control dependenc ies.Each IPU accepts them and issues the instruction byforwarding them to the tail entry of the WRB, if the WRB is notfull.(c) E (Execute) stage:At each machine cycle, every instruction in the WRB is checkedto see if it can be executed. If an instruction has no probableflow dependency (which is defined in section 5.3), then the IPU can#ire the instruction by dispatch ing it to one of functional units.Otherwise, the instruction must wait in the WRB until itsprobable flow dependencies have been resolved. An instructioncan resolve its dependencies by monitoring the IPCN. When thefiring instruction completes its execution, the result is stored inthe WRB.At the same time, if no control dependency exists to theinstruction which has completed its execution, the IPU cancommit the instruction by broadcasting the execution result toall the IPUs via the IPCN. Otherwise, the instruction must waitin the WRB until its control dependencies have been resolved.(d) R (register-write and Retire) stage:Each IPU forwards a committed result, if any, from the WRB tothe associated BB in the DHRF. The DHRF updates the WRT andCDT, according to the committed results. For the instructionblock at the head of the WRBs, if al1 4 instructions are committed,then the instruction block can be retired from the IPUs. Theretirement has the DHRF update the destination registers withthe contents of the head entries of the BBS.4.3 Balanced Instruction Set Computer

The SIMP architecture itself is not an ISP architecture but avehicle useful for implementing several different target ISParchitectures . We believe, however, that there is a class of ISPSmost suitable for SIMP in a wide spectrum of ISPs from RISC toCISC. We refer to it as EISC (Balanced Instruction SetComputer), and apply it to the SIMP processor prototype. Inorder to utilize strengths of SIMP, we place the followingconstraints on the BISC architecture;(1) Single-sized instructions: Single-sized instructions allowfetching and decoding multiple instructions simultaneously.(2) LOAD/STORE architecture: Register-register operationsrelieve some of the burden to detect and resolve datadependencies.(3) Fixed-cycle ope rations: Unlike RISCs emphas izing single-cycle operations, operations need not be completed in onecycle, but in some fixed cycles. If E-stage is pipelined by

means of multiple arithmetic pipelines, it is no longernecessary to limit all instructions to single-cycle operations.Also. from a viewpoint of code schedu ling, it is suflicient to fixthe number of operation cycles for every instruction. ThusBISC can include some complex instructions requiring fixedmultiple cycles, such as integer MULTIPLY/DIVIDE andfloating-point instructions, which are usually excluded fromRISCs. However, the other complex instructions, whoseoperations spend the variable number of cycles, are stillexcluded from BISC.(4) Balanced execution time: Though the fixed-cycle operationscan loosen RISCs constraint of single-cycle operations, theunbalance of execution time of every instruction introducesproblems of large pipeline latency and low pipelinethroughput. The pipeline stall due to the unbalance is severein SIMP, because all instructions in an instruction block hadbetter flow th rough pipelines on a lockstep basis so as to avoidthe imprecise machine state problem. It means that aninstruction with the longest execution time in an instructionblock determines the pipeline elapsed time for the instruct ionblock. Therefore the execution time. especially the number ofoperation cycles in E-stage, of every instruct ion should bebalanced within limits.The ISP architecture for the SIMP processor prototype is arepresentative of BISC; i.e., 32-bit single-sized instruct ions,LOAD/STORE architecture, register-registe r operations specifiedby 3-operand formats, and balanced fixed-cycle operations ranging

from 2 to 6 machine-cycles in E-stage. From a standpoint of thebalanced execution t ime, DIVIDE instructions (which wouldrequire over 13 machine-cycles ) are not provided, but MULTIPLYinstructions, for both integer and floating-point (which require 2and 6 machine-cycles respectively), are provided.6 Resolution Algorithms

To resolve the critical issues discussed in section 3, we havealready developed algorithms for branch problem resolution, datadependency resolution [Kuga89l, control dependency resolution,and so on. Details of all the algorithms cannot be presented heredue to insufficient space, however, and they will be reportedelsewhere. We describe the outline of these algorithms below.6.1 Branch Problem Resolution

During IF (Instruction-block Fetch) stage, the IBSU(Instruction-Block Supply Unit) supplie:a an instruction block of 4successive instructions. Because of the presence of branchinstructions, the supplying instruction-block stream is somewhatdifferent from a dynamic instruction stream, which is a sequentialexecution order of instructions in a static instruction stream. Thesupplying instruction-block stream has the followingcharacteristics;(1)

(2)

To fetch and distribute 4 instructions simultaneously, theIBSU simply makes an instruction block of 4 successiveinstructions in the static instruction stream. Thus theinstruction block may include some instructions not to beexecuted (i.e., to be excluded from the dynamic instructionstream).The instruction block to be prefetched is determined with thehelp of branch prediction using a BTB (Branch Target Buffer).Thus the supplying instruction-block stream may containsome instruction blocks fetched from incorrectly predictedbranch paths.Conventional prediction-miss handling techniques nullify all_ . . _ - . . .the instructions executed in a conditional mode (i.e., llush thepipeline), if (and only if) a branch prediction is wrong. Such apipeline flushing scheme is, however, inadequate to the supplyinginstruction-b lock stream because of the following reasons;(1) Even if a prediction is accurate, some instructions must benullified due to the above-mentioned characteristics 1.(2) Although a prediction is wrong, it is not efficient to flush thepipeline regardless of whether or not the instruction to berefetched already exists in the pipeline. The inefficiency is

62


6/8

more severe in the SIMP processor prototype, because as manyas 23 instructions can be executed in a conditional mode.Hence, instead of the pipeline flushing scheme, we haveintroduced the selective instruction-squashing scheme, whichsquashes only the instructions to be excluded from the dynamicinstruction stream whenever a branch decision is resolved[Murakami88]. The selective instruction-squashing schemeresults in allowing the instructions fetched from multiple branchpaths to be execute in multiple conditional modes.The selective instruction-squashing scheme forces the IBSU toselect instructions to be squashed. For that purpose, the IBSUmaintains a queue, called IBAT (Instruction-Block AddressTable), each of whose entries is associated with-an instruct ionblock under pipelined execution. There are the following 4patterns of squashing instructions selectively, as shown in Figure3;(1) Predict-Not-taken/Not-taken (Figure 3-a): When a predictionis Not-taken and found to be accurate, no squashing occurs.Predict-Taken/Taken (Figure 3-b) and Predict-Not-taken/Taken (Figure 3-c): When a branch is taken, the IBATis associatively searched to see if the branch target instructionhas already been supplied. If the target is present, all theinstructions between the branch and the target must besquashed. In such a case, it is not necessary to refetch thetarget; otherwise, the pipeline must be flushed and the targetrefetched.

(2)

(3) Predict-Taken/Not-taken (Figure 3-d): When a prediction isTaken and found to be wrong, the IBAT is associativelysearched to see if the sequential target instruct ion, which isincidentally fetched from the sequential stream, has alreadybeen supplied. The succeeding process is the same as thatstated above.5.2 Data Dependency Resolution

Our data-dependency resolution algorithm is an extension toTomasulos algorithm [TomasuloBIJ. We describe the outline ofour algorithm by comparing it with original Tvmasulos algorithmand the other extensions.(a) Flow Dependency Detection and Representat ion:(1) Tomasulos algorithm: Instructions are sequentially issuedone by one. Each register is assigned a busy bit whichindicates if it is the destination register of an instruction in

execution. Each register is also assigned a tag whichidentifies the instruction that must write the result into theregister. When an instruction is issued, the source registersare checked to see if they are busy. An instruction whosesource registers are busy obtains tags for the busy sourceregisters.(2) Our algorithm: We adopt a bitmap scheme, called multiple-dependency representation scheme, rather than Tomasulvstag scheme, because the tagging substantially needs to becarried out in sequential. Four instructions are issuedsimultaneously. Each register is assigned an entry of theWRT (Write Reservation Table), which entry is a bitmap andidentifies all the instructions that mus t write the results intothe register. When an instruction is issued, it always obtainsthe bitmaps for its source registers. Thus every instruct ion iscapable of recognizing its multiple flow dependencies causedby up to 3 preceding instructions in the same instruction blockand by up to 3 preceding instruction blocks (Le., up to 15dependencies per source-register). It is another reason for thebitmap scheme, and detailed in section 5.3.

(b) Flow Dependency Resolution:(1) Tomasulos algorithm: An instruction whose source registersare busy is forwarded to a reservation station (RS). Theinstruction must wait in the RS until its flow dependencieshave been resolved. It can resolve its dependencies bymonitoring the common data bus (CDB) and by performing the

supplying directionBlock [n] Block [ntl] Block [n+Z]

prediction : Not.takenresolution : Not-taken(a) Predict-Not-Taken/Not-Taken

Block [n] Block [n+l] Block [n+2] SupplyingInstruction-Block Stream

prediction : Takenresolution : Taken(b) Predict-Taken/Taken

Block [n] Block [n+l] Block En+21-.---. SupplyingInstruction-

) Block Stream

t prediction : Not-takenresolution : Taken(c) Predict-Not-Taken/Taken

Block [n] Block [n+lI Block h+21\FT:;,:Fr\ ;;F;iirn

p . ._........prediction . Takenresolution : Not-taken

(d) Predict-Taken/Not-Takenq : Instructions to be SquashedB : Branch Instruction 0 : Dynamic Instruction StreamBT : Branch Target InstructionP : Predicted Point ST : Sequential Target Instruction

Figure 3. Selective Instruction-Squashing Scheme

tag-matching process. If a matching tag is found, the data onthe CDB is the requisite source operand.(2) Our algorithm: Every instruction is forwarded to the tailentry of the WRB (Waiting and Reorder Buffer) in thecorresponding IPU (Instruction Pipeline Unit). The WB(Waiting Buffer) part of the WRB corresponds to reservationstations. If an instruction has any probable flow dependcnc!(which is defined in section 5.3). then it must wait in the WRBuntil its probable flow dependencies have been resolved. Flowdependencies can be resolved by monitor ing the IPCN (IPUChaining Network) and by clearing bits in bitmapsrepresenting dependencies. No tag-matching logic is needed.If a bit associated with a probable flow dependency is cleared,the data on the IPCN is the requisite source operand .

(c) Anti/Output Dependency Resolution:(1) Tomasulo s algorithm: If the destination register is busy otissuing an instruction, the instruction renames its destinationregister by updating the tag of the destination register with anew tag. Every register as well as the RS monitors the CDB,and updates its contents and clears its busy bit if a matchingtag is found. The direct register renaming scheme introducesimprecise interrupts, because the instruction that completesits execution is allowed to write the result into the destination

83


7/8

register directly if a tag match occu rs, In other words, theresult disappears if no tag match occurs.(2) Pa tts and Sohis algorithms: The registers are not renameddirectly, but done indirectly through a buffer such as theresult buffer [PattSB l or the reorder buffer of the RUU(Register Update Unit) [Sohi87]. That is, an instruction resultis not written into the destination register straightforward,but stored in the buffer temporarily. The indirect registerrenaming scheme allows recovery from wrong branchpredictions and traps, and therefore implem ents preciseinterrupts.(3) Our algorithm: Basically the same as Patts and Sohisalgorithms. The scheme of interest to us is the RUU, whichacts as a queue and merges the RSs with the reorder buffer[SohiSIl. The WRB (Waiting and Reorder Buffer) in ouralgorithm corresponds to the RUU. Unlike the RUU, aseparate WRB is located in every IPU and a set of all theWRBs acts as a queue.

5.3 Control Dependency ResolutionAs stated briefly in section 3.3, there have been two hardwaresolutions of the basic block prob lem caused by controldependencies: lazy execution and eager execution. In addition tothe eager execution scheme , we attempt a more eager executionscheme, called aduance execution scheme.

(a) Lazy Execution:When a branch instruction is present, subsequent instructionsare blocked from firing (or dispatch ing) their executions until thecontrol dependency due to the branch has been resolved[TomasuloBIl.

(b) Eager Execution:In the above case, subsequent instructions are allowed to iiretheir executions as soon as their flow dependenc ies are resolved,even before the control dependency has not been resolved. Thescheme does not commit the instructions (i.e., update the machinestate) until the control dependencies have been resolved [Sohi871 .(c)Advance Execution:To take advantage of multiple conditional modes derived fromthe selective instruction -squashing scheme , the advanceexecution scheme allows an instruction to fire its execution (calledpre-execution) even before its flow dependencies have not beendefinitely resolved because of the presence of controldependencies. It also allows the instruction to re-execute (orbacktrack) when the pre-execution result is found invalid.We first introduce two types into flow dependencies: PFD(Probable Flow Dependency) and UFD (Uncertain FlowDependency). As shown in Figure 4, a PFD occurs when theinstruction A is going to write to the register Rn from which theinstruction D is going to read, and when no control dependencycovers A. On the other hand, a UFD occurs when the instructionC is going to write to Rn and some control dependencies covers C.From the viewpoint of the instruction D, at most 1 PFD and morethan 1 UFD may be present per source operand at a time. Toidentify all the PFD and UFDs, we employ the multiple-dependency representation scheme mentioned in 5.2.For this example, Tomasulos tag scheme represents that D isflow-dependent on C, with a single tag to identify C which willsupply D with the latest version of Rn. However, whether C isexecuted or not depends on the conditional branch ins truction B.If the branch is taken and the branch target falls between C andD, then C is never executed, and therefore the tag identifying Chas no meaning. The instruction D should access the correctversion of Rn, which was generated by A and may be alreadywritten into the register file. In such a case, conventional pipelineflushing schemes recover from the incorrect sequence bysquashing all the instructions following B, and by refetchinginstructions starting from the branch target. Then the instructionD is refetched, and now can obtain the correct version of Rn fromthe register file.

Since the selective instruction -squashing scheme need notrefetch the instruction D, however, D cannot obtain the correctsource operand with the tag. The tag scheme is inadequate to theselective instruction-squashing scheme .consequen tly. Themultiple-dependency representation schleme has just resultedfrom the selective instruction -squashing scheme.Now an instruction can fire its pre-execution as soon as itsPFDs are resolved, regardless of whether or not its controldependencies have been resolved. Resolution of its controldependencies may uncover any new PFD (which was one ofUFDs), even after the instruction complel;ed its pre-execution. Insuch a case, the instruction must backtrack and wait in a WRBuntil its PFDs have been resolved again. There is no limit to thenumber of the backtrack. The WRB commits the instruction whenits PFDs and control dependencies have been resolved. Thecommitted result is then forwarded, via the IPCN, to all theinstructions waiting for it.

Rn &Ri op S2...

. A

Conditional Branch . B

PFD Rn+Rk op S2

. C

. D

PFD : Probable Flow Dependency issuingdirectionUFD : Uncertain Flow DependencyFigure 4. Probable and Uncertain Flow Dependencies

6. ConclusionsIn this paper, we have introduced a novel multiple instruction-pipeline parallel architecture: SIMP. Our goal is to enhance theperformance of SlSD processors drastically by exploiting bothtemporal and spatial parallelisms, and to keep programcompatibility as well. Key techniques to achieving highperformance are the mechanisms of fetching multiple instructionssimultaneous ly and continuously, and of resolving instructiondependencies effectively at run time. We have devised thesemechanisms for the SIMP processor prototype which is a

quadruple instruction-pipe line processor under development .The dependency resolution mechanism pe rmits out-of-orderexecution of sequential instruction stream. Our out-of-orderexecution model is based on Tomasulos algorithm, but is greatlyextended and accommodated to multiple instruction pipelining.The instruction fetch mechanism cooperates with the out-of-orderexecution model by supplying it with multiple instructionscontinuously with the help of branch prediction. In this paper, wehave introduced several unique schemes such as selectiveinstruction squashing, multiple conditional modes, multiple-dependency representation, advance execution, and so on.

84


8/8

Although SIMP does not limit the target ISP architectures, wehave suggested a class of ISP architectures most suitable forSIMP: BISC. BISC is a LOAD/STORE architecture with single-sized instructions, but it loosens RISC% constraint of single-cycleoperations. We appiy BISC lo the SIMP processor prototype.Currently we have several projects in progress regarding SIMP;the developmen t of the SlMP processor prototype, the design of anoptimiz ing C compiler capable of static code schedu ling for theprototype, and the evaluation of SIMP architecture via softwaresimulations.Since we have not yet completed a software simulator of theSIMP processor prototype , we cannot report the performanceestimate for the prototype here. T he measurements obtained by asimple trace-driven simulator (see Table 1). however, show that aquadruple instruction-pipeline processor can achieve aperformance of about 200 -27096 of a single instruction-pipelineprocessor IIrie881.

Table 1. Relative Speedupsover Single Instruc tion Pipeline

We wouId like to emphasize that the results are verypreliminary and the benchmarks are not optimized by static codescheduling. The results therefore encourage us to expect moreperformance enhancement in the SIMP processor prototype.As our future work, we are planning a VLSI implementation ofSIMP architecture. It will be done by implementing one or moreinstruction pipelines on a single-chip VLSI, and by constructingan SIMP processor using the chips as building blocks. We refer tothe single-chip VLSI as pipeline-slice microprocessor on theanalogy of bit-slice microprocessor. The processor organizationusing pipeline-slice microprocesso rs will support scalabilitytoward more instruction pipelines. As bit-slice microprocesso rscan achieve a wide range of data width, pipeline-slicemicroprocesso rs will be able to achieve a wide range ofperformance.We believe that SlMP architecture is one of the most promisingarchitectures toward the coming generation of high-speed singleprocessors.AcknowledgementsWe would like to thank the following students who havecontributed or are now contributing to the project: T.Goto(currently with Toshiba Corp.), O.-B.Gwun, T.Hara, andTHironaka. We would also like to acknowledge the considerablecontributions from A. Fukuda and TSueyoshi.References[Acosta861 R.D.Acosta, J.Kjelstrup, and H.C.Torng, An lnstruc-tion Issuing Approach to Enhancing Performance in MultipleFunctional Unit Processors, IEEE Trans. Cornput., vol.C-36,

no.9, pp.815-828, Sept. 1986.[CoIweIl87] R.P.ColwelI, R.P.Nix, J.J.ODonnell, D.B.Papworth,and P.K.Rodman, A VLIW Architecture for a TraceScheduling Compiler, Proc. 2nd Znt. Conf. ArchifecturalSupport for Programming Languages and Operating SystemsfASPLOS If), pp.180-192, Oct. 1987.[Fisher81 ] J.A.Fisher, Trace Scheduling: A Technique for GlobalMicrocode Compaction, IEEE Trans. Com put., vol. C-30, no.7,pp.478-490, July 1981.[Fisher831 J.A.Fisher, Very Long Instruction Word Architecturesand the ELI-512, Proc. 10th Ann. Int. Symp. ComputerArchitecture, pp.140-150, June 1983.

LHagiwara801 H.Hagiwara, STomita, S.Oyanagi, andK.Shibayama, A Dynamically Microprogrammable Computerwith Low-Level Parallelism, IEEE Trans. Comput., vol.C-29,no.7, pp.577-695, July 1980.lHwu871 W.W.Hwu and Y.N.Patt, Checkpoint Repair for Out-of-order Execution Machines, Proc. 14th Artn. Int. Symp.Computer Architecture, pp.18-26, June 1987; also IEEE Trans.Cornput.. vol.C-36, no.12, pp.1496.1514, Dec. 1987.lIrie881 N.Irie, M.Kuga, K.Murakami, and S.Tomita, SpeedupMechanisms and Performance Estimate for the SIMP ProcessorPrototype (in Japanese), ZPSJ WGARC report 73-11, Nov.1988.[Kuga89] M.Kuga, K.Murakami, and STomita, Low-level Paral-lel Processing Algorithms for the SIMP Processor Prototype (inJapanese), Proc. IPSJ Joint Symp. Parallel Processing89,pp.163-170, Feb. 1989.[Lam881 M.Lam, Software Pipelining: An Effective SchedulingTechnique for VLIW Machines, Proc. SIGPLAN88 Conf.Programming Language Design and ImpZemcntation , pp.318-328, June 1988.[Lee841 J.K.F.Lee and A.J.Smith, Branch Prediction Strategiesand Branch Target Buffer Design, IEE E Computer, ~01.17,no.1, pp.6-22, Jan. 1984.[Murakami881K.Murakami,A.Fukuda, T.Sueyoshi, and STomita,SIMP:Single Instruction stream/Multiple instructionPipelining (in Japanese), IPSJ WG ARC report 69-4, Jan. 1988.

[Patt851 Y.N.Patt, W-M.Hwu, and M.Shebanow,HPS, A NewMicroarchitecture : Rationale and Introduc tion, Proc. 18thAnn. Workshop on Microprogramming, pp.103-108, Dec. 1985.[Pleszkun881 A.R.Pleszkun and G.S.Sohi, The PerformancePotential of Multiple Functiona l Unit Processors, Proc. 15thAnn. lnt. Symp. Computer A rchitecture, pp.37-44, May 1988.[Rau89JB.R.Rau,D.W.L.Yen, W.YenandR.A.Towle,TheCydra5 Departmental Supercomputer, ZEEE Computer, ~01.22, no.J,Jan. 1989.[Smith851 J.E.Smith and A.R.Pleszkun, Implementation ofPrecise Interrupts in Pipelined Processors, Proc. 12th Ann. Int.Symp. Computer A rchitecture, pp.36-44, June 1985; also IEEETrans. Cornput., vol.C-37, no.5, pp.562-573, May 1988.[Sohi87] GSSohi and S.Vajapeyam, Instruction Issue Logic forHigh-Performance, Interruptable Pipelined Processors, Proc.14th Ann. Int. Symp. Computer Architecture, pp.27-34, June1987.[TjadenlO] GSTjaden and M.J.Flynn, Detection and ParallelExecution of independent instructions, IEEE Trans. Cornput.,vol.C-19,no.l0, pp.889-895, Oct. 1970.[Tomasulo67] R .M.Tomasulo, An Efficient Algorithm forExploiting Multiple Arithmetic Units. IBM J. Res. Develop.,vol.ll,pp.25-33, Jan. 1967.[Tomita83] S.Tomita, KShibayama, T.Kitamura, T.Nakata, andH.Hagiwara, A User-Microprogrammable, Local HostComputer with Low-Level Parallelism, Proc. 10th Ann. Int.Symp. Computer Architecture, pp.151-157, June 1983.[Tomita86] STomita, K.Shibayama, T.Nakata, S.Yuasa, andH.Hagiwara, A Computer with Low-Level Parallelism QA-2 -Its Applications to 3-D Graphics and Prolog/Lisp Machines -,Proc. 13th Ann. Int. Symp. Computer Architecture, pp.280-289,June 1986.[Weiss841 SWeiss and J.E.Smith, Instruction Issue Logic forPipelined Supercomputers, Proc. 11th Ann. Znt. Symp.Computer Architecture, pp.llO-118, June 1984; also IEEETrans. Comput., vol.C-33, no.ll.pp.1013-1022, Nov. 1984.

Documents

1989-Single Instruction Multiple Pipeline Architecture