8
Maximizing Resource Utilization by Slicing of Superscalar Architecture Shruti Patil and Venkatesan Muthukumar Dept. of Electrical and Computer Engineering University of Nevada Las Vegas 4505 Maryland Parkway, Las Vegas, NV 89154, USA Abstract Superscalar architectural techniques increase instruc- tion throughput by increasing resources and using complex control units that perform various functions to minimize stalls and to ensure a continuous feed of instructions to the execution units. This work proposes a dynamic scheme to increase efficiency of execution (throughput) by a method- ology called block slicing. This takes advantage of instruc- tion level parallelism (ILP) available in programs without increasing the number of execution units. Implementation of this concept in a wide, superscalar pipelined architecture introduces nominal additional hardware and delay, while offering power and area advantages. We present the design of the hardware required for the implementation of the pro- posed scheme and evaluate it for the metrics of speed-up, throughput and efficiency. 1 Introduction Current technology demands have spurred an excep- tional development in the way computers are designed. Advanced computer architectures take advantages of the concepts of out-of-order superscalar architectures, aggres- sive speculative techniques, high bandwidth caches and dis- tributed processor architectures. Improvements in integra- tion density of components on a chip using VLSI techniques and the corresponding lower costs have enabled integrating complete processors with some memory on a single chip for improved performance. Embedded systems and special purpose architectures can follow techniques similar to those used by general-purpose computers for performance enhancement. However, they are also constrained by area and power considerations and design decisions can be vastly different for them. Cur- rent trends show that technologically advanced products are moving towards multi-functional systems. Classic exam- ples include cellphones and smart devices. Thus, it may be desired to cram greater functionality into embedded sys- tems, while decreasing area and power requirements and improving performance. When applications have vastly dif- ferent characteristics, the task of designing systems with such constraints can be daunting. One such design decision is to determine the number of functional units of each type that should be incorporated in the arithmetic and logic unit (ALU). To cater to the varied nature of applications, we pro- pose a dynamic scheme called block slicing that leads to ef- ficient utilization of available resources and offers enhanced performance while introducing minimal aditional hardware. This is a general scheme that can be applied to any pro- cessor to ensure that applications that need fewer resources than available have a gain in performance and power. 2 Relation to prior work Superscalar architectures were built with a view to ex- tract parallelism from data and instructions. Multiple in- structions and data are fetched simultaneously and out-of- order execution is enabled to reduce stalls. The main archi- tectural challenge is to issue multiple instructions per cycle and to do so efficiently. Instruction level parallelism (ILP) can be achieved by efficient scheduling, more number of execution units and high scheduling bandwidth. SIMD type of processors like Vector processors parallelly process mul- tiple sets of data to achieve a high throughput. They rely on the nature of the task to achieve the parallel performance. The Intel MMX is an SIMD instruction set to process mul- tiple datasets within a single execution cycle. When the pro- cessor encounters an MMX instruction, it interprets the data registers as a collection of data, and performs the same op- eration on all operands. This increases the throughput of the processor for tasks that operate on small bit sizes. There have been other special architectures proposed that add a certain degree of flexibility to a processor. Wirthlin et al [6] proposed the Dynamic Instruction Set Computer (DISC) architecture that can support dynamic modification of its instruction set based on the demand of the incoming instruction. The DISC architecture had two significant fea- tures: Partial FPGA reconfiguration, that provided the abil- ity to reconfigure a sub section of an FPGA while letting the remaining logic operate unaffected; and Relocatable hard- ware, that gives the flexibility to relocate or make placement decisions of partial configurations at run time in order to 11th EUROMICRO CONFERENCE on DIGITAL SYSTEM DESIGN Architectures, Methods and Tools 978-0-7695-3277-6/08 $25.00 © 2008 IEEE DOI 10.1109/DSD.2008.125 923

[IEEE 2008 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools - Parma, Italy (2008.09.3-2008.09.5)] 2008 11th EUROMICRO Conference on Digital System

Embed Size (px)

Citation preview

Page 1: [IEEE 2008 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools - Parma, Italy (2008.09.3-2008.09.5)] 2008 11th EUROMICRO Conference on Digital System

Maximizing Resource Utilization by Slicing of Superscalar Architecture

Shruti Patil and Venkatesan MuthukumarDept. of Electrical and Computer Engineering

University of Nevada Las Vegas4505 Maryland Parkway, Las Vegas, NV 89154, USA

Abstract

Superscalar architectural techniques increase instruc-tion throughput by increasing resources and using complexcontrol units that perform various functions to minimizestalls and to ensure a continuous feed of instructions to theexecution units. This work proposes a dynamic scheme toincrease efficiency of execution (throughput) by a method-ology called block slicing. This takes advantage of instruc-tion level parallelism (ILP) available in programs withoutincreasing the number of execution units. Implementationof this concept in a wide, superscalar pipelined architectureintroduces nominal additional hardware and delay, whileoffering power and area advantages. We present the designof the hardware required for the implementation of the pro-posed scheme and evaluate it for the metrics of speed-up,throughput and efficiency.

1 Introduction

Current technology demands have spurred an excep-tional development in the way computers are designed.Advanced computer architectures take advantages of theconcepts of out-of-order superscalar architectures, aggres-sive speculative techniques, high bandwidth caches and dis-tributed processor architectures. Improvements in integra-tion density of components on a chip using VLSI techniquesand the corresponding lower costs have enabled integratingcomplete processors with some memory on a single chip forimproved performance.

Embedded systems and special purpose architectures canfollow techniques similar to those used by general-purposecomputers for performance enhancement. However, theyare also constrained by area and power considerations anddesign decisions can be vastly different for them. Cur-rent trends show that technologically advanced products aremoving towards multi-functional systems. Classic exam-ples include cellphones and smart devices. Thus, it maybe desired to cram greater functionality into embedded sys-tems, while decreasing area and power requirements andimproving performance. When applications have vastly dif-

ferent characteristics, the task of designing systems withsuch constraints can be daunting. One such design decisionis to determine the number of functional units of each typethat should be incorporated in the arithmetic and logic unit(ALU). To cater to the varied nature of applications, we pro-pose a dynamic scheme called block slicing that leads to ef-ficient utilization of available resources and offers enhancedperformance while introducing minimal aditional hardware.This is a general scheme that can be applied to any pro-cessor to ensure that applications that need fewer resourcesthan available have a gain in performance and power.

2 Relation to prior work

Superscalar architectures were built with a view to ex-tract parallelism from data and instructions. Multiple in-structions and data are fetched simultaneously and out-of-order execution is enabled to reduce stalls. The main archi-tectural challenge is to issue multiple instructions per cycleand to do so efficiently. Instruction level parallelism (ILP)can be achieved by efficient scheduling, more number ofexecution units and high scheduling bandwidth. SIMD typeof processors like Vector processors parallelly process mul-tiple sets of data to achieve a high throughput. They rely onthe nature of the task to achieve the parallel performance.The Intel MMX is an SIMD instruction set to process mul-tiple datasets within a single execution cycle. When the pro-cessor encounters an MMX instruction, it interprets the dataregisters as a collection of data, and performs the same op-eration on all operands. This increases the throughput of theprocessor for tasks that operate on small bit sizes.

There have been other special architectures proposed thatadd a certain degree of flexibility to a processor. Wirthlinet al [6] proposed the Dynamic Instruction Set Computer(DISC) architecture that can support dynamic modificationof its instruction set based on the demand of the incominginstruction. The DISC architecture had two significant fea-tures: Partial FPGA reconfiguration, that provided the abil-ity to reconfigure a sub section of an FPGA while letting theremaining logic operate unaffected; and Relocatable hard-ware, that gives the flexibility to relocate or make placementdecisions of partial configurations at run time in order to

11th EUROMICRO CONFERENCE on DIGITAL SYSTEM DESIGN Architectures, Methods and Tools

978-0-7695-3277-6/08 $25.00 © 2008 IEEE

DOI 10.1109/DSD.2008.125

923

Page 2: [IEEE 2008 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools - Parma, Italy (2008.09.3-2008.09.5)] 2008 11th EUROMICRO Conference on Digital System

Instruction types Average % of inte-ger operations in in-teger benchmarks

Average % of in-teger operations infloating point bench-marks

Load-store 38% 22%Add, sub, compare,shift, and, or, xor

45% 31%

Branch, conditionalmove, jump, call,return

16% 4%

Table 1. Average of MIPS dynamic instructionmix in SPECint2000 and SPECfp2000 bench-mark suite

enhance run time hardware utilization. A. Gloria describeda Variable Instruction Set Architecture (VISA) [2], an in-struction coding technique that reduces the width of the in-structions using dynamic instruction coding using compil-ers. The compiler selected the instruction set and parallelhardware functions based on the number of bits requiredfor the instructions. Chandra Shekhar et al. tried to real-ize benefits of software based general purpose architecturesand dedicated hardware architectures through ApplicationSpecific Instrution Set Processor (ASIP) [5] architectures.ASIPs are suitable for embedded applications as they per-mit an alteration of hardware-software boundary to meet thespeed and energy constraints of a specific application.

All processor described above have static datapaths. Thehardware is incapable of adapting to input tasks at run-time. Hardware is usually designed with sufficient re-sources for all possible types of applications expected torun on it. However, all tasks that require minimal resourcesand the tasks that require maximum resources pass throughthe same datapath, which reduces the overall utilization ofhardware. This issue has been addressed in this work. Theprocessor can operate on smaller data sets independently,without the need for any special instructions. It identifiesand tries to schedule instructions in any task that wouldotherwise stall due to lack of resources. A small amountof hardware detects the presence of instructions that do notneed to use the full word size of the execution units andschedules them on only a part of the unit, leaving the re-maining part to operate on other instructions. Thus the ALUcan do different operations simultaneously.

3 Sliced Processor Architecture

Table 2 [3] lists the average of MIPS dynamic in-struction mix of five SPECint2000 programs: gap, gcc,gzip, mcf, perl, and five SPECfp2000 programs: applu, art,equake, lucas, swim.

In a superscalar processor, execution units are providedto service all types of instructions present in the instructionset that need a computation unit. A generic instruction setconsists of four basic types of instructions: ALU, Branch,Load and Store. The number of units allotted to each type

of instruction affects the space requirements, power usageand additional logic necessary for smooth functioning ofthese units in parallel. The optimum number of executionunits of each type included in the stage is typically basedon applications served by the processor and the type oftasks that are expected to run on it. The instruction mixin the SPECint2000 programs consists of a large percent-age of integer operations of add, subtract, compare, shift,and, or and exclusive-or. To cater to the high percentage ofALU instructions, it may be necessary to include more thanone ALU units. The number of ALU instructions can varywildly from one benchmark suite to another. An uncheckedaddition of more ALU units can result in idle units in the ex-ecution stage. Also, in ALU-intensive tasks, the ALU reser-vation units get flooded while others remain idle. Hence,a flexible scheme is proposed in this work, that takes intoaccount the observation that value of operands of ALU op-erations is not always as large as that accommodated by theword length of the machine. In such cases, only a part ofthe execution unit is doing useful computation, while therest has zero operands. Thus, truly speaking the utilizationis not 100%. This work adds run-time flexibility to hard-ware modules for the purpose of accommodating as manyinstructions as possible in the execution unit. The exact ex-tra hardware and logic required to do this is designed andimplemented, while the general concepts associated withthe addition of flexibility are described below. These canbe applied in any form to any application.

3.1 Block Slicing

’Block slicing’ is the process of splitting a block intomultiple modules. A logic circuit that operates on 1-bitoperands can be called as a unit. N units are intercon-nected to form N -bit modules to operate concurrently onN-bit operands. In all implementations, N is known or ispre-set and the interconnection network between units thatform the modules is static in nature. When operands ofvarying lengths are encountered, the value of N is requiredto be dynamic. In order to allow N to be a dynamic valuedetermined at run-time, it is necessary to make the inter-connection network flexible. The network may be built tobe completely flexible, but it is impractical to reprogramit before execution of every instruction. Instead, a degreeof flexibility can be allotted to it. For this, m units are con-nected together statically to form m-bit functional units. Werefer to each m-bit unit as a slice, capable of operating onm-bit operands. In a contemporary processor, if N-bit func-tional modules are present, then there will be N/m slices ina sliced architecture. The interconnection network betweenslices can be designed to be completely flexible, so that eachslice can operate independently, or connect itself to moreslices and operate concurrently with them. When two m-bitslices operate independently, they are capable of executingtwo instructions simultaneously, provided the operands arem-bit. When two slices connect together, they form a 2m-

924

Page 3: [IEEE 2008 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools - Parma, Italy (2008.09.3-2008.09.5)] 2008 11th EUROMICRO Conference on Digital System

InstructionFetch

InstructionDecode

Dispatch

ResourceMapping

Execution

Reorder

Retirement

(a) Sliced ALUpipeline

Generate EnableSignals

OperandsA1, A2

OperandsB1, B2

ALUFunction

A

ALUFunction

B

ResourceAllocationVectorA, B

Load operandsaccording to RAVs

Compute Result

Append Zeros

Forward output

(b) Sliced ALU operation

Figure 1. Functions in a sliced ALU

bit functional module and can operate on one instructionwith 2m-bit operands. Since there are N/m slices, when allslices are connected to each other, they can operate on N-bitoperands as before.

In an execution unit sliced into m-bit slices, slices areallocated to each instruction based on the resource require-ments of ready instructions. There are two functions asso-ciated with the process of allocation before the instructionsare ready to be executed: directing the operands into thecorrect operand register slices, and directing the result cor-rectly into an N-bit output register. These functions can beperformed by using decoders at the input and output of theexecution unit. A truth table for the decoder can be easilydeveloped and implemented as the internal circuitry for thedecoder. Different execution units need different decodingfunctions as can be seen from the architecture explained inthe next section.

3.2 Sliced ALU Implementation

The pipeline stages in a sliced ALU are shown inFig. 1(a). Resource Mapping is done by a unit called theResourceMapper. Its latency is equivalent to a few logicgates, and can be included in the dispatch pipeline stage in-stead of a separate stage.

3.2.1 Resource Mapper

This determines the number of slices required by an in-coming instruction and allocates slices for all incoming in-structions. For determining the number of slices requiredby an instruction, the resource mapper performs a functioncalled ’zero-checking’. It determines the length of signifi-cant bits in both operands and returns the maximum of thesetwo lengths as the number of slices required by the instruc-

tion. This can be achieved simply by using AND gates. Thezero-checking function is slightly different for the shift op-eration, for which not only the number of significant bitsof first operand are required, but also the value of the sec-ond operand. Using these values and a simple logic circuit,the number of units required by a shift instruction can bedetermined.

With each reservation unit is associated a register calledthe Resource Allocation Vector (RAV). The Resource Allo-cation Vector keeps track of slices allotted to the instructionstored in a reservation unit. In addition, the Resource Map-per uses a global register called the Resource Vector (RV).If there are m slices in the execution unit, then the RAV andRV are m-bit. Each bit in the RAV and RV indicates a statusfor slices of the execution unit as allocated/not-allocated.When a slice is allotted to an instruction, the bit in the re-spective location of the slice is set to 1. When an instructionfinishes using the slice, the bit is reset to 0. The ResourceMapper can also issue an instruction to one or more slicesof functional units and set one or more bits at a time in theRAV of the instruction and global RV respectively.

If the execution units are all known to finish the execu-tion of an instruction in one clock cycle, then a global Re-source Vector can be assumed to be an all-zero number atthe beginning of every clock cycle, and is redundant. In thiscase, allocation is done by examining all ready instructionswaiting for a resource and determining the number of slicesrequired by each. In the situation where the ready instruc-tions need more slices than available, the instructions can beprioritized based on instruction count and other instructionscan be stalled. De-allocation is not necessary here. The Re-source Vector will only be needed if some instructions takelonger than a clock cycle to finish. Though unused in thiswork, the use of Resource Vector has been proposed in viewof future work, one instance of which is when integer slicesare rearranged into a floating point pipeline, with a latencyof more than one clock cycle.

Fig. 2 shows the block diagram of a sliced ALU, whilethe flowchart in Fig. 1(b) shows the basic steps in whicha sliced ALU functions. The Enable signals in Fig. 2 arefed to D-flip-flops so that only the appropriate part of theALU functions, while the other parts retain their values.This leads to lower power consumption. The architectureis explained in detail in the next section.

3.3 Architecture of integer executionunits

The architecture of a sliced integer unit (8-bit slices) isproposed in this section. The integer unit comprises of anadder/subtracter unit, a shifter, a logical unit and a compar-ison unit. The decision to use 8-bit slices in this architec-ture was based on the trade-off between inter-slice circuitryoverhead and performance gain. Experimentation with dif-ferent slice sizes may be performed before design.

925

Page 4: [IEEE 2008 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools - Parma, Italy (2008.09.3-2008.09.5)] 2008 11th EUROMICRO Conference on Digital System

Adder/Subtracter Comparator Logical

Operations Shifter

op1 op2En

op1En

op2En

op1En

op2En

op1En

op2En

Decoder Decoder Decoder Decoder Decoder Decoder Decoder Decoder

Comparelogic

Comparelogic

Decoder Decoder

opA1En

opA2En

opB1En

opB2En

Result-A Result-B

Figure 2. Block diagram of sliced ALU

3.3.1 Adder/Subtracter Unit

Fig. 3 shows the design of an adder/subtracter module, builtusing two slices of 4-bit adder/subtracter. The inputs re-quired for the 4-bit adder subtracter are two 4-bit operandsand a 1-bit operation add/sub (‘0’ for addition and ‘1’ forsubtraction). The adder/subtracter module is designed byinterconnecting signals between the two slices and usingmultiplexers to enable it to operate at variable data length.It is capable of performing an addition and/or subtrac-tion operation on the set of operands {X3, . . . , X0} and{X7, . . . , X4}. Control signals add/sub1 and add/sub2 con-trol the operation. The control input sel indicates whetherthe two slices are to perform independently or concurrently.Multiplexer Mux1 determines the propagation of add/subsignal to the second slice, while Multiplexer Mux2 con-trols the cascading of carry out signal from the unit U3 tounit U4. Mux3 generates the overflow exception bits v1

and v2. After the output is produced, it is sign-extended inorder to be passed on to the result register and subsequentlystored. When only one slice (say, slice-0) is to be used,the signals {S4, . . . , S7}, v2 and Cout7 are ignored, andvice-versa. When both slices are used for one operation, theappropriate signals are routed to the output.

This design is extended in Fig. 4 to include four slices ofthe adder/subtracter unit, each capable of operating on two8-bit operands, resulting in a 32-bit sliced ALU. The blockdiagram of this flexible adder/subtracter unit is shown inFig. 4.

As explained before, there are two functions associatedwith slice-allocation: Directing the operands into the cor-rect operand register slices, and Directing the result cor-rectly into an N-bit output register.

The input operands are initially present in N-bit operandregisters. If an ALU instruction with two input operand reg-

A3 B3Cin3

SLICE-3

Cout3 S3

1 0

add/sub3op2op1

A2 B2Cin2

SLICE-2

Cout2 S2

1 0

add/sub2op2op1

8

8

8 8

A1 B1Cin1

SLICE-1

Cout1 S1

1 0

add/sub1Op2[15:8]Op1[15:8]

A0 B0Cin0

SLICE-0

Cout0 S0

1 0

add/sub0Op2[7:0]Op1[7:0]

8

8

8 8

88

888

8

8

8

‘0’

sel-3 sel-2 sel-1 sel-0

Figure 4. Architecture of flexibleadder/subtracter unit

SLICE-0

opA[7:0] opB

A<BA=BA>B

SLICE-1

opA opB

A<BA=BA>B

SLICE-2

opA opB

A<BA=BA>B

SLICE-3

opA opB

A<BA=BA>B

Signed/unsigned

Signed/unsigned

Signed/unsigned

Signed/unsigned

Figure 5. Four 8-bit compare slices for signedor unsigned comparison

isters containing 8-bit values is ready for execution and isallotted Slice-1, the operands have to be loaded at locations[15:8] of registers op1 and op2. Similarly, the 8-bit resultgenerated by Slice-1 has to be directed to locations [7:0] ofoutput register. This direction of input to appropriate inputregisters and of the output to a result register is done by theuse of decoders. The RAV is the individual Resource Allo-cation Vector set for an instruction. There are two decoders,one for each instruction, which are input the RAVs for twoinstructions and outputs respective sign-extended result.

The adder/subtracter units along with input and outputdecoders constitute the complete flexible adder/subtracter.Area analysis for this module is made in section 3.4.

3.3.2 Compare Unit

The compare operation is required to be performed on bothsigned and unsigned operands, and requires a slightly dif-ferent treatment for each. This comparator can be designedas a minimal-delay circuitry, or it can be designed with min-imal area constraint, depending upon the constraints im-posed by the system. Fig. 5 shows the use of such compar-ison units in a sliced comparator design. The 1-bit controlsignal takes the value 0 for unsigned comparison and 1 forsigned.

Once sliced comparison is performed, the final result ofcompare operation is determined by a separate logic cir-cuitry that takes into account the respective outputs of eachcompare slice. The control signals for compare operationsand resource allocation vector for each instruction are madeavailable to this circuitry. The final bit output of the com-pare unit is concatenated with (N-1) leading zeros and re-turned as an output of the comparator unit.

926

Page 5: [IEEE 2008 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools - Parma, Italy (2008.09.3-2008.09.5)] 2008 11th EUROMICRO Conference on Digital System

Y X CinCout0 S0

Y X CinCout1 S1

Y X CinCout2 S2

Y X CinCout3 S3

Y X CinCout4 S4

Y X CinCout5 S5

Y X CinCout6 S6

Y X CinCout7 S7

X0Y0

X1Y1

X2Y2

X3Y3

X4Y4

X5Y5

X6Y6

X7Y7

add/sub-1

v2

0 1mux1

add/sub-2

sel

0 1mux2 sel

1 0mux3 sel

v1

Figure 3. Two interconnected 4-bit adder/subtracter units forming one 2-slice adder/subtracter

2:1 MUX4:1 MUXLogic GatesAdder 4Adder Result Decoder 23ComparatorComparator Result Decoder 8 44Shifter 14Logical OperationsShifter and Logical Result Decoder 23RAV Decoder 16Load Operand Decoder 16Total 26 16 106

Table 2. Additional hardware used for slicingof ALU

3.4 Area analysis

The sliced ALU design requires additional hardware fordecoders, multiplexers and added signals. For the imple-mentation of 2:1 multiplexers used extensively in the de-sign, transmission gates (pass transistor logic) can be used.These are designed using an NMOS and a PMOS transistorin a configuration that result in no static power consump-tion. The pass transistors add three NMOS and three PMOSgates to the hardware. To estimate the hardware used fordecoders that perform direction of input and output signalsinto correct register slices, the average cost of decoders wascomputed in terms of logic gate equivalents. Table 2 liststhe additional hardware used by various units in a slicedALU. The additional hardware introduced for implementa-tion of slicing is minimal. On performing a delay analysis,the maximum delay path of decoders is found to be equiv-alent to three gate propagation delays. Thus each decoderadds a nominal delay to the execution datapath.

4 Architecture Implementation

In order to evaluate the block slicing concept in a pro-cessor, it was implemented in a DLX pipeline using VHDL

(VHSIC Hardware Description Language). The DLX archi-tecture was designed by Hennessey and Patterson as a repre-sentative architecture of practical processors. This sectionexplains the architectural design of the DLX machine andthe architectural implementation of the scheme.

4.1 DLX Architecture

The DLX is a simple 32-bit load-store architecture de-scribed in [3]. The operations supported by the DLX areclassified into four major types: ALU, branch, load-storeand floating point operations. The control instructions arejumps and branches, where branches are conditional whichneed to be evaluated before the branch is resolved. Thefloating point unit of DLX handles all floating point oper-ations as well as integer operations of multiply and divide.The scalar, pipelined implementation of DLX consists offive stages: Instruction fetch, instruction decode and regis-ter fetch, execute and effective address calculation, memoryaccess and write-back stage. It can be extended to a super-scalar pipelined version using general superscalar concepts.The number of pipeline stages and their functions remainsimilar.

4.2 Superscalar, Pipelined DLX imple-mentation in VHDL

We implemented the superscalar VHDL version of DLXas a two-width, five-stage pipelined, 32-bit architecture. Itis capable of executing integer arithmetic and logical opera-tions, compare, shift, jump and branch instructions. It doesnot contain a floating point unit. The architecture uses aninstruction cache to store instructions loaded from memory.Fig. 6(a) shows the pipeline stages in this implementation.

Each stage can process two instructions simultaneously.Fig. 7 shows the block diagram of the integer unit. It is im-plemented as a 32-bit functional unit. The dispatch and ex-

927

Page 6: [IEEE 2008 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools - Parma, Italy (2008.09.3-2008.09.5)] 2008 11th EUROMICRO Conference on Digital System

Instruction Fetch

Instr-A Instr-B

Decode andDispatch

Load/StoreUnit

Integerunit

Multiply/DivideUnit

Completion

Retirement

InstructionCache

Register File

(a) Simulated DLX architecture

dlxasm

SimulationEngine

testbench

WaveformViewer

.asm file

waveform

.out file

(b) Simulation dataflowdiagram

Figure 6. DLX architecture and SimulationDataflow

ecute stages differ from the original implementation. Thereare at most two instructions in each stage of the pipeline atany given time, except the reorder unit. Once valid operandsare fetched in the dispatch stage and an instruction is readyto begin execution, the number of units required for the in-struction is computed from the value of the operands. Thisis done by a zero-checking unit. The Resource Mapper thenallocates execution unit slices to an instruction. In addition,the resource mapper also sets the control signals that slicean execution unit appropriately. The resource vector is a bitvector that indicates the slices allocated to an instruction.For example, if instruction A is allocated slice number 1,then its 4-bit resource vector will be 0001. For instruction Bwith allocated slice numbers 2 and 3, the resource vector is0110. Thus, the global resource vector during that clock cy-cle is 0111, indicating that only three slices of the executionunits will operate, and the fourth slice will consume idlepower. Data is loaded into the operand registers at the risingedge of the clock. Due to block slicing, the resource map-ping control signals slice the execution unit and the ALUgives at most two outputs (ALU Output A and ALU OutputB) by simultaneous execution of two instructions. These re-sults are stored into their respective reorder buffer entries,and forwarded if necessary for the next clock cycle. Thefetch stage is set so as to fetch the next instruction when aninstruction is issued to an execution unit. Thus, when par-allelism due to slicing exists in a program, the fetch stage isalso speeded up and the total time of execution of a programdecreases. In the absence of any additional instruction-levelparallelism, the time of execution of the program remainsthe same as that in a non-sliced processor, since instructionsare prioritized by program order.

Adder/Subtracter Comparator Logical

Operations Shifter

op1En

op2En

op1En

op2En

op1En

op2En

op1En

op2En

opA1En

opA2En

opB1En

opB2En

Result

Reservation Unit

Multiplexer Selectoperation

D flip-flops

Figure 7. Block diagram of integer unit inVHDL implementation of superscalar DLX

Benchmark Program Time of execution (us) Speed-up Gainnon-sliced sliced %

ALUinstrutions-1 93.5 47.5 1.96849.198ALUinstrutions-2 137.5 95.5 1.44030.545

DLX 61.5 58.5 1.051 4.878LoadStore 119.5 119.5 1.000 0.000

PrimeNumber 6595.5 6471.5 1.019 1.880supscal 63.5 39.5 1.60837.795

MDUinstructions 198.5 191.5 1.037 3.526BranchJump 85.5 67.5 1.26721.053

NtoK 76.5 62.5 1.22418.301

Table 4. Results of evaluation of Time of Exe-cution and Speed-up

5 Evaluation

The usage of an integer ALU unit was studied by run-ning several benchmarks on a VHDL implementation of theDLX superscalar processor. Table 3 shows the results ob-tained.

The VHDL program takes a text file containing machinecodes as input. It can be simulated using Active-HDL 7.1.Benchmark programs are usually present as assembly-levelprograms. Such benchmark programs for DLX cannot bedirectly used as input to the VHDL program. Fig. 6(b)shows the data flow diagram while using the VHDL DLXprocessor emulator code. Benchmark programs with exten-sion .asm are first converted to a text file with extension .outusing a DLX assembler program called dlxasm [1] avail-able freely. The dlxasm assembler converts DLX instruc-tions into respective DLX machine codes. Each machinecode is indexed by a 32-bit memory address in which theinstruction is expected to be stored in a true hardware sys-tem. Format of the .asm and converted .out file is given inthe Appendix. The .out file is used as input to the simulator

928

Page 7: [IEEE 2008 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools - Parma, Italy (2008.09.3-2008.09.5)] 2008 11th EUROMICRO Conference on Digital System

Benchmark Program # ofin-stns

Time ofsimula-tion

# ofALUinstns

# of ALUinstnswith 8-bitoperands

# of ALUinstnswith16-bitoperands

# of ALUinstnswith24-bitoperands

# of ALUinstnswith32-bitoperands

ALUinstructionsPart1 22 93.5 22 10 4 8 0ALUinstructionsPart2 33 137.5 33 14 10 4 5BranchJump 85.5 85.5 10 0 0 3 7BubbleSort 6477 12191.5 2741 658 1 708 1374Dlx 18 61.5 9 0 0 5 4LoadStore 30 119.5 5 1 1 2 1MDUinstructions 39 198.5 27 8 9 7 4PrimeNumber 1321 6595.5 718 1 22 360 335

Table 3. Initial usage of ALU units in benchmarks

engine that contains the VHDL code.The simulation engine produces waveforms for signals

that propagate in the processor. These are in the form of aValue Change Dump (.VCD) file and can be easily viewedusing a waveform viewer.

6 Results

The performance criteria used for evaluating the conceptof slicing are speed-up, throughput, utilization and power.These criteria are widely used for comparison of differentarchitectures. To evaluate the performance of the blockslicing concept with respect to these factors, a hardwarecode for the DLX processor was developed using VHDLand tested with benchmark programs. Benchmark programswere obtained from various sources from internet resources.These were assembly level programs written for the DLXmachine. .asm files containing benchmark programs wereconverted to .out files using the package dlxasm [1] andthen run on the VHDL code of the sliced processor. In-stead of developing the code from scratch, the freely avail-able VHDL package dlx-vhdl [4] was used as base codeand it was suitably modified for the proposed architecture.Throughput is given by number of instructions completedper unit time. It can also be related to the number Instruc-tions Per Cycle (IPC), where the unit of time is a clockcycle. Considering that a new instruction is fetched everyclock cycle, the number of fetch cycles indicates the inputstream to the architecture and the number of instructionscommitted per fetch cycle indicates the output stream of theprocessor. The throughput is then given as:

IPfC =Total No. of Instructions Committed

Total No. of Fetch Cycles(1)

The speed-up is computed with respect to the DLX architec-ture without the processor modifications for block slicing.Thus, speed-up is given as:

Speed−Up =T ime of ExecutionNon−Sliced DLX Arch.

T ime of ExecutionSliced DLX Arch.(2)

Resource utilization at the bit-level is given by the % ofresource used during time of execution. Resource utiliza-tion can be given in terms of the ratio of number of times

the resource slices were completely used to the total num-ber of times the resource was accessed. Power consumedduring execution of two sequential operations is evaluatedusing the Xilinx Xpower tool that is included with XilinxISE. The power-delay product is then used to compare thenon-sliced and the sliced architectures.

Benchmark ProgramNSALU SALU pNSSliceSSlice n εS GALUinstrutions-1 22 13 9 44 26 220.84669.23%ALUinstrutions-2 33 31 2 66 62 330.532 6.45%

DLX 9 7 2 18 14 90.64328.57%LoadStore 5 5 0 10 10 50.500 0.00%

PrimeNumber 718 68137 1436 13627180.527 5.43%supscal 14 8 6 28 16 140.87575.00%

MDUinstructions 27 26 1 54 52 270.519 3.85%BranchJump 21 14 7 42 28 210.75050.00%

NtoK 16 15 1 32 30 160.533 6.67%

Table 5. Efficiency

Table 4 presents the speed-up obtained for the bench-mark programs by listing time of execution of each bench-mark on a non-sliced and sliced processor and using Equa-tion 2. Table 5 presents the efficiency of use of ALU slices.In a non-sliced implementation, each time the ALU is ac-cessed, both potential slices are accessed. In a sliced ALU,each time two instructions are executed in parallel, they areassumed to use two slices each, resulting in entire length ofALU being used. Let,

NSALU = # of times ALU is accessed

in non − sliced implementation

SALU = # of times ALU is accessed

in sliced implementation

NSSlice = # of times potential ALU

slices are accessed in non − sliced

implementation

SSlice = # of ALU slices accessed

in sliced implementation

p = # of times ALU instructions executed

in parallel in sliced implementation

n = Total # of ALU instructions

NSSlice is given as:

NSSlice = NSALU × 2 and Sslice = 2 × SALU

929

Page 8: [IEEE 2008 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools - Parma, Italy (2008.09.3-2008.09.5)] 2008 11th EUROMICRO Conference on Digital System

Benchmark Program # of fetch cycles # of instructions IPfC Gain innon-sliced sliced non-slicedslicedIPfC (%)

ALUinstrutions-1 23 14 22 0.9571.571 64.286ALUinstrutions-2 34 32 33 0.9711.031 6.250

DLX 27 25 18 0.6670.720 8.000LoadStore 30 30 30 1.0001.000 0.000

PrimeNumber 1693 1660 1321 0.7800.796 1.988supscal 17 11 16 0.9411.455 54.545

MDUinstructions 71 70 39 0.5490.557 1.429BranchJump 34 29 28 0.8240.966 17.241

NtoK 34 33 24 0.7060.727 3.030

Table 6. Throughput in terms of InstructionPer Fetch Cycle

Power (mW) Delay (ns) PDPNon-Sliced ALU 431 20 8620

Sliced ALU 604 10 6040

Table 7. Power Delay Product for executionof two worst-case operations for two 16-bitworst-case operands

Thus, Efficiency is given by ε:

εNS =n

NSSliceand εS =

n

SSlice

Table 6 shows the throughput of both implementationsin terms of instructions per fetch cycle.For estimation of power consumption, the Xilinx XPowertool was used with synthesizable designs of sliced ALU andnon-sliced ALU. The ALU is capable of performing addi-tion/subtraction, shift, compare and logical operations. Ev-ery combination of two different operations was selectedand simulated with worst case 16-bit operands. The opera-tions of addition and comparison were found to consumemost power. The ALU designs were then analyzed forpower consumption during execution of the operations ofaddition and comparison of 16-bit operands sequentially ona non-sliced ALU and parallelly on a sliced ALU. Table 7shows the power-delay product (PDP) during this analysis.

7 Conclusion

The concept of resource slicing was implemented inthe DLX processor using VHDL. Sliced resources processgreater number of instructions without the need to add extrahardware resources. The sliced resource implementationwas evaluated with respect to speed-up, throughput, powerand utilization of the integer unit.From the results thus obtained, it can be observed that byaddition of one low-latency stage, the Resource Mappingand minimal hardware, it is possible to obtain a speed-up and higher efficiency of execution. The number offunctional units required to be pipelined in a superscalarpipeline can also be reduced if the task running on the

processor allows it. For a generic processor that runs avariety of different applications, each requiring differentnumber of functional units, this can provide a flexiblescheme for efficient execution.It is necessary to evaluate the performance enhancementobtained at varying superscalar widths on more benchmarksthan used here. This will help in determining the optimalnumber of slices required for different applications. Thisnumber can then be used to design sliced processors formost efficiency. Block slicing is a general concept thatcan be applied in a variety of forms to modules otherthan functional units. It may be applied to registers andcaches. It is required to design a suitable hardware toaddress, identify and access sliced data when stored insliced registers and caches. A complete sliced processorwill be obtained once work is performed for slicing thesemodules.

References

[1] Ashenden. Compiler for dlx instruction set, dlxasmpackage. Website. http://www.ashenden.com.au/designers-guide/DG-DLX-material.html.

[2] A. Gloria. Visa: A variable instruction set architec-ture. ACM SIGARCH Computer Architecture News,18(2):76–84, June 1990.

[3] J. Hennessy and D. Patterson. Computer Architecture:A Quantitative Approach. Morgan Kaufmann, 2002.

[4] Darmstadt University of Technology. VHDL-DLXpackage available freely at http://www.rs.e-technik.tu-darmstadt.de/TUD/res/dlxdocu/SuperscalarDLX.html.

[5] C. Shekhar, R. Singh, A. S. Mandal, S. C. Bose,R. Saini, and P. Tanwar. Application specific instructionset processors: Redefining hardware-software bound-ary. Proceedings of the 17th International Conferenceon VLSI Design, IEEE, pages 915–918, 2004.

[6] M. J. Wirthlin and B. L. Hutchings. A dynamic instruc-tion set computer. IEEE Symposium on FPGAs for Cus-tom Computing Machines, pages 99–107, April 1995.

930