[IEEE 2006 International Symposium on System-on-Chip - Tampere Hall, Tampere, Finland (2006.11.13-2006.11.16)] 2006 International Symposium on System-on-Chip - SIxD: A Configurable

SIxD: A Configurable Application-SpecificSISD/SIMD Microprocessor Soft-Core

Nehir SonmezDepartment of Computer Engineering

Bogazici UniversityIstanbul / Turkey

[email protected]

Abstract In this study, an FPGA-based, configurable,application-specific, SIMD-capable, non-pipelined RISC softprocessor core that operates on variable-width fixed-point datais presented. The flexible SISD core with general-purposeinstructions fits on small FPGAs, but can be easily configuredto incorporate application-specific instructions and to operatein SIMD mode when higher performance is required.

I. INTRODUCTION

In recent years, the advents of chip design technologyhave reduced the unit price of FPGAs, hence soft CPU coresthat can be used in FPGAs have started to appear [1,2]. A"soft" processor is one that is implemented using the FPGAlogic, rather than being a fixed part of the chip circuitry,providing the flexibility to implement and potentiallycustomize as required. The architecture of these processors isusually SISD (Single Instruction-Single Data). Xtensa [3] isa SISD architecture whose instruction set can be enhancedby including user-defined application-specific instructions,as well as by including a SIMD engine for DSP applications.Many modem commercial microprocessors also have SIMDcapability, which has become an integral part of theInstruction Set Architecture.

This study presents the early results of a soft CPU core,the SIxD, a non-pipelined RISC system (load/storearchitecture) with fixed 16-bit instruction word and variabledata space. Depending on the size of FPGA, the user canselect the operation mode (SISD or SIMD) and shrink,extend or modify the instruction set to include application-specific instructions and exclude redundant instructions atcompile-time. The core is designed with VHDL for XilinxFPGAs, but modifications to fit FPGAs of other vendors caneasily be done. In the literature, there are processorarchitectures that incorporate both SISD and SIMD in asingle core [4,5] using partitioned data words, however nocases of a completely parametrizable SIMD core have beenobserved. Therefore the SIxD is a quite novel soft core thatcan fit in as low as a 40K system gate FPGA, or offer highperformance vector processing on bigger FPGAs. The SIxDarchitecture is briefly explained in two parts, the scalar and

Arda YurdakulDepartment of Computer Engineering

Bogazici UniversityIstanbul / Turkey

[email protected]

the array/vector part in the following section. An illustrativeimplementation concerning the MPEG-7 Motion ActivityDescriptor is explained in Section III, with conclusions andfuture work in Section IV.

II. THE SIxD ARCHITECTURE

The SIxD is a configurable, application-specific systemwith fixed 16-bit instruction word and variable data space,fusing a non-pipelined RISC uni-processor, SISD, with aSIMD multiprocessor. The SIxD supports application-specific modules, as well as general-purpose processorfunctionality. A configuration file decides on the ALUinstructions, branching instructions, data space, and mostimportantly, activating the SIMD mode of the core with auser-defined number of Processing Nodes.

A. The bare SISD coreThe SISD model is a non-pipelined RISC system with 8

general-purpose registers, as well as dedicated specialregisters such as the Program Counter, the InstructionRegister, Memory Address Registers and Memory DataRegisters. Attempting to make the most out of an FPGA,dedicated Block RAM resources are chosen to serve asInstruction Memory (IM) and Data Memory (DM) of theSIxD (Fig. 1). The Register File (RF) with a single-port inputand a double-port output has been implemented as a SelectRAM in the FPGA. The primitive Hardware Multiplier,which is standard in Virtex and Spartan 3 FPGAs, is theobvious choice for multiplier, requiring no LUT space on thechip. If no Hardware Multiplier is present on the targetFPGA, the user must either remove the multiplicationinstruction or provide with a custom generic multiplier.

The hardwired Control Unit (CU) is a Finite StateMachine, implementing a sequential circuit based ondifferent states in the machine, issuing a set of controlsignals at each state. Depending on the instruction wordinput, the Control FSM completely describes the InstructionSet Architecture of the SIxD, by setting up the necessary bitsfor corresponding units to perform the correct operation.

This work has been fully supported by the Turkish Foundation ofScience and Technology, under Kariyer Research Project Program, pr.no: 104E038.

1-4244-0622-6/06/$20.00 ©2006 IEEE.

The instruction formats supported by the SIxD are:

* Register-register (direct)* Register-constant (immediate)* Register-memory (indirect)* Branch format

A custom unit can be a user-defined instruction or a

custom-designed hardware module. The bare SISD core hasan ALU, a multiplier and no custom unit. In order to create a

custom instruction on the SISD, the custom unit must beplaced with inputs from the register file, and output(s) to theRF input multiplexer. A reserved opcode in the ISA must beassociated with the control on the new unit andcorresponding states must be revised for the proper enablingof the unit, and the correct choice of the select signal of theRF input multiplexer to assure the proper propagation of theoutputs. The "reserved" places are used for custominstruction extension to the ISA. More custom units could beadded to the ISA by having selection bits distinguishdifferent instructions, like done in the shift instructions; twoinstructions in a single opcode.

Up to four interrupts are implemented, as well as an

interrupt enable/disable instruction. The SIxD gives an

exception when an invalid opcode is read. Theinterrupt/exception addresses are at the start of theInstruction Memory. There is no stack or accumulator on theSIxD, resulting in a single level of interrupt handling.

The non-pipelined implementation is useful for the ease

of design and lower area occupation.

synthesis of the core with the aid of a configuration filewritten in VHDL.

TABLE I. THE SIxD INSTRUCTION SET

Opcode Explanation -b

0000 NOOP-enable/disable interrupts-RET 2

0001 A/La register (direct) 40010 A/La register (immediate) 4

0011 A/La register (indirect) 5

0100-0110 Reserved0111 Shift right/left 41000 Multiply 4/51001 SIMD execute instruction c1010-1011 Load/store to/from memory 31100-1101 Branch if less than/branch if equal to 5

o10 SIMD load/store/route/set iter -

1111 Immed./indirect load/store from/to mem./port 2/3/4a. A/L: Arithmetic/Logic (ADD/SUB/AND/OR/XOR).

b. : clock cycles.

c. See Table III.

The following types of modifications are supported bythe configuration file:

1. Enable SIMD with a number of Processing Nodes:Depending on the size of chip, the user can enable SIMDinstructions, choosing to have any number of ProcessingNodes (PN) that is a multiple of two.

2. Determine the data space (width and length):Depending on the size of data memory needed and the BlockRAM available on the FPGA, the user can specify the datawidth (as a multiple of 8), and data length (at least 2Kwords).

3. Select the most appropriate branch instructions: Inthe default instruction set, BL and BEQ are used. However,if greater-than operations are used more frequently than BL,then the user has the chance of replacing BL with BG. Thereare six possible branch instructions for two branchinstruction slots: BEQ, BNEQ, BL, BLEQ, BG, and BGEQ.

4. Remove unused instructions from the instructionset: As an example, removing the XOR instruction wouldreduce the area of the core by 11 slices (Table II). This way,the SISD core can fit in a very small FPGA.

TABLE II. SISD SLICE INFORMATION

Data WidthUnit

8-bit 16-bit 24-bit 32-bit

SISD (all) 279 426 573 689ALU 68 148 231 310Register File 22 40 60 80Reg. MUX 22 44 66 88Control Unit 186 186 186 186

B. The Configuration File

Although Table I shows the full instruction set, it shouldbe noted that these instructions can be modified during the

MA I} MORio I -

MDf2

1R_,bx

~~~~~~~~~~I.

Fiue1 h SIS.WDZdatpat

C7!)--- ------ -

C. The SIMDfunctionalityThe SIMD functionality of the SIxD depicted in Fig. 2 is

initiated within the configuration file by the SIMDenablesignal set high. This produces the generation of a number ofuser-defined Processing Nodes. A SIMD Processing Node ismade up of a Local Memory (a dual-ported Block RAM) anda Processing Element, thus providing with the mostelementary units for computation. The Local Memory(SIMD registers) also needs to be able to take inputs fromthe main Data Memory, resulting in a need for a multiplexer,completing the PN.

The instructions that emerge with the activation of theSIMD are as shown in Table III. For SIMD instructions, onlyregister indirect addressing can be used, since otherwisethere would be no way of issuing a SIMD instruction inside a

16-bit instruction word. In these instructions, the registercontents act as address locations for the DM and the LMs.The set_iter instruction sets an iterations register, iter, with a

constant, determining the number of times to iterate thegiven SIMD instruction. Let's say that a load vectorinstruction is being issued on the SIMD with 4 processingelements, and iter is set to 8. In this case, 32 values inside theDM are distributed dually (via dual-port Block RAMs) to 8consecutive locations in the 4 PEs. This way, the iterationsregister, iter achieves the speeding up of data transfersbetween the DM and the LMs. A large data multiplexer isplaced at the output to select which Local Memory to readfrom when writing back to the main Data Memory.

For the SIMD part, the customization process differs inthe sense that the new functional unit is added inside theProcessing Element with a certain bit of the op_sel signalenabling it. For a single-cycle custom operation, no othermodification of the core is necessary. If the custom unitoutputs two results, setting the double-output bit high on the

configuration file lets results to be written on twoconsecutive memory address locations.

TABLE III. SIMD INSTRUCTIONS OF THE SIxD

SIMDinstruction

Load vector

Explanation

Loads a vector from DM to the LMs.

Store vector Stores a vector from LMs to the DM.

Exec. vector

Route

Set_iter

Execute two vectors based on the op_selsignal, save to a vector in LMs.

Route a scalar from one LM to another.

Set the number of iterations to perform.

_ b

5+3i a

5+3i a

6+2i a

9

2

a. i: iterations.

b. : clock cycles

C. Intra-PE CommunicationsIt is easy to implement a SIMD with communication

autonomy [6]; by having the LM input multiplexers (Fig. 2)take their inputs from all outputs of all PEs. Of course, thisresults in very large multiplexers and increased area usage.

If an on-chip network is formed with routers in-between theLMs, these multiplexers can be removed. The slow butsmall method implemented in the SIxD employs none ofthese, but uses a pre-defined look-up table that containsencoded direction information for the movement of databetween the LMs. The table consists of encoded bits ofsource and destination PEs, requiring no extra hardware,and only a few slices. For example, if the user defines "101"as a transfer from LM1 to LM4 in compile-time, using thisbit sequence in the "route" instruction, along with the source

LM address and the destination LM address, results in a datamovement between these two Processing Units.

The size of the SIMD is completely dependent on theapplication, since the number and the contents of theProcessing Elements determine the size of the SIMD modeof the SIxD core. Section III gives better insight on theutilization and performance of the core with an

implementation of MPEG-7 Motion Activity Descriptors on

the SIxD.

III. IMPLEMENTATION EXAMPLE: THE MPEG-7 MOTIONACTIVITY DESCRIPTORS

Although a detailed explanation can be found in [7], TheMPEG-7 Motion Activity Descriptors (MAD) consideredhere are for motion intensity and spatial distribution ofmotion activity. The algorithm can be summarized as

follows: In a set of motion vectors, first the hypotenuses of8-bit input pairs are calculated, and the average of the resultsare found. Then, the lower-than-the-average values of thevectors are set to zero, resulting in the spatial activity matrix,and another averaging of this matrix gives us the intensity ofmotion for the frame. The descriptor implementation whichworks with pre-calculated motion vectors has been employedon 16x16 motion vectors, requiring two sets of 256 elementseach. Although the inputs are 8-bits wide, a 16-bit CPU was

A PtuingNode

r- ------n

MDR ~ ~ MR DR~

Figure 2: The SIMDlunit

Fiur 2:1|Th SIM|Dunit

used because of the usage of multiplication and averaging.For the sake of processing comparison, data is previouslyloaded from the input ports, and is available in DataMemory.

Fig. 3 shows how the PE operation bits were used tomanage the hypotenuse unit, the compare unit, and theaccumulator. In the algorithm, first the hypotenuses are

calculated in parallel, while the outputs are collected andadded by the accumulator. Then, the final accumulatorresults are gathered by the SISD, added and shifted tocompute the average, and sent back to the PNs of the SIMD.Afterwards, the compare instruction comes into play,comparing the data elements with the average, outputtingzero if less than the average, or the input data itself if greaterthan or equal to the average, forming the spatial activitymatrix. It also accumulates its results, which are finallysaved, added and scaled together for the final result, theintensity of motion for the frame.

There are many ways to implement this "take-the-hypotenuse/square root-and- accumulate- intensive"application. On the SISD, an 8-bit square root calculationtakes about 280 clock cycles. A custom unit designed for thiscalculation and placed in opcode "0100" takes 5 cycles and55 slices to compute the square root. The whole algorithmruns for a total of 21,286 clock cycles. Another method is touse a specially-designed hypotenuse unit, taking 76 slices,and 5 clock cycles to execute. Using this unit, two MUL, one

ADD, and one SQRT operation is combined into one HYPinstruction, with the total MAD runtime taking 19,413 clockcycles.

Finally, enabling the SIMD and using vector instructionson specially designed Processing Elements executing in twoclock cycles (Fig. 3) and occupying 111 slices, the results are

shown in Table IV.

TABLE IV. THE MPEG-7 MOTION ACTIVITY DESCRIPTOR ON THESIXD

No. Total SIMDPN Iter slices slices freq. (clock cons.

(MHz) cycles) (mW)

2 128 963 475 109 1833 521

4 64 1205 717 94 1196 548

8 32 1647 1163 94 882 542

16 16 2523 2049 89 734 566

32 8 4266 3800 94 678 606

64 4 7754 7288 83 686 647

When the chosen number of Processing Nodes of theSIxD is redundant for the algorithm, the design not onlytakes up too much space, but is slower and inefficient:Experiments done with 128 and 256 PEs resulted in 14,780and 28,720 total slices and 762 and 944 clock cycles toexecute, respectively. This is due to the fact that the data

transfer starts taking more time that the processing. The idealnumber of PNs is 16 or 32 for the Motion ActivityDescriptor on a 16x16 motion vector frame, depending on

area or time being a constraint. Furthermore, the 256-PESIMD doesn't fit even fit the biggest Xilinx Virtex-II chip(8M gates), because it occupying 259 Block RAMs and 513MUL units.

IV CONCLUSIONS

In this study, we have implemented a flexibleapplication-specific soft processor that can be configured torun in scalar mode and fit in small FPGAs or run with vectorprocessing (SIMD) capability for higher performance on

larger chips. We have demonstrated an implementation ofthe core with the MPEG-7 Motion Activity Descriptors,showing how the core might be configured for up to 28-foldspeedup in algorithm run time using SIMD customizations.In the future, the core's flexibility is to be further tested withmore complex algorithms.

REFERENCES[I ] MicroBlaze Processor Reference Guide, Xilinx

http://www.xilinx.com/ise/embedded/mb ref guide.pdf.Inc.

[2] Nios II Embedded Processor, Altera Inc.http://www.altera.com/literature/hb/nios2/n2cpu nii5vl .pdf.

[3] R.E. Gonzalez, "Xtensa: A Configurable and Extensible Processor,IEEE Micro, vol. 20, no. 2, pp. 60-70, Mar. 2000.

[4] Freedom CPU, http:Hf-cpu.seul.org/.[5] D. Etiemble and L. Lacassagne, "Introducing image processing and

SIMD computations with FPGA soft-cores and customizedinstructions", in I st International Workshop on ReconfigurableComputing Education, Karlsruhe, Germany, March 2006.

[6] P. J. Narayanan, "Processor Autonomy on SIMD Architectures,t InProc. the ACM International Conference on Supercomputing, pp.127--136, July 1993.

[7] Savakis A, Sniatala P, Rudnicki R. "Real-time video annotation usingMPEG-7 motion activity descriptors," Conference MIXDES 2003.Lodz: June 26-28: 2003.

c_-I? PE opeatiDn001 Hypotemwe010 Outt1tc!ula:dor

011 Hyote=e+acumnalaeI1m C ouehar110 Con +aozuimla±eI

I iI iIPr;IL *17 -

Figure 3: The Processing Node for the MPEG-7 Motion ActivityDescriptor and the PE operation table.

Documents

[IEEE 2006 International Symposium on System-on-Chip - Tampere Hall, Tampere, Finland (2006.11.13-2006.11.16)] 2006 International Symposium on System-on-Chip - SIxD: A Configurable