16
CSX PROCESSOR ARCHITECTURE www.clearspeed.com Abstract This paper describes the architecture of the CSX family of processors based on ClearSpeed’s multi-threaded array processor; a high performance, power-efficient, scalable, data-parallel processor. The CSX processors have a simple programming model and are supported by a suite of software development tools, built around a C compiler, graphical debugger and visual profiling tools. High performance standard math libraries are also available. Developed for applications with demand- ing performance requirements, the processor can be used as an application accelerator for high performance computing (HPC ) and also as an embedded processor for Digital Signal Processing (DSP ). The use of the add-in CSX600-based Advance TM Board as an application coprocessor is also described. This PCI form factor board provides over 50 GFLOPS DGEMM sustained performance while typically consuming only 25 watts. CLEARSPEED WHITEPAPER: CSX PROCESSOR ARCHITECTURE

CLEARSPEED WHITEPAPER: CSX PROCESSOR ARCHITECTUREjamiepacker.eu/docs/ClearSpeed_Architecture_Whitepaper_Feb07v2.… · CSX PROCESSOR ARCHITECTURE Abstract This paper describes the

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CLEARSPEED WHITEPAPER: CSX PROCESSOR ARCHITECTUREjamiepacker.eu/docs/ClearSpeed_Architecture_Whitepaper_Feb07v2.… · CSX PROCESSOR ARCHITECTURE Abstract This paper describes the

CSX PROCESSOR ARCHITECTURE

www.clearspeed.com

AbstractThis paper describes the architecture of theCSX family of processors based onClearSpeed’s multi-threaded array processor;a high performance, power-efficient, scalable,data-parallel processor.

The CSX processors have a simple programming model and are supported bya suite of software development tools, built around a C compiler, graphicaldebugger and visual profiling tools. High performance standard math librariesare also available.

Developed for applications with demand-ing performance requirements, theprocessor can be used as an applicationaccelerator for high performance computing (HPC ) and also as an embedded processor for Digital SignalProcessing (DSP ).

The use of the add-in CSX600-basedAdvanceTM Board as an applicationcoprocessor is also described. This PCIform factor board provides over 50GFLOPS DGEMM sustained performancewhile typically consuming only 25 watts.

CLEARSPEED WHITEPAPER: CSX PROCESSOR ARCHITECTURE

Page 2: CLEARSPEED WHITEPAPER: CSX PROCESSOR ARCHITECTUREjamiepacker.eu/docs/ClearSpeed_Architecture_Whitepaper_Feb07v2.… · CSX PROCESSOR ARCHITECTURE Abstract This paper describes the

IntroductionThis whitepaper describes ClearSpeed’sCSX family of coprocessors which arebased around a multi-threaded arrayprocessor (MTAP) core. This processingarchitecture has been developed toaddress a number of problems in high performance, high data rate processing.

CSX processors can be used as applicationaccelerators, alongside a standard processorsuch as those from Intel or AMD.

The architecture can be utilized with anyapplication with significant data parallelism:

nn Fine Grained e.g. vector operationsnn Medium Grained e.g. unrolled

independent loopsnn Coarse Grained e.g. multiple

simultaneous data channels

The CSX600-based ClearSpeed Advanceboard is used as an application acceleratorfor high performance computing (HPC ). Itworks with standard processors (from Intelor AMD) to share the compute intensiveparts of an application. By acceleratingstandard software libraries used by a numberof applications, the ClearSpeed Advanceboard allows these applications to beaccelerated “out of the box.”

2

FIGURE 2: PROCESSOR EVOLUTION

FIGURE 1: CLEARSPEED ADVANCE BOARD

The ClearSpeed Advance acceleratorboard can be used to accelerate a singleworkstation, server or an entire cluster;multiple boards may be used in one computer.

CSX FamilyThe CSX processors are designed as application accelerators and embeddedprocessors. They provide a completeSystem on Chip (SoC ) solution: integratingan MTAP processor with all the necessaryinterfaces and support logic.

The first product in the CSX family is the CSX600. This is optimized for high performance, floating point intensive applications. It includes an MTAP processor,DDR2 DRAM interface, on-chip SRAM andhigh speed inter-processor I/O ports.

Page 3: CLEARSPEED WHITEPAPER: CSX PROCESSOR ARCHITECTUREjamiepacker.eu/docs/ClearSpeed_Architecture_Whitepaper_Feb07v2.… · CSX PROCESSOR ARCHITECTURE Abstract This paper describes the

MTAP ArchitectureThe MTAP architecture defines a family ofembedded processors with parallel dataprocessing capability. Figure 1 compares astandard processor and the multi-threadedarray processor. As can be seen, the MTAPprocessor has a standard, RISC-like, control unit with instruction fetch, cachesand I/O mechanisms. This is coupled to ahighly parallel execution unit which provides the performance and scalabilityof the architecture.

To simplify the integration of the processorinto a variety of systems, the processorcan be configured at boot time to be big orlittle-endian.

Control UnitThe control unit fetches, decodes and dispatches instructions to the executionunits. The processor executes a fairly standard, three operand instruction set.

The control unit also provides hardwaresupport for multi-threaded execution,allowing fast swapping between multiplethreads. The threads are prioritized andare intended primarily to support efficientoverlap of I/O and compute. This can beused to hide the latency of external data accesses.

The processor also includes instructionand data caches to minimize the latency ofexternal memory accesses.

The control unit also includes a controlport which is used for initializing anddebugging the processor; it supportsbreakpoints, single stepping and theexamination of internal state.

Execution UnitsThe execution unit uses a number of polyexecution (PE) cores. This allows it toprocess data elements in parallel. Each PEcore consists of at least one ALU, registers,memory and I/O. The CSX600 alsoincludes an integer MAC and a dual floatingpoint unit (FPU) on every PE core.

The execution unit can be thought of as two, largely independent, parts (see Figure 4)

One part forms the mono execution unit;this is dedicated to processing mono (i.e. scalar or non-parallel) data. The monoexecution unit also handles program flowcontrol such as branching and threadswitching. The rest of the PE cores formthe poly execution unit which processesparallel (poly) data.

The poly execution unit may consist oftens, hundreds or even thousands of PEcores. This array of PE cores operates in asynchronous manner, similar to a SingleInstruction, Multiple Data (SIMD) processor,where every PE core executes the sameinstruction on its piece of data. Each PEcore also has its own independent localmemory; this provides fast access to thedata being processed. For example, onePE core at 250 MHz has a memory band-width of 1 Gbytes/s. An array with 96 suchPE cores has an aggregate bandwidth ofapproximately 100 Gbytes/s with singlecycle latency.

3

CSX PROCESSOR ARCHITECTURE

FIGURE 4: POLY EXECUTION UNIT

FIGURE 3: CLEARSPEED CSX600 COPROCESSOR

ON THE ADVANCE BOARD

Page 4: CLEARSPEED WHITEPAPER: CSX PROCESSOR ARCHITECTUREjamiepacker.eu/docs/ClearSpeed_Architecture_Whitepaper_Feb07v2.… · CSX PROCESSOR ARCHITECTURE Abstract This paper describes the

Programming ModelThe control unit fetches, decodes and dispatches instructions to the executionunits. The processor executes a fairly standard, three operand instruction set.

FIGURE 5: INSTRUCTION EXECUTION

From a programmer’s perspective, theprocessor appears as a single processorrunning a single C program. This is verydifferent from some other parallel processingmodels where the programmer has toexplicitly program multiple independentprocessor cores, or can only access theprocessor via function calls or some otherindirect mechanism.

The processor executes a single instructionstream; each instruction is sent to one ofthe functional units: this may be in themono or poly execution unit, or one of the I/O controllers. The processor can dispatch an instruction on every cycle.

For multi-cycle instructions, the operationof the functional units can be overlapped.So, for example, an I/O operation can bestarted with one instruction and on the nextcycle the mono execution unit could start amultiply instruction (which requires severalcycles to complete). While these operationsare proceeding, the poly execution unitcan continue to execute instructions.

The main change from programming a standard processor is the concept ofoperating on parallel data. Data to beprocessed is assigned to variables whichhave an instance on every PE core and areoperated on in parallel: we call this polydata. This can be thought of as a data vectorwhich is distributed across the PE core array.

Variables which only need to have a singleinstance (e.g. loop control variables) areknown as mono variables, they behaveexactly like normal variables on a sequential processor.

ClearSpeed provides a compiler whichuses a simple extension to standard C toidentify data which is to be processed inparallel. The new keyword poly is used in adeclaration to define data which exists, andis processed, independently on every PEcore in the poly execution unit.

Example

To give a feel for the way the processor isprogrammed, we use a simple example ofevaluating the sine function.

#include <stdio.h>#include <math.h>

#define PI 3.1415926535897932384f#define SAMPLES 96

int main() {double sine, angle;int i;

for (i = 0; i < SAMPLES; i++) {// convert to an angle in

range 0 to Piangle = i * PI / SAMPLES;// calculate sine of anglesine = sin(angle);}

}

FIGURE 6: STANDARD C

4

Page 5: CLEARSPEED WHITEPAPER: CSX PROCESSOR ARCHITECTUREjamiepacker.eu/docs/ClearSpeed_Architecture_Whitepaper_Feb07v2.… · CSX PROCESSOR ARCHITECTURE Abstract This paper describes the

#include <lib_ext.h>

#define PI 3.1415926535897932384f#define SAMPLES 96

int main() {poly double sine, angle;poly int i;

// get PE number: 0...n-1i = get_penum();// convert to an angle in range 0

to Piangle = i * PI / SAMPLES;// calculate all sine values

simultaneouslysine = sinp(angle);

}

FIGURE 7: PARALLEL C

Figure 6 shows a loop in standard C whichiterates to calculate a number of values ofthe sine function.

Figure 7 shows an equivalent piece ofcode which simultaneously calculates 96different values of the sine function acrossthe PE core array.

This code uses the library functionget_penum() to get a unique value in thevariable pe on each PE core. This is scaledto give a range of values for anglebetween 0 and pi across the PE cores.

Finally, the library function sinp() is calledto calculate the sine of these values on allPE cores simultaneously. The sinpfunction is the poly equivalent of the standard sine function; it takes a poly argument and returns the appropriatevalue on every PE core.

The result of running this program on a 96-PE core processor is shown graphicallyin Figure 8.

5

CSX PROCESSOR ARCHITECTURE

FIGURE 8: RESULTS FROM EXAMPLE PROGRAM

Page 6: CLEARSPEED WHITEPAPER: CSX PROCESSOR ARCHITECTUREjamiepacker.eu/docs/ClearSpeed_Architecture_Whitepaper_Feb07v2.… · CSX PROCESSOR ARCHITECTURE Abstract This paper describes the

InterfacesThe processor has a number of interfacesto the ClearConnect NoC. There are threecategories of interface:

Mono data & instructions: This interface isused for mono loads and stores, and forinstruction fetching.

Poly data: There are one or more interfacesfor I/O and data. The number of physicalinterfaces corresponds to the number ofI/O channels implemented.

Control: The control interface is used forinitialization and debug, and to allow theprocessor to generate interrupts on thehost system.

Instruction setThe processor has a fairly standard RISC-like instruction set. Most instructions canoperate on mono or poly operands and areexecuted by the appropriate executionunit. Some instructions are only relevant toeither the mono or poly execution unit, forexample all program flow control is handled by the mono unit.

The instruction set provides a standard set of functions on both mono and polyexecution units:

nn Integer and floating point adds, subtracts, multiplies, divides

nn Logical operations: and, or, not, exclusive-or, etc.

nn Arithmetic and logical shiftsnn Data comparisons: equal, not equal,

greater than, less than, etc.nn Data movement between registersnn Loads and stores between registers

and memory

To give a feel for the nature of assemblycode for the processor, a few lines of codeare shown below. This simple exampleloads a value from memory into a monoregister, gets the PE core number into aregister on each PE core and then addsthese two values together producing a different result on every PE core.

ld 0:m4, 0x3000 // load mono reg0 from mem

penum 8:p4 // PE core number intopoly reg 8

add 4:p4, 0:m4, 8:p4 // add;result in reg 4

6

I Cache Peripheral NetworkControl/Debug

Semaphore Unit

Poly Execution Unit

PIO Collection/Distribution

Thread Control

MonoExec Unit

Poly Scoreboard

PolyMCodedControl

Poly LS Control

Poly PIOControl

System Network

System Network

D Cache

PIO

SRAM

Reg File

FP-M

UL

FP-A

DD

DIV

/SQ

RT

ALU

MA

C

PIO

SRAM

Reg File

FP-M

UL

FP-A

DD

DIV

/SQ

RT

ALU

MA

C

PIO

SRAM

Reg File

FP-M

UL

FP-A

DD

DIV

/SQ

RT

ALU

MA

C

FIGURE 9: MULTI-THREADED ARRAY PROCESSOR

Architecture DetailsThe following sections describe the multi-threaded array processor in more detail. A block diagram of the processor is shownin Figure 9. This is a conceptual view, which reflects the programming model. In practice, the control and mono executionunits form a tightly-coupled pipeline whichprovides overall control of the processor.

Page 7: CLEARSPEED WHITEPAPER: CSX PROCESSOR ARCHITECTUREjamiepacker.eu/docs/ClearSpeed_Architecture_Whitepaper_Feb07v2.… · CSX PROCESSOR ARCHITECTURE Abstract This paper describes the

Control UnitThe control unit fetches instructions frommemory, decodes them and dispatchesthem to the appropriate functional unit.

The controller includes a scheduler to providehardware support for multi-threaded code.This is a vital part of the architecture forachieving the performance potential of theprocessor. Because of the highly parallelarchitecture of the poly execution unit,there can be significant latencies if all PEcores need to read or write external data.

When part of an application stalls becauseit is waiting for data from external memory,the processor can switch to another codethread that is ready to run. This serves tohide the latency of accesses and keep theprocessor busy.

The threads are prioritized: the processorwill run the highest priority thread that isready to run; a higher priority thread canpre-empt a lower priority thread when it becomes ready to run. Threads are synchronized (with each other and withhardware such as I/O engines) via hardware semaphores.

In the simplest case of multi-threadedcode, a program would have two threads:one for I/O and one for compute. By pre-fetching data in the I/O thread, the programmer (or the compiler) can ensurethe data is available when it is required bythe execution units and the processor canrun without stalling.

Execution UnitThis section describes the commonaspects of the poly and mono executionunits.

ALU OperationsInstructions for arithmetic and logical operations are provided in several versions:

nn Various sizes: 1, 2, 4 and 8 bytes;nn Various data types: signed and

unsigned integers, and single and double-precision floating point;

nn Hardware support for floating point and DSP functions.

Status RegisterAssociated with the ALU is a status register;this contains five status bits that provideinformation about the result of the last ALUoperation. When set, these bits indicate:

nn Most significant bit setnn Carry generatednn Overflow generatednn Negative resultnn Zero result

RegisterTo support operations on data of differentwidths, the registers in the PE cores can beaccessed very flexibly. The register files arebest thought of as an array of bytes whichcan be addressed as registers of 1 to 8bytes wide.

The mono and poly registers areaddressed in a consistent way using byteaddresses and widths specified in bytes.The mono register file is 16 bits wide andso all addresses and widths must be a multiple of 2 bytes. There are no alignmentrestrictions on poly register accesses.

7

CSX PROCESSOR ARCHITECTURE

FIGURE 10: POLY-DIE CHIP

Page 8: CLEARSPEED WHITEPAPER: CSX PROCESSOR ARCHITECTUREjamiepacker.eu/docs/ClearSpeed_Architecture_Whitepaper_Feb07v2.… · CSX PROCESSOR ARCHITECTURE Abstract This paper describes the

Addressing modesLoad and store instructions are used to transfer data between memory and registers. In the case of the mono executionunit, these transfer data to and from memory external to the processor.

Poly loads and stores transfer databetween the PE core register file and PEcore memory. Data is transferred betweenthe PE cores and external memory usingthe I/O functions described later.

There are three addressing modes forloads and stores which can be used forboth mono and poly data.These are:

Direct: The address to be read/written isspecified as an immediate value.

Indirect: The address is specified in a register.

Indexed: The address is calculated fromadding an offset to a base address in aregister. The offset must be an immediatevalue.

Conditional codeThe main difference between code for themono and the poly execution units is theway that conditional code is executed. Themono unit uses conditional jumps tobranch around code, typically based onthe result of the previous instructions. Thismeans that mono conditions affect bothmono and poly operations. This is just likea standard RISC architecture.

The poly unit uses a set of enable bits(described in more detail below) to controlwhether each PE core will have its statechanged by any instructions it executes.This provides per-PE core predicated operation. The enable state can be changedbased on the result of a previous operation.

The following sections describe the architectural features of the two executionunits in more detail.

Mono execution unitAs well as handling mono data, the monounit is responsible for program flow control(branching), thread switching and othercontrol functions. The mono execution unitalso has overall control of I/O operations.Results from these operations are returnedto registers in the mono unit.

Conditional executionThe mono execution unit handles conditionalexecution in the same way as a traditionalprocessor. A set of conditional and unconditional jump instructions use theresult of previous operations to jump overconditional code, back to the start of loops, etc.

Multi-threaded executionThe processor supports several hardwarethreads. There is a hardware scheduler inthe control unit and the mono executionunit maintains multiple banks of critical registers for fast context switching.

The threads are prioritized (0 being highestpriority). Control of execution between thethreads is performed using semaphoresunder programmer control. Higher prioritythreads will only yield to lower priority threadswhen they are stalled on yielding instructions(such as semaphore wait operations).Lower priority threads can be pre-emptedat any time by higher priority ones.

Semaphores are special registers that canbe incremented or decremented withatomic (non-interruptible) operationscalled signal and wait. A signal instructionwill increment a semaphore. A wait willdecrement a semaphore unless the semaphore is 0, in which case it will stalluntil the semaphore is signalled by anotherthread. Semaphores can also be accessedby hardware units (such as the I/O controllers)to synchronize these with software.

8

Page 9: CLEARSPEED WHITEPAPER: CSX PROCESSOR ARCHITECTUREjamiepacker.eu/docs/ClearSpeed_Architecture_Whitepaper_Feb07v2.… · CSX PROCESSOR ARCHITECTURE Abstract This paper describes the

Poly Execution UnitThe poly execution unit is an array of PolyExecution (PE) cores. Each PE core in thepoly execution unit is similar to a VLIWprocessor and consists of:

nn Multiple function units such as:ll Floating point unit (FPU)ll Integer multiply-accumulate (MAC)

unitll Integer arithmetic/logic unit (ALU)

nn A register filenn Status and enable registersnn A block of fast, private memorynn An inter-PE core communication pathnn One or more I/O channels

Load and store instructions move databetween a PE core register file and PEcore memory, while the function unitsoperate on data in the register file. Data istransferred in to, and out of, the PE coresmemory using I/O instructions.

The multiple functional units can operateconcurrently so that, for example, an integeroperation can be performed simultaneouslywith a floating point operation.

Conditional behaviorThe SIMD nature of the PE core array prohibits each PE core having its ownbranch unit (branching being handled bythe mono execution unit). Instead, each PEcore can control whether its state shouldbe updated by the current instruction byenabling or disabling itself; this is ratherlike the predicated instructions in someRISC CPUs.

Enable stateA PE core's enable state is determined bya number of bits in the enable register. If allthese bits are set to one, then a PE core isenabled and executes instructions normally.If one or more of the enable bits is zero,then the PE core is disabled and mostinstructions it receives will be ignored(instructions on the enable state itself, forexample, are not disabled).

The enable register is treated as a stack,and new bits can be pushed onto the topof the stack allowing nested predicatedexecution. The result of a test, either a 1 ora 0, is pushed onto the enable stack. Thisbit can later be popped from the top of thestack to remove the effect of that condition.This makes handling nested conditionsand loops very efficient. Note that,although the enable stack is of fixed size,the compiler handles saving and restoringthe state automatically, so there are no limitations on compiled code. When programming at the assembler level, it isthe programmer’s responsibility to managethe enable stack.

InstructionsConditional execution on the poly execu-tion unit is supported by a set of poly con-ditional instructions: if, else, endif,etc. These manage the enable bits to allowdifferent PE cores to execute each branchof an if...else construct in C, for example.These also support nested conditions bypushing and popping the condition valueon the enable stack.

As a simple example, consider the followingcode fragment:

// disable all PEs where reg 32 isnon-zero

if.eq 32:p1, 0 // pushing resultonto stack

// increment reg 8 on enabled PEs

add 8:p4, 8:p4, 1

// return all PEs to originalenable state

endif // pop enable stack

Here, the initial if instruction comparesthe two operands on each PE core. If theyare equal it pushes 1 onto the top of theenable stack. This leaves those PE coresenabled if they were previously enabledand disabled if they were already disabled.If the two operands are not equal, a 0 ispushed onto the stack, this disables thecorresponding PE cores.

9

CSX PROCESSOR ARCHITECTURE

Page 10: CLEARSPEED WHITEPAPER: CSX PROCESSOR ARCHITECTUREjamiepacker.eu/docs/ClearSpeed_Architecture_Whitepaper_Feb07v2.… · CSX PROCESSOR ARCHITECTURE Abstract This paper describes the

The following add instruction is sent to allPE cores, but only acted on by those thatare still enabled. Finally, the endifinstruction pops the enable stack, returningall PE cores to their original enable state.

Normally, none of this is visible to the programmer writing code in C: conditionalcode just does what’s expected.

Forced loads and storesPoly loads and stores are normally predicated by the enable state of the PEcore. However, because there are instanceswhere it is necessary to load and storedata regardless of the current enable state,the instruction set includes forced loadsand stores. These will change the state ofthe PE core even if it is disabled.

I/O mechanismsI/O instructions transfer data between PEcore memory and devices outside the CSXprocessor. Programmed I/O (PIO) extendsthe load/store model: it is used for transfersof small amounts of data between PE corememory and external memory.

I/O architectureThe I/O systems consist of three parts:Controller, Engine and Node.Controller: The PIO controller decodes I/Oinstructions and coordinates with the restof the control unit and the mono processor.The controller synchronizes with softwarethreads via semaphores.Engine: The I/O engines are basically DMAengines which manage the actual datatransfer. There is a Controller and Engine for eachI/O channel. A single Controller can manageseveral I/O Engines.Node: There is an I/O Node in each PEcore. The I/O Engine activates each Nodein turn allowing to serialize the data transfers.The Nodes provide buffering of data tominimize the impact of I/O on the performance of the PE cores.

PIO is closely coupled to program executionand is used to transfer data to and from theoutside world (e.g. external memory,peripherals or the host processor).

The PIO mechanism provides a number ofaddressing modes:Direct addressed: Each PE core providesan external memory address for its data.This provides random access to data.Strided: The external memory address isincremented for each PE core.In each case, the size of data transferred toeach PE core is the same.

When multiple PE cores are accessingmemory then the transfers can be consolidated so as to perform the minimumnumber of external accesses. So, forexample, if half the processors are readingone location and the other half readinganother, then only two memory readswould be performed. In fact, consolidationcan be better than that: because the bus transfers are packetized, even transfers from nearby addresses can beeffectively consolidated.

SwazzleFinally, the PE cores are able to communicatewith one another via what is known as theswazzle path that connects the register fileof each PE core with the register files of itsleft and right neighbors. On each cycle, PE coren can perform a register-to-register transfer to either its left or right neighbor, PE coren-1 or PE coren+1, while simultaneously receiving data from theother neighbor.

Instructions are provided to shift data left orright through the array, and to swap databetween adjacent PE cores.

The enable state of a PE core affects itsparticipation in a swazzle operation in thefollowing way: if a PE core is enabled, then itsregisters may be updated by a neighboringPE core, regardless of the neighboring PE core’s enable state. Conversely, if a PEcore is disabled, its register file will not be altered by a neighbor under any circumstance. A disabled PE core will stillprovide data to an enabled neighbor.

The data written into the registers of the PEcores at the ends of the swazzle path canbe set by the mono execution unit.Alternatively, the two ends can be connectedto form a circular path.

10

Page 11: CLEARSPEED WHITEPAPER: CSX PROCESSOR ARCHITECTUREjamiepacker.eu/docs/ClearSpeed_Architecture_Whitepaper_Feb07v2.… · CSX PROCESSOR ARCHITECTURE Abstract This paper describes the

Host InterfaceThe interfaces to the host system are usedfor three basic purposes: initialization andbooting, access to host services anddebugging.

InitializationThere are a number of stages of initializationrequired to start code running on theprocessor. These are normally handledtransparently by the development tools,but an overview is provided here as back-ground information.

First, the application code (including boot-strap) is loaded into memory.

Next, the host system initializes the state ofthe control unit, caches and mono executionunit by a series of writes to the host/debugport. The last of these specify the startaddress of the code to execute and tell theprocessor to start fetching instructions.

Finally, the boot code does any remaininginitialization of the processor including thePE cores (e.g. setting the PE core numberingbefore running the application code.

Host servicesOnce the application program is running itwill need to access host resources, suchas the file system. To support this, a protocolis defined between the run-time librariesand a device driver on the host system.This is interrupt based and allows the coderunning on the processor to make calls toan application or the operating system running on the host.

DebuggingThe processor includes hardware supportfor breakpoints and single-stepping. Theseare controlled via registers accessedthrough the control interface.

This interface also allows the debugger,running on the host, to interrogate andupdate the state of the processor to supportfully interactive debugging.

Case StudyTo illustrate the use of the multi-threadedarray processor in a real device, the architecture of the CSX600 is described inthis section.

The CSX600 is fabricated on a 0.13µprocess and runs at clock speeds between200 MHz and 250 MHz.

CSX600 architecture

Processor coreThe MTAP processor in the CSX600 hasthe following specification:Generalnn 8 Kbyte instruction cache: 4-way,

512 lines x 4 instructions, with manual and auto pre-fetch

nn 4 Kbyte data cache: 4-way, 256 lines x 16 bytes

11

CSX PROCESSOR ARCHITECTURE

FIGURE 11: CSX600 SYSTEM-ON-CHIP (SOC)ARCHITECTURE

Page 12: CLEARSPEED WHITEPAPER: CSX PROCESSOR ARCHITECTUREjamiepacker.eu/docs/ClearSpeed_Architecture_Whitepaper_Feb07v2.… · CSX PROCESSOR ARCHITECTURE Abstract This paper describes the

Mono execution unitnn 64 bit FPUnn 64 byte register filenn Support for 8 threads

Poly execution unitnn Array of 96 PE cores nn Superscalar 64 bit FPU on each PE corenn 16 bit MAC per PE corenn 6 Kbytes SRAM per PE corenn 128 byte register file per PE core

Performancenn Over 25 GFLOPS single or

double precisionnn 25,000 MIPSnn 25 GMAC/s integer multiply-accumulatenn 96 Gbytes/s internal memory bandwidth

Memory systemExternal memory is connected via a 64 bitDDR2 SDRAM interface. When used with72 bit wide DRAM modules this providesError Checking and Correction (ECC).Each processor supports up to 4 Gbytes of local DRAM.

The processor supports 64 bit addressingso that large data sets can be processed.The 64 bit address space is flexiblymapped into a 48 bit physical addressspace. For embedded systems and backward compatibility, a simple 32 bitaddressing mode is provided.

On-chip SRAM is included for fast accessto frequently accessed code and data.

The integrated DMA controller can be programmed to transfer data to and fromthe external memory interface and anyother device on the ClearConnectNetwork-on-Chip (NoC, whether on thesame chip or elsewhere in the system).

Bridge portsThe NoC is made available at two portswhich can be interconnected with no gluelogic to construct multi-processor systems.

This enables system performance to be scaled to meet the requirements of the application.

These ports use double data rate interfaces to minimize the pin count. If notfully used, they can be selectively turnedoff to reduce power consumption.

Data can be transferred directly from onebridge port to the other without impactingother transactions on the ClearConnectNoC. This allows efficient multi-processorsystems to be constructed.

The bridge ports can also be used to connect to a Field Programmable GateArray (FPGA) to provide other functions and interfaces.

Interrupt and Semaphone Unit(ISU)The Interrupt and Semaphore Unit supports the synchronization of threadswith external events. Both pin and messagesignaled interrupts are supported for flexiblesupport of multiple devices in various host environments.

Host/debug port (HDP)A host interface allows the CSX600 to communicate with, and be controlled by,the system’s host processor. This port canalso be used as a hardware and softwaredebug port as it provides full access to allthe internal registers on the device.

ClearConnect® NoCThe various subsystems are interconnectedusing the ClearConnect on-chip network tointerface the subsystems on the chip. This supports multiple independent datatransfers; for example, the processor canaccess data in the on-chip SRAM at thesame time as data is transferred to theDDR2 interface from one of the bridge ports.

12

Page 13: CLEARSPEED WHITEPAPER: CSX PROCESSOR ARCHITECTUREjamiepacker.eu/docs/ClearSpeed_Architecture_Whitepaper_Feb07v2.… · CSX PROCESSOR ARCHITECTURE Abstract This paper describes the

ApplicationsHigh Performance ComputingClearSpeed’s CSX processors are used asapplication accelerators for HPC, deliveringperformance that unlocks the developmentof completely new algorithms for applicationssuch as financial modeling and a host ofscientific research disciplines. Because ofits low power consumption, the technologycan be applied to systems from workstations to clusters delivering compute performance beyond the limits of standard systems.

Application AccelerationIn HPC systems, the CSX600 is typicallyused as a coprocessor to accelerate theinner loops of an application. The CSX600-based ClearSpeed Advance acceleratorboard is shown in Figure 12.

The use of the CSX600 as a coprocessor isillustrated in Figure 13. Here an application isrunning on the host system. Key functions of the application are ported tothe CSX600 and the main program effectively calls the code on the CSX600 via a communication library. This sends the data to be processed from host memory to the ClearSpeed Advance board andwaits for the results. By pipelining multiple operations, it is possible to overlap thecommunication with computation.

Libraries will be provided to allow the codeon the coprocessor to be called from C,C++ or FORTRAN.

13

CSX PROCESSOR ARCHITECTURE

FIGURE 12: CLEARSPEED ADVANCE BOARD

Host OS

General purpose CPUe.g. AMD, Intel

CSX600 codefor ClearSpeed

Advance™ Accelerator

Sof

twar

eH

ardw

are

Shared library APIse.g. BLAS, LAPACK etc

Host Application

User's View

Shared librariesfor CPU

e.g. ACML, MKL

CSXL Shared library runtime suite for

Advance™ accelerator

Device driver

FIGURE 13: CLEARSPEED ADVANCETM APPLICATION ACCELERATION

Page 14: CLEARSPEED WHITEPAPER: CSX PROCESSOR ARCHITECTUREjamiepacker.eu/docs/ClearSpeed_Architecture_Whitepaper_Feb07v2.… · CSX PROCESSOR ARCHITECTURE Abstract This paper describes the

Example: Drug DockingThe developing field of in-silico drug discovery uses computer simulation ratherthan the ‘wet science’ of test tubes toexplore the behavior of potential new drugs.

Bristol University’s Department ofBiochemistry has produced their own software, called Dockit, for this application.This FORTRAN code performs molecularsimulation of the interactions of molecules.For example, the docking of a ligand molecule (drug) into a protein.

This simulation requires the calculation ofthe interaction energies of all atoms of onemolecule with all the atoms of the other.This process is repeated for thousands ofconfigurations of the molecules. This naturally results in a massively data-parallel problem ideally suited to the CSX processor architecture.

ClearSpeed worked with the team at BristolUniversity to identify the functions in theprogram that take the most compute time.These were then rewritten in C for theCSX600 and the original FORTRAN sub-routines replaced with code to communicatewith the CSX processor; passing the datato be processed and getting the results back.

On the CSX600, each PE core is allocateda different configuration of protein and ligand, and performs all of the atom-atominteraction energy calculations. The memoryavailable within a PE core is insufficient tohold the details of an entire molecule; however, the molecule can be split intopieces and each piece processed in turn.The atom data is effectively ‘streamed’through the PE cores. The fetching ofatoms can be efficiently overlapped withthe atom-atom processing. However, evenwith the floating point accelerated processor the task is compute bound soI/O bandwidth is not an issue.

The calculations currently performed in mostapplications of this sort are a simplistic

model of the interaction of atoms. The processing power that the CSX600 makesavailable will allow more sophisticatedmodels to be developed (e.g. taking intoaccount quantum effects), improving theaccuracy and thus value of the results.

Embedded systemsThe low power dissipation of the processoris also a key benefit in embedded systemswhere it can be used as a DSP in applicationssuch as defense (e.g. radar and sonar processing) and medical imaging.

CommunicationsThe array processor architecture is ideallysuited to packet processing for networkingsystems as well as portable and satellitecommunications.

Network processingIn this case, the data to be processed consists of a stream of packets of variablesize. Each packet consists of header information and a data payload. In its simplest form, the problem is to examinevarious fields in the header, do some sort oflookup function and route the packet to theappropriate output port. This processingmust be done in real time.

In the network processing application, anI/O channel can be used for continuouslystreaming packets into, and out of, PE corememory. Within each PE core, the softwaremaintains several buffers so that, while one set of packets is being processed, the previous set can be output and, simultaneously, the next set can be loaded.

PIO channels can be used by the PE coresto send requests to another subsystem thathandles lookups. This could on-chip or off-chip, using table lookup hardware or asolution such as Content AddressableMemory (CAM).

14

Page 15: CLEARSPEED WHITEPAPER: CSX PROCESSOR ARCHITECTUREjamiepacker.eu/docs/ClearSpeed_Architecture_Whitepaper_Feb07v2.… · CSX PROCESSOR ARCHITECTURE Abstract This paper describes the

Software development kitFor customers who need greater flexibilityor wish to port proprietary code, ClearSpeedprovides a set of software development tools.

The ClearSpeed Software Development Kit (SDK) is a suite of tools and librariesdesigned to enable the rapid developmentof application software for CSX processors.The tools are centered around a C compilerand the industry-standard GDB symbolicdebugger.

ConclusionThe twin demands of functionality and ever increasing data volumes demandpowerful yet flexible processing solutions.ClearSpeed’s CSX processors supply asolution to these next generation needs inthe form of a hardware and software platform that can be rapidly integratedwith new and existing industry standardsystems.

15

CSX PROCESSOR ARCHITECTURE

FIGURE 14: CLEARSPEED DEBUGGER

Page 16: CLEARSPEED WHITEPAPER: CSX PROCESSOR ARCHITECTUREjamiepacker.eu/docs/ClearSpeed_Architecture_Whitepaper_Feb07v2.… · CSX PROCESSOR ARCHITECTURE Abstract This paper describes the

CLEARSPEED TECHNOLOGY

www.clearspeed.com

Copyright 2007 ClearSpeed Technology plc (“ClearSpeed”).

All rights reserved.

All information in this document is provided only as general informationin connection with ClearSpeed products. Except as provided inClearSpeed’s terms and conditions of sale for such products,ClearSpeed assumes no liability whatsoever, and ClearSpeed disclaims any express or implied warranty relating to sale and/or useof ClearSpeed products, including liability or warranties relating tofitness for a particular purpose, merchantability, or infringement of any patent, copyright, or other intellectual property right.ClearSpeed may make changes to specifications, product descriptions, and plans at any time, without notice.

ClearSpeed, ClearConnect and Advance are trademarks or registered trademarks of ClearSpeed Technology plc or its groupcompanies. All other marks are the property of their respective owners.

PN-1110-0702