16
A Reconfigurable Processor A Reconfigurable Processor Architecture and Software Architecture and Software Development Environment for Development Environment for Embedded Systems Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma, A. La Rosa, L. Lavagno, C. Passerone, R.Canegallo Nice, France April 22, 2003

A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

Embed Size (px)

Citation preview

Page 1: A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

A Reconfigurable Processor Architecture A Reconfigurable Processor Architecture and Software Development Environment and Software Development Environment

for Embedded Systemsfor Embedded Systems

A Reconfigurable Processor Architecture A Reconfigurable Processor Architecture and Software Development Environment and Software Development Environment

for Embedded Systemsfor Embedded Systems

Andrea CappelliF. Campi, R.Guerrieri, A.Lodi, M.Toma, A. La Rosa,

L. Lavagno, C. Passerone, R.Canegallo

Nice, FranceApril 22, 2003

Page 2: A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

OutlineOutline

Motivations XiRisc: a VLIW Processor PiCoGA: A Pipelined Configurable Gate

Array Software Development Environment Results & Measurements Conclusions

Page 3: A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

MotivationsMotivations

Increased on-chip Transistor density

Increased Integration costs

Strong limitations in power supply

Severepower consumption

constraints

Millions of transistors/Chip

1997199920012003200520070

400

200

300

100

2009

Technology (nm)

Increased Algorithmic complexity

Quest for performance and

flexibility

1997199920012003200520072009

Algorithm complexityMoore’s law

Battery capacity

Page 4: A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

Embedded systems Algorithms analysisEmbedded systems Algorithms analysis 90% of computational complexity is concentrated

in small kernels covering small parts of overall code

Many algorithms show a relevant instruction-level parallelism Performance improved by multiple parallel data paths

Operand granularity is typically different from 32-bit Traditional ALU is power-inefficient

Significant improvements can be obtained extending embedded processors with application-specific function units

Reconfigurable computingto achieve maximum flexibility

Page 5: A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

Existing ArchitecturesExisting Architectures

Standard processor coupled with embedded programmable logic where application specific functions are dynamically

remapped depending on the performed algorithm

1: Coprocessor model 2: Function unit model

Page 6: A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

32-bit load/store Risc architecture (5 stages pipeline)

Concurrent fetch and execution of two 32-bit instructions per cycle

VLIW Elaboration:

Set of specialized function units implementing DSP-specific operations

EXTENDED INSTRUCTION SET RISC ARCHITECTURE

Function unit approach: Reconfigurable device fits in a classical RISC pipeline:

Low communication overhead Exploits very high resource parallelism

Page 7: A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

ArchitectureArchitecture

Duplicated instruction decode logic (2 simmetrical data- channels)

Duplicated commonly used function Units (Alu and Shifter)

All others function units are shared (DSP operations, Memory handler)

A tightly coupled pipelined configurable Gate Array

Page 8: A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

Dynamic Instruction Set ExtensionDynamic Instruction Set Extension

configuration specificationregion

specificationpGA-load

Specific operation to transfer data from a configuration cache to the PiCoGA:

32-bit and 64-bit operation to launch the execution inside the PiCoGA(Data exchange through register file):

operation

specification

32-bit

pGA-opSource 1 Source 2 Dest 1 Dest 2

64-bit

pGA-opSource 1 Source 2

operation

specificationDest 1 Dest 2Source 3 Source 4

Page 9: A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

PiCoGA: a Pipelined ConfigurablePiCoGA: a Pipelined ConfigurableGate ArrayGate Array

Two-dimensional array of LUT-based Reconfigurable Logic Cells Each row implements a possible stage of a customized pipeline, independent and concurrent with the processor Up to 4x32-bit input data and up to 2x32-bit output data from/to register File

Embedded function unit for dynamic extension of the Instruction Set

PiCoGA

Page 10: A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

DFG-based elaborationDFG-based elaboration Row elaboration is activated by an embedded control unit Execution enable signal for of each pipeline stage

PiCoGA operation latency is dependent on the operation performed

Page 11: A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

ConfigurationCachePiCoGA

PiCoGA ConfigurationPiCoGA Configuration

Goal: to reduce cache misses due to PiCoGA configuration

Multi-context programming (4 cache layers/planes inside the array) Dedicated Configuration Cache with high bandwith bus to the PiCoGA (192 bits) Partial Run-Time Reconfiguration (A region is configured while another one is

active) Configuration is completely concurrent with processor elaboration

Layer4

Layer3

Layer2

Layer1

Page 12: A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

PiCoGA mapping

The Software Development EnvironmentThe Software Development Environment

InititialC code

Profiling

Computationkernel

extraction

100010100001100101001010110110010010100101110101101001011101101001010110111111111101

Executablecode

Latencyinformation

AssemblerLevel

Scheduler

pGA-op

Page 13: A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

Software SimulationSoftware SimulationGoals: check the correctness of the algorithm and evaluate performances

In the source code pGA-op is described using a pragma directive:

#pragma pGA shift_add 0x12 5 c a bc = ( a << 2 ) + b

#pragma end

/**************************************//* Shift_add mapped on PiCoGA *//**************************************/

#if defined(PiCoGA)...

asm(“pGA-op 0x12 ...”)...

/*************************************//* Emulation function _shift_add *//************************************/

#elsevoid _shift_add(){

...c = ( a << 2 ) + b

...}

#endif

Page 14: A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

Sofware SimulationSofware Simulation

Two special instructions are defined to support emulation:...topga ...jal _shft_addfmpga ......

topga saves current state and passes arguments to emulation function. Function clock cycle count is halted

fmpga copies emulation function result(s) and restores registers; cycle count is incremented with the latency value of the pGA-op

Evaluation of overall performances by counting elaboration cycles

Page 15: A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

Results and Measurements Results and Measurements

0,076

0,27

0,150,22

0

0,2

0,4

0,6

0,8

1

DES CRC MedianFilter

MotionPrediction

Only VLIW

VLIW + PiCoGA

Normalized Energy Histogram

Speed-ups for several signal processing cores:

75% of energy consumption for a VLIW

architecture is due to accesses to instruction

and data memory

Strong reduction of accesses to instruction

memory DES CRC

Median

Filter

Motion

Estimation

Motion

PredictionTurbo Codes

13.5x 4.3x 7.7x 12.4x 4.5x 12x

Page 16: A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

ConclusionsConclusions

XiRisc: VLIW Risc architecture enhanced by run-time reconfigurable function unit

PiCoGA: pipelined, runtime configurable, row-oriented array of LUT-based cells

Specific software development toolchain Speedups range from 4.3x to 13.5x Up to 93% energy consumption reduction