36
Page 1 JR.S00 1 Lecture 14: (Re)configurable Computing Case Studies Prof. Jan Rabaey Computer Science 252, Spring 2000 The contributions of Andre Dehon (Caltech) and Ravi Subramanian (Morphics) to this slide set are gratefully acknowledged JR.S00 2 Summary of Previous Class Configurable Computing using “programming in space” versus “programming in time” for traditional instruction-set computers Key design choices Computational units and their granularity Interconnect Network (Re)configuration time and frequency Next class: Some practical examples of reconfigurable computers

Summary of Previous Class Lecture 14: (Re)configurable ...bwrcs.eecs.berkeley.edu/Classes/CS252/Notes/lec14-config.pdf · – Chameleon, Pleiades, Morphics JR.S00 4 ... I960 Board

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Summary of Previous Class Lecture 14: (Re)configurable ...bwrcs.eecs.berkeley.edu/Classes/CS252/Notes/lec14-config.pdf · – Chameleon, Pleiades, Morphics JR.S00 4 ... I960 Board

Page 1

JR.S00 1

Lecture 14: (Re)configurable Computing

Case Studies

Prof. Jan RabaeyComputer Science 252, Spring 2000

The contributions of Andre Dehon (Caltech) and Ravi Subramanian (Morphics)to this slide set are gratefully acknowledged

JR.S00 2

Summary of Previous Class

• Configurable Computing using “programming in space” versus “programming in time” for traditional instruction-set computers

• Key design choices– Computational units and their granularity– Interconnect Network– (Re)configuration time and frequency

• Next class: Some practical examples of reconfigurable computers

Page 2: Summary of Previous Class Lecture 14: (Re)configurable ...bwrcs.eecs.berkeley.edu/Classes/CS252/Notes/lec14-config.pdf · – Chameleon, Pleiades, Morphics JR.S00 4 ... I960 Board

Page 2

JR.S00 3

Applicability of Configurable Processing

• Stand-alone computational engines– E.g. PADDI, UCLA Mojave

• Adding programmable I/O to embedded processors

– E.g. Napa 2000

• Augmenting the instruction set of processors– E.g. GARP, RAW

• Providing programmable accelerator co-processors to embedded micro’s and DSP

– Chameleon, Pleiades, Morphics

JR.S00 4

Stand-Alone ComputationalEnginesUCLA Mojave System

Template Matching for AutomaticTarget Recognition

I960 Board

Page 3: Summary of Previous Class Lecture 14: (Re)configurable ...bwrcs.eecs.berkeley.edu/Classes/CS252/Notes/lec14-config.pdf · – Chameleon, Pleiades, Morphics JR.S00 4 ... I960 Board

Page 3

JR.S00 5

As Programmable Interface and I/O Processor

• Logic used in place of – ASIC environment

customization– external FPGA/PLD

devices

• Example– bus protocols– peripherals– sensors, actuators

• Case for:– Always have some system

adaptation to do – varying glue logic requirements

– Modern chips have capacity to hold processor + glue logic

– Reduces part count– Valued added must now be

accommodated on chip (formerly board level)

JR.S00 6

Example: Interface/Peripherals

• Triscend E5

Page 4: Summary of Previous Class Lecture 14: (Re)configurable ...bwrcs.eecs.berkeley.edu/Classes/CS252/Notes/lec14-config.pdf · – Chameleon, Pleiades, Morphics JR.S00 4 ... I960 Board

Page 4

JR.S00 7

Model: IO Processor

• Array dedicated to servicing IO channel

– sensor, lan, wan, peripheral

• Provides– protocol handling– stream computation

» compression, encrypt

• Looks like IO peripheral to processor

• Case for:– many protocols, services– only need few at a time– dedicate attention, offload

processor

JR.S00 8

IO Processing

• Single threaded processor– cannot continuously monitor multiple data pipes (src,

sink)– need some minimal, local control to handle events– for performance or real-time guarantees , may need to

service event rapidly– E.g. checksum (decode) and acknowledge packet

Page 5: Summary of Previous Class Lecture 14: (Re)configurable ...bwrcs.eecs.berkeley.edu/Classes/CS252/Notes/lec14-config.pdf · – Chameleon, Pleiades, Morphics JR.S00 4 ... I960 Board

Page 5

JR.S00 9

Source: National Semiconductor

NAPA 1000 Block Diagram

RPCReconfigurablePipeline Cntr

ALPAdaptive Logic

Processor

SystemPort

TBTToggleBusTM

Transceiver

PMAPipeline

Memory Array

CR32CompactRISCTM

32 Bit Processor

BIUBus Interface

Unit

CR32PeripheralDevices

ExternalMemoryInterface SMA

ScratchpadMemory Array

CIOConfigurable

I/O

JR.S00 10

Source: National Semiconductor

NAPA 1000 as IO Processor

SYSTEMHOST

NAPA1000

ROM &DRAM

ApplicationSpecific

Sensors, Actuators, orother circuits

System Port

CIO

Memory Interface

Page 6: Summary of Previous Class Lecture 14: (Re)configurable ...bwrcs.eecs.berkeley.edu/Classes/CS252/Notes/lec14-config.pdf · – Chameleon, Pleiades, Morphics JR.S00 4 ... I960 Board

Page 6

JR.S00 11

I/O Stream Processor

Philips Nexperia NX-2700A programmable HDTVmedia processor

Combines Trimedia VLIW withConfigurable media co-processors

JR.S00 12

Model: Instruction Augmentation• Observation: Instruction Bandwidth

– Processor can only describe a small number of basic computations in a cycle

» I bits →2I operations– This is a small fraction of the operations one could do even in

terms of w ⊗w→w Ops» w22(2w) operations

– Processor could have to issue w2 (2 (2w) -I) operations just to describe some computations

– An a priori selected base set of functions could be very bad for some applications

Page 7: Summary of Previous Class Lecture 14: (Re)configurable ...bwrcs.eecs.berkeley.edu/Classes/CS252/Notes/lec14-config.pdf · – Chameleon, Pleiades, Morphics JR.S00 4 ... I960 Board

Page 7

JR.S00 13

Instruction Augmentation

• Idea:– provide a way to augment the processor’s instruction set– with operations needed by a particular application– close semantic gap / avoid mismatch

• What’s required:– some way to fit augmented instructions into stream– execution engine for augmented instructions

» if programmable, has own instructions– interconnect to augmented instructions

JR.S00 14

“First” Instruction Augmentation

• PRISM– Processor Reconfiguration through Instruction Set

Metamorphosis

• PRISM-I– 68010 (10MHz) + XC3090– can reconfigure FPGA in one second!– 50-75 clocks for operations

[Athanas+Silverman: Brown]

Page 8: Summary of Previous Class Lecture 14: (Re)configurable ...bwrcs.eecs.berkeley.edu/Classes/CS252/Notes/lec14-config.pdf · – Chameleon, Pleiades, Morphics JR.S00 4 ... I960 Board

Page 8

JR.S00 15

PRISM-1 Results

Raw kernel speedupsJR.S00 16

PRISM

• FPGA on bus• access as memory mapped peripheral• explicit context management• some software discipline for use• …not much of an “architecture” presented to

user

Page 9: Summary of Previous Class Lecture 14: (Re)configurable ...bwrcs.eecs.berkeley.edu/Classes/CS252/Notes/lec14-config.pdf · – Chameleon, Pleiades, Morphics JR.S00 4 ... I960 Board

Page 9

JR.S00 17

PRISC

• Takes next step– what look like if we put it on chip?– how integrate into processor ISA?

• Architecture:– couple into register file as “superscalar” functional unit– flow-through array (no state)

[Razdan+Smith: Harvard]

JR.S00 18

PRISC

• ISA Integration– add expfu instruction– 11 bit address space for user-defined expfu instructions– fault on pfu instruction mismatch

» trap code to service instruction miss– all operations occur in clock cycle– easily works with processor context switch

» no state + fault on mismatch pfu instr

Page 10: Summary of Previous Class Lecture 14: (Re)configurable ...bwrcs.eecs.berkeley.edu/Classes/CS252/Notes/lec14-config.pdf · – Chameleon, Pleiades, Morphics JR.S00 4 ... I960 Board

Page 10

JR.S00 19

PRISC Results

• All compiled• working from MIPS

binary• <200 4LUTs ?

– 64x3

• 200MHz MIPS base

Razdan/Micro27 JR.S00 20

Chimaera

• Start from PRISC idea– integrate as functional unit– no state– RFUOPs (like expfu)– stall processor on instruction miss, reload

• Add– manage multiple instructions loaded – more than 2 inputs possible

[Hauck: Northwestern]

Page 11: Summary of Previous Class Lecture 14: (Re)configurable ...bwrcs.eecs.berkeley.edu/Classes/CS252/Notes/lec14-config.pdf · – Chameleon, Pleiades, Morphics JR.S00 4 ... I960 Board

Page 11

JR.S00 21

Chimaera Architecture

• “Live” copy of register file values feed into array

• Each row of array may compute from register values or intermediates (other rows)

• Tag on array to indicate RFUOP

Results• Compress 1.11

• Eqntott 1.8

• Life 2.06 (160 hand parallelization)

[Hauck/FCCM97] JR.S00 22

Instruction Augmentation

• Small arrays with limited state– so far, for automatic compilation

» reported speedups have been small– open

» discover less-local recodings which extract greater benefit

Page 12: Summary of Previous Class Lecture 14: (Re)configurable ...bwrcs.eecs.berkeley.edu/Classes/CS252/Notes/lec14-config.pdf · – Chameleon, Pleiades, Morphics JR.S00 4 ... I960 Board

Page 12

JR.S00 23

GARP

Identified Problems:• Single-cycle flow-through

– not most promising usage style

• Moving data through Register File to/from array

– can present a limitation » bottleneck to achieving high computation rate

[Hauser+Wawrzynek: UCB]JR.S00 24

GARP

• Integrate as coprocessor– similar bandwidth to processor as FU– own access to memory

• Support multi-cycle operation– allow state– cycle counter to track operation

• Fast operation selection– cache for configurations– dense encodings, wide path to memory

Page 13: Summary of Previous Class Lecture 14: (Re)configurable ...bwrcs.eecs.berkeley.edu/Classes/CS252/Notes/lec14-config.pdf · – Chameleon, Pleiades, Morphics JR.S00 4 ... I960 Board

Page 13

JR.S00 25

GARP• ISA -- coprocessor operations

– issue gaconfig to make a particular configuration resident (may be active or cached )

– explicitly move data to/from array» 2 writes, 1 read

– processor suspend during coprocessor operation» cycle count tracks operation

– array may directly access memory» processor and array share memory space

• cache/mmu keeps consistency

» can exploit streaming data operations

JR.S00 26

GARP

• Processor Instructions

Page 14: Summary of Previous Class Lecture 14: (Re)configurable ...bwrcs.eecs.berkeley.edu/Classes/CS252/Notes/lec14-config.pdf · – Chameleon, Pleiades, Morphics JR.S00 4 ... I960 Board

Page 14

JR.S00 27

GARP Array

• Row oriented logic– denser for datapath

operations

• Dedicated path for – processor/memory data

• Processor not have to be involved in array ⇔memory path

JR.S00 28

GARP Results

• General results– 10-20x on stream, feed-

forward operation– 2-3x when data-

dependencies limit pipelining

[Hauser+Wawrzynek/FCCM97]

Page 15: Summary of Previous Class Lecture 14: (Re)configurable ...bwrcs.eecs.berkeley.edu/Classes/CS252/Notes/lec14-config.pdf · – Chameleon, Pleiades, Morphics JR.S00 4 ... I960 Board

Page 15

JR.S00 29

PRISC/Chimera … GARP

• PRISC/Chimaera– basic op is single cycle:

expfu (rfuop )– no state– could conceivably have

multiple PFUs?– Discover parallelism =>

run in parallel?– Can’t run deep pipelines

• GARP– basic op is multicycle

» gaconfig

» mtga

» mfga

– can have state/deep pipelining

– ? Multiple arrays viable?– Identify mtga/mfga w/ corr

gaconfig ?

JR.S00 30

Common Theme

• To get around instruction expression limits– define new instruction in array

» many bits of config … broad expressability» many parallel operators

– give array configuration short “name” which processor can callout

» …effectively the address of the operation

But – Impact of using reconfiguration at Instruction Level seems limited⇒ Explore opportunities at larger granularity levels (basic block, task, process)

Page 16: Summary of Previous Class Lecture 14: (Re)configurable ...bwrcs.eecs.berkeley.edu/Classes/CS252/Notes/lec14-config.pdf · – Chameleon, Pleiades, Morphics JR.S00 4 ... I960 Board

Page 16

JR.S00 31

Applicability of Configurable Processing

• Stand-alone computational engines– E.g. PADDI, UCLA Mojave

• Adding programmable I/O to embedded processors

– E.g. Napa 2000

• Augmenting the instruction set of processors– E.g. GARP, RAW

• Providing programmable accelerator co-processors to embedded micro’s and DSP

– Chameleon, Pleiades, Morphics

JR.S00 32

Example: Chameleon Reconfigurable Co-Processor (network, communication applications)

Reconfigurable LogicArray of 32-bit Data Path

Operators & Control Logic

Data MemoryPCI

InterfaceMemory

ControllerJTAG

Debugging Port

ARC CPU Instruction Cache

Bus Manager& DMA

Controllers

Local StoreMemory(LSM)

Background Configuration Plane

Configuration bit stream

Wide Internal communications bus

Multiple banks of I/O

Page 17: Summary of Previous Class Lecture 14: (Re)configurable ...bwrcs.eecs.berkeley.edu/Classes/CS252/Notes/lec14-config.pdf · – Chameleon, Pleiades, Morphics JR.S00 4 ... I960 Board

Page 17

JR.S00 33

Reconfigurable Processor Tools Flow

Customer Application / IP

(C code)

ARCObject Code

C Compiler

RTLHDL

Linker

Chameleon Executable

C Model Simulator

Configuration Bits

Synthesis & Layout

C DebuggerDevelopment

Board

JR.S00 34

Heterogeneous Reconfiguration

ReconfigurableLogic

ReconfigurableDatapaths

adder

buffer

reg0

reg1

muxCLB CLB

CLBCLB

DataMemory

InstructionDecoder

&Controller

DataMemory

ProgramMemory

Datapath

MAC

In

AddrGen

Memory

AddrGen

Memory

ReconfigurableArithmetic

ReconfigurableControl

Bit-Level Operationse.g. encoding

Dedicated data pathse.g. Filters, AGU

Arithmetic kernelse.g. Convolution

RTOSProcess management

Page 18: Summary of Previous Class Lecture 14: (Re)configurable ...bwrcs.eecs.berkeley.edu/Classes/CS252/Notes/lec14-config.pdf · – Chameleon, Pleiades, Morphics JR.S00 4 ... I960 Board

Page 18

JR.S00 35

Multi-granularity Reconfigurable Architecture: The Berkeley Pleiades Architecture

Communication Network

ControlProcessor

ArithmeticProcessor

ArithmeticProcessor

ArithmeticProcessor

ConfigurableDatapath

ConfigurableLogic

Configuration Bus

Network Interface

DedicatedArithmetic

Co

nfiguratio

n

Satellite Processor

• Computational kernels are “spawned” to satellite processors• Control processor supports RTOS and reconfiguration• Order(s) of magnitude energy-reduction over traditional programmable architectures

JR.S00 36

Matching Computation and Architecture

AddressGen AddressGen

Memory Memory

MAC MAC

ControlProcessor

L CG

Convolution

Two models of computation:communicating processes + data-flow

Two architectural models:sequential control+ data-driven

Page 19: Summary of Previous Class Lecture 14: (Re)configurable ...bwrcs.eecs.berkeley.edu/Classes/CS252/Notes/lec14-config.pdf · – Chameleon, Pleiades, Morphics JR.S00 4 ... I960 Board

Page 19

JR.S00 37

Execution Model of a Data-Flow Kernel

for(i=1;i<=L;i++)

for(k=i;k<=L;k++)

phi[i][k]= phi[i-1][k-1]

+in[NP-i]*in[NP-k]

-in[NA-1-i]*in[NA-1-k];

endstart

Embedded processor

AddrGen

MEM: in

ALU

ALU

AddrGen

MEM: phiMPY MPY

• Distributed control and memory

Code se g

Code se g

JR.S00 38

Reconfigurable Kernels for W-CDMA

• Dominant kernel M(MTX) requires arra y of MACs and segmented memories

• Additional o perations such as sqrt (x), 1/x, and Trellis decodin g may be im plemented usin g FPGA or cordic satellite

Page 20: Summary of Previous Class Lecture 14: (Re)configurable ...bwrcs.eecs.berkeley.edu/Classes/CS252/Notes/lec14-config.pdf · – Chameleon, Pleiades, Morphics JR.S00 4 ... I960 Board

Page 20

JR.S00 39

Inter-Satellite Communication• Data-driven execution

– A satellite processor is enabled only when input data is ready

• Data sources generate data of different types: scalars, vectors,matrices

• Data computing processors handle data inputs of different types

end-of-vector token

MPY

MPY1

nn

MACn

n1

AddrGen Memory

Embeddedprocessor

1

11

Data sources Data computing processorsJR.S00 40

Impact of Architectural Choice

1870

Str

ongA

RM

131

Nor

ma

lize

d E

ner

gy /

sta

ge [n

J]

TM

S3

20C

2xx

E nergy /stage

49

TM

S32

0LC

54x

1000

100

10000

10

21u

Str

ongA

RM

10u

Nor

ma

lize

d D

ela

y/st

age

[s]

TM

S32

0C

2xx

D elay /stage

3.8u

TM

S32

0LC

54x

10u

1u

100u

100n

18.5

TM

S32

0LC

54x

No

rmal

ize

d E

ne

rgy*

De

lay

/ sta

ge

[Js*

e-1

4]

10

1

100

1000 E nergy*D elay /stage

137

TM

S32

0C2x

x0 .1

3970

Str

ongA

RM

10000Example: 16 point Complex Radix-2 FFT (Final Stage)

13

570n 0.75

Ple

iade

s

Ple

iade

s

Ple

iade

s

Page 21: Summary of Previous Class Lecture 14: (Re)configurable ...bwrcs.eecs.berkeley.edu/Classes/CS252/Notes/lec14-config.pdf · – Chameleon, Pleiades, Morphics JR.S00 4 ... I960 Board

Page 21

JR.S00 41

Adaptive Multi-User Detector for W-CDMAPilot Correlator Unit Using LMS

AG

MULSUB

ADDMEM

MEM

MEM

MEMAG

MUL

MUL

MUL

Filter

Coefficient Update

MEM

MEMAG

ACC

ACC

MAC

MAC

MUL

MUL

SUB

SUB

MULSUB

ADD

MUL

MUL

MUL

SUB

SUB

alt

alt

alt

alt

alt

alt

alt

s_r

s_i

y_r

y_iADD

ADD

Zmf_r

Zmf_i

s_r

s_iZmf_r

Zmf_i

y_r

y_i

JR.S00 42

Architecture Comparison

LMS Correlator at 1.67 MSymbols Data RateComplexity: 300 Mmult/sec and 357 Macc/sec

Note: TMS implementation requires 36 parallel processors to meet data rate -validity questionable

16 Mmacs/mW!

Page 22: Summary of Previous Class Lecture 14: (Re)configurable ...bwrcs.eecs.berkeley.edu/Classes/CS252/Notes/lec14-config.pdf · – Chameleon, Pleiades, Morphics JR.S00 4 ... I960 Board

Page 22

JR.S00 43

Maia: Reconfigurable Baseband Processor for Wireless

• 0.25um tech: 4.5mm x 6mm

• 1.2 Million transistors

• 40 MHz at 1V

• 1 mW VCELP voice coder

• Hardware

• 1 ARM-8

• 8 SRAMs & 8 AGPs

• 2 MACs

• 2 ALUs

• 2 In-Ports and 2 Out-Ports

• 14x8 FPGAJR.S00 44

Reconfigurable Interconnect Exploration

N Inputs

B Buses

M Outputs

Multi-Bus

cluster

cluster

cluster

Hierarchical MeshMesh

Module

Page 23: Summary of Previous Class Lecture 14: (Re)configurable ...bwrcs.eecs.berkeley.edu/Classes/CS252/Notes/lec14-config.pdf · – Chameleon, Pleiades, Morphics JR.S00 4 ... I960 Board

Page 23

JR.S00 45

Software Methodology Flow

Algorithms

Kernel Detection

Estimation/Exploration

Partitioning

Software CompilationReconfig. Hardware Mapping

Interface Code Generation

Power & Timing Estimation of Various Kernel Implementations

PDA Models

PremappedKernels

Acceleratorµproc &

Behavioral

C++ Module Libraries

C++

SUIF+ C-IF

JR.S00 46

Hardware-Software Exploration

Macromodel call

Page 24: Summary of Previous Class Lecture 14: (Re)configurable ...bwrcs.eecs.berkeley.edu/Classes/CS252/Notes/lec14-config.pdf · – Chameleon, Pleiades, Morphics JR.S00 4 ... I960 Board

Page 24

JR.S00 47

Implementation Fabrics for Protocols

BU

F

Memory

Slot_Set_Tbl2x16

addr

BU

F

slot_set<31:0>

Slot_no<5:0>

Slotstart

Pktend

RACHreq

RACHakn

W_ENA

R_ENAupdate

idle

writereadslotset

RACH

idle

A protocol = Extended FSM

Intercom TDMA MAC JR.S00 48

Intercom TDMA MACImplementation alternatives

• ASIC: 1V, 0.25 µm CMOS process• FPGA: 1.5 V 0.25 µm CMOS low-energy FPGA• ARM8: 1 V 25 MHz processor; n = 13,000• Ratio: 1 - 8 - >> 400

AS IC FP GA AR M8P ower 0.26m W 2.1m W 114m WE nergy 10.2pJ/op 81.4pJ/op n*457pJ /op

Page 25: Summary of Previous Class Lecture 14: (Re)configurable ...bwrcs.eecs.berkeley.edu/Classes/CS252/Notes/lec14-config.pdf · – Chameleon, Pleiades, Morphics JR.S00 4 ... I960 Board

Page 25

JR.S00 49

The Software-Defined Radio

ReconfigurableDataPath

FPGA EmbeddeduP

Dedicated FSM

DedicatedDSP

JR.S00 50

An Industrial Example:Basestation for Cellular Wireless

800 MHz 1900 MHz

A B A B A D B CFE

RF/IFTuner

Block-SpectrumA/D

RF/IFTuner

Block-SpectrumA/D

RF/IFTuner

Block-SpectrumA/D

AntennaSystem

multiple sectorsmulti-band

RF/IFTuner

Block-SpectrumA/D

RF/IFTuner

Block-SpectrumA/D

RF/IFTuner

Block-SpectrumA/D

AntennaSystem

multiple sectorsmulti-band

HIGH-SPEED DIGITAL BUS

Modular/Parameterizable•per carrier•per TDMA time-slot•per CDMA code

T1 / E

1

...

Page 26: Summary of Previous Class Lecture 14: (Re)configurable ...bwrcs.eecs.berkeley.edu/Classes/CS252/Notes/lec14-config.pdf · – Chameleon, Pleiades, Morphics JR.S00 4 ... I960 Board

Page 26

JR.S00 51

BTS Signal Processing Platforms+LJK�6SHHG'LJLWDO%XV

$�'

&RQYHUVLRQ

&RPP$JHQW

&RPP

$JHQW

'�$

&RQYHUVLRQ Standard A, F1

N DSP/CPUs

&38

Standard A, F2

Standard B, F1...

...

Hardwired ASICs

JR.S00 52

Basestation of the Next Generation

WidebandRF

Data Networks

10/100or

Gbit

ATM

Page 27: Summary of Previous Class Lecture 14: (Re)configurable ...bwrcs.eecs.berkeley.edu/Classes/CS252/Notes/lec14-config.pdf · – Chameleon, Pleiades, Morphics JR.S00 4 ... I960 Board

Page 27

JR.S00 53

Coexistence of Multiple Standards In Product Supply Chain

2.5G– GPRS– HCSD– IS-95 MDR– IS-95 HDR– IS-136 HS

3G– ETSI UTRA – ARIB W-CDMA– TIA cdma2000– W-TDMA (UWC)

2G– GSM– DCS1800– PCS1900– IS-95– IS-54B– IS-136– PDC

CIRCUIT PACKET VOICE DATANARROWBAND WIDEBAND

JR.S00 54

Wideband CDMA: MOPS? No, GOPS!

Function MIPSDigital RRC Channel 3600S earc her 2100RA K E 1050M ax im al Rat io Com biner 24Channel E s t im ator 12A G C, A F C 10Deinterleaver 15Turbo Coder 90TOTAL 6901

Single 384 kbps ARIB W-CDMA Channel

Source: J. Kohnen et al. “Baseband Solution for WCDMA,” Proc. IEEE

Communication Theory Workshop, May 1999, Aptos, USA.

Page 28: Summary of Previous Class Lecture 14: (Re)configurable ...bwrcs.eecs.berkeley.edu/Classes/CS252/Notes/lec14-config.pdf · – Chameleon, Pleiades, Morphics JR.S00 4 ... I960 Board

Page 28

JR.S00 55

The common approach to hardware design involves:multiple ASIC’s to support each standard.

• Hardwired implementation is not scalable or upgradeable to new standards.• This approach costs time in a time-to-market dominated world.• Creating new chipsets for every technology combination critically challenges available design resources!

HW Multistandard Solutions

DigitalHardwired

ASICIF RF

DigitalHardwired

ASICIF RF

DigitalHardwired

ASICIF RF

AnalogProgrammable Unique

Combinations

DSP

Control Processor

JR.S00 56

SW Multistandard SolutionApplying instruction-set processor architectures to

all baseband processing would be desireable...

…but is simply not an good implementation for base stations:-Unacceptably high cost per channel

-Unacceptably large power per channel

This is definitely not a viable implementation for terminals

AnalogProgrammable

IF RF

IF RF

IF RF

DSP

Control Processor

Page 29: Summary of Previous Class Lecture 14: (Re)configurable ...bwrcs.eecs.berkeley.edu/Classes/CS252/Notes/lec14-config.pdf · – Chameleon, Pleiades, Morphics JR.S00 4 ... I960 Board

Page 29

JR.S00 57

The Law of Diminishing Returns• More transistors are being thrown at improving

general-purpose CPU and DSP performance

• Fundamental bounds are being pushed– limits on instruction-level parallelism– limits on memory system performance

• Returns per transistor are diminishing– new architectures realizing only 2-3 instructions/clock– increasingly large caches to hide DRAM latency

JR.S00 58

Embedded Logic

Evolution– Increasing fixed-function hardwired content of systems– Core+Logic becomes de-facto design architecture– Move to deep sub-micron technology

» rapidly increasing product integration cycles» increasingly constrained design resources» sharp increases in cost of “trying out” an idea- NRE

– Design methodologies optimized for random logic, homogeneous architectures, and lower speed signal processing (I.e. control-flow dominated systems)

– Verification issues dominate design cycle time

Growing Design Cycle Times At Odds With Shrinking Product Cycle Times

Page 30: Summary of Previous Class Lecture 14: (Re)configurable ...bwrcs.eecs.berkeley.edu/Classes/CS252/Notes/lec14-config.pdf · – Chameleon, Pleiades, Morphics JR.S00 4 ... I960 Board

Page 30

JR.S00 59

FPGA the Solution?Cellular Handset Using Current FPGA

JR.S00 60

Some Interesting Observations

Don’t use more transistors to stretch general-purpose performance, whether for CPUs, DSPs, or

reconfigurable logic.

Don’t use more time to designdedicated hardwired solutions in cases where

mass customizationis what the market demands.

Page 31: Summary of Previous Class Lecture 14: (Re)configurable ...bwrcs.eecs.berkeley.edu/Classes/CS252/Notes/lec14-config.pdf · – Chameleon, Pleiades, Morphics JR.S00 4 ... I960 Board

Page 31

JR.S00 61

View The “Reconfigurability” Problem From The System Level

What Are the Application-Specific Performance Needs ?

– What are the applications targeted?– What algorithms are essential to achieving the performance goals?– What are the functions at the heart of these algorithms?– Which functions yield poor price-performance with general-purpose

MOPS and system memory models?– What is the embedded systems programmer’s model?– Best performance at what cost:

» Area - instruction-level parallelism, memory hierarchy» Power- energy requirements on a function basis» Time- quality and ease of programming for app development» Pain- forward opportunity vs backward compatibility

JR.S00 62

Define optimal architecture for efficient implementation

Successfully Using Reconfigurability

Application-Specific Leverage

Focus on first on applications and constituent algorithms, not the silicon architecture !Wireless Communications Transceiver Signal Processing

Minimize the hardware reconfigurability to constrained set

Maximize the software parameterizability and ease of use of the programmer’s model for flexibility

Page 32: Summary of Previous Class Lecture 14: (Re)configurable ...bwrcs.eecs.berkeley.edu/Classes/CS252/Notes/lec14-config.pdf · – Chameleon, Pleiades, Morphics JR.S00 4 ... I960 Board

Page 32

JR.S00 63

Application-Specific MOPS in Digital Communications

RF/IF

TDMAWideband Signal

Processing Engine

CDMAWideband Signal

Processing Engine

Wideband ChannelDecoder Engine

Programmable DSP

Microprocessor

DigitalDowncon version

and Channelization

JR.S00 64

Morphics’ DRL ArchitectureHeterogeneous Multiprocessing Engine Using Application-

Specific Reconfigurable Logic

m

R

m

O

Clk

Enableoutput

input

input

m

R

m

O

Clk

Enableoutput

input

input

m

R

m

O

Clk

Enableoutput

input

input

Small Granularity Kernel

Large Granularity Kernel

DA

TA

FLO

W

Page 33: Summary of Previous Class Lecture 14: (Re)configurable ...bwrcs.eecs.berkeley.edu/Classes/CS252/Notes/lec14-config.pdf · – Chameleon, Pleiades, Morphics JR.S00 4 ... I960 Board

Page 33

JR.S00 65

DRL Kernels

DATA SEQUENCER

DATA MEMORY

PARAMETERIZABLECONFIGURABLE

ALU

JR.S00 66

Mapping Software to Target Architecture

DATA SEQUENCER

DATA MEMORY

PARAMETERIZABLECONFIGURABLE

ALU

DATA SEQUENCER

DATA MEMORY

PARAMETERIZABLECONFIGURABLE

ALU

Page 34: Summary of Previous Class Lecture 14: (Re)configurable ...bwrcs.eecs.berkeley.edu/Classes/CS252/Notes/lec14-config.pdf · – Chameleon, Pleiades, Morphics JR.S00 4 ... I960 Board

Page 34

JR.S00 67

Programmer’s Guide

Document provided with each processor to enable application development, configuration, and system control via host processor.

Includes complete • description of each API function

call• system control functions• variables, parameters• coding examples, and

performance realized (i.e. ROC curves)

/* morphics soft API usage examples: *//* search a pilot set in search set *//* maintainance mode. */

...set_ptr = get_next_set();if (no_need_to_throttle()){

search_set(set_ptr);}...void search_set(PILOT_TYPE *pilot){

/* search threshold is assumed to be set in other places */morphics_searcher_set_win_size(pilot->win_size);morphics_searcher_set_pn(pilot->pn);morphics_searcher_set_int_len(pilot->int_len);

}

/* finger re-assignments */...fing[i].pos = morphics_demod_get_fin_pos(i);...distance = calc_fing_movement(&fing[i], new_pn);...fing[i].slew = distance;fing[i].pn = new_pn;morphics_demod_set_fing_iq(i, fing[i].pn);while ( slew_not_done(morphics_demod_get_status()) ){

morphics_demod_set_fing_slew(i, fing[i].slew);}...

JR.S00 68

Key Pieces of Design Methodology

System-level Profiling• Analyze sequences of operations (arithmetic, memory access, etc)• Analyze communication bottlenecks• Key flexible parameters (algorithm v architecture parameters)

Architecture-level Profiling• ALU/kernel definition (sequences of operators)• Memory profile• Type of configurability required for flexibility• Macro-sequencer development

Implementation• SW- programmer’s model developed at architecture specification stage• SW- API proven out via behavioral models & demonstrator hardware• VLSI-focus on regular predictable timing and routability• VLSI- embedded reconfigurability in an ASIC flow

Page 35: Summary of Previous Class Lecture 14: (Re)configurable ...bwrcs.eecs.berkeley.edu/Classes/CS252/Notes/lec14-config.pdf · – Chameleon, Pleiades, Morphics JR.S00 4 ... I960 Board

Page 35

JR.S00 69

CDMA Modem Analysis

JR.S00 70

MOPS Breakdown Analysis

Page 36: Summary of Previous Class Lecture 14: (Re)configurable ...bwrcs.eecs.berkeley.edu/Classes/CS252/Notes/lec14-config.pdf · – Chameleon, Pleiades, Morphics JR.S00 4 ... I960 Board

Page 36

JR.S00 71

Prototype Demonstrator

RISC Microprocessor

Configurable Kernel(s)

Data Router

JR.S00 72

Summary

• Configurable computing is finding its way into the embedded processor space

• Best suited (so far) for – Flexible I/O and Interface functionality– providing task-level acceleration of “parametizable” functions

• Improvement of IP seems limited• Software flow still subject to improvement• Might become more interesting with the emergence of

low-current devices (TFT, organic transistors, molecular computing)

DO NOT FORGET CONFIGURATION OVERHEAD