35
1 CAIRN Project-Team Energy-Efficient Reconfigurable System-on-Chip DART Coarse-Grain Reconfigurable Architecture Olivier Sentieys [email protected] with contribution from Raphaël David (CEA List), Sébastien Pillement (IRISA) 2 Agenda Motivations and Challenges Dynamically Reconfigurable Architectures Anatomy of an RSoC From Applications to Architecture Coarse-grain Reconfigurable Architecture DART architecture (Mozaic platform) Morea architecture Conclusion A cairn in Bréhat

Transmedia Reconf IRISA Cairn - Inriapeople.rennes.inria.fr/.../presentations/CGReconf_IRISA_Cairn.pdf · Collaboration between IRISA/Cairn on DART ... WCDMA/UMTS Receiver 3900 MOPS

  • Upload
    votuong

  • View
    221

  • Download
    6

Embed Size (px)

Citation preview

Page 1: Transmedia Reconf IRISA Cairn - Inriapeople.rennes.inria.fr/.../presentations/CGReconf_IRISA_Cairn.pdf · Collaboration between IRISA/Cairn on DART ... WCDMA/UMTS Receiver 3900 MOPS

1

CAIRN Project-Team

Energy-Efficient Reconfigurable System-on-Chip DART Coarse-Grain Reconfigurable Architecture

Olivier Sentieys [email protected]

with contribution from Raphaël David (CEA List), Sébastien Pillement (IRISA)

2

Agenda   Motivations and Challenges   Dynamically Reconfigurable Architectures

  Anatomy of an RSoC   From Applications to Architecture

  Coarse-grain Reconfigurable Architecture   DART architecture   (Mozaic platform)   Morea architecture

  Conclusion

A cairn in Bréhat

Page 2: Transmedia Reconf IRISA Cairn - Inriapeople.rennes.inria.fr/.../presentations/CGReconf_IRISA_Cairn.pdf · Collaboration between IRISA/Cairn on DART ... WCDMA/UMTS Receiver 3900 MOPS

2

3

[L. Ducousso, STMicroelectronics]

State-of-the-art SoC at   HDTV Set Top Box

  Domain-specific SoC   Functionality is inside

  Various applications and standards inside

  MPEG2, H264   Satellite, Wifi/LAN   Hard disk, …

  65nm, 150 MTr   1 B$ mask set   60 weeks design   6-18 months lifetime

  Heterogeneity   16 processors, 38 IPs   5-6000 MIPS   140 memory blocks   5 Gbytes/s on-chip

interconnection network

  HW: 5M RTL code lines   SW: 60M code lines

•  OS, Middleware, HAL, Firmware

®

4

Challenges and limitations   High-performance applications

  e.g. H264 codec, 802.11n MIMO, …   Energy and Power constraints

  Battery life, manufacturing cost   Rapidly changing application standards

  SW updates vs. HW redesign   Compilation and synthesis tools targeting

heterogeneous SoC   Technological impacts

  Manufacturing problems, transient errors, silicon bugs

Page 3: Transmedia Reconf IRISA Cairn - Inriapeople.rennes.inria.fr/.../presentations/CGReconf_IRISA_Cairn.pdf · Collaboration between IRISA/Cairn on DART ... WCDMA/UMTS Receiver 3900 MOPS

3

5

A road for reconfigurable chips

  Dynamically adapt the hardware to the application   energy-performance-cost trade-

off   Self-adapting devices

  continuously adapt to changing environments

  Other advantages   regularity of the layout   high-performance, parallel   error and fault tolerance

Fresh SoC from CEA with DART IP from IRISA

  “Flexible Software on Flexible Hardware”

6

Agenda   Motivations and Challenges   Dynamically Reconfigurable Architectures

  Anatomy of an RSoC   From Applications to Architecture

  Coarse-grain Reconfigurable Architecture   DART architecture   (Mozaic platform)   Morea architecture

  Conclusion

A cairn in Bréhat

Page 4: Transmedia Reconf IRISA Cairn - Inriapeople.rennes.inria.fr/.../presentations/CGReconf_IRISA_Cairn.pdf · Collaboration between IRISA/Cairn on DART ... WCDMA/UMTS Receiver 3900 MOPS

4

7

HW Processor Memory Hierarchy

Fine grain Reconfigurable

Coarse grain Reconfigurable HW

Reconfigurable system-on-chip   Programmable processors, specialized HW blocks   Reconfigurable hardware

  fine-grain, coarse-grain   "on-the-fly ASIC"

  Reconfigurable interconnect and memory structures

8

HW Processor Memory Hierarchy

Fine grain Reconfigurable

Coarse grain Reconfigurable HW

Reconfigurable system-on-chip   Multithreaded applications

  Thread compilation to reconfigurable hardware   Fixed-point specification

  Reconfiguration management   Hardware abstraction layer   Static or dynamic (at run-time) reconfiguration

Page 5: Transmedia Reconf IRISA Cairn - Inriapeople.rennes.inria.fr/.../presentations/CGReconf_IRISA_Cairn.pdf · Collaboration between IRISA/Cairn on DART ... WCDMA/UMTS Receiver 3900 MOPS

5

9

Design Space

RECONFIGURABLE ARCHITECTURES (R-SoC)

FINE GRAIN (FPGA)

MULTI GRANULARITY (Heterogeneous)

COARSE GRAIN

Processor + Coprocessor

Tile-Based Architecture

Coarse Grain Coprocessor

Fine Grain Coprocessor

Island Topology

Hierarchical Topology

Linear Topology

Hierarchical Topology

Mesh Topology

•  Chameleon •  REMARC •  Morphosys •  PACT XPP

•  Pleiades •  Garp •  FIPSOC •  Triscend E5 •  Triscend A7 •  Xilinx Virtex-II Pro •  Altera Excalibur •  Atmel FPSIC

•  Xilinx Virtex •  Xilinx Spartran •  Atmel AT40K •  Lattice ispXPGA

•  Altera Stratix •  Altera Apex •  Altera Cyclone

•  Systolic Ring •  RaPiD •  PipeRench

•  DART •  FPFA

•  RAW •  CHESS •  MATRIX •  KressArray •  Systolix Pulsedsp

•  aSoC •  E-FPFA

[Bossuet03]

10

High Performance (12 GOPS) Low Power (500 mW)

24MOPS/mW@12GOPS

Source

Data

Audio

Video

Source Coding

V34, V8, H225, H245, ...

EFR, AMR, CELP, RPE-LTP, ...

MPEGx, H26x, ... Channel Coding

Viterbi, Turbo coder, Reed Solomon, ...

Access Technique

TDMA, FDMA, W-CDMA, ...

Modulation

PSK, MSK, ASK, QAM, ...

Viterbi, turbo dec., Reed Solomon, ...

Channel Decoding Access Technique

TDMA, WCDMA, ...

Demodulation

PSK, MSK, ASK, QAM, ...

3G Wireless Terminal   Flexibility

  Applications   Services

  Multiple granularity   Arithmetic   Logic

Page 6: Transmedia Reconf IRISA Cairn - Inriapeople.rennes.inria.fr/.../presentations/CGReconf_IRISA_Cairn.pdf · Collaboration between IRISA/Cairn on DART ... WCDMA/UMTS Receiver 3900 MOPS

6

11

Reconfigurable Architectures

Image

Music Demult. Multiple

Access Channel Decoder

Demodul. Equalizer

Source Decoder

Voice

Processor Processor

Reconfigurable Coprocessor

time

Wireless Multimedia Receiver

12

Processing Model

T3

T1

T2b T2a T2c

RA4

t

T1

T2a

T3

T2b [adapted from Leray08]

RA: Reconfigurable Area CM: Configuration Management

RA5

RA2 RA3 RA1

T2c

Page 7: Transmedia Reconf IRISA Cairn - Inriapeople.rennes.inria.fr/.../presentations/CGReconf_IRISA_Cairn.pdf · Collaboration between IRISA/Cairn on DART ... WCDMA/UMTS Receiver 3900 MOPS

7

13

Agenda   Motivations and Challenges   Dynamically Reconfigurable Architectures

  Anatomy of an RSoC   From Applications to Architecture

  Coarse-grain Reconfigurable Architecture   DART architecture   (Mozaic platform)   Morea architecture

  Conclusion

A cairn in Bréhat

14

DART Architecture

  Architecture Principles of DART

  Compilation Workflow

  Validation and Silicon Prototype

Page 8: Transmedia Reconf IRISA Cairn - Inriapeople.rennes.inria.fr/.../presentations/CGReconf_IRISA_Cairn.pdf · Collaboration between IRISA/Cairn on DART ... WCDMA/UMTS Receiver 3900 MOPS

8

15

Overall Objectives   Coarse-grained reconfigurable architecture   Energy-efficiency   Dynamic reconfiguration (4 to 20 cycles)   Compilation from a C code specification (no

place and route)

16

Energy Efficiency

  Technological parameter   CS

  Applicative parameters   Nop.Fclk , α

  Potential optimisations   Actrl , Amem , Aop , α , VDD

Page 9: Transmedia Reconf IRISA Cairn - Inriapeople.rennes.inria.fr/.../presentations/CGReconf_IRISA_Cairn.pdf · Collaboration between IRISA/Cairn on DART ... WCDMA/UMTS Receiver 3900 MOPS

9

17

Cost of control   Minimize the configuration data volume ( Actrl)

  Limited number of operation types and data format   Various processing patterns   Reconfiguration at the data-path level (rather than at the gate

level as in the case of FPGA)   Reduce the frequency of reconfigurations ( α)

  Loop body •  Limited number of operations •  Regular patterns

  Each loop can be implemented as a unique configuration which is maintained during the processinf time

18

Example: Motion Estimation (ME)   Video coding MPEGx, H26x

Motion Vector (u,v)

Reference Block NxN

Matched Block NxN

N+2p

Search Window

p

sadmin = MAXINT; mvx=0; mvy=0; for (u=-p; u<=p; u++) { for (v=-p; v<=p; v++) { sad = 0; for (i=0; i<N; i++) { for (j=0; j<N; j++) { sad = sad + ABS[BR(i,j)-FR(i+u,j+v)] /* if (sad>=sadmin) break; */ } } if (sad<sadmin) { sadmin = sad; mvx = u; mvy = v; } } }

Page 10: Transmedia Reconf IRISA Cairn - Inriapeople.rennes.inria.fr/.../presentations/CGReconf_IRISA_Cairn.pdf · Collaboration between IRISA/Cairn on DART ... WCDMA/UMTS Receiver 3900 MOPS

10

19

Data access cost

  Minimize memory access cost ( Amem)   Storage capacity   High bandwith   Memory hierarchy

  Minimize number of memory accesses ( α)   Minimize the number of temporary data storage   Avoid redundant access to data   Local registers

0

20

40

60

80

100

120

140

160

64 256 1024 16536

Number of words

pJ p

er

acce

ss

20

Operator

Reconfigurable Operators

Operation 2 Operation 0

( α)

Operation 1

Input 1 Input 2

Sortie

Control

( Aop)

Page 11: Transmedia Reconf IRISA Cairn - Inriapeople.rennes.inria.fr/.../presentations/CGReconf_IRISA_Cairn.pdf · Collaboration between IRISA/Cairn on DART ... WCDMA/UMTS Receiver 3900 MOPS

11

21

System Architecture of DART

Data Memory

Instruction M

emory

I/O Ctrl

Cluster 3 Cluster 4

Cluster 1 Cluster 2

Task Controller C

onfiguration M

emory

22

Cluster Architecture

Config. Memory FPGA

DMA Ctrl

Configuration Controller

RDP1

RDP2

RDP3

RDP4

RDP5

RDP6

Data M

emory

Segmented N

etwork

Page 12: Transmedia Reconf IRISA Cairn - Inriapeople.rennes.inria.fr/.../presentations/CGReconf_IRISA_Cairn.pdf · Collaboration between IRISA/Cairn on DART ... WCDMA/UMTS Receiver 3900 MOPS

12

23

reg1 reg2 FU1 FU2 FU3 FU4

Multi-Bus Crossbar Network

Data Mem1

Data Mem2

Data Mem3

Data Mem4

AG1 AG2 AG3 AG4

HW Loop Management

Global Bus

Reconfigurable Data Path Architecture • FU1 • FU2 • Crossbar • Bus • AG1 • Loop

24

reg1 reg2 FU1 FU2 FU3 FU4

Multi-Bus Crossbar Network

Data Mem1

Data Mem2

Data Mem3

Data Mem4

AG1 AG2 AG3 AG4

HW Loop Management

Global Bus

Reconfigurable Data Path Architecture

92 bits

34 bits

826 bits to reconfigure the arithmetic resources of a cluster

Page 13: Transmedia Reconf IRISA Cairn - Inriapeople.rennes.inria.fr/.../presentations/CGReconf_IRISA_Cairn.pdf · Collaboration between IRISA/Cairn on DART ... WCDMA/UMTS Receiver 3900 MOPS

13

25

Irregular and Regular Software

for (n=0;n<1024;n++){! tmp=0;! for (i=0;i<N;i++){! tmp+=x[i]*h[N-i];! }! y[n]=tmp<<6;! X[0]=x[n]+128;!}!

  Irregular Processing   few parallelism   few regularity   less complex

  Regular Processing   massively parallel   very regular   complex

26

rec

4 cycles

Mem3

- X

Configuration 2

y(i)=(x(i)-x(i-1))²

Mem1

Configuration 1

tmp+=x(i)*h(N-i);

X +

Mem1 Mem2

HW Reconfiguration   DART potential is fully exploited

  Optimal flexibility of operators and network   Use of registers   Multiple DPR chaining via segmented network

Page 14: Transmedia Reconf IRISA Cairn - Inriapeople.rennes.inria.fr/.../presentations/CGReconf_IRISA_Cairn.pdf · Collaboration between IRISA/Cairn on DART ... WCDMA/UMTS Receiver 3900 MOPS

14

27

Configuration 1

C=A+B

+

Mem1 Mem2

rec

1 cycle

Configuration 2

E=C*D

X

Mem4 Mem1

SW Reconfiguration   Reduced flexibility of the DPR

  Operator configuration   Operator source configuration   No operator or DPR chaining

28

SCMD Single Configuration Multiple Data

  Irregular processing have few parallelism   Implementation on one DPR

  Massively parallel processings are very regular   Redundancy in DPR configurations

  Configuration data stream can be reduced if the regularity is exploited   Simultaneous broadcast of common configuration

data toward several DPRs

Page 15: Transmedia Reconf IRISA Cairn - Inriapeople.rennes.inria.fr/.../presentations/CGReconf_IRISA_Cairn.pdf · Collaboration between IRISA/Cairn on DART ... WCDMA/UMTS Receiver 3900 MOPS

15

29

SCMD at work

RDP1

RDP2

RDP6 configuration

data

configuration

data

configuration

data

Configuration bits

RDP1 Validation

RDP2 Validation

RDP6 Validation

LATCH

LATCH

LATCH

30

DART Architecture

  Architecture Principles of DART

  Compilation Workflow

  Validation and Silicon Prototype

Page 16: Transmedia Reconf IRISA Cairn - Inriapeople.rennes.inria.fr/.../presentations/CGReconf_IRISA_Cairn.pdf · Collaboration between IRISA/Cairn on DART ... WCDMA/UMTS Receiver 3900 MOPS

16

31

Compilation Workflow

SystemC Simulation (SCDART) •  BA-CA Simulation • Performance Estimation

Synthesis (gDART) •  DFG scheduling •  Operator binding •  HW configuration generation

Compilation (cDART, ACG) •  SW configuration generation •  Code compilation for address generators

Front-End (SUIF) •  Code Optimisation •  Code Extraction

C Code

32

Compiler front-end   Currently based on SUIF   High-level source optimisations   Parallelism extraction

  Partial loop unrolling   Semi-automatic partitioning

  Regular processing (loops) •  HW configurations

  Irregular processing and data management •  SW configurations and AG instructions

Page 17: Transmedia Reconf IRISA Cairn - Inriapeople.rennes.inria.fr/.../presentations/CGReconf_IRISA_Cairn.pdf · Collaboration between IRISA/Cairn on DART ... WCDMA/UMTS Receiver 3900 MOPS

17

Compilation Front End

SUIF

C Code

void main(){...!for (n=0;n<1024;n++){! tmp=0;! for (i=0;i<64;i+=2){! Mem1=x[i];Mem2=h[N-i];}! for (j=0;j<32;j+=2){! Mem3=x[j];Mem4=h[j];! z[j]=Mem5;}! y[n]=Mem6<<6;! X[0]=x[n]+128;!}!…}!

void main(int X0, int H0, !…, int *Y){! int tmp;! tmp=tmp+X0*H0;! tmp=tmp+X1*H1;! *Y=tmp;!}!

void main(int X0, int H0, !…, int *Z0, int *Z1){! *Z0=X0-H0;! *Z1=X1-H1;!}!

Loop body 1 Loop body 2 Irregular processing

Regular code extraction

Partial loop unrolling

void main(){!...!for (n=0;n<1024;n++){! tmp=0;! for (i=0;i<64;i++)//unroll 2! tmp+=x[i]*h[N-i];! for (j=0;j<32;j++)//unroll 2! z[j]=x[j]-h[j];! y[n]=tmp<<6;! X[0]=x[n]+128;!}!...!

SUIF Front-end

void main(){!...!for (n=0;n<1024;n++){! tmp=0;! for (i=0;i<64;i+=2){! tmp+=x[i]*h[N-i];! tmp+=x[i+1]*h[N-(i+1)];! }! for (j=0;j<32;j+=2){! z[j]=x[j]-h[j];! z[j+1]=x[j+1]-h[j+1];! }! y[n]=tmp<<6;! X[0]=x[n]+128;!}!...!

34

Framework

ARMOR model of

DART Compilation Gateways

Specialized Information

Source Code

Optimized Binary Code

Compilation Library

Code selection

Register allocation

Scheduling

Retargeting compilation framework CALIFE

Page 18: Transmedia Reconf IRISA Cairn - Inriapeople.rennes.inria.fr/.../presentations/CGReconf_IRISA_Cairn.pdf · Collaboration between IRISA/Cairn on DART ... WCDMA/UMTS Receiver 3900 MOPS

18

35

*

*

*

*

+ + + + *

*

*

*

+

+

+ +

*

*

*

*

+ + + +

HW configuration generation   gDART transforms the nested loops (regular processing) into HW

configurations HW   Based on classical techniques used in high-level synthesis

  Loop reduction, merging, …   Graph depth reduction   Resource binding, memory allocation

36

Simulator   SCDART is a bit-accurate et cycle-accurate

simulator developed in SystemC at the Register Transfer Level

  Verification   Power and performance estimation

Page 19: Transmedia Reconf IRISA Cairn - Inriapeople.rennes.inria.fr/.../presentations/CGReconf_IRISA_Cairn.pdf · Collaboration between IRISA/Cairn on DART ... WCDMA/UMTS Receiver 3900 MOPS

19

37

DART Architecture

  Architecture Principles of DART

  Compilation Workflow

  Validation and Silicon Prototype

38

DART Architecture

  5-10 GOPS/cluster @ 130nm   300 mW @ 200MHz   16 MOPS/mW @ 5 GOPS   Simulator, Compiler Tools   Delivered as an RTL model   Circuit (ST 130nm) in june 2005

Config Mem. FPGA

Ctrl DMA

Ctrl

RDP1

RDP2

RDP3

RDP4

RDP5

RDP6

Data. Mem.

Segmented N

etwork

reg reg FU1 FU2 FU3 FU4

Fully Connected Network

Data mem1

Data mem2

Data mem3

Data mem4

AG1 AG2 AG3 AG4

Loop Management

  3G/UMTS Mobile Terminal   802.11a (Channel Est.)

  STMicroelectronics   CEA LIST/LETI

Page 20: Transmedia Reconf IRISA Cairn - Inriapeople.rennes.inria.fr/.../presentations/CGReconf_IRISA_Cairn.pdf · Collaboration between IRISA/Cairn on DART ... WCDMA/UMTS Receiver 3900 MOPS

20

39

Fresh Circuit (CEA) 4G mobile terminals Technology: ST 0.13µ CPU core: ARM946 4.8 Mgates Chip area = 80 mm2 Package: TBGA 420 Core power supply: 1.2 V

Silicon prototype of DART   Complex SoC including DART accelerator   Collaboration between IRISA/Cairn on DART

(architectural design, synthesis, validation), CEA List (validation, integration), CEA Leti (integration, backend)

40

Récepteur WCDMA

D.C. h(n)

Nyquist Filter

A D

AGC

s(n)

RRC

Rake Receiver - Synchronisation - Channel estimation - Decoding

WCDMA/UMTS Receiver

3900 MOPS 500 MOPS

Page 21: Transmedia Reconf IRISA Cairn - Inriapeople.rennes.inria.fr/.../presentations/CGReconf_IRISA_Cairn.pdf · Collaboration between IRISA/Cairn on DART ... WCDMA/UMTS Receiver 3900 MOPS

21

41

Complete receiver on a DART cluster

Filtrage (54613 cy.)

9 cy.

Synchronisation Fchip

(4608cy.)

9 cy.

Synchro. Fsymb

(36cy.)

9 cy.

Estim. Canal (8cy.)

3 cy.

Décodage (2560cy.)

3 cy.

114.8mW ⇔ 38.8 MOPS/mW @ 6.2 GOPS 1% 9%

6%

5%

79%

Instruction reading and decoding

Data access in the DPRs

Data access in the cluster

Address generator

Operators

42

10

15

20

25

30

35

40

1 10 100 1000 10000 100000 1000000

Number of symbols

Lo

g2(T

exec)

C64

DART

Xc200E

Real-Time Limit

Positioning DART

  DSP is not real time   Reconfiguration (2.7ms) overhead for the FPGA

  Processing of several symbols (> 150 symbols)

  Temporary results (> 1.2Mbits) in memory

C64: 1.5 MOPS/mW, Xc200E: 3 MOPS/mW DART: 39MOPS/mW

Page 22: Transmedia Reconf IRISA Cairn - Inriapeople.rennes.inria.fr/.../presentations/CGReconf_IRISA_Cairn.pdf · Collaboration between IRISA/Cairn on DART ... WCDMA/UMTS Receiver 3900 MOPS

22

43

0

50

100

150

200

250

300

350

400

CPU64 TM-1000 320C62 Ring-8 Ring-64 PentiumI PentiumII NEC V830

# cy

cles

VLIW Superscalar

DART DART SWP

Reconfigurable

DCT implementation

44

Motion estimation   Video coding MPEGx, H26x

Motion Vector (u,v)

Reference Block NxN

Matched Block NxN

N+2p

Search Window

p

sadmin = MAXINT; mvx=0; mvy=0; for (u=-p; u<=p; u++) { for (v=-p; v<=p; v++) { sad = 0; for (i=0; i<N; i++) { for (j=0; j<N; j++) { sad = sad + ABS[BR(i,j)-FR(i+u,j+v)] /* if (sad>=sadmin) break; */ } } if (sad<sadmin) { sadmin = sad; mvx = u; mvy = v; } } }

Page 23: Transmedia Reconf IRISA Cairn - Inriapeople.rennes.inria.fr/.../presentations/CGReconf_IRISA_Cairn.pdf · Collaboration between IRISA/Cairn on DART ... WCDMA/UMTS Receiver 3900 MOPS

23

45

Motion estimation   SAD calculation

HW Configuration

- ABS

BR FR

sadmin = MAXINT; mvx=0; mvy=0; for (u=-p; u<=p; u++) { for (v=-p; v<=p; v++) { sad = 0; for (i=0; i<N; i++) { for (j=0; j<N; j++) { sad += ABS[BR(i,j)-FR(i+u,j+v)] } } if (sad<sadmin) { sadmin = sad; mvx = u; mvy = v; } } }

+

46

Gantt diagram for ME (tentative)

SAD on search window

SAD

Update Reference MB Current MB

HW SW

min

HW

256 cy 256 cy

12 cy

47 cy 1 cy

3 cy

16 cy

  N=8

Page 24: Transmedia Reconf IRISA Cairn - Inriapeople.rennes.inria.fr/.../presentations/CGReconf_IRISA_Cairn.pdf · Collaboration between IRISA/Cairn on DART ... WCDMA/UMTS Receiver 3900 MOPS

24

47

Power distribution inside a DART cluster during ME

1%25%

9%

16%

49%

Instruction reading and decoding

Data access inside the DPRs

Data access inside the cluster

Address generation

Operators

48

Conclusions 1/2   Définition d'une architecture reconfigurable

dynamiquement au niveau fonctionnel   Hautes performances   Maîtrise de la consommation   Flexibilité   Minimisation de l'impact de la reconfiguration

•  Performances •  Consommation

  Organisation hiérarchique •  Exploitation du parallélisme

Page 25: Transmedia Reconf IRISA Cairn - Inriapeople.rennes.inria.fr/.../presentations/CGReconf_IRISA_Cairn.pdf · Collaboration between IRISA/Cairn on DART ... WCDMA/UMTS Receiver 3900 MOPS

25

49

Conclusions 2/2   Définition d'une chaîne de développement

  Front-end C   Partitionnement semi-automatique   Méthode mixte compilation/synthèse architecturale   Simulation RTL et estimation de consommation

  Validation   Comparaison des différents paradigmes de

reconfiguration •  Performances •  Consommation •  Coût de la reconfiguration

50

Travaux en cours   Plateforme Mozaic (thèse Julien Lallet)

  DART reste spécifique à un domaine d’applications   Rendre génériques

•  la structure de l’architecture •  les mécanismes de reconfiguration

  Spécification par un langage ADL •  Génération du code RTL

  Architecture MOREA (thèse Erwan Grâce)   Optimisation de la hiérarchie mémoire   Mémoire et générateurs d’adresses « reconfigurables »   Etude système multi-cluster   Gestion de la reconfiguration   Prototype FPGA (en cours)

Page 26: Transmedia Reconf IRISA Cairn - Inriapeople.rennes.inria.fr/.../presentations/CGReconf_IRISA_Cairn.pdf · Collaboration between IRISA/Cairn on DART ... WCDMA/UMTS Receiver 3900 MOPS

26

51

Perspectives pour transmedi@   Spécialisation au domaine d’application

  Vidéo, Audio   Transcodage, multi-standard

  Gestion des flux vidéos multiples   Mode « co-processing »   Gestion des interfaces

CAIRN Project-Team Energy-Efficient Reconfigurable System-on-Chip

DART Coarse-Grain Reconfigurable Architecture

Olivier Sentieys [email protected]

with contribution from Raphaël David (CEA List), Sébastien Pillement (IRISA)

Page 27: Transmedia Reconf IRISA Cairn - Inriapeople.rennes.inria.fr/.../presentations/CGReconf_IRISA_Cairn.pdf · Collaboration between IRISA/Cairn on DART ... WCDMA/UMTS Receiver 3900 MOPS

27

53

Appendix  Architecture   Tools   Validation

54

SIMD Multiplier Architecture

16

16-bit Booth-Wallace

mul/add 8-bit carry-save mul/

add

Input A

Input B

Output

SIMD

16

32

L L L L L : Latch

Shifter

8-bit carry-save mul/

add

Mux

Demux

OP

Shift

Page 28: Transmedia Reconf IRISA Cairn - Inriapeople.rennes.inria.fr/.../presentations/CGReconf_IRISA_Cairn.pdf · Collaboration between IRISA/Cairn on DART ... WCDMA/UMTS Receiver 3900 MOPS

28

55

ALU architecture

Arithmetic Unit ADD, SUB, ABS

Input A

Input B

Output

SIMD

Shifter

Logic Unit AND, OR, …

Mux

Demux

Command

Output Shift

Shifter

L L L L

Input Shift

56

Fully connected network

Global Bus

Mem

FU1

FUs + registers

2:1

4:1 14:1

2:1

4:1 14:1

FU4

2:1

4:1 14:1

2:1

4:1 14:1

Page 29: Transmedia Reconf IRISA Cairn - Inriapeople.rennes.inria.fr/.../presentations/CGReconf_IRISA_Cairn.pdf · Collaboration between IRISA/Cairn on DART ... WCDMA/UMTS Receiver 3900 MOPS

29

57

Connections to global bus

11:1 11:1

decod

11:1 11:1

Configuration Even Bus

Odd Bus Mem. UFs Reg.

2 4 4

‘ Z ’

58

Segmented network

RD

P i

Configurable Interconnection

RD

P i+

1

Page 30: Transmedia Reconf IRISA Cairn - Inriapeople.rennes.inria.fr/.../presentations/CGReconf_IRISA_Cairn.pdf · Collaboration between IRISA/Cairn on DART ... WCDMA/UMTS Receiver 3900 MOPS

30

59

Mem 1 64x16

Data Mem1

decod

@

Instr

datapath @ 1

Seq1

Data Mem4

decod

@

Instr

datapath @ 4

Seq4

Zero - overhead loop support

Mem ‘  64x16

Address generation unit   Generate the address sequences for data processed inside the

DPRs   Addressing modes:

  Pre- ou Post-Increment, Modulo, Bit reverse, …   Hardware loop management

  Up to 4 nested loops   Up to 8 instructions loop body

60

Address generation unit

Mem @

64x16b

Seq

RI decod

@

data

R2 R3

+/++/-/--/NoP modulo

NoP/ Bit_reverse

MUX1 $1

Push N, M

R4 R5

R0 R1

R6 R7

MUXA MUXB

latch @ data

MUX2

MUXC

Page 31: Transmedia Reconf IRISA Cairn - Inriapeople.rennes.inria.fr/.../presentations/CGReconf_IRISA_Cairn.pdf · Collaboration between IRISA/Cairn on DART ... WCDMA/UMTS Receiver 3900 MOPS

31

61

Sequencer

PC

++ -M/NOP

M1+M2+M3+M4 M2+M3+M4

M3+M4

M4

push

LIFO_minus_M

load

threshold

pop

M

Cd_minus_M

clk

Pointer

reset push

62

M1 N1 M2 N2 M3 N3 M4 N4

Cpt 1 Cpt 2 Cpt 3

CPT

++

empty

reset

=N ?

pop

Data_out

Data_in

push

LIFO

load

=M ? load

pop

+ +

M N

Cd_minus_M

+

Hardware loop management

Page 32: Transmedia Reconf IRISA Cairn - Inriapeople.rennes.inria.fr/.../presentations/CGReconf_IRISA_Cairn.pdf · Collaboration between IRISA/Cairn on DART ... WCDMA/UMTS Receiver 3900 MOPS

32

63

Appendix   Architecture

 Tools   Validation

64

Les compilateurs cDART et ACG

ACG CDART

Compilation Compilation

Extraction accès aux données

Extraction code irregulier

Traitements irréguliers + manipulations de données

Instructions SW Instructions de génération d'adresses

Parser assembler -> Codes AG

Parser assembler -> Config SW

Compilation

Page 33: Transmedia Reconf IRISA Cairn - Inriapeople.rennes.inria.fr/.../presentations/CGReconf_IRISA_Cairn.pdf · Collaboration between IRISA/Cairn on DART ... WCDMA/UMTS Receiver 3900 MOPS

33

65

FU2 FU3 FU4 FU1

network

Mem1 Mem2 Mem3 Mem4

Armor model of DART

AG1 AG2 AG3 AG4

Mem5 Mem6 Mem7 Mem8

AG5 AG6 AG7 AG8

Mem22 Mem22 Mem23 Mem24

AG21 AG22 AG23 AG24

Cluster Memory

Memory Controller

66

Appendix   Architecture   Outils   Validation

Page 34: Transmedia Reconf IRISA Cairn - Inriapeople.rennes.inria.fr/.../presentations/CGReconf_IRISA_Cairn.pdf · Collaboration between IRISA/Cairn on DART ... WCDMA/UMTS Receiver 3900 MOPS

34

67

Task vs. Operation parallelism

FIR, 6 DPRs

47 %

Rake Receiver 6

DPRs, 9 %

4 cy Other

threads, 6 DPRs,

44 %

11 cy

4 DPRs, 59 %

2 DPRs, 27 %

6 DPRs, 41%

11 cy

4 cy 2 DPR, 32 %

1 DPR, 53 %

1 DPR, 59 %

11 cy

4 cy

4 DPRs, 59 %

6 DPRs, 41%

68

Reconfiguration cost

  Configuration data   1x1.4 Mbits for the FPGA

  Control data   Ncyclex256 bits for the DSP

14423016

520

13107200

53248

14423016

1716

2010624

208

1

10

100

1000

10000

100000

1000000

10000000

100000000

Data

Volume

(bits)

Configuration

(filtre)

Control (filtre) Configuration

(Rake)

Control (Rake)

C64

Xc200E

DART

Page 35: Transmedia Reconf IRISA Cairn - Inriapeople.rennes.inria.fr/.../presentations/CGReconf_IRISA_Cairn.pdf · Collaboration between IRISA/Cairn on DART ... WCDMA/UMTS Receiver 3900 MOPS

35

69

Implementation of the WCDMA/FIR on a DART cluster with the SIMD mode

Nb of DPR nb cy/sampleresource usage rate

DPR usage rate

Nb configuration instructions

3 7 90 82,7 84 5 100 59,1 95 5 84 59,1 126 4 92 47,3 11

0

20

40

60

80

100

120

3 4 5 6

Number of allocated DPR

number of cycles needed to proceed a sampleresource usage rate (%)

DPR usage rate (%)

Number of configurationinstructions

70

0

20

40

60

80

100

120

1 2 3 4 5 6

Number of Allocated DPR

number of cycles (/50) needed to proceed a symbol

resource usage rate (%)

DPR usage rate (%)

Number of configurationinstructions

Implementation of the WCDMA/Rake on a DART cluster with the SIMD mode

Nb of DPRnb cy/symbol (x100)

resource usage rate

DPR usage rate

Nb configuration instructions

1 15,51 100 17,8 42 7,78 100 8,9 43 5,21 100 5,9 44 5,21 75 5,9 45 5,21 60 5,9 46 2,63 100 3 4