Architecture-Level Synthesis for Automatic Interconnect Pipelining

Architecture-Level Synthesis Architecture-Level Synthesis

for Automatic Interconnect Pipeliningfor Automatic Interconnect Pipelining

Jason Cong, Jason Cong, Yiping FanYiping Fan, Zhiru Zhang, Zhiru ZhangVLSI CAD LabVLSI CAD Lab

Computer Science Department Computer Science Department

University of California, Los AngelesUniversity of California, Los Angeles

{cong, fanyp, zhiruz}@cs.ucla.edu{cong, fanyp, zhiruz}@cs.ucla.edu

Funded by GSRC, NSF, and Altera Corp.Funded by GSRC, NSF, and Altera Corp.

OutlineOutline

MotivationMotivation

Our contributionsOur contributions RDR-Pipe micro-architectureRDR-Pipe micro-architecture

• Regular Distributed Register micro-architecture with interconnect Regular Distributed Register micro-architecture with interconnect pipelining pipelining

Synthesis flow and algorithmsSynthesis flow and algorithms• MCAS-Pipe: automatic interconnect pipelining and sharingMCAS-Pipe: automatic interconnect pipelining and sharing

Experimental resultsExperimental results

ConclusionsConclusions

Interconnect Bottleneck in Nanometer DesignsInterconnect Bottleneck in Nanometer Designs

11.4 22.8 28.30

1 cycle

2 cycles

3 cycles

4 cycles

5 cycles

Challenge: single-cycle full chip communication will be no longer possibleChallenge: single-cycle full chip communication will be no longer possible

Not supported by the current CAD toolsetNot supported by the current CAD toolset

ITRS’01 0.07um Tech 5.63 GHz across-chip clock 800 mm2 (28.3mm x 28.3mm) IPEM BIWS estimations

Buffer size: 100x Driver/receiver size: 100x

Semi-global layer (Tier 3) Can travel up to 11.4mm in

one cycle Need 5 clock cycles From

corner to corner

Related WorkRelated Work Retiming with placement or floorplanning Retiming with placement or floorplanning

Retiming + multilevel partitioning [Cong et al, ICCAD’00] and coarRetiming + multilevel partitioning [Cong et al, ICCAD’00] and coarse placement [Cong et al, DAC’03]se placement [Cong et al, DAC’03]

Retiming + floorplanning [Chong & Brayton, IWLS’01] Retiming + floorplanning [Chong & Brayton, IWLS’01]

Retiming + placement for FPGAs [Singh & Brown, FPGA’02]Retiming + placement for FPGAs [Singh & Brown, FPGA’02]

Global wire pipelining in ItaniumGlobal wire pipelining in ItaniumTM TM processor processor [McInerney et al. ISPD’00][McInerney et al. ISPD’00]

Buffer and flip-flop insertion in RTL Buffer and flip-flop insertion in RTL [Lu et al. DATE’02] [Lu et al. DATE’02] [Cocchini, ICCAD’02][Cocchini, ICCAD’02]

Limitation during Logic/Physical Level to Explore Limitation during Logic/Physical Level to Explore Multicycle Communication Multicycle Communication

Minimum clock period achievable by logic optimization is bMinimum clock period achievable by logic optimization is bounded by max. delay-to-register (DR) ratio of the loops in ounded by max. delay-to-register (DR) ratio of the loops in the circuits [Papaefthymiou, MST’94]the circuits [Papaefthymiou, MST’94]

• In a loop, 4 logic cells, 2 registers• Cell delay = 1ns• Interconnect delay = 1ns • DR ratio = (Dlogic+Dint)/#Registers = (4+4)/2 = 4ns• Clock period 4ns

Interconnect pipelining by flip-flop insertion ?Interconnect pipelining by flip-flop insertion ? Requires considerable amount of manual rework on the original Requires considerable amount of manual rework on the original

RTL descriptionsRTL descriptions

Our ApproachOur Approach Consideration of multicycle communication during architeConsideration of multicycle communication during archite

ctural (or behavioral) synthesis ctural (or behavioral) synthesis [Cong et al, ISPD’03] [Cong et al. ICCAD’03][Cong et al, ISPD’03] [Cong et al. ICCAD’03] Regular Distributed Register (RDR) micro-architecture Regular Distributed Register (RDR) micro-architecture

• Highly regularHighly regular• Direct support of multicycle on-chip communicationDirect support of multicycle on-chip communication

MCAS: Architectural Synthesis for Multi-cycle CommunicationMCAS: Architectural Synthesis for Multi-cycle Communication• Efficiently maps the behavioral descriptions to RDR uArch Efficiently maps the behavioral descriptions to RDR uArch • Integrates architectural synthesis (e.g. resource binding, scheduling) Integrates architectural synthesis (e.g. resource binding, scheduling)

with physical planningwith physical planning

This workThis work Extension of RDR and MCAS for interconnect pipeliningExtension of RDR and MCAS for interconnect pipelining

OutlineOutline







…

LCCLCC

…

LCCLCC

…

LCCLCC

…

LCCLCC

…

LCCLCC

…

LCCLCC

FSM

FSM

FSM

FSM

FSM

FSM

FSM

FSM

FSM

FSM

FSM

FSM

Reg. file

Glob

al Intercon

nect

Reg. file

Reg. file Reg. file Reg. file

Reg. file

Regular Distributed Register Micro-ArchitectureRegular Distributed Register Micro-Architecture

LocalComputationalCluster (LCC)

LocalComputationalCluster (LCC)

….

Wi

H i

FSM

FSM

ALUALU

MULMUL MUXMUX

IslandIsland

1 cycle1 cycle

2 cycle

2 cycles

K cycle

K cycles

Distribute registers to each “island” Choose the island size such that local computation and communication in

each island can be done in a single cycle Use register banks: registers in each island are partitioned to k banks for 1

cycle, 2 cycle, … k cycle interconnect communication in each island

Wiring Overhead in RDR DesignsWiring Overhead in RDR Designs

Data transfers rData transfers r11rr33 and r and r22rr4 4 are overlapped are overlapped

Two dedicated global wires are needed Two dedicated global wires are needed

ALU1

MUL1

Interconnects with delay of 2 cycles

r1 r2

r3 r4

+

+

*

+ ALU1 MUL1*

Cycle 1

Cycle 2

Cycle 3

Cycle 4

Cycle 5

r1

r2r3

r4

Sender register Receiver register

Architectural Solution: RDR-PipeArchitectural Solution: RDR-Pipe

Keep the intra-island Keep the intra-island

structuresstructures

Inter-island pipeline Inter-island pipeline

register station (PRS) for register station (PRS) for

global communicationsglobal communications

PRS performs PRS performs

autonomous autonomous store-and-store-and-

forwardforward Synchronous designSynchronous design

No global control signal No global control signal needed for PRSneeded for PRS

LCCF

SM

LCC

FS

M

LCC

FS

M

LCC

FS

M

LCC

FS

M

LCC

FS

M

Reg. File

V channel

PRS

H channel

Pipeline Register Station (PRS)

1 2

4

3

5 6

3

1 24

PRS

PRS PRS

Reducing Wiring Overhead in RDR-PipeReducing Wiring Overhead in RDR-Pipe

Data transfers are pipelined Data transfers are pipelined One wire with a pipeline register is enoughOne wire with a pipeline register is enough

ALU1ALU1

MUL1MUL1 2 cycle communication

r1

r2r3 r4

+

+

*

+ ALU1 MUL1*

Cycle 1

Cycle 2

Cycle 3

Cycle 4

Cycle 5

r1

r1r3

r4

Sender register

Receiver register

Pipeline register

Synthesis Flow: MCAS-Pipe SystemSynthesis Flow: MCAS-Pipe System

MC

AS

-Pip

eM

CA

S-P

ipe

ICG

C / VHDL

Locations

Placement-driven rescheduling & rebinding

Placement-driven rescheduling & rebinding

Scheduling-driven placementScheduling-driven placement

CDFG generationCDFG generation

Register and port bindingRegister and port binding

Datapath & FSM generationDatapath & FSM generation

Resource allocation& Functional unit binding

Resource allocation& Functional unit binding

RTL VHDL & Floorplan constraints

CDFG

Global interconnect sharingAfter scheduling and functional u

nit binding

Before register and port binding

Enable multiple data communications to shar a physical link (a wire with pipeline registers)

Advantages over MCASExpect to reduce global wiring de

mand

No multicycle path constraint needed

Global interconnect sharingGlobal interconnect sharing

Global Interconnect SharingGlobal Interconnect Sharing

Two physical links are needed to Two physical links are needed to

support the concurrent data transferssupport the concurrent data transfers

A Bpe ce

D = 2

pg cgCycle 4

Cycle 1

Cycle 2

Cycle 3

Cycle 5

Cycle 6

Cycle 7 ce cg

pe

pg

Conflicted data transfers

Pipeline register Sender register Receiver register

Cycle 4

Cycle 1

Cycle 2

Cycle 3

Cycle 5

Cycle 6

Cycle 7 ce cg

pe

pg

Compatible data transfers

A Bpe

ce

D = 2

pgcg

Only one physical link is required to Only one physical link is required to support the scheduled data transferssupport the scheduled data transfers

A B

pe, pg

ce

D = 2

cg

Now, two producer registers can be merged, Now, two producer registers can be merged, since their life-times become compatiblesince their life-times become compatible

Global Pipelined Interconnect MinimizationGlobal Pipelined Interconnect Minimization DefinitionsDefinitions

Data links: pipelined global interconnectsData links: pipelined global interconnects Channel: set of data links between two islandsChannel: set of data links between two islands

• Width of a channel: number of its data linksWidth of a channel: number of its data links

Data transfer: movement of data from a producer to a consumerData transfer: movement of data from a producer to a consumer

Architectural assumptionArchitectural assumption Channels cannot share interconnectsChannels cannot share interconnects

TheoremTheorem Global pipelined interconnects are minimized if and only if the Global pipelined interconnects are minimized if and only if the

width of every channel is minimizedwidth of every channel is minimized

Transfer Scheduling for a Single ChannelTransfer Scheduling for a Single Channel A decision problem formulationA decision problem formulation

Given: Given:

• A channel (A channel (A, BA, B)) containing containing m m data linksdata links

• A data transfer set {A data transfer set {e | pe | pee A A and and ccee B B}, where each transfer }, where each transfer ee is associat is associat

ed with an arrival time ed with an arrival time TT((ppee))+1+1, a deadline , a deadline TT((ccee))-D-D((A, BA, B), and unit effective oc), and unit effective oc

cupancy timecupancy time

Fact: for every time slot, at most one transfer can be issued on a data linkFact: for every time slot, at most one transfer can be issued on a data link

Objective: to find a feasible transfer schedule on these data linksObjective: to find a feasible transfer schedule on these data links

Transfer scheduling is polynomial solvableTransfer scheduling is polynomial solvable A special real-time scheduling problem A special real-time scheduling problem [J. Blazewicz, 1979]

• Binary search for minimum feasible channel width Binary search for minimum feasible channel width mm

• For each width, apply Earliest-Deadline-First (EDF) scheduling: O(For each width, apply Earliest-Deadline-First (EDF) scheduling: O(nnloglognn))

• Overall time complexity: O(Overall time complexity: O(nnloglog22nn))

EDF-Based Transfer Scheduling ExampleEDF-Based Transfer Scheduling Example

Successfully scheduling onto 2 Successfully scheduling onto 2 data linksdata links

Data Link 1

Data Link 2

1

2 ?

3

4

5

Ordered by Earliest-Deadline-First Ordered by Earliest-Deadline-First

Time slot Time slot

Data Link 1

Data Link 2

1 2

3 4

5

6

12345

6

1

2

345

6

Ordered by left edgeOrdered by left edge Failed for 2 data links!Failed for 2 data links!

OutlineOutline







Experiment SettingsExperiment SettingsC / VHDLC / VHDL

Conventional Conventional flowflow

Altera QuartusII + StratixAltera QuartusII + StratixAltera QuartusII + StratixAltera QuartusII + Stratix

Scheduling-driven Scheduling-driven placementplacement

Scheduling-driven Scheduling-driven placementplacement

CDFG generationCDFG generationCDFG generationCDFG generation

MCAS-Pipe flowMCAS-Pipe flow

Conventional Conventional Scheduling Scheduling

Conventional Conventional Scheduling Scheduling

Datapath & Control generationDatapath & Control generationDatapath & Control generationDatapath & Control generation

Floorplan constraints (for MCAS and MCAS-Pipe); Floorplan constraints (for MCAS and MCAS-Pipe); Multicycle path constraints (for MCAS only)Multicycle path constraints (for MCAS only)

uArch. spec.

uArch. spec.

Target clock periodTarget clock period

RTL VHDL filesRTL VHDL files(for all flows)(for all flows)

Global interconnect Global interconnect sharingsharing

Global interconnect Global interconnect sharingsharing

MCAS MCAS flowflow

Functional unit Functional unit allocation & bindingallocation & binding

Functional unit Functional unit allocation & bindingallocation & binding

Placement-driven Placement-driven rebinding & reschedulingrebinding & rescheduling

Placement-driven Placement-driven rebinding & reschedulingrebinding & rescheduling

Register and port bindingRegister and port bindingRegister and port bindingRegister and port binding

Experimental Results: Register and LE UsageExperimental Results: Register and LE Usage

DesignsDesigns Node#Node#MCASMCAS CONV / MCASCONV / MCAS MCAS-Pipe / MCASMCAS-Pipe / MCAS

Reg#Reg# LELE Reg#Reg# LELE Reg#Reg# LELE

PRPR 46 31 1181 0.71 0.95 1.19 0.95

WANGWANG 52 40 1435 0.63 0.81 1.20 0.85

LEELEE 53 29 988 0.76 0.96 1.00 0.84

MCMMCM 98 57 2467 0.75 1.00 1.05 1.19

HONDAHONDA 101 41 2542 0.83 0.90 1.05 1.01

DIRDIR 152 44 2260 0.75 0.95 1.05 1.01

AverageAverage 　 - 　 - 　 - 0.74 0.74 0.93 0.93 1.09 1.09 0.98 0.98

Design environment: Altera QuartusII, Stratix EP1S40Design environment: Altera QuartusII, Stratix EP1S40 MCAS vs. Conventional flow:MCAS vs. Conventional flow:

Uses more registers and logic elements (LE)Uses more registers and logic elements (LE)

MCAS-Pipe vs. MCAS: MCAS-Pipe vs. MCAS: Slightly more registers, and comparable logic element costSlightly more registers, and comparable logic element cost

Experimental Results: PerformanceExperimental Results: Performance Design environment: Altera QuartusII, Stratix EP1S40Design environment: Altera QuartusII, Stratix EP1S40 MCAS vs. Conventional flow:MCAS vs. Conventional flow:

36% reduction in clock period and 30% in total latency36% reduction in clock period and 30% in total latency

MCAS-Pipe vs. MCAS:MCAS-Pipe vs. MCAS: Comparable design performance (4% better)Comparable design performance (4% better)

0

2

4

6

8

10

12

Clo

ck p

eri

od

(n

s)

PR WANG LEE MCM HONDA DIR Average

Conventional

MCAS

MCAS-Pipe

0

100

200

300

400

500

600

To

tal l

ate

ncy

(n

s)

PR WANG LEE MCM HONDA DIR Average

Conventional

MCAS

MCAS-Pipe

Clock periodClock period Total latencyTotal latency

Interconnect Structure of Altera’s Stratix Interconnect Structure of Altera’s Stratix

Local: LL, LOV4

H4

H8

V8 Global:V16

Global: H24

Experimental Results: WirelengthExperimental Results: Wirelength Wire typesWire types

LL, LO: local wires; H4, V4, H8, V8: short global wiresLL, LO: local wires; H4, V4, H8, V8: short global wires

V16, H24: long global wiresV16, H24: long global wires

MCAS-Pipe vs. MCAS:MCAS-Pipe vs. MCAS:

28.8% long global wires reduction, 19.3% total wirelength reduction28.8% long global wires reduction, 19.3% total wirelength reduction

0

0. 2

0. 4

0. 6

0. 8

1

1. 2

1. 4

PR WANG LEE MCM HONDA DI R Average

LL+LOH4+V4H8+V8V16+H24Total


High-level automatic on-chip interconnect pipeliningHigh-level automatic on-chip interconnect pipelining

RDR-Pipe: extension of RDR micro-architecture RDR-Pipe: extension of RDR micro-architecture

• Micro-architecture supporting interconnect pipeliningMicro-architecture supporting interconnect pipelining

MCAS-Pipe: enhancement of MCAS synthesis systemMCAS-Pipe: enhancement of MCAS synthesis system

• Add in a novel global interconnect sharing algorithm to Add in a novel global interconnect sharing algorithm to

effectively reduce the global wiringeffectively reduce the global wiring


Matches or exceeds the RDR-based approach in performance Matches or exceeds the RDR-based approach in performance

Greatly reduces wiring demandGreatly reduces wiring demand

Thank youThank you

Documents

Architecture-Level Synthesis for Automatic Interconnect Pipelining