Upload
lynda
View
34
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Simultaneous Time Slack Budgeting and Retiming for Dual-Vdd FPGA Power Reduction. Yu Hu 1 , Yan Lin 1 , Lei He 1 and Tim Tuan 2 1 EE Department, UCLA 2 Xilinx Research Lab Presented by Yu Hu Partially supported by NSF. Address comments to [email protected]. Outline. - PowerPoint PPT Presentation
Citation preview
Simultaneous Time Slack Budgeting and Retiming for Dual-Vdd FPGA Power Reduction
Simultaneous Time Slack Budgeting and Retiming for Dual-Vdd FPGA Power Reduction
Yu Hu1, Yan Lin1, Lei He1 and Tim Tuan2
1EE Department, UCLA
2Xilinx Research LabPresented by Yu HuPresented by Yu Hu
Partially supported by NSF. Partially supported by NSF.
Address comments to [email protected] comments to [email protected]
OutlineOutline
Background, Motivation and Problem Formulation
Chip-level Vdd-level Assignment Algorithm [for mixed length wire segments]
Simultaneous Vdd Level Assignment and Retiming
Experimental Results
Conclusions
BackgroundBackground
Existing FPGAs are power inefficient compared to ASICs.
Interconnect is the dominant component of FPGA power dissipation (dynamic and leakage). [Li, TCAD‘05]
Power aware FPGA architectures and CAD algorithms have been studied extensively. CAD algorithms to minimize power-delay product
[Lamoureux, ICCAD’03] Configuration inversion for leakage reduction
[Anderson, FPGA’04] Vdd-programmable FPGA logic blocks [Li, FPGA’04] [Li, DAC’04] Vdd-programmable FPGA interconnects [Li, ICCAD’04] [Gayasen, FPL’04] [Anderson, ICCAD’04] [Lin, DAC’05]
Island style and mixed wire segment length.
Routing switch/connection block (Two PMOS power transistors M3 and M4 are inserted between the tri-state buffer and VddH, VddL power rails, respectively.) [Li, ICCAD’04]
Level converter free in routing tree (Guarantee that no VddL switch
drives VddH switches.) with LEAST area and power penalty[Lin, TCAD’06].
Vdd Programmable Interconnect Arch.Vdd Programmable Interconnect Arch.
Limitation of Existing ApproachesLimitation of Existing Approaches
The most recent work [Lin, DAC'05] for programmable dual-Vdd FPGA considers timing slack budgeting to maximize power reduction Uniform wire segment length was assumed, and can not be extended to
mixed wire segment. Vdd level assignment was performed in combinational sub-circuits.
Simultaneous retiming and timing budgeting has been studied to reduce area or improve performance.[Yeh, DAC'03] [Yeh, ICCAD'03] Power reduction has not been considered. Post-layout flip-flop binding constraints were not addressed.
Call for Simultaneous Vdd Assignment and RetimingCall for Simultaneous Vdd Assignment and Retiming
[Lin, DAC'05] performs Vdd level assignment in the combinational sub-circuit, which limits the searching space.
Simultaneous retiming and Vdd assignment explores larger searching space to extract more useful timing slack for VddL switches insertion.
4 (0) 4 (0)
3 (1)3 (1
)
4 (0) 4 (0)
2 (2)4 (0
)
4 (0)
4 (0)
3 (1)
3 (1
)
VddL Switch Inserted !
All VddH Switches !
Combinational Assign Sequential
Assignment
Movable with Retiming!
2 units slack needed for a VddL switch insertionInterconnect
DelayTiming Slack
Major ContributionsMajor Contributions
Power-aware post-layout re-synthesis processes:Sequential vs. Simultaneous
Vdd level assignment for mixed wire segments FPGAs.53% interconnect power reduction is achieved compared to single Vdd designs.
Simultaneous retiming and interconnect Vdd assignment with flip-flop binding constraints. Up to 20% further interconnect power reduction is achieved compared to sequential flow.
Synthesis / Placement / routing
Simultaneous retiming and interconnect Vdd assignment
CLB level Vdd assignment
Min-clock retiming
Interconnect Vdd assignment
Global refinement
Problem Formulations Problem Formulations
[ Dual-Vdd Level Assignment Problem ]
Given: placement and routing results of a FPGA design
Find: A Vdd-level assignment to each interconnect switch
Objective: Minimize interconnect (dynamic and leakage) power
Constraints: Meet the delay target Tspec
Vdd-level converters are inserted ONLY at CLB inputs/outputs
[ Simultaneous Retiming and Dual-Vdd Level Assignment Problem ]
Same to Dual-Vdd level assignment problem in addition to:
Retiming as an extra design freedom
Satisfy post-layout flip-flop binding constraints.
OutlineOutline
Background, Motivation and Problem Formulation
Chip-level Vdd-level Assignment Algorithm [for mixed length wire segments] Interconnect Power Reduction Estimation LP Based Vdd-level Assignment Algorithm
Simultaneous Vdd Level Assignment and Retiming
Experimental Results
Conclusions
Delay and Power Model for InterconnectDelay and Power Model for Interconnect
Delay Model Intrinsic delay and effective driving resistance of
switch has been pre-characterized using SPICE. Elmore delay is used to calculate routing delay.
Interconnect Power Model Dynamic power Pd(Vddjj)=0.5fclk*C*Vddjj
2
Leakage power Pl(Vddjj) is pre-characterized using SPICE
Interconnect power reduction estimation is the essential part of dual-Vdd assignment algorithm.
Review of Vdd Level Assignment Algorithm [Lin, DAC'05]Review of Vdd Level Assignment Algorithm [Lin, DAC'05]
Chip-level timing slack allocation (LP)
Net-level Bottom-up Vdd assignment
Refinement
Interconnect Vdd assignment
...
...
The net-level bottom-up Vdd assignment guarantees the
legalization of final solutions.[Lin, DAC’05]
Leverage all extra slack with VddL switches
[Lin, DAC’05]
f(1) = .5f(2) = 1f(3) = 1f(4) = 1.5
VddL possibility for switches S2=3
S1=1
b1
b2
b3
b4
Timing Slack assigned at sinks
S2=3S1=1
b1
b2
b3
b4
Vdd assignment base on estimation
Plr(j)=f(j)*ΔPlr(j)
Pdr(j)=0.5fclk*C(j)* f(j)*ΔVdd2
Power reduction estimation
Interconnect power reduction estimation
Problem remained: How to calculate VddL possibility for
mixed wire segment?
VddL Possibility CalculationVddL Possibility Calculation
Represent timing slack in number of switches: si = Li * ( Si / Di )
si is the number of VddL switches can be inserted in the path from source to jth sink in the routing tree.
Li is the number of switches along this path. si : how many switches can be turned to VddL along source-
to-sink-i path for the given timing slack Si.
VddL possiblity for switch j at sink i based on load capacity: f(i,j) = si * (cij / Ci) Key idea: distribute timing slack to each switch based on cap.
L2 = 3D2 = 12s2 = 3*(10/12)=5/2
S2=10
S1=6
b1, 8x
b2, 8x
b3, 16x
b4, 16x f(2,2) = 1f(2,3) = 1f(2,4) = 1/2
Power Reduction Estimation for Mixed Wire SegmentsPower Reduction Estimation for Mixed Wire Segments
The lower bound estimation [Y. Lin, DAC'05] for interconnect power reduction is no longer valid for mixed wire segments.
Our solution: develop the upper bound estimation of VddL switch number
Consistent upper bound of power reduction Remove the non-linear term "min" and the corresponding extra
LP constraints from lower bound estimation
S = 2.7
b2, 8x, need 1.0 slack
b1, 16x, need 1.8 slackfn(i,1) = 0.9
fn(i,2) = 0.5lower bound of VddL switches = 0.9 + .5 = 1.4
S = 2.7
Consume 1.0
Sum up all VddL possibility
1.7 slack left -1.8 needed!
Only 1.0 VddL switch assignment
Problem here: Lower bound > actual number!
LP formulation for dual-Vdd Level AssignmentLP formulation for dual-Vdd Level Assignment
Basic timing constraints
Slack constraints
Objective function Dynamic power reduction upper bound
Leakage power reduction upper bound
Slack upper bound
Slack constraints
Slack non-negative
Arrival time for prim-output
Arrival time for prim-input
Arrival time constraints
OutlineOutline
Motivation
Problem Formulations
Chip-level Vdd-level Assignment Algorithm [for mixed length wire segments]
Simultaneous Vdd Level Assignment and Retiming MILP formulation for retiming FPGA circuits Extra constraints for post-layout FPGA retiming Link between MILP retiming to timing budgeting
Experimental Results
Conclusions
Retiming for FPGARetiming for FPGA
Retiming graph is a directed cyclic graph.
Given the retiming graph G=(V, E), a retiming is an integer-valued vertex-labeling r: V → Z.
A weight is w(u,v) associated with edge e(u,v) denotes the number of FFs in that edge.
After retiming (re-labeling of vertices): w'(u,v) = w(u,v) + r(v) – r(u)
Retiming
SUBBLK_ IPIN
SUBBLK_ IPIN
SUBBLK_OPIN
FF_NODE
(C)
SUBBLK_ IPIN
SUBBLK_ IPIN
SUBBLK_OPIN
FF_NODE
(B)
Link between MILP retiming & timing budgetingLink between MILP retiming & timing budgeting
Extend MILP formulation in [Leiserson, Algorithmica’91] to link arrival time with retiming labeling
Timing slack in edge (u,v) is represented by
Timing slack in connection from sink Sk to the source of routing tree Ri
The real value a(v) assigned in node v is its arrival time
after retiming
R(v) = r(v) + a(v) /c,
linearize
Keep both placement and routing unchanged after retiming. No FFs in global interconnect (inter-CLB) No FFs in local interconnect (intra-CLB and inter-SLICE)
Within a single SLICE, only FF_NODE → SUBBLK_OPIN edges allow FF insertion.
Extra constraints in MILP formulation:
SUBBLK_ IPIN
SUBBLK_ IPIN
SUBBLK_OPIN
FF_NODE
(A)
Retiming Constraints 1: Placement and Flip-Flop Binding ConstraintsRetiming Constraints 1: Placement and Flip-Flop Binding Constraints
No way to assign this FF in any SLICE physically!
SUBBLK_ IPIN
SUBBLK_ IPIN
SUBBLK_OPIN
FF_NODE
(C)
SUBBLK_ IPIN
SUBBLK_ IPIN
SUBBLK_OPIN
FF_NODE
(B)
FF# can be further reduced!
The only timing edge that can insert FFs
Delay constraints for timing edges within SLICE:
Retiming Constraints 2: LUT Delay and FF Setup &Hold Time ConstraintsRetiming Constraints 2: LUT Delay and FF Setup &Hold Time Constraints
SUBBLK_ IPIN
SUBBLK_ IPIN
SUBBLK_OPIN
FF_NODE
(A)Delay = T_comb
Delay = 0
SUBBLK_ IPIN
SUBBLK_ IPIN
SUBBLK_OPIN
FF_NODE
(B)
Delay = T_seq_out
Delay = T_seq_ inLUT delay FF setup
time + LUT delay
FF hold time
FF# in edge (e)
OutlineOutline
Motivation
Problem Formulations
Chip-level Vdd-level Assignment Algorithm [for mixed length wire segments]
Simultaneous Vdd Level Assignment and Retiming
Experimental Results Dual-Vdd Assignment for FPGAs with Mixed Wire
Segments Simultaneous Vdd Level Assignment and Retiming A runtime Efficient Post-Layout Re-Synthesis CAD Flow
Conclusions
Experimental SettingExperimental Setting
Cluster-based Island Style FPGA Structure Size-10 cluster and size-4 LUT 100% buffered interconnects, subset switch block 60% length-4 and 40% length-8l wire segments 25x buffer for length-4 and 10x buffer for length-8
ITRS 100nm technology, 1.3v for VddH and 0.8v for VddL
Use VPR [Betz-Rose-Marquardt] for placement and routing
Use fpgaEva-LP2 [Lin et al, FPGA’05] for power calculation Considering short-circuit power, glitch power and input vector 8% average error compared to SPICE simulation
10 biggest sequential MCNC benchmarks are tested
Use mosek [student license, www.mosek.com] to solveLP and MILP
EdTLC-LP algorithm achieves 85% VddL assignment.
EdTLC-LP algorithm achieves 53% interconnect power reduction for mixed length interconnect wire on average.
Experimental Results: Dual-Vdd Assignment for FPGAs with Mixed Wire SegmentsExperimental Results: Dual-Vdd Assignment for FPGAs with Mixed Wire Segments
S-VDD DUAL-VDD
Circuit Cluster# power VddL power clock Runtime (s)
tseng 131 10.2 96% 3.9(-62%) 13.1 32
dsip 162 111.4 72% 66.5(-40%) 6.2 44
diffeq 195 10.2 88% 4.0(-61%) 13.5 41
s298 256 23.3 83% 10.9(-53%) 24.3 96
bigkey 294 96.1 73% 57.9(-40%) 7.0 89
elliptic 421 35.9 90% 14.9(-58%) 17.3 229
frisc 595 29.6 98% 11.1(-63%) 23.7 479
s38584.1 704 121.1 92% 51.0(-58%) 11.9 421
s38417 847 125.8 82% 67.2(-47%) 15.7 709
clma 1358 155.4 75% 76.9(-51%) 23.2 2735
ave 496 77.7 85% 36.4(-53% ) 15.5 488
Simultaneous Retiming and Slack Budgeting vs. Sequential Approach (Delay-Optimal Retiming + Slack Budgeting)
[Those circuits with VddL < 85% are selected]
Simultaneous approach gains 5% on average and up to 20% further power reduction compared to sequential one.
Experimental Results – Simultaneous vs. SequentialExperimental Results – Simultaneous vs. Sequential
SEQUENTI AL APPROACH SI MULTANEOUS
circuit VddL power cl ock VddL power
dsi p 72 % 66. 5 6. 2( 0 %) 71 % 66. 5( 0 %)
s298 83 % 10. 9 24. 3( 0 %) 83 % 10. 8(- 1 %)
bi gkey 73 % 57. 8 6. 8(- 2 %) 79 % 46. 1(- 20 %)
s38417 82 % 66. 7 15. 7( 0 %) 82 % 66. 1(- 1 %)
cl ma 75 % 76. 9 23. 2( 0 %) 75 % 74. 4(- 3 %)
ave 77 % 55. 76 15. 24(- 1 %) 78 % 52. 8(- 5 %)
Runtime Efficient CAD FlowRuntime Efficient CAD Flow
VddL > 85% ?Min-clock retiming gains little?
Synthesis / Placement / routing
Min-clock retiming
Simultaneous retiming and interconnect Vdd assignment
Interconnect Vdd assignment
CLB level Vdd assignment
Global refinement
Yes
No
SOLUTION Do Simultaneous Procedure
only when necessary
Simultaneous approach has 10x more runtime overhead compared to place&route.
DO NOT need to perform simultaneous approach for every single design.
Indicators for simultaneous gain High percentage of VddL assignment will not lead to gain from simultaneous approach
Little gain from min-clock retiming indicates little room for improvement by simultaneous approach
Synthesis / Placement / routing
Simultaneous retiming and interconnect Vdd assignment
CLB level Vdd assignment
Min-clock retiming
Interconnect Vdd assignment
Global refinement
OutlineOutline
Motivation
Problem Formulations
Chip-level Vdd-level Assignment Algorithm [for mixed length wire segments]
Simultaneous Vdd Level Assignment and Retiming
Experimental Results
Conclusions
ConclusionsConclusions
A chip-level dual-Vdd assignment algorithm for mixed length wire segment. Experimental results show that reduces interconnect power by 53% on average compared to single-Vdd FPGA designs.
A MILP based simultaneous timing budgeting and retiming formulation which further reduces interconnect power up to 20% compared to min-clock retiming followed Vdd assignment.
A runtime efficient post-layout re-synthesis CAD flow. Do simultaneous procedure only when necessary.
Thank you!
Q/A
Extra Slides for Q/A
EdTLC-LP : Net-level Bottom-up AssignmentEdTLC-LP : Net-level Bottom-up Assignment
Theorem: the bottom-up assignment is optimal
Perform bottom-up assignment within each tree to leverage the allocated slacks
d
Bottom-up assignment Assign low-Vdd to switches in the routing tree in a bottom-up fashion Slack is reduced by in each step Stop the process until no slack left
Major ContributionsMajor Contributions
Present a tight estimation of power reduction upper bound for mixed-length interconnect in FPGAs.
Develop a linear programming (LP) based slack budgeting and Vdd level assignment algorithm for mixed length interconnect FPGAs. The experimental results show 53% interconnect power
reduction on average compared to single Vdd interconnects.
Propose a mixed integer and linear programming (MILP) based simultaneous retiming and slack budgeting for power reduction while considering placement and flip-flop (FF) binding constraints. The experimental results show up to 20% interconnect power
reduction compared to the sequential approach (retiming followed by slack budgeting).
Island style routing architecture.
Mixed wire length (60% length 4 wire and 40% length 8 wire).
Routing switch/connection block (Two PMOS power transistors M3 and M4 are inserted between the tri-state buffer and VddH, VddL power rails, respectively.)
Level converter free (Guarantee that no VddL switch drives VddH switches.)
Vdd Programmable Interconnect Arch.Vdd Programmable Interconnect Arch.
source
Timing Slack vs. VddL Switch NumberTiming Slack vs. VddL Switch Number
Timing Slack Sij of a connection between source and jth sink in routing tree Ri, = the amount of delay which could be added to this connection without increasing the cycle time Tspec.
Timing Slack Sij indicates the number of VddL switches.
Useful Slack: Timing Slack Sij is bounded due to the number of switches in connection between source and jth sink in routing tree Ri,. Extra slack will NOT lead to more VddL switches!
Timing Slack Bounding Constraint: 0 ≤ Sij ≤ Dij
[Dij is the delay increase of the path from source to jth sink by setting VddL to all the switches in this path]
s1=1s2=1
b1
b2
b3
b4
s2=4s1=2
b1
b2
b3
b4Useful Slack
= 3
1 unit slack is needed for
VddL
Retiming for LUT based FPGARetiming for LUT based FPGA Retiming graph is a directed cyclic graph.
Given the retiming graph G=(V, E), a retiming is an integer-valued vertex-labeling r: V → Z.
A weight is w(u,v) associated with edge e(u,v) denotes the number of FFs in that edge.
After retiming (re-labeling of vertices): w'(u,v) = w(u,v) + r(v) – r(u)
LUT1
LUT2
FF
PI1
PI2
PO
A B
C
D E
FA B
C
D
I1I2
CLB1
I1 I2
SLI
CE
1
G HO1
CLB2
I3
O2
G
F
HO1
I3
O2PO
PI1PI2
(a) FPGA Circuit (b) Retiming GraphS
LIC
E2
LUT1
LUT2
PI1
PI2
PO
A B
C
D E
FA B
C
D
I1I2
CLB1
I1 I2
SLI
CE
1
G HO1
CLB2
I3
O2
G
F
HO1
I3
O2PO
PI1PI2
(a) FPGA Circuit (b) Retiming Graph
SLI
CE
2
FF r(G) = 1
r(D) = 1
r(F) = 1
Retiming
MILP Based Retiming FormulationExtended from MILP formulation [Leiserson, Algorithmica’91]MILP Based Retiming FormulationExtended from MILP formulation [Leiserson, Algorithmica’91]
Let G = (V, E, d, w) be a synchronous circuits, and let c be a positive real number. Then there exists a retiming r of G such that Φ(Gr) ≤ c if and only if there exists an assignment of a real value a(v) and an integer value r(v) to each vertex v such that the following conditions are satisfied:
Let R(v) = r(v) + a(v) /c, then this formula can be rewritten as
Runtime Efficient CAD FlowRuntime Efficient CAD Flow Runtime overhead of post-layout re-synthesis processes
High percentage of VddL assignment after won‘t lead to gain from RTSB
Little gain from min-clock retiming indicates little room for improvement by RTSB
CRT+SB RTSB
circuit total b-Search total
tseng 43 4 44
dsip 74 9 66
diffeq 192 124 85
s298 654 470 230
bigkey 302 77 254
elliptic 575 145 473
frisc 1286 171 2402
s38584 1072 193 1092
s38417 1687 1029 863
clma 9736 2427 8580
ave 1562 465 1409
Placement / routing
Min-clock retiming
RTSB
Vdd assignment
Chip-level vdd assignment
Global refinement
VddL > 85% ?Min-clock retiming gains little?
No
Yes