Upload
merilyn-gibson
View
221
Download
0
Tags:
Embed Size (px)
Citation preview
Design, Synthesis and Evaluation of Heterogeneous FPGA
with Mixed LUTs and Macro-Gates
Yu Hu1, Satyaki Das2, Steve Trimberger2, and Lei He1
1. Electrical Engineering Dept., UCLA
2. Research Labs, Xilinx Inc.
Presented by Yu HuPresented by Yu Hu
Address comments to [email protected] comments to [email protected]
Outline
Introduction
Design of the Macro-gates
Synthesis for the Proposed FPGA Architecture
Comparison of Heterogeneous FPGA Architectures
Conclusions and Future Work
Heterogeneity in FPGA Architectures
Heterogeneity among SLICEs Programmable logic and routing Tiles are not identical
soft logic fabric [Kaviani, FPGA’96]] hard structures [Jamieson, FPL’05]
Dedicated hard structures e.g. DSP e.g memory block
Heterogeneity within a SLICE Programmable logic and routing Tiles (SLICEs) are identical Different logics exist within a SLICE
e.g. LUTs with different size [Cong, FPGA’99] e.g. mixed PLAs and LUTs [Cong, TODAES’05] e.g. mixed macro-gates and LUTs
(source: Jamieson@FPL’05)
Heterogeneous FPGA with Macro-Gates
There exists programmability and cost trade-off between LUTs and macrogates Xilinx V4 benefits from small gates (MUX2, XOR2) built in
SLICEs.
The benefit of wider macro-gates Effectiveness of the incorporation of wider logic functions (macro
gates) is not clear.
Our contributions Design a new FPGA architecture with mixed LUTs and macro-
gates Propose a new automatic synthesis flow for mapping a circuit to
the proposed FPGA architecture Evaluate the architecture and show that the proposed
architecture reduces delay and area by 16.5% and 30%, respective, compared to the LUT-only architecture.
Outline
Introduction
Design of the Macro-gates
Synthesis for the Proposed FPGA Architecture
Comparison of Heterogeneous FPGA Architectures
Conclusions and Future Work
Overview of Macro-Gate Design
Key problem Select the logic functions for the macro-gate
Problem formulation: Input: a set of training circuits, which have been
mapped to K-input LUTs Output: N K-input Boolean functions: f1 , … , fN Objective: Maximize the number of logics (in the
training circuit set) which can be implemented by f1 , … , fN
The proposed solution Ranking of the logic functions for a set of training
circuits
NPN-Class Diagram: Organization of Logics Canonical and efficient representation of all NPN classes
NPN-Equivalent: functional equivalency under inputs negation, permutation or output negation
E.g., f(a,b,c)=a+bc, g(a,b,c)=b’a+b’c
NPN-Cofactor relationship is indicated
DAG: easy to manipulate
It becomes impractical to compute for more than 6-input functions! Solution: Utilization NPN-Class Diagram
Level3: 3-inputLevel2: 2-input
Level1: 1-input
Level0: constant
Wider inputs
UND: Utilization NPN-Class Diagram UND is an DAG, sub-graph of NCD
Help for scoring and ranking functions ab’c’+a’bc’
ab’c’+a’bc’ / 1 / xx% abc
ab / 0 / xx%
a / 0 / xx%
ab’+a’b / 0 / xx%
-0- / 0 / xx%
abc/ 1 / xx%
ab’+a’b
a
Implementation capability
Appearance frequency
functionality
UND: Utilization NPN-Class Diagram
ab’c’+a’bc’
ab’c’+a’bc’ / 1 / xx% abc
ab / 0 / xx%
a / 0 / xx%
ab’+a’b / 0 / xx%
-0- / 0 / xx%
abc/ 1 / xx%
ab’+a’b
aab’+a’b / 1 / xx%
a / 1 / xx%
a / 1 / 25%
ab’+a’b / 1 / 50%
UND: Utilization NPN-Class Diagram
Calculate Implementation Capability
ab’c’+a’bc’
ab’c’+a’bc’ / 1 / 75% abc
ab / 0 / 25%
-0- / 0 / xx%
abc/ 1 / 50%
ab’+a’b
a
Fanout cone ofab’c+a’bc’
The topology property (DAG) of UND enables us to efficiently explore
different metrics for functionality ranking, e.g., utilization rate.
Recap: Overall Flow for Macro-Gate Design
Map with LUT-N
Extract logic functions
Generate UtilizationNPN Diagram
Calculate score For logic functions
Rank logic functions
ab’c’+a’bc’ / 1 / xx%
ab / 0 / xx%
a / 0 / xx%
ab’+a’b / 0 / xx%
-0- / 0 / xx%
abc/ 1 / xx%
ab’+a’b / 1 / xx%
a / 1 / xx%
F
f
g
d
e
h
b
a
c
LUTLUT
LUTLUT
LUTLUT
and2(3)
inv(1)
nand2(2)
000000100000000000000100000000000000100000000000000100000000000000100000000000000100000000000000
……
a / 1 / 25%
ab’+a’b / 1 / 50%
ab’c’+a’bc’ / 1 / 75%
ab / 0 / 25%
-0- / 0 / xx%
abc/ 1 / 50%
1+1*1/2=1.5
1
1*1/2=0.5
1+1*1/3=1.33 1+1*2/3+1*1/3=2
Best function: ab’c’+a’bc’
Proposed Macro-Gates and FPGA Architecture
For IWLS’05 benchmarks, the following four 6-input functions have the highest ranks GI1=a b c d e f (AND-6) GI2=a’ b’ c’ + b c f’ + b c’ d’ + b’ c e (MUX-4) GI3=a b' c d' e + b c e f + d e f GI4=a b' + a' c d' + b' c' + e' + f‘
It can implement over 50% of logic functions in IWLS’05 benchmarks.
The architecture of the proposed macro-gate and FPGA SLICE are
Outline
Design of the Embedded Macro-gates
Synthesis for the Proposed FPGA Architecture
Technology Mapping for Heterogeneous FPGAs
SAT-based Packing
Place and Routing
Comparison of Heterogeneous FPGA Architectures
Conclusions and Future Work
Functional & Structural Cut Enumeration
ab
d
zyx
c
w
a=(x+y)’b=y+wz
d=ab=(x+y)’(y+wz)=x’y’wz
Is x’v’wz in library?
4-input macro gate lib
000000100000000000000100000000000000100000000000000100000000000000100000000000000100000000000000
……Yes
Phase1:Enumerate and label cuts from PIs to Pos Check the feasibility of a cut w.r.t. the macro-gate
Phase2:Select best choice from POs to Pis
A general yet efficient solution is SAT based Boolean matching Exploiting Symmetry in SAT-Based Boolean Matching for Heterogeneous FPGA Technology
Mapping , Session 5C.1, ICCAD 07
Key in Technology Mapping: Balance Resource Utilization
Asymmetric architecture causes problem to resource utilization
Exclusively use of one logic resource leads to lots of unused fabric
Simple yet effective solution : Change LUT-MG ratio by adjusting their area weights. Precise calibration is hard to reach by this approach.
0
1000
2000
3000
4000
5000
6000
1:1 1:0.95 1:0.9 1:0.8 1:0.5 1:0.1
MG# LUT6#
Total# too large!
Hard to obtain precise calibrationObjective
architecture:LUT6:MacroGate6
=1:1
Best LUT-MG ratio= 1:1
LUT-MG ratio = LUT#/MG#
Post-Mapping Area Recovery (motivation example)
Given: Target architecture = LUT6 + MG6 LUT-MG ratio in target architecture = 1:1 LUT# < MG# in the mapped design Intrinsic delay (LUT6 : MG6) = 5:4
Objective: balance LUT MG number without increasing delay
LUT6
MG6
MG6
MG6
MG6
5 / 5
4 / 5
9 / 9
9 / 13
17 / 17
13 / 13
PI POMG6
MG68 / 9
Post-Mapping Area Recovery (motivation example)
Given: Target architecture = LUT6 + MG6 LUT-MG ratio in target architecture = 1:1 LUT# < MG# in the mapped design Intrinsic delay (LUT6 : MG6) = 5:4
Objective: balance LUT MG number without increasing delay
LUT6
MG6
MG6
MG6
5 / 5
4 / 5
9 / 9
10 / 13
17 / 17
13 / 13
PI POMG6
MG68 / 9
LUT6
Post-Mapping Area Recovery (motivation example)
Given: Target architecture = LUT6 + MG6 LUT-MG ratio in target architecture = 1:1 LUT# < MG# in the mapped design Intrinsic delay (LUT6 : MG6) = 5:4
Objective: balance LUT MG number without increasing delay
LUT6
MG6 MG6
5 / 5
5 / 5
9 / 9
10 / 13
18 / 17
14 / 13
PI POMG6
10 / 9
LUT6
LUT6 LUT6 Timing target violation!
Timing slack budgeting is necessary!
Post Mapping Area Recovery by Timing Budgeting
Formulated as an Integer Linear Programming (ILP) Problem
Objective (minimize gap between target and actual LUT-MG ratios): min |m2+…+m7-7/2|
Arrival time constraints: ai+dj+bj<=aj
Clock period target: ai<=17
LUT assignment with given timing slack: (5-4)*mj<=bj, mj={0,1}
LUT6
MG6 MG6
a1
a6
a5
a2
a3
a4
PI POMG6
a7
MG6
MG6 MG6
Easy to be generalized to handle arch with multiple macro gates with different input pin numbers
Outline
Design of the Embedded Macro-gates
Synthesis for the Proposed FPGA Architecture
Technology Mapping for Heterogeneous FPGAs
SAT-based Packing
Comparison of Heterogeneous FPGA Architectures
Conclusions and Future Work
SAT-Based Packing
Motivation Traditional packing tools, e.g., T-VPack, hard-codes the architecture
specification of a SLICEs…. Re-impalement from scratch when architecture changes
Propose a unified implementation of the packers for different architectures: easy to perform architecture exploration!
The architecture dependent sub-problem in packing Structural feasibility checking for a sub-circuit to the SLICE
Solution Solve the problem of validating SLICE packing as a local
place&route problem A SAT solver is used to carry out the validation checking
Example of SAT-Based SLICE Packing Examples of constraints: (for each classes of constraint…)
Placement and routing choice variables: X@A, X@B, U5@N10
Exclusively constraint: (¬X@A) (¬X@B)∨
Presence constraint: (X@A) (¬X@B)∨
Input/Output constraint: X@A → U5@N10
Routing constraint: G0 →out U∧ 5@N10) → U5@N12
Recap: Overall Synthesis Flow
Area weightSetting
Cut-based Mapping
Area-BalanceTrade-off?
Y
NPost-mappingArea recovery
LUT6
MG6
MG6
MG6
MG6
MG6
MG6
LUT6
MG6
MG6
MG6
MG6LUT6
LUT6
MG6
MG6
MG6
LUT6
LUT6
packing
F
f
g
d
e
h
b
a
c
LUTLUT
LUTLUT
LUTLUT
LUT
LUT
LUT
Outline
Motivation and Objectives
Methodology for Logic Function Exploration
Technology Mapping for Heterogeneous FPGAs
Evaluation of Heterogeneous FPGA Architectures
Conclusions and Future Work
Experimental Setting Design library parameters [Cong, TODAES’05]
Benchmark set: IWLS 2005
Four architectures are compared: LUT4, LUT4 + macro gate, LUT6, and LUT6 + macro gate Synthesize the proposed macro-gate by SIS1.2 Delay and area model
Interconnect delay is igonired
Delay Comparisons
Compared to LUT4, LUT4+MG reduces both logic depth and delay by 9.2%.
Compared to LUT6, LUT6+MG reduces delay by 30% while increasing logic depth by 36.5%. A LUT6 can implement more logics than a macro-gate
Logic depth
7.867.14
5.48
7.48
0
2
4
6
8
10
delay
7.86 7.14
10.959.14
02468
1012
Logic Area Comparisons
Compared to LUT4, LUT4+MG reduces logic area by 12.5%.
Compared to LUT6, LUT6+MG reduces logic area by 16.9%.
PLB#
6406
29853711
2142
01000200030004000500060007000
Area
7346 6408
16816
11849
02000400060008000
1000012000140001600018000
Outline
Motivation and Objectives
Methodology for Logic Function Exploration
Technology Mapping for Heterogeneous FPGAs
Comparison of Heterogeneous FPGA Architectures
Conclusions and Future Work
Conclusions
Conclusions A novel FPGA architecture with the mixed LUTs and macro-
gates is proposed A synthesis flow for the proposed architecture is implemented The preliminary experimental results show the effectiveness of
the proposed architecture for the area and delay reduction
Future Work Perform the physical design for the synthesized circuits and
compare the routing costs, architecture evaluation considering interconnect delay
Study the effectiveness of the power reduction for the proposed architecture
Macro-gates with wider inputs will be examined