FPGA Routing
Dr. Philip BriskDepartment of Computer Science and Engineering
University of California, Riverside
CS 223
2
Routing Resource Graph (RRG)
www.eecg.toronto.edu/~aling/ece1718/project/fang/route_rr_graph.png
2-LUT
in1 in2
out
wire1
wire2
wire3 wire4source
sink
out
wire4
wire2
in2in1
wire1
wire3
3
FPGA Routing• Disjoint-path problem (on the RRG); NP-complete
• Input:– Graph G(V, E)– Set of sources S = {s1, s2, …, sm}
– Set of sets of sinks T = {T1, T2, …, Tn}, Ti = {ti1, ti
2, …, tik}
• Solution
– Finds paths from each source si to all sinks in Ti
– Paths emanating from different sinks must be disjoint (cannot shared any vertices or edges)
• Objective(s)– Minimize delay, wirelength, etc.
4
Disjoint and Non-disjoint Paths
s1
s2
t21
t11
5
Disjoint and Non-disjoint Paths
This route is illegal!
s1
s2
t21
t11
Two nets are routed on the same FPGA segment(remember, RRG vertices represent wires and CLB I/Os)
6
Disjoint and Non-disjoint Paths
This route is legal!
s1
s2
t21
t11
7
LUT Input Equivalence (1/3)
s1
s2
2-LUTin1
in2
s1
s2
2-LUTin1
in2
LUT inputs are interchangeable
8
LUT Input Equivalence (2/3)
s1
s2
2-LUTin1
in2
s1
s2
2-LUTin1
in2
s1
s2 t21 = in2
t11 = in1 t2
1 = in1
t11 = in2
s1
s2
Overly restrictive disjoint-path problem formulation
9
LUT Input Equivalence (3/3)
s1
s2
2-LUTin1
in2
s1
s2
2-LUTin1
in2
Represent all inputs of a LUT as one RRG sink, t
ts1
s2
PathFinder: A Negotiation-Based Performance-Driven Router for FPGAs
Larry McMurchie and Carl EbelingInternational Symposium on FPGAs, 1995
Architecture and CAD for Deep Submicron FPGAs
(Section 2.2.3, pp. 25-30)
Vaughn Betz, Jonathan Rose, Sandy MarquardtSpringer, 1999
Basic Plan
• Route nets one-by-one– Each net is routed by a priority-driven breadth-first
search– Multiple nets may use the same nodes and edges– Node/edge costs depend on current usage and past
usage history• Repeat for a fixed number of iterations (say 200)– Success: A disjoint routing solution for for all nets– Failure: No disjoint solution found after a fixed number
of iterations
Routability Cost Function (per Node)
• C(n) = p(n) × (b(n) + h(n))
– b(n): base cost of routing through node n• Typically the intrinsic delay of the circuit element corresponding to
the node
– h(n): history cost of routing through node n, based on previous router iterations
– p(n): depends on the number of signals presently routed through node n in the current iteration
Timing-Routability Cost Function
First-Order Congestion
s1
s3
s2
At1
t3
t2
B
C
1
1
1
2
3
4
3
1
1
1
2
3
4
3
First-Order Congestion
s1
s3
s2
At1
t3
t2
B
C
1
1
1
2
3
4
3
1
1
1
2
3
4
3
All paths route through B
• Increase the penalty cost of B per iteration due to sharing
• In a future iteration, it will be cheaper to route signal 1 through A, rather than B
• In a future iteration, it will be cheaper to route signal 2 through C, rather than B
• Requires gradual increase in cost of sharing nodes
Second-order Congestion
s1
s3
s2
At1
t3
t2
B
C
1
2
2
1
1
2
1
2
1
1
Routing Order: 1, 2, 3
Second-order Congestion
Routing Order: 1, 2, 3A
B
C
1
2
2
1
1
2
1
2
1
1
s1
s3
s2
t1
t3
t2
Second-order Congestion
Routing Order: 1, 2, 3A
B
C
1
2
2
1
1
2
1
2
1
1
s1
s3
s2
t1
t3
t2
Second-order Congestion
Routing Order: 1, 2, 3A
B
C
1
2
2
1
1
2
1
2
1
1
Two paths route through C
• Increasing the history cost through C will push signal 2 onto an alternative path through B
s1
s3
s2
t1
t3
t2
Second-order Congestion
Routing Order: 1, 2, 3A
B
C
1
2
2
3
3
2
1
2
3
3
Two paths route through C
• Increasing the history cost through C will push signal 2 onto an alternative path through B
s1
s3
s2
t1
t3
t2
Second-order Congestion
Routing Order: 1, 2, 3A
B
C
1
2
2
3
3
2
1
2
3
3
Two paths route through B
• Increasing the history cost through B will push signal 1 onto an alternative path through A
• Increasing the history cost through B will push signal 2 back onto the path through C
s1
s3
s2
t1
t3
t2
Second-order Congestion
Routing Order: 1, 2, 3A
B
C
3
4
2
3
3
4
3
2
3
3
Two paths route through B
• Increasing the history cost through B will push signal 1 onto an alternative path through A
• Increasing the history cost through B will push signal 2 back onto the path through C
s1
s3
s2
t1
t3
t2
Second-order Congestion
Routing Order: 1, 2, 3A
B
C
3
4
2
3
3
4
3
2
3
3
Two paths route through C
• Increasing the history cost through C will push signal 2 back through B
s1
s3
s2
t1
t3
t2
Second-order Congestion
Routing Order: 1, 2, 3A
B
C
3
4
2
5
5
4
3
2
5
5
Two paths route through C
• Increasing the history cost through C will push signal 2 back through B
s1
s3
s2
t1
t3
t2
Pseudocode
Priority driven BFS• Computed from the source
to every sink of net(i)
Connect the newly found path to the current sink to the partly-computed routing tree RT(i) for net I
Timing-Routability Cost Function
Represents the estimated criticality of the path from source I to sink j on net(i)• net(i) may fan out to multiple sinks with varying criticality, depending on placement
Delay of critical path
Negotiated A* Routing for FPGAs
Russell TessierFifth Canadian Workshop on Field Programmable
Devices, 1998
Basic Idea
A* Cost Function
• Cost at a node ni
fi = gi + di
gi = fi-1 + ci
Cost of routing from the source to ni
Estimated cost of routing from ni to the sink
Total cost of previous path
Cost of the next candidate node
A* vs. BFS
A*: fi = gi + di = (fi-1 + ci) + di
BFS: fi = gi
Hybrid: fi = (1 - α)gi + αdi
= (1 - α)(fi-1 + ci) + αdi
Good Ideas, But…
• Algorithm and implementation assumed disjoint switch block– i.e., disjoint global routing domains
• Key issue:– Which routing domains (reachable from the
source) are explored in what order?
A Fast Routability-driven Router for FPGAs
Jordan S. Swartz, Vaughn Betz, Jonathan RoseInternational Symposium on FPGAs, 1998
Architecture and CAD for Deep Submicron FPGAs
(Section 4.3, pp. 76-80)
Vaughn Betz, Jonathan Rose, Sandy MarquardtSpringer, 1999
Speed Enhancements
• Directed depth-first search– Similar to A*, but arguably more aggressive– Key Issues:
• Cost function• Sink (target) ordering for multi-terminal nets
• Route classification– Low-stress– Difficult– Impossible
Cost Function
Cost(n) = C(n) + Costprev + α(ΔD)– Costprev: cost of track segments used to reach the
current segment– C(n): (base) cost of using the track segment
• Increases more rapidly than the original PathFinder algorithm– (ΔD): estimated cost of routing from the current track
segment to the destination • Based on Manhattan (XY) distance)
– α: Direction factor• BFS: α = 0• Directed search: α > 0
Updated Node (Base) Cost Function
• C(n) = p(n) × (b(n) + h(n))– b(n): base cost – h(n): history cost– p(n): usage cost
• C(n) = p(n) × b(n) × h(n) + Bendcost(n, m)– Multiplying b(n) and h(n) eliminates normalization– Penalize global routes that take a lot of turns• These routes are less likely to use long wires
Penalty Cost Function
• Not specified in the original PathFinder paper
p(n) = 1 + max(0, [occupancy(n) – capacity(n)] × pfac)• occupancy(n): number of nets using resource n• capacity(n): maximum number of nets that can use
resource n (typically 1)• pfac: 0.5 for the first iteration; increase by 1.5x to 2x each
subsequent iteration
History Cost Function
• Not specified in the original PathFinder paper
• occupancy(n): number of nets using resource n• capacity(n): maximum number of nets that can use
resource n (typically 1)• hfac: constant; any value between 0.2 and 1 works well
Target Selection
Closest Sink First• Uses fewer track segments
Furthest Sink First• Uses more track segments
Net Order
• Route nets in order of decreasing fan-out
• High fanout nets tend to span the whole FPGA– Easier to route when there is less congestion do to
other nets routed earlier
• Low fanout nets tend to be more localized– Relatively easy to route, even in the presence of
some congestion
Binning
• Only the portions of the net close to the target sink should be expanded– E.g., only expand within
Bin 4 in this example
• Shown to be effective for sinks with more than 50 targets
Bounding Box
• Define a bounding box around source and sinks• Restrict the route for each net to no more than 3
channels outside of the bounding box
Routing High-fanout Nets
Difficulty Prediction
• Since you can’t know Wmin first without routing, use an estimate based on a wirelength prediction model
Routability Results
Routing Times as W increases
Routing Time vs. W (clma Benchmark)
BFS vs. Directed Search (w/Binning)
Difficulty Prediction
• Westimate took less than 1 second to compute
• Some prediction mistakes will happen
• Prediction accuracy was 84%
Fuzzy Prediction
Architecture and CAD for Deep Submicron FPGAs(Section 2.2.4, pp. 31-34,
Section 4.4, pp. 80-89, Section 4.6.4, pp. 99-102 )
Vaughn Betz, Jonathan Rose, Sandy MarquardtSpringer, 1999
VPR Timing-Driven Router
Example Timing Graph
Example Timing Graph
12ns
Timing and Slack
Process vertices in forward topological order
Process vertices in reverse topological order(Required time for any output node is Dmax)
How much can I slow down this circuit element (node) without increasing the critical path?
Elmore Delay of a Source-Sink Path
• Td,i: intrinsic delay of element i (buffer); 0 otherwise
• Ri: equivalent resistance of element i (Rwire, Rbuf, Rpass)
• C(subtreei): total downstream capacitance not isolated by buffers
Equivalent Circuits
Example: M Wire Segments connected by Buffers
Linear and Elmore Delay Models agree!
Example: Chain of Pass Transistors
Linear
Quadratic
Fanout
Introduce Elmore Delay into VPR-PathFinder’s Node Cost Function
Parameters that control how slack affects congestion-delay tradeoff
… And to Support Directed Search• Expected cost to route from a wire segment n to the target
sink j
– The connection will be completed using wires of the same type as node n • Length, connecting switch type, etc.
– There is no congestion along the shortest path to the target • p(m) and h(m) are both 1 for all nodes m on the path that completes the
connection.
Why Optimize Elmore Delay?
Loops
Timing-Driven Router
Timing-Driven Routing of High-Fanout Nets
Xun Chen, Jianwen Zhu, Minxuan ZhangInternational Conference on Field Programmable
Logic and Applications, 2011
Motivation (for Binning, Earlier)
Angle-based Pruning
Route the trunk first, then each leaf
NetDecomposition
Results
Routing Algorithms for FPGAs with Sparse Intra-Cluster Routing Crossbars
Yehdhih Moctar, Guy Lemieux, Philip BriskInternational Conference on Field Programmable
Logic and Applications, 2012
71
Full Crossbar Intra-cluster Routing2-LUT BLE
in1
in2
2-LUT BLEin1
in2
CLB inputs
Local feedbacks
72
RRG
Full Crossbar Intra-cluster Routing
RRG
73
Full Crossbar RRG
Don’t model the RRG inside each CLB!
74
Sparse Crossbar Intra-cluster Routing (1/6)
2-LUT BLEin1
in2
2-LUT BLEin1
in2
CLB inputs
Local feedbacks
• RRG must include the intra-cluster routing
75
Sparse Crossbar Intra-cluster Routing
RRG
76
Sparse Crossbar RRG
RRG encompasses the entire FPGA
77
Why is this Hard?
• Key Issues relating to RRG size– May run out of memory • If you are not routing in the cloud
– Long router runtimes• Large search space leads to slow convergence• Thrashing
• Objective– Route without expanding the whole RRG
78
Routing Algorithms• Extensions to PathFinder routing algorithm
– Routes one net at a time via wavefront expansion
• Baseline– Extend the RRG with intra-cluster routing at each CLB
• Selective RRG Expansion (SERRGE)– Route nets to CLB inputs– Dynamically expand the RRG to finish the route
• Partial Pre-Routing (PPR)– Pre-route all nets within CLBs– Complete global routes to CLB inputs
79
Baseline Router2-LUT BLE
in1
in2
2-LUT BLEin1
in2
CLB inputs
Local feedbacks
• Inputs of the same LUT are equivalent!
Global RRG
80
SERRGE in Action (1/8)
Start with a global RRG
s
t
81
SERRGE in Action (2/8)
s
t
Route from s’s CLB to t’s CLB
82
SERRGE in Action (3/8)
s
t
Locally expand the RRG with part of t’s CLB
83
2-LUT BLEin1
in2
2-LUT BLEin1
in2
CLB inputs
Local feedbacks
Maintain one copy of the intra-cluster topology
SERRGE in Action (4/8)
t
84
2-LUT BLEin1
in2
2-LUT BLEin1
in2
CLB inputs
Local feedbacks
Expand the fanout of the CLB input being routed
SERRGE in Action (5/8)
t
85
2-LUT BLEin1
in2
2-LUT BLEin1
in2
CLB inputs
Local feedbacks
Complete the route if possible; if not, try again
SERRGE in Action (6/8)
t
86
2-LUT BLEin1
in2
2-LUT BLEin1
in2
CLB inputs
Local feedbacks
Track used CLB and BLE inputs; remember the route
SERRGE in Action (7/8)
t
87
2-LUT BLEin1
in2
2-LUT BLEin1
in2
CLB inputs
Local feedbacks
Deallocate unused RRG resources
SERRGE in Action (8/8)
t
88
SERRGE Summary• Global routing uses standard PathFinder
• Selectively expand portions of the RRG to complete routes from CLB inputs to BLE inputs
• Storage Requirement– Global RRG (standard PathFinder requirement)– One copy of the intra-cluster routing topology– All intra-cluster routes computed thus far
• Fairly challenging to implement!
89
2-LUT BLEin1
in2
2-LUT BLEin1
in2
CLB inputs
Local feedbacks
Process CLBs one at a time
PPR in Action (1/4)
t1
t2
90
2-LUT BLEin1
in2
2-LUT BLEin1
in2
CLB inputs
Local feedbacks
Compute local routes for each BLE
PPR in Action (2/4)
t1
t2
91
2-LUT BLEin1
in2
2-LUT BLEin1
in2
CLB inputs
Local feedbacks
CLB inputs now form equivalence classes.
PPR in Action (3/4)
t1
t2
92
PPR in Action (4/4)
Perform global routing w/equivalence class constraints on CLB inputs
s
t
93
PPR Summary
• Pre-route each CLB
• Propagate LUT equivalences to CLB inputs– A baseline router on the full-blown RRG would still need to
satisfy BLE input-pin equivalence constraints
• Route as normal
• Much easier implementation than SERRGE
Critical Path Delayac
_ctr
l
aes_
core
des_
area
mem
_ctr
l
pci_
brid
ge32 sp
i
syst
emca
es
syst
emcd
es
usb_
func
t
wb_
conm
axgeommean
012345678
p = 40%
ns
ac_c
trl
aes_
core
des_
area
mem
_ctr
l
pci_
brid
ge32 sp
i
syst
emca
es
syst
emcd
es
usb_
func
t
wb_
conm
ax
geomean
012345678
p = 75%
ns
PPR SERRGE Baseline
RRG Size (MB)ac
_ctr
l
aes_
core
des_
area
mem
_ctr
l
pci_
brid
ge32 sp
i
syst
emca
es
syst
emcd
es
usb_
func
t
wb_
conm
axgeommean
0
10
20
30
40
50
60
p = 40%
MB
ac_c
trl
aes_
core
des_
area
mem
_ctr
l
pci_
brid
ge32 sp
i
syst
emca
es
syst
emcd
es
usb_
func
t
wb_
conm
ax
geomean
0
10
20
30
40
50
60
p = 75%
MB
PPR SERRGE Baseline
Router Runtime (s)ac
_ctr
l
aes_
core
des_
area
mem
_ctr
l
pci_
brid
ge32 sp
i
syst
emca
es
syst
emcd
es
usb_
func
t
wb_
conm
axgeomean
0
200
400
600
800
1000
1200
p = 40%
Seco
nds
ac_c
trl
aes_
core
des_
area
mem
_ctr
l
pci_
brid
ge32 sp
i
syst
emca
es
syst
emcd
es
usb_
func
t
wb_
conm
ax
geomean
0
200
400
600
800
1000
1200
Seco
nds
PPR SERRGE Baseline
p = 75%