Upload
xiaoyu-shi
View
589
Download
1
Tags:
Embed Size (px)
Citation preview
LOGO
CSBP: A Fast Circuit Similarity-Based Placement for FPGA Incremental Design
and Design Space Exploration
1Xiaoyu Shi, 1Dahua Zeng, 2Yu Hu, 1Guohui Lin, 1Osmar R. Zaiane
1Dept. of Computing Science, University of Alberta2Dept. of Electrical and Computer Engineering, University of Alberta
Presented by Xiaoyu Shi
Please address comments to [email protected]
Outline
Introduction
Circuit Similarity-Based Placement
Experimental Results
Conclusion and Future Work
Introduction Field Programmable Gate Array (FPGA)
Ease of design, low start-up costs and fast manufacturing turnaround time.
Size of FPGAs has reached million gates level. Modern FPGA designs suffer from long compilation time.
FPGA placement Determines which logic block within an FPGA should implement each of the
logic blocks required by the circuits. Has a significant impact on the performance and routability in nanometer
circuit designs. The optimization goals are to minimize certain criteria, such as wire length,
critical delay and area. Now becomes the bottleneck of modern FPGA circuit design [Chen’06].
Up-to-date fast placement algorithms Extensive studies have been performed to improve the placement efficiency
as a single synthesis phase for decades. State-of-the-art work includes using multi-core [Ludwin’08], embedding-
based [Gopalakrishnan’06], partitioning-based [Maidee’05], multi-level [Sankar’99], simulated annealing [Betz’97].
Xilinx SPARTAN-6 board
Reusable Info in CAD Incremental design for FPGAs
Design preservation is the key of incremental design. Similarity among circuits exists because functional changes or optimizations
are small, and they generally result in a similar topology of the modified circuit compared to the original circuit [Krishnaswamy’09].
Final iteration
Iteration 3 …
Iteration 2
Iteration 1Initial design
Optimizations, timing, etc …
Final design
Changes due to verification, timing, etc
Incremental design process for FPGAs
Reusable Info in CAD (Cont.) Design space exploration for FPGAs
FPGA design offers a variety of customizations by varying design parameters.
Local similarity and global similarity exist in design space exploration.
Initial design
Optimizations, timing, etc …
Final design
Changes due to verification, timing, etc
Constant multiplier blocks by CMU SPIRAL [Puschel’04]
Data Mining Overview
The key of data mining is to extract patterns and useful information from data, including text, graphs and circuits, etc.
It has been extensively studied since 1950s, and has been widely applied to many domains, such as businesses, sciences and health cares.
Graph mining, including graph pattern mining, graph classification and graph compression, is a research hot area in data mining [Borgwardt’08].
Graph similarity It quantitatively defines the topological similarity between two graphs. It has been used to many applications, such as web searching
[Kleinberg’99], social network mapping [Watts’99] and chemical structure matching [Hattori’03].
Graph Similarity Summary of graph similarity measures
Measure Description TimeComplexity
Global Topo
Isomorphism [Pelillo’02]
Identifying a bijection between the nodes of two graphs which preserves (directed) adjacency
NP-Hard Yes
Edit distance [Bunke’99]
Given a cost function on edit operations,determine the minimum cost transformation from one graph to another
NP-Hard Yes
Common subgraph[Fernandez’01]
Identifying the largest isomorphic subgraphs of two graphs
NP-Hard Yes
Iterative methods [Blondel’04]
Two graph elements are similar if their neighborhoods are similar
Cubic Yes
Statistical methods [Alberta’02]
Assessing aggregate measures of graph structure, degree distribution, diameter, betweenness measures
Linear No
Iterative methods It has lower computational complexity and considers global topological
information. It takes advantage of the graph sparsity.
Circuit Similarity Circuit similarity
We define circuit similarity to describe the similar topological structures between two circuits.
We adapt the iterative methods in graph similarity. It exists in several CAD phases, such as placement, routing and verification. It can be widely used to accelerate FPGA designs, such as incremental
design and exploration of the design space, etc.
Outline
Introduction
Circuit Similarity-Based Placement
Experimental Results
Conclusion and Future Work
Motivating Example Circuit similarity algorithm
Graph G
Graph G’Similarity score matrix for G and G’
V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
V’70.92 0.25 0.48 0.15 0 0 0 0.42 0.06 0
V’80 0.73 0 0 0.05 0 0.39 0 0.17 0.06
V’90 0.39 0 0 0.4 0 0.73 0 0.06 0.48
V’10
0.48 0 0.89 0.25 0.3 0.12 0.14 0.06 0.33 0.09V’11
0 0 0.11 0.48 0 0.86 0 0.36 0.17 0V’12
0 0 0.3 0.34 0.64 0.25 0.39 0.34 0.15 0.42V’13
0.48 0.25 0.07 0.4 0 0.36 0 0.88 0.06 0V’14
0.4 0.39 0.29 0.15 0.15 0.18 0.12 0.46 0.59 0.06V’15
0 0.12 0.09 0 0.63 0 0.36 0 0.27 0.82
Motivating Example (Cont.) Circuit similarity-based
placement The initial placement of the new
circuit design (G’) is generated by computing the similarity between the original (G) and modified circuits, and finding the correspondent node matching.
A low-temperature simulated annealing is applied to further refine the results.
The proposed circuit similarity algorithm can be used to speedup placement, which allows faster incremental design and design space exploration.
Motivating Example (Cont.)
A real example For circuit “des”, the reference
configuration (synthesized using “resyn3” script in ABC) has 1245 CLBs and 1501 nets while the new configuration (synthesized using “rwsat2” script in ABC) has 1215 CLBs and 1471 nets.
The results show that CSBP successfully finds the internal node correspondence.
(a) Placement of reference config
(b) Init placement using CS
(c) Final placement using CS
(d) init placement using VPR
(c) Final placement using VPR
Placement layouts comparison of circuit “des”
Wire Delay (E-05)
Critical Delay (E-08)
Runtime (s)
CS-init 306 5.93 - -
VPR-init 1087 14.00 - -
CS-final 237 5.08 8.28 13.38
VPR-final 221 4.98 10.10 28.42
Status of placement results of circuit “des”
Circuit Similarity CAD Flow
CAD flow for design space explorationCAD flow for incremental design
Circuit Similarity Algorithm Iterative similarity algorithm
We employ the iterative similarity algorithm for undirected molecular graphs [Rupp’07].
We adapt the iterative similarity algorithm to consider directed circuit graphs, fix the I/O pins, and compute the similarity of faninand fanout nodes respectively, based on unique circuit constraints.
If (|in(vi)| < |in(v’j)| and |out(vi)| < |out(v’j)|)
Summary of variables
Performance Enhancement Support constraint
A support of a node is the set of nodes with predefined matchingsin the transitive fanin or fanoutcone of this node.
Formally, if v ∈ G and v’ ∈ G’, the support constraint requires:
where β ∈ (0,1].
Level constraint A topological sort and reverse
topological sort can label each internal node with two values.
Formally, if v ∈ G and v’ ∈ G’, the level constraint requires:
where Bl and Br are two nonnegative integers.
Effectiveness of the pruning techniques
Outline
Introduction
Circuit Similarity-Based Placement
Experimental Results
Conclusion and Future Work
Incremental Design CAD flow
Two-iteration CAD flow. CSBP flow (a) and from-scratch
flow (b) are compared. Optimization “imfs” reduces the
number of CLBs by 2%.
Settings Two versions of CSBP are
compared: A high quality version (CS) with β = 0.5, inner_num = 1 and Bl = Br = 1; A turbo version (CS-t) with β = 1, inner_num = 0.1 and Bl = Br = 0.
CSBP is implemented in C and evaluated on the 20 largest MCNC benchmarks.
The results are averaged over 5 funs on a Linux server with dual-core 2.19GHz CPU and 5GB memory.
CS2 package [Goldberg’97] is used for maximum matching problem.
f
CAD flow for incremental design
Results Initial placement results
Bounding box cost (bb cost) and delay cost are compared. Clearly, the initial placement results generated using CS is much better than
VPR’s initial results, and is very close to VPR’s final results.
Comparisons of initial bb cost Comparisons of initial delay cost
0%10%20%30%40%50%60%70%80%90%
100%
alu4
apex
2ap
ex4
bigk
eycl
ma
des
diff
eqds
ipel
liptic
ex10
10ex
5pfr
isc
mis
ex3
pdc
s298
s384
17s3
8584 seq
spla
tsen
g
CS-init VPR-final VPR-init
0%10%20%30%40%50%60%70%80%90%
100%
alu4
apex
2ap
ex4
bigk
eycl
ma
des
diff
eqds
ipel
liptic
ex10
10ex
5pfr
isc
mis
ex3
pdc
s298
s384
17s3
8584 seq
spla
tsen
g
CS-init VPR-final VPR-init
CS reduces bb cost by 72% on avg. compared to VPR CS reduces delay cost by 53% on avg. compared to VPR
Perc
enta
ge
Perc
enta
ge
0.00E+005.00E-081.00E-071.50E-072.00E-072.50E-073.00E-073.50E-074.00E-074.50E-07
alu4
apex
2ap
ex4
bigk
eycl
ma
des
diff
eqds
ipel
liptic
ex10
10ex
5pfr
isc
mis
ex3
pdc
s298
s384
17s3
8584 seq
spla
tsen
g
CS-t CS VPR
Results (Cont.) Post-routing results comparison
A low-temperature annealing is applied to the initial results.
Wire length, critical delay and area are compared.
The results demonstrate the effectiveness of the pruning techniques, which do not affect the quality significantly.
0
50000
100000
150000
200000
250000
300000
alu4
apex
2ap
ex4
bigk
eycl
ma
des
diff
eqds
ipel
liptic
ex10
10ex
5pfr
isc
mis
ex3
pdc
s298
s384
17s3
8584 seq
spla
tsen
g
CS-t CS VPR Wire length
0.00E+005.00E+071.00E+081.50E+082.00E+082.50E+083.00E+083.50E+084.00E+08
alu4
apex
2ap
ex4
bigk
eycl
ma
des
diff
eqds
ipel
liptic
ex10
10ex
5pfr
isc
mis
ex3
pdc
s298
s384
17s3
8584 seq
spla
tsen
g
CS-t CS VPR AreaCritical delay
CS increases the area by 2% on avg.
CS increases the wire length by 3% on avg.
CS increases the crit. delay by 6% on avg.
Results (Cont.) Runtime comparison
Only placement time is compared. CS-t achieves 31x speedup on average, with up to 91x. More speedup is expected when circuits become larger.
Speedups compared to VPR
0
10
20
30
40
50
60
70
80
90
100
CS-t CS VPR
Spee
dups
Design Space Exploration CAD flow
Study logic-level and algorithm-level design space, respectively.
CSBP flow (a) and from-scratch flow (b) are compared.
Settings The logic-level design space
consists of 19 configurations generated by 19 ABC1 synthesis scripts in abc.rc.
The algorithm-level design space consists of 18 configurations of constant multiplier generated by CMU SPIRAL [Puschel’04]varying bits from 7 to 252.
Both CS and CS-t are evaluated. The benchmarking environments
are the same as logic-level design space exploration.
CAD flow for design space exploration2 Bit = 16 is abandoned due to ABC crash
1 http://www.eecs.berkeley.edu/~alanmi/abc/
Logic-level Sample Synthesis Scripts
Alias Scripts
resyn "b; rw; rwz; b; rwz; b"
resyn2 "b; rw; rf; b; rw; rwz; b; rfz; rwz; b"
resyn2a "b; rw; b; rw; rwz; b; rwz; b"
src_rw "st; rw -l; rwz -l; rwz -l"
src_rs "st; rs -K 6 -N 2 -l; rs -K 9 -N 2 -l; rs -K 12 -N 2 -l"
choice "fraig_store; resyn; fraig_store; resyn2; fraig_store; fraig_restore" rwsat "st; rw -l; b -l; rw -l; rf -l"
compress "b -l; rw -l; rwz -l; b -l; rwz -l; b -l" share "st; multi -m; fx; resyn2"
http://www.eecs.berkeley.edu/~alanmi/abc/
0
500
1000
1500
2000
2500
resy
nre
syn2
resy
n2a
resy
n3co
mpr
ess
com
pres
s2ch
oice
choi
ce2
rwsa
trw
sat2
shak
esh
are
src_
rwsr
c_rs
src_
rws
resy
n2rs
com
pres
s2rs
resy
n2rs
dcco
mpr
ess2
rsdc
CS CS-t VPR
Logic Level Results Initial results comparison
The number of CLBs and levels vary widely in logic-level design space.
Show circuit “dsip” as an example. Bounding box cost and delay cost are
compared for initial placement results.
Initial bb cost of “dsip”
Critical delay
Characteristics of logic-level design space
0.00E+00
1.00E-04
2.00E-04
3.00E-04
4.00E-04
resy
nre
syn2
resy
n2a
resy
n3co
mpr
ess
com
pres
s2ch
oice
choi
ce2
rwsa
trw
sat2
shak
esh
are
src_
rwsr
c_rs
src_
rws
resy
n2rs
com
pres
s2rs
resy
n2rs
dcco
mpr
ess2
rs…
CS CS-t VPR Initial delay cost of “dsip”
CS reduces bb cost by 76% on avg.
CS reduces delay cost by 48% on avg.
Logic Level Results (Cont.) Final placement results
Wire length and critical delay of circuit “dsip” are compared. The final results produced by CS and CS-t are very close or better
compared to VPR’s, with 32% overhead for wire length and 20% improvement for critical delay.
Final wire length comparison of “dsip” Final critical delay comparison of “dsip”
0%
20%
40%
60%
80%
100%
resy
nre
syn2
resy
n2a
resy
n3co
mpr
ess
com
pres
s2ch
oice
choi
ce2
rwsa
trw
sat2
shak
esh
are
src_
rwsr
c_rs
src_
rws
resy
n2rs
com
pres
s2rs
resy
n2rs
dcco
mpr
ess2
rsdc
CS-t CS VPR
0%
20%
40%
60%
80%
100%
resy
nre
syn2
resy
n2a
resy
n3co
mpr
ess
com
pres
s2ch
oice
choi
ce2
rwsa
trw
sat2
shak
esh
are
src_
rwsr
c_rs
src_
rws
resy
n2rs
com
pres
s2rs
resy
n2rs
dcco
mpr
ess2
rsdc
CS-t CS VPR
Perc
enta
ge
Perc
enta
ge
Logic Level Results (Cont.) Design space shape characterization
We compare the minimal, median and maximal wire length and critical delay produced by CS, CS-t and VPR.
We also compare the shapes of each configuration over 19 designs.
The almost identical curves show that CSBP is able to accurately depict the shape of a design space.
Shape of minimal wire length of 20 circuits over 19 designs
0
500
1000
1500
2000
2500
alu4
apex
2ap
ex4
bigk
eycl
ma
des
diff
eqds
ipel
liptic
ex10
10ex
5pfr
isc
mis
ex3
pdc
s298
s384
17s3
8584 seq
spla
tsen
g
vpr-min cs-min cs-t-min
05E-08
0.00000011.5E-07
0.00000022.5E-07
0.00000033.5E-07
0.00000044.5E-07
alu4
apex
2ap
ex4
bigk
eycl
ma
des
diff
eqds
ipel
liptic
ex10
10ex
5pfr
isc
mis
ex3
pdc
s298
s384
17s3
8584 seq
spla
tsen
g
vpr-min cs-min cs-t-min
0100200300400500600700800
resy
nre
syn2
resy
n2a
resy
n3co
mpr
ess
com
pres
s2ch
oice
choi
ce2
rwsa
trw
sat2
shak
esh
are
src_
rwsr
c_rs
src_
rws
resy
n2rs
com
pres
s2rs
resy
n2rs
dcco
mpr
ess2
…
vpr cs cs-t
Shape of minimal crit. delay of 20 circuits over 19 designs
Shape of final wire length of circuit “dsip”
Logic Level Results (Cont.) Runtime comparison
Only placement time is compared. CS-t achieves 30x speedup on
average, with up to 100x. In practice, one can take
advantage of the significant speedup of CS-t to perform quick design space exploration.
Runtime comparison (“*” marked time is measured with a timeout )
0102030405060708090
100
alu4
apex
2
apex
4
bigk
ey
clm
a
des
diff
eq
dsip
ellip
tic
ex10
10
ex5p
fris
c
mis
ex3
pdc
s298
s384
17
s385
84 seq
spla
tsen
g
CS CS-t VPR
Speedups compared to VPR
Spee
dups
Algorithm Level Results Experimental settings
The algorithm-level design is a constant multiplier.
The design parameter explored in our experiments is the fractional bits varying from 7 to 251.
CMU SPIRAL is used to generate RTL design based on Hcub algorithm [Voronenko’07].
Experimental results The initial and final placement results
are similar to logic-level space exploration.
CS and CS-t achieve 7x and 30x speedup compared VPR, respectively.
Characteristics of algorithm-level design space generated by CMU SPIRAL
An example of a constant parallel multiplier
1 Bit = 16 is abandoned due to ABC crash
Algorithm Level Results (Cont.) Wire length-delay space comparison
The pareto-points, which are the optimal configurations in a design space, are of most interests to IC designers.
CS and VPR find the same pareto-points. Bits = 24 is used as the reference circuit.
Wire length-delay space of VPR Wire length-delay space of CS
B7
B8
B9B10
B12
B14
B15
B17
B18
B19
B21
B22B23
B25
1.50E-07
2.00E-07
2.50E-07
3.00E-07
3.50E-07
4.00E-07
0 100 200 300 400 500
Esti
mat
ed c
riti
cal d
elay
Wire length
B7
B8 B9B10
B12
B14B15
B17
B18B19
B21
B22B23
B25
1.75E-07
2.25E-07
2.75E-07
3.25E-07
3.75E-07
4.25E-07
0 200 400 600
Esti
mat
ed c
riti
cal d
elay
Wire length
Outline
Introduction
Circuit Similarity-Based Placement
Experimental Results
Conclusion and Future Work
Future Work Improvement to CSBP
Integrate predefined matchings, for example, naming matching, into our CSBP to further enhance both the efficiency and the quality of the design.
Other applications Study the effectiveness of applying circuit similarity algorithm to other
applications, such as routing and sequential verification for FPGAs
Conclusion Proposed an efficient circuit similarity algorithm Developed CSBP, a fast circuit similarity-based placement for
FPGAs Applied CSPB to incremental design and design space exploration. Open-source tool available at:
http://webdocs.cs.ualberta.ca/~xshi/soft.html Applied CSBP to incremental design for FPGAs
CSBP is able to reduce engineering effort by capturing the similarity from the previous design iterations.
CSBP is 31x faster compared to VPR. Applied CSBP to design space exploration for FPGAs
CSBP can precisely depict the shape of a design space and pinpoint the optimal designs.
CSBP is 30x faster compared to VPR.
LOGO
Xiaoyu Shi, Dahua Zeng, Yu Hu, Guohui Lin, Osmar R. Zaiane
www.themegallery.com
CSBP: A Fast Circuit Similarity-Based Placement for FPGA Incremental Design and Design Space Exploration