Making Good Points: Application-Specific Pareto-Point Generation for Design Space Exploration using Rigorous Statistical Methods
David Sheldon, Frank Vahid*
Department of Computer Science and EngineeringUniversity of California, Riverside
*Also with the Center for Embedded Computer Systems at UC Irvine
Counterbus
W1
16 bytes4 physical lines filled when line size is 32 bytes
Off Chip Memory
Line Concatenation
[Zhang/Vahid/Najjar, ISCA 2003, ISVLSI 2003, TECS 2005]
Parameterized Component: Cache
2 of 19
127% 620% 126%
0%
20%
40%
60%
80%
100%
120%
padp
cm
crc
auto
2
bcnt bilv
bina
ry blit
brev
g3fa
x fir
pjep
g
ucbq
sort v4
2
adpc
m epic
g721
pegw
it
mpe
g
jpeg ar
t
mcf
pars
er vpr
Ave
Nor
malized
Ene
rgy)
cnv8K4W32B cnv8K1W32B cfg8Kwcwslc
40% avg savings
David Sheldon, UC Riverside 3 of 19
FPGA Systems are Often Built from Parameterized Components
Parameterized components include: Cache (e.g., size, associatively, line size)
Processors Co-processors Buses (e.g., bit width, network-on-chip structure)
uP
MPEG Enc
Cache config
config
config
config Bus
FPGADSP
config
0
2
4
6
8
10
12
14
0 20 40 60 80 100 120 140 160 180 200
Mill
ions
Thousands
Equivilent LUTs
cycl
es
520 pointsOver 10 days
~35 min per point
<1 min to execute
Remaining time was in synthesis and place and route
Microblaze Soft-Core Processor – Design Space due to Parameters
Pareto points: Points where no point exists that is better in all metrics.
Cycles
Equivalent LUTs 4 of 19
David Sheldon, UC Riverside 5 of 19
Pareto Points Differ Per Application and Per Criteria
App a2
Designer B
Platform
App a1
Time
Ener
gyTime
Ener
gy
Pareto points
Designer A
c1c2 c3
c1
c2
c3
(a)
(b)
c1 c3 ...c2
David Sheldon, UC Riverside 6 of 19
Previous Work: Parameter Interdependency graph
Platune [Givargis/Vahid 2002]: Introduced parameter interdependency graph
Edges – parameters are dependent
Nodes not connected – independent
Search dependent parameters exhaustively; compose local Pareto points into global points
Greatly reduces search space if independent parameters
Good results, 44 hours Randomized Approaches
Pareto Simulated Annealing (PSA) [Talarico 2006]
Good results, 6 hours Genetic Algorithms [Ascia 2005]
Good results, 4 hours
Platune’s Architecture
MIPSI$
D$MEM
CPU–I$ Bus
CPU–D$ Bus
$-MEM Bus
sizeassoc.linesize
sizeassoc.
codea code
code
Supply Voltage
David Sheldon, UC Riverside 7 of 19
Our Approach We developed
Design-of-Experiments (DoE)-based technique to automatically generate a parameter interdependency graph
Relieves designer of burden Technique to generate Pareto-points via parameter interdependency graph edge-weight-based algorithm
Improve speed versus Platune Called DoE-Based Pareto-Point Generator (DPG)
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0 2 4 6 8 10 12Time (sec)
Ener
gy (
J)
Time
Performance
David Sheldon, UC Riverside 8 of 19
Design of Experiments (DoE)
i$ size i$ assoc d$ size d$ lined$ assocm-i$code
m-i$a code
$-mcode
Supply Voltage
MIPSI$
D$MEM
CPU–I$ Bus
CPU–D$ Bus
$-MEM Bus
sizeassoc.linesize
sizeassoc.
codea code
code
Supply Voltage
2k8
8k832
BiBi
Bi
4.1
DoE generates a set of orthogonal experiments that allows for statistical analysis of the search space
David Sheldon, UC Riverside 9 of 19
DPG Algorithm Subsequent DoE analysis determines main effects of parameters
Y bar Marginal Means Plot
0
0.00002
0.00004
0.00006
0.00008
0.0001
0.00012
0.00014
0.00016
-1 1 -1 1 -1 1 -1 1 -1 1 -1 1 -1 1 -1 1 -1 1
Effect Levels
i$ size
i$ assoc
d$ size
d$ line
d$ assoc
m-i$code
m-i$a code
$-mcode
Supply Voltage
MIPSI$
D$MEM
CPU–I$ Bus
CPU–D$ Bus
$-MEM Bus
sizeassoc.linesize
sizeassoc.
codea code
code
Supply Voltage
David Sheldon, UC Riverside 10 of 19
DPG Algorithm (cont.) Compute weight of each pair of nodes Sort edges in decreasing weight
DK, (I$ assoc, CPU-I$ address code) DI, (I$ assoc, CPU I$ code) IK, (CPU-I$ code, CPU I$ address code) IQ, (CPU-I$ code, $-MEM address code) KQ, (CPU I$ address code, $-MEM address code) ...
MIPSI$
D$MEM
CPU–I$ Bus
CPU–D$ Bus
$-MEM Bus
sizeassoc.linesize
sizeassoc.
codea code
code
Supply Voltage
David Sheldon, UC Riverside 11 of 19
DPG Algorithm (cont.) Pair wise merge of nodes
Creates a sparse set of Pareto points The designer can direct the tool to fill in the regions of interest
Original Pareto pointsFilled in Pareto points
Time
Energy
David Sheldon, UC Riverside 12 of 19
Platune – Pareto Graph with Fill-in
0
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
Time
Ener
gy
Platune
Single Factor
DoE IOT
DPG - 3 value
DPG - fill in
jpeg
David Sheldon, UC Riverside 13 of 19
Platune – Pareto Graph with Fill-in
0
0.001
0.002
0.003
0.004
0.005
0.006
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Time (sec)
Ener
gy (J
)
PlatuneDPG
b1_histogram
Interdependency Graph Comparison: Manual vs. Automated
David Sheldon, UC Riverside 14 of 19
jpeg b1_histogram g3fax
David Sheldon, UC Riverside 15 of 19
Platune Results
44
0
1
2
3
4
5
6
SF DoE IOT DPG Genetic PSA Platune
Run
time
in H
ours
DPG is 30x faster than Platune 2.5x faster than Genetic Algorithms
Xilinx Microblaze Soft-Core Processor
Tuned the Microblaze for various benchmarks
Exhaustive data generated for 12 benchmarks for comparison
The Microblaze also has a configurable cache, which allows for over 3,000 configurations.
For these tests we used results previously generated thus giving us only 64 configurations.
David Sheldon, UC Riverside 16 of 19
MicroblazebsFPUdiv
mulMSRPCMP
David Sheldon, UC Riverside 17 of 19
Network on Chip – Results
DPG also works on larger design spaces
DPG Scales Well
David Sheldon, UC Riverside 18 of 19
Number of Parameters
DPG Analysis Phase
Total Design Space
Percent of Design Space
6 34 64 53.13%10 67 1,024 6.54%15 136 32,768 0.42%20 234 1,048,576 0.02%25 353 33,554,432 0.001%30 497 1,073,741,824 0.00005%
David Sheldon, UC Riverside 19 of 19
Conclusion DoE-Based Pareto-Point Generation (DPG) algorithm quickly finds good Pareto Points Results were better and obtained faster than previous Platune or randomized techniques
Approach is easier to use – no designer knowledge of parameter interdependencies is needed
Useful for FPGAs as well as other parameterized systems, such as SOCs synthesized to ASICs, parameterized SOCs, etc.