Upload
candace-floyd
View
218
Download
0
Embed Size (px)
Citation preview
A Decomposition Algorithm to Structure Arithmetic Circuits
Ajay K. Verma, Philip Brisk, Paolo Ienne
Ecole Polytechnique Fédérale de Lausanne (EPFL)
International Workshop on Logic and Synthesis
August 1, 2009
Logic Optimization Strategies
Ripple-Carry Adder Carry-Lookahead Adder
• Logic synthesis tools– Local optimization via Boolean minimization
• Architectural transformation– Not with “traditional” logic synthesis
1
Naïve Leading Zero Detector
xi is TRUE if (i+1)th most-significant bit is the leading non-zero bit
xi = a15a14 … a15-(i-2)a15-(i-1)a15-i
Convert xi to a binary number
2
Optimized LZD [Oklobdzija 1994]
3
Comparison
0.36 ns (427 μm2)
0.30 ns (392 μm2)
16% faster, 8% smaller
4
Outline• Algorithmic Overview
• Progressive Decomposition Algorithm… – … and its shortcomings– [Verma et al., DAC 2007]
• New Algorithm
• Experimental Results
• Conclusion
5
Input Condensation• Leader Expressions
– Sufficient to evaluate expression– Once evaluated, you can discard input bits– Works for circuits with “effective online algorithms”
IN
OneBig
Circuit
OUTRecursively compute leader expressions
again
Leader Expressions
L |L| < |IN|
Smaller Circuit
OUT
IN
6
8:4 Parallel Counter
sc
(Leader Expressions)
7
Hierarchical Circuit Construction
Use leader expressions as building blocks to impose hierarchy
8
Progressive Decomposition
• Choose a subset of input bits– How many bits?– Many different combinations?
• Find leader expressions– Optimize via Boolean ring properties– Find identities
• Discard dependent expressions
x y zz = f(x, y)
• Rewrite circuit in terms of leader expressions• Recursively process the remaining circuit
9
Progressive Decomposition: Shortcomings and Concerns
• [Verma et al., DAC 2007]
• Entire algorithm based on Reed-Muller Form– Rewrite ‘your’ optimizer, e.g., if you use AIGs or BDDs.– Exponential blowup for leading one detector
• Cannot optimize multipliers
• Cannot optimize “structurable circuits” surrounded by peripheral logic
10
M1 M2
48
E1 E2
19 19
4
sign
negs1 s2
xor
out
Compound CircuitsM1 M2
48
E1 E2
19 19
sign
not
out
and
1
4
s1 s2
xor
g72x
0.82 ns (7998 μm2)
12% faster, 55% larger
0.94 ns (5142 μm2)
11
Support Sets
• Progressive Decomposition– Support sets are subsets of one another or disjoint– Blocks must always reduce the number of inputs
12
Support Sets
• New Approach– Support sets may overlap– Relaxes input condensation constraint– Both conditions are necessary to support multipliers
13
New Algorithm: Overview• Supports any representation with minimization algorithms
– We use BDDs
• Use SAT to check functional dependency – [Lee et al., ICCAD 2007]
• Restrict Operator computes generalized cofactors– [Coudert and Madre, ICCAD 1990]
• Entropy-based delay estimator – [Macii et al., GLS-VLSI 1999]– Imprecise, but effectively computes relative delays
14
Input Bit Selection via Random Sampling
a5 a4 a3 a2 0 1
b5 b4 b3 b2 1 1 a5 1 a3 a2 a1 0
b5 b4 1 1 b1 b0
Complexity of a 4-bit adder Complexity of a 6-bit adder
• Select every combination of k input bits for k < 6• Randomly assign values to the bits• Estimate the complexity of the resulting circuit
15
Computing Leader ExpressionsE – input expression
B – chosen input bits
S – leader expressions
found thus far
R – remaining bits
Is E functionally dependent on SR?
1. Randomly sample R’s assignment space
2. Find missing leader expressions using SAT [Lee et al. ICCAD 2007]
- Satisfying assignments provide missing leader expressions
- S contains all leader expressions if no satisfying assignments exist
E
B R
S
16
Redundant Leader Expressions
• Leader expression is an input bit– Non-disjunctive decomposition is required– Remove from the set
• Leader expression contains no useful information– Remove from the set
a0b1 + a1b0
B = {a0, b0} a1 = b1 = 1 a0 + b0
Cannot help to compute the original expression
17
• Generalized cofactors– E = (a+b)x + (ab + c)y– g = a + b + c
– E |g=0 = 0
– E |g=1 = (a + b)x + (ab + c)y = E
• The general case is a reduction to SAT– Problem instances tend to be small
Redundant Leader Expressions
18
Rewrite Original Expression
• Rewrite as Shannon expansion using the Restrict operator – Generalized cofactors are not unique – The order in which the cofactors of each leader expression are
computed may affect the result
• For each cofactor:– Estimate the delay– Estimate delay of F based on
Shannon expansion
• Select the cofactor that leads to the minimal estimated delay of F
F = g(F |g=1) + g’(F |g=0)
D(F) max{D(F |g=1), D(F |g=0), D(g)} + D(mux)
F
F |g=0
F |g=1
g
19
Rinse, Repeat
We picked this set of input bits to optimize!We generated a set of leader expressionsLocal optimization using your favorite LS tool can’t hurt.The leader expressions are now frozen, and the block that computes them is optimized. Optimize the remaining circuit
20
Experimental Setup
Circuit written by hand
Known Arithmetic Circuits
Prog. Decomp.
Synopsis Design Compiler
- compile_ultra - minimize delay
Artisan Standard CellsUMC (90 nm)
1 2 3
New Algorith
m
4
21
Critical Path Delay
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
16-bit ADD 12-bit 3ADD 8x8-bit MUL 16-bit SHIFT ADPCM g72x SAD
ns
Original Progressive Decomposition
Our Algorithm Library/Manual Implementation
Optimized for Area, Not DelayProgressive Decomposition
Fails
22
Area
μm2
Original Progressive Decomposition
Our Algorithm Library/Manual Implementation
Optimized for Area, Not Delay
0
2000
4000
6000
8000
10000
12000
14000
16-bit ADD 12-bit 3ADD 8x8-bit MUL 16-bit SHIFT ADPCM g72x SAD
Progressive Decomposition Fails
23
Conclusion
• Technique to structure arithmetic circuits– Fixes shortcomings of Progressive Decomposition
• Our approach is orthogonal to classical Boolean minimization techniques
• Discovered new implementation of a k-input MAX function– Similar structure to LZD Circuit– Will appear at ICCAD 2009
24
Computing Leader Expressions
E
B R
S
Original Variables
B = {b1, …, bm}
R = {r1, …, rn}
Dummy Variables
C = {c1, …, cm}
S = {s1, …, sn}
Extra Constraints
[Lee et al., ICCAD 2007]
ri = si, 1 < i < n
For each leader expression ej:
ej(b1, …, bm) = ej(c1, …, cm)
E(b1, …, bm, r1, …, rn) E(c1, …, cm, s1, …, sn)
ej
Two different input assignments
Input Bit Selection
EE’
X
B = {x, y, z}
E’ |xyz=000
E’ |xyz=001
E’ |xyz=111
…
B = {x, y, z}
Use delay estimator for each E’The complexity of E’ is the metric by which we evaluate each group of input bits
Compute every combination of k input bits for k < 6
Assign values to x, y, z using random sampling