A Decomposition Algorithm to Structure Arithmetic Circuits Ajay K. Verma, Philip Brisk, Paolo Ienne Ecole Polytechnique Fédérale de Lausanne (EPFL) International

A Decomposition Algorithm to Structure Arithmetic Circuits

Ajay K. Verma, Philip Brisk, Paolo Ienne

Ecole Polytechnique Fédérale de Lausanne (EPFL)

International Workshop on Logic and Synthesis

August 1, 2009

Logic Optimization Strategies

Ripple-Carry Adder Carry-Lookahead Adder

• Logic synthesis tools– Local optimization via Boolean minimization

• Architectural transformation– Not with “traditional” logic synthesis

1

Naïve Leading Zero Detector

xi is TRUE if (i+1)th most-significant bit is the leading non-zero bit

xi = a15a14 … a15-(i-2)a15-(i-1)a15-i

Convert xi to a binary number

2

Optimized LZD [Oklobdzija 1994]

3

Comparison

0.36 ns (427 μm2)

0.30 ns (392 μm2)

16% faster, 8% smaller

4

Outline• Algorithmic Overview

• Progressive Decomposition Algorithm… – … and its shortcomings– [Verma et al., DAC 2007]

• New Algorithm

• Experimental Results

• Conclusion

5

Input Condensation• Leader Expressions

– Sufficient to evaluate expression– Once evaluated, you can discard input bits– Works for circuits with “effective online algorithms”

IN

OneBig

Circuit

OUTRecursively compute leader expressions

again

Leader Expressions

L |L| < |IN|

Smaller Circuit

OUT

IN

6

8:4 Parallel Counter

sc

(Leader Expressions)

7

Hierarchical Circuit Construction

Use leader expressions as building blocks to impose hierarchy

8

Progressive Decomposition

• Choose a subset of input bits– How many bits?– Many different combinations?

• Find leader expressions– Optimize via Boolean ring properties– Find identities

• Discard dependent expressions

x y zz = f(x, y)

• Rewrite circuit in terms of leader expressions• Recursively process the remaining circuit

9

Progressive Decomposition: Shortcomings and Concerns

• [Verma et al., DAC 2007]

• Entire algorithm based on Reed-Muller Form– Rewrite ‘your’ optimizer, e.g., if you use AIGs or BDDs.– Exponential blowup for leading one detector

• Cannot optimize multipliers

• Cannot optimize “structurable circuits” surrounded by peripheral logic

10

M1 M2

48

E1 E2

19 19

4

sign

negs1 s2

xor

out

Compound CircuitsM1 M2

48

E1 E2

19 19

sign

not

out

and

1

4

s1 s2

xor

g72x

0.82 ns (7998 μm2)

12% faster, 55% larger

0.94 ns (5142 μm2)

11

Support Sets

• Progressive Decomposition– Support sets are subsets of one another or disjoint– Blocks must always reduce the number of inputs

12

Support Sets

• New Approach– Support sets may overlap– Relaxes input condensation constraint– Both conditions are necessary to support multipliers

13

New Algorithm: Overview• Supports any representation with minimization algorithms

– We use BDDs

• Use SAT to check functional dependency – [Lee et al., ICCAD 2007]

• Restrict Operator computes generalized cofactors– [Coudert and Madre, ICCAD 1990]

• Entropy-based delay estimator – [Macii et al., GLS-VLSI 1999]– Imprecise, but effectively computes relative delays

14

Input Bit Selection via Random Sampling

a5 a4 a3 a2 0 1

b5 b4 b3 b2 1 1 a5 1 a3 a2 a1 0

b5 b4 1 1 b1 b0

Complexity of a 4-bit adder Complexity of a 6-bit adder

• Select every combination of k input bits for k < 6• Randomly assign values to the bits• Estimate the complexity of the resulting circuit

15

Computing Leader ExpressionsE – input expression

B – chosen input bits

S – leader expressions

found thus far

R – remaining bits

Is E functionally dependent on SR?

1. Randomly sample R’s assignment space

2. Find missing leader expressions using SAT [Lee et al. ICCAD 2007]

- Satisfying assignments provide missing leader expressions

- S contains all leader expressions if no satisfying assignments exist

E

B R

S

16

Redundant Leader Expressions

• Leader expression is an input bit– Non-disjunctive decomposition is required– Remove from the set

• Leader expression contains no useful information– Remove from the set

a0b1 + a1b0

B = {a0, b0} a1 = b1 = 1 a0 + b0

Cannot help to compute the original expression

17

• Generalized cofactors– E = (a+b)x + (ab + c)y– g = a + b + c

– E |g=0 = 0

– E |g=1 = (a + b)x + (ab + c)y = E

• The general case is a reduction to SAT– Problem instances tend to be small

Redundant Leader Expressions

18

Rewrite Original Expression

• Rewrite as Shannon expansion using the Restrict operator – Generalized cofactors are not unique – The order in which the cofactors of each leader expression are

computed may affect the result

• For each cofactor:– Estimate the delay– Estimate delay of F based on

Shannon expansion

• Select the cofactor that leads to the minimal estimated delay of F

F = g(F |g=1) + g’(F |g=0)

D(F) max{D(F |g=1), D(F |g=0), D(g)} + D(mux)

F

F |g=0

F |g=1

g

19

Rinse, Repeat

We picked this set of input bits to optimize!We generated a set of leader expressionsLocal optimization using your favorite LS tool can’t hurt.The leader expressions are now frozen, and the block that computes them is optimized. Optimize the remaining circuit

20

Experimental Setup

Circuit written by hand

Known Arithmetic Circuits

Prog. Decomp.

Synopsis Design Compiler

- compile_ultra - minimize delay

Artisan Standard CellsUMC (90 nm)

1 2 3

New Algorith

m

4

21

Critical Path Delay

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

16-bit ADD 12-bit 3ADD 8x8-bit MUL 16-bit SHIFT ADPCM g72x SAD

ns

Original Progressive Decomposition

Our Algorithm Library/Manual Implementation

Optimized for Area, Not DelayProgressive Decomposition

Fails

22

Area

μm2

Original Progressive Decomposition

Our Algorithm Library/Manual Implementation

Optimized for Area, Not Delay

0

2000

4000

6000

8000

10000

12000

14000

16-bit ADD 12-bit 3ADD 8x8-bit MUL 16-bit SHIFT ADPCM g72x SAD

Progressive Decomposition Fails

23

Conclusion

• Technique to structure arithmetic circuits– Fixes shortcomings of Progressive Decomposition

• Our approach is orthogonal to classical Boolean minimization techniques

• Discovered new implementation of a k-input MAX function– Similar structure to LZD Circuit– Will appear at ICCAD 2009

24

Computing Leader Expressions

E

B R

S

Original Variables

B = {b1, …, bm}

R = {r1, …, rn}

Dummy Variables

C = {c1, …, cm}

S = {s1, …, sn}

Extra Constraints

[Lee et al., ICCAD 2007]

ri = si, 1 < i < n

For each leader expression ej:

ej(b1, …, bm) = ej(c1, …, cm)

E(b1, …, bm, r1, …, rn) E(c1, …, cm, s1, …, sn)

ej

Two different input assignments

Input Bit Selection

EE’

X

B = {x, y, z}

E’ |xyz=000

E’ |xyz=001

E’ |xyz=111

…

B = {x, y, z}

Use delay estimator for each E’The complexity of E’ is the metric by which we evaluate each group of input bits

Compute every combination of k input bits for k < 6

Assign values to x, y, z using random sampling

Documents

A Decomposition Algorithm to Structure Arithmetic Circuits Ajay K. Verma, Philip Brisk, Paolo Ienne Ecole Polytechnique Fédérale de Lausanne (EPFL) International