CS 150 – Spring 2008 – Lec #15: Ckt Performance - 1 Circuit Performance and Adders zRecap from last time yHardware Design is Complicated Because We Want

CS 150 – Spring 2008 – Lec #15: Ckt Performance - 1

Circuit Performance and Adders Recap from last time

Hardware Design is Complicated Because We Want Circuits to Go Fast

Combinational Logic: Used A Simple Model of Delay Integer Delay on Each Gate Reduction of Circuit to Directed Acyclic Graph Delay of Circuit (= Clock Period) is longest path in graph

Making Circuits Go Fast = Shortening Longest Path Exploit Asymmetry between path lengths Shorten Longest Path by

• Introducing Redundant Logic• Moving Logic from Long to Short Paths

We will see a different technique today!


Delay Model of a Circuit

Translate circuit into graph

Weights on nodes are delay through gates

Delay through circuit is longest path through graph

Easy, linear-time algorithm

2

1

1

A

B

C

D

2

1

1

A

B

C

D


Circuit Performance Model

Latches

Combinational Logic

Latches

Inputs stabilize at 0

Logic finishes when last output stabilizes


Circuit Performance Model

Outputs of latches are stable only at clock edge

Inputs to latches must be stable by next clock edge Time between clock edges must be > delay of combinational logic

Latches

Combinational Logic

Latches


Adders

Highly-Studied Circuit, so case study in design “Ripple-carry” adder: standard adder where carry ripples

from one bit to another Longest path for n-bit adder is O(n) Number of gates for n-bit adder is O(n)

“Carry Lookahead”: Accelerate carry chain Collapse carry into all bits O(log n) delay (optimal!) O(n^3) gates (terrible!)

Practical Compromise is block-accelerated adders Block-carry lookahead Carry-select adder


Hierarchical Carry lookahead

PHG, GG used as propagate, generate inputs to hierarchical block

Carry Lookahead Block

m-bit CLA adderPP GG

m-bit CLA adderPG GG

m-bit CLA adderPG GG

PG0 GG0PG1

GG1PG2 GG2


Synopsis of Hierarchical Carry-Lookahead

n-bit adder, m-bit blocks, n/m blocks Delay is 2 log n + 2 log m Size is max(nm^2, (n/m)^3) Best is m = n^2/5 Delay is 14/5 log n, size is O(n^9/5)


Analysis of the Carry-Lookahead Adder

n bit adder, m-bit blocks, n/m blocks

Delay through the adder: 2 * delay through the lookahead block + delay through the super-lookahead block Lookahead block 2 log m Super-block: 2 log n/m = 2 log n – 2 log m Total: 2 log n + 2 log m

Logic: scales like the lookahead blocks Size p block: O(p^3) from before Two size of blocks: n/m blocks of size m, one block of size n/m Total: n/m * m^3 = nm^2, (n/m)^3 Choose m to minimize max(nm^2,(n/m)^3) Solution at m=n^(2/5). Total is n + n^3/5


Carry Select Adder

“Combinational Speculative Execution”

Basic intuition: Adders spend time waiting to see what carry-in is

Therefore Go ahead and guess each way Pick the right answer when the carry comes by


Carry-Select adder Each block is doubled

One block computes Carry-in=0, other carry-in=1 Actual carry-in (carry-out from previous block) computes result

m sum bits 1 carry-out bit

m-bit blockm-bit blockm-bit block

m

m

01

001 1

m

Block 1

Block 0


Analysis of Carry-Select Adder Delay analysis: Worst-case path is through Block0 then control of multiplexer chain O(m) gates in Block0 O(p = n/m) gates in multiplexer chain

Block0Block10Block11Block20Block21Blockp0Blockp1

Choose m to minimize max(n/m, m) Minimum is to choose m= n


Twelve-bit Carry-Select Example

Problem: add -3 (0xffd, 111111111101) to 17 (0x011, 000000010001))

Use 4-bit carry select blocks 1 d

1 f1 f

1 0

e00,0

0

0,1

Result is 0xe (14)

0 f0 f

1 0

0,f

0,0

0,0

0


Hardware for the Carry Select Adder

n blocks, each of n gates Additional hardware is n multiplexers +

additional adder for each block but the first n - n additional adder bits Therefore n + 2n - n = 2n gates Exactly twice the size of an ordinary adder, but

delay is n instead of n


Carry-Bypass Adder

Like the carry-select adder, has O(n) delay

But even more efficient (in terms of gates) than the carry-select Has only n + n log n gates

However, it broke every timing analyzer…

Instead of shortening the longest path, made it longer!

How can this be? Isn’t the delay of the circuit the length of the longest path?...


What is the delay of the Circuit?

The delay of a circuit is the time that the last output settles

This can be the length of the longest path, but sometimes isn’t

The longest path is an upper bound on the delay of the circuit, but sometimes this isn’t tight


Example

Long paths are from X,Y->out through bottom of circuit But no signal can travel down these paths!

zy

x 1

2

2

2

2

2

2


Example

zy

x 1

2

2

2

2

2

2

11 1

t=0t=1 0

t=2

1

1

1

t=3

t=6

0

1

1

t=4


Timing Analysis

zy

x 1

2

2

2

2

2

2x y z delay

0 0 0 6 (z->out)

0 0 1 5 (z->z’->out)

0 1 0 6 (z->out)

0 1 1 5 (z->z’->out)

1 0 0 6 (z->out)

1 0 1 6 (y->out)

1 1 0 6 (z->out)

1 1 1 6 (x,y->out)

Longest path is 8, but no signal ever travels down it!


What happened?

Long Paths are false A->B requires z=1 B->C requires z=0

Conflict! No signal can propagate down this path This analysis doesn’t quite work

Analysis has to take into account delays Complete theory not understood till 1993

This is good enough for carry-bypass adder

zy

x 1

2

2

2

2

2

2

A

B

C


Announcements

Prof. Pister will lecture on wireless protocol Thursday Need this for your project

Spring Break

Tuesday 4/1 – TBD

Thursday 4/3 – MT review

Tuesday 4/8 – MT 2


False Paths and Adders

Key idea: Don’t make critical paths in adder short Idea behind Carry Lookahead and Carry-Select adders Instead, make long paths false

Critical Path is Through the Carry Chain Only exercised when propagate bit through every block

is set? (Question: is this likely?) Therefore: when signal would propagate through carry

chain, skip the block!

Recall from block carry-lookahead adder: Group Propagate PG = P0P1P2P3 When PG=1 have the carry skip the whole block!


Carry-Skip Block

m-bit ripple-carry adder

PG

0

1

Carry-in

Carry-in to next block


Suppose Carry-in Propagates to Carry-Out…


PG

0

1

Carry-in



Then PG=1


PG

0

1

Carry-in



So Path goes Through the 1-port of the MUX


PG

0

1

Carry-in


Delay is 1-MUX delay, not 4 propagate delays!


Full Carry-Bypass Adder

Block 0PG

0

1

Carry-inBlock 1

0

1

Block n/m

0

1

As before, n/m array of m-bit blocks


Full Carry-Bypass Adder: Worst-case path

Block 0PG

0

1

Carry-inBlock 1

0

1

Block n/m -1

0

1

Worst-case path goes through m-1 bits of block 0, n/m-2 1 gates of multiplexer, m-1 bits of block n/m -1


Timing and Size Analysis

Delay = 2 * (m – 1) + n/m – 2 Choose m to minimize delay => m= n We have Delay = 2 * (n – 1) + n – 2 = 3 n – 4 What’s the additional circuitry?

log m gates to build PG (1 per block) 1 two-input multiplexer per block n/m blocks => n/m (log m + 1) m = n => n (log n/2 + 1)

Same delay as carry-select, but much smaller (n + n) vs 2n


Full Carry-Bypass Adder: Longest path

Block 0PG

0

1

Carry-inBlock 1

0

1

Block n/m -1

0

1

Longest path goes through all blocks and all multiplexers: m * n/m + n/m


Longest Path vs Circuit Delay

Longest Path is n + n

Worst-case path is n

Worst-case path for ripple-carry is n

Made things better, but a timing analyzer thinks it’s worse! Stimulated tremendous interest in timing analyzers!


Adder Summary

Adder Delay Size

Ripple-Carry n n

Carry-Lookahead (full)

log n n^3

Carry lookahead

(block)

14/5 log n n + n^3/5

Carry Select n 2n

Carry-Bypass n n+n


A comment on n

Asymptotic results tell us what happens at infinity

For our purposes, n=16, 32, 64 Means: square root n = 4 – 8 Means: Log n = 4-6

For the sizes we are interested in, carry-select and carry-bypass are as fast as block CLA


Remaining Questions (just for fun)

How often does worst-case delay path occur in Carry-bypass adder?

How do we automatically analyze for false paths?


How often does (near) worst-case delay occur?

Worst case delay: Pi = 1 for all i > j, small j Pi=Ai Bi

How often is Pi=Ai Bi = 1?

Ai Large Ai Small, Negative

Ai Small, Positive

Bi Large

Bi Small, Negative

Bi Small, Positive

Only two of nine cases, but they happen frequently


How hard is it to analyze false paths?

Hard! Problem noticed in early timing verifiers in the 1970’s Early researchers (Hitchcock, Jouppi, Ousterhout) used

hand-done rules Often wrong (if it’s hard to analyze automatically, it’s hard

to guess right by hand) Next: “Static sensitization”

Assert “non-controlling’’ values on side inputs (0 for OR/NOR, 1 for AND/NAND)

Make sure assignments are consistent Problem: Values are changing!


Example

To sensitize a->d->f->g, note: a->d requires b=1

But b=1 => e=0, and f->g requires b=1

Similar argument says you can’t set b->d->f->f

1

1

1

1a b

c

d

e

f

g


But…

1

1

1

1a b

c

d

e

f

g

a b c d e f g

0 0 0 0

1 0 0 0 0 1

2 0 0 0 0 1 0

3 0 0 0 0 1 0 0

Delay of the circuit is 3!

Path a->d->f->g really was true


Key Problem

All inputs are changing… a->d requires b=1 means b=1 stable at t=0 But b changes to 0 at t=0 Therefore, value of b is unknown (X)

Also, delays of gates are unknown “1” really means [0,1]

1

1

1

1a b

c

d

e

f

g


Key Idea: Derive Function for each time

1

1

1

1a b

c

d

e

f

g

a b c d e f g

0

1

2

3

d= 1 at 1d = 0 at 1d = X at 1



1

1

1

1a b

c

d

e

f

g

a b c d e f g

0

1

2

3

(d= 1 at 1) = (a=1 at 0) and (b = 1 at 0)



1

1

1

1a b

c

d

e

f

g

a b c d e f g

0

1

2

3

(d= 0 at 1) = (a=0 at 0) or (b = 0 at 0)



1

1

1

1a b

c

d

e

f

g

a b c d e f g

0

1

2

3

(d= 0 at 1) = (d=1 at 1) nor (d = 1 at 0)


Delay of the Circuit

Delay of the Circuit is the latest t such that (“output = X at t”) is not == 0

Problem is NP-complete

Size of problem is linear in number of time slices x number of gates

Mathematical machinery fairly massive “Special Theory”: 1989 – handled symmetric gates, zero-lower-

bounded delays (all signals were X until they hit their final values) Other cases were conservatively approximated

“General Theory”: 1993 – handled all gates, general delay models Gave exact answers for all delay types

Still hasn’t quite reached industrial practice!

Documents

CS 150 – Spring 2008 – Lec #15: Ckt Performance - 1 Circuit Performance and Adders zRecap from last time yHardware Design is Complicated Because We Want