Lecture 27+28 Final Project Presentationsbwrcs.eecs.berkeley.edu › Classes › icdesign › ee241_s06 › ... · 2006-05-16 · 1 EE241 - Spring 2006 Advanced Digital Integrated

1

EE241 - Spring 2006Advanced Digital Integrated Circuits

Lecture 27+28Final Project Presentations

Study of Subthreshold SRAM Study of Subthreshold SRAM Operation using FinFETsOperation using FinFETs

EE241 Final ProjectEE241 Final Project

Anupama Bowonder, Pratik PatelAnupama Bowonder, Pratik Patel

2

Subthreshold SRAM Operation

Motivation behind Subthreshold SRAM Operation:• Large area of chip devoted to SRAM Cache

• SRAM supply voltage needs to scale with logic supply voltage

• Reduce leakage by operating at DRV during hold

• Reduce active power consumption

Impediments to Subthreshold SRAM Operation:• Supply voltage scaling degrades cell stability

• Scaling increases sensitivity to process variations

– Variation induced local asymmetry, degraded SNM

– Spread in SNM over the whole SRAM array

• Impact of soft errors more significant at lower supply voltages

Previous Work

Leakage power reduction schemes• Offset non scalability of supply voltage

• Cell supply reduced to DRV during standby (H. Qin, et. al, 2004)

• Body biasing to control Vt of bulk transistors during standby(K. Osada et al., ISSCC 2003)

(K. Itoh, Hitachi)

3

Not Disturbed during a read

Cell Area = 0.41um2

Vsn2(V)

Vsn2(V)

Vsn2(V)

14% SNM improvement with 13% area penalty

Previous Work (cont’d)Increasing Cell Stability

(L. Chang, 2005) (Z. Guo, 2005)

Cell Stability vs. Area tradeoff

Fin rotation to increased pull down strength, improve SNM

FinFETs for Subthreshold Operation

• Better subthreshold swing – Improve Ion/Ioff in subthreshold

• Lower Ioff and hence leakage power

• Undoped Channel – no random dopant fluctuation

• Back gate Vt tuning

GATE1GATE2

S

D

GATE1GATE2

S

D

30nmFin Height

1016cm-3Channel Doping

4.5-5 evGate Work-function(φm)

20nmBody Thickness

20AGate Oxide Thickness

35nmGate length

FinFETparameter

30nmFin Height

1016cm-3Channel Doping

4.5-5 evGate Work-function(φm)

20nmBody Thickness

20AGate Oxide Thickness

35nmGate length

FinFETparameterTsi

Gate1

Gate2

Source Drain

Tox

FDSOI Tsi

Gate1

Gate2

Source Drain

Tox

FDSOI

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

4.4 4.6 4.8 5 5.2

workfunction in Ev

Ion

/Ioff

at

Vd

d=0

.3V

NFINFET

PFINFET

NFIN decreasing Vt

PFIN decreasing Vt

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

4.4 4.6 4.8 5 5.2

workfunction in Ev

Ion

/Ioff

at

Vd

d=0

.3V

NFINFET

PFINFET

NFIN decreasing Vt

PFIN decreasing Vt

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

4.4 4.6 4.8 5 5.2

workfunction in Ev

Ion

/Ioff

at

Vd

d=0

.3V

NFINFET

PFINFET

NFIN decreasing Vt

PFIN decreasing Vt

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

4.4 4.6 4.8 5 5.2

workfunction in Ev

Ion

/Ioff

at

Vd

d=0

.3V

NFINFET

PFINFET

NFIN decreasing Vt

PFIN decreasing Vt

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

4.4 4.6 4.8 5 5.2

workfunction in Ev

Ion

/Ioff

at

Vd

d=0

.3V

NFINFET

PFINFET

1.00E+01

1.00E+02

1.00E+03

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

4.4 4.6 4.8 5 5.2

workfunction in Ev

Ion

/Ioff

at

Vd

d=0

.3V

NFINFET

PFINFET

NFIN decreasing Vt

PFIN decreasing Vt

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

4.4 4.6 4.8 5 5.2

workfunction in Ev

Ion

/Ioff

at

Vd

d=0

.3V

NFINFET

PFINFET

1.00E+01

1.00E+02

1.00E+03

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

4.4 4.6 4.8 5 5.2

workfunction in Ev

Ion

/Ioff

at

Vd

d=0

.3V

NFINFET

PFINFET

NFIN decreasing Vt

PFIN decreasing Vt

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

4.4 4.6 4.8 5 5.2

workfunction in Ev

Ion

/Ioff

at

Vd

d=0

.3V

NFINFET

PFINFET

1.00E+01

1.00E+02

1.00E+03

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

4.4 4.6 4.8 5 5.2

workfunction in Ev

Ion

/Ioff

at

Vd

d=0

.3V

NFINFET

PFINFET

NFIN decreasing Vt

PFIN decreasing Vt

4

Leakage power reduction:

From 1v to 0.3 V = 95.7%

From 1V to 0.4V=92.1%

From 1V to 0.5V=86.6%

Standby Power of FinFET SRAM

7.493e-91.0

3.662e-90.8

1.002e-90.5

5.91e-100.4

3.189e-100.3

Standby Power

VddStandby Power =

ddleak VI ⋅

DATA RETENTION VOLTAGE OF 6T FINFET SRAM

0.00E+00

1.00E-01

2.00E-01

3.00E-01

4.00E-01

5.00E-01

0.00E+00 1.00E-01 2.00E-01 3.00E-01 4.00E-01 5.00E-01Vsn1

Vsn

2

DRV (60mV)

0.5V0.4V

0.3V

0.2V0.1V

PL PR

NL NR

VDD

1 0

Ioffp

Ioffn

Read Stability of 6T FinFET SRAM 6T SRAM SNM vs Vdd

(Read Stability and Hold Stability)

0

50

100

150

200

250

0 0.1 0.2 0.3 0.4 0.5 0.6

Vdd (V)

SN

M (m

V)

Dynamic Read Non Dynamic Read Hold

SNM vs Beta Ratio

29

33

37

20

25

30

35

40

0 1 2 3 4 5 6

Beta Ratio

SN

M (

mV

)

WL (VDD)BL (VDD) BL (VDD)

ACL ACR

PL PR

NL NR

VDD

0

0.1

0.2

0.3

0 0.05 0.1 0.15 0.2 0.25 0.3

Vsn2 (V)

Vsn

1 (V

)

Current has a linear dependence on size, exponential dependence on Vt change

>2x

5

Cell Write Ability

•N Curve - One of the few metrics used to determine is the cell is writeable.

(C. Wann et. al, 2005)

•Writeable Cell – positive current at the node you write.

Cell write ability as a function of workfcuntion

0.00E+00

1.00E-06

2.00E-06

3.00E-06

4.00E-06

5.00E-06

6.00E-06

7.00E-06

4.55 4.6 4.65 4.7 4.75 4.8

workfunction (eV)

I(vq

) (A

)

Writeable

0.00E+00

2.00E-06

4.00E-06

6.00E-06

8.00E-06

1.00E-05

1.20E-05

1.40E-05

1.60E-05

1.80E-05

2.00E-05

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

V(q) (V)

I(vq

) (A

)

BL (VDD)WL (VDD)BL (VSS)

ACL ACR

PL PR

NL NR

VDD

V(q)

Impact of Process Variation on Read Stability

3σ Lg = 3σ Tsi = 12% Lg(ITRS 2005)

1.00E-10

1.00E-09

1.00E-08

1.00E-07

1.00E-06

1.00E-05

1.00E-04

1.00E-03

1.00E-02

0 0.2 0.4 0.6 0.8 1 1.2

nominal

bestlg

worstlg

smallesttsi

largesttsi

96.6%

90.1%Statistical Variations in SNM

0

1

2

3

4

5

6

7

8

9

50 55 60 65 70 75 80 85 90 95 100

SNM (mV)

SN

M D

istr

ibu

tion

Den

sity

Mean = 74mV; Sigma = 10mV

Read SNM Monte Carlo

6

The project formerly known as

Design Considerations of Logic Operating in the IC=1 Regime

An EE241 Class Project byC. Marcu, M. Mark, J. A. RichmondUC Berkeley, Spring 2006

Introduction / Motivation / Outline

Strong and weak inversion has been extensively studiedModerate inversion might be a sweet spot for energy/delay trade-offsModeling, sensitivities to parameter variations, optimization of test circuit Is there actually any advantage to IC=1?

7

Modeling the Inverter

Based on EKV

Td =C ⋅VDD ⋅ k fit

IC ⋅ IS

CVLTIVEOP DDPdleakDD ⋅⋅+⋅⋅⋅= 2α

( )σ+

−+=

1

1ln2 ICTT

DD

enUVV

Sensitivity to Parameter Variations

8

Optimization: Adder

Simple full-adder using NAND & INV only

Modeling of Complex Gates: NAND

Logical effort

Leakage of transistor stack

9

Optimization: Software

Based on work done by Dejan MarkovicExtended to moderate and weak inversion by use of our modelOptimizes VDD, VT and gate sizing to minimize energy for a given delay and activity factor

Results: Minimum EOP vs. Delay

Delay and energy normalized to minimum delay and its corresponding maximum energySignificant energy savings within strong inversionVery little energy savings moderate weakHigher potential for energy savings with lower activity factors

10

Results: Optimizing VDD, Vt, W

Results: What Knob to Turn?

Optimizing different parameters at a time

11

Conclusion

Ultimate design goal tends to be minimizing energy for a given delayOur work provides the tools to optimize designs across all regions of operationOperating in IC=1 is therefore a result of the optimization, not a design targetIn fact, IC=1 is neither better nor worse than any other region for a digital designer, so don’t be afraid of it

A SelfA Self--Timed, Tunable State Machine Using Timed, Tunable State Machine Using Low Energy Pass Transistor LibraryLow Energy Pass Transistor Library

Matthew Pierson, Salman Suharwardy– Electrical Engineering and Computer Science - UC Berkeley

Next State/Output Logic

State Elements

......

Inputs Outputs

12

Increased focus on low energy logic and regular structuresIncreased focus on low energy logic and regular structuresRecovering performance at low supply voltages requires thresholdRecovering performance at low supply voltages requires thresholdscaling scaling

Low leakage architecture needed so leakage not dominatorLow leakage architecture needed so leakage not dominator

Regular Structures Needed to cope with manufacturing variabilityRegular Structures Needed to cope with manufacturing variability

Problem and Motivations (Prior Work)Problem and Motivations (Prior Work)

Configurable Logic Block

Inputs

Stack

“Design Considerations in CLBs for Deep Sub-Micron Technologies, ” Louis Alarconand Octavian Florescu, EE241 Spring 2005.

Problem and Motivations (Our Work)Problem and Motivations (Our Work)Until now, CLB library only applied to clocked data path.Until now, CLB library only applied to clocked data path.

Two different clocks requiredTwo different clocks required

Control logic not yet exploredControl logic not yet explored

Our goal: Our goal: Adapt this library to create a SelfAdapt this library to create a Self--Timed, Tunable Timed, Tunable State MachineState Machine

Why?Why?

1.1. Explore applicability to control logicExplore applicability to control logicState machine is most general form of control logicState machine is most general form of control logic

2.2. Asynchronous GainsAsynchronous GainsNo need for two clock networks that contribute heavily to energyNo need for two clock networks that contribute heavily to energydissipationdissipation

Robustness to manufacturing variationsRobustness to manufacturing variations

13

C C

Starti

C C

Donei

Stacki

SAi

Eni

!Eni+2

SAi+1 Stacki+1

Eni+1

!Eni+1 !Eni+2

Starti+1 Donei+1

SolutionSolutionStart with Self-Timed Pipeline

Evolved Into Optimized Pipeline

C

Starti

C

Donei

Stacki

SAi

Donei-1

Eni

SAi+1 Stacki+1

Eni+1 Starti+1 Donei+1

!Eni+1

!Eni+1!Eni+2

Donei-1

Adapting the LibraryAdapting the Library1.1. Cross coupled weak Cross coupled weak NMOSesNMOSes added to stack outputsadded to stack outputs

2.2. Must generate Done signal in the stack.Must generate Done signal in the stack.Stack is a low swing network because of NMOS pass transistors, Stack is a low swing network because of NMOS pass transistors, delay line is really only option.delay line is really only option.

Sense Amp Complete

Vdd + DeltaV

Done

PMOS transistors give us full swing on Done signal which is necessary for correctC element operation at low Supply/Threshold voltages

Ground

14

SelfSelf--Timed State MachineTimed State Machine

C

Starti+1

C

Donei+1

Stacki+1

SAi

Eni

!Eni+2

SAi+1 Stacki+2

Eni+1

!Eni+1

Starti+2 Donei+2

C

SAi+2

Eni+2

StartSDoneS

StackS

SAS

En

Inputs

InputsDone

State ElementState Element

!Eni

OutS

OutS

Results from Self Timed MachineResults from Self Timed Machine

But what if we don’t want to go so slow at 300 mV Supply Voltage…

15

Original clocked library implemented body bias threshold scalingOriginal clocked library implemented body bias threshold scaling and and artificial threshold scalingartificial threshold scaling……

Body BiasBody Bias –– Only works for about ~50 mV of threshold scaling before Only works for about ~50 mV of threshold scaling before body body –– source junction current becomes noticeable.source junction current becomes noticeable.

Artificial threshold scalingArtificial threshold scaling -- Shifting up the output on the Sense Shifting up the output on the Sense Amplifier by Amplifier by DeltaVDeltaV..

Threshold Voltage ScalingThreshold Voltage Scaling

SA

Vdd + DeltaV

DeltaV

Dynamic Power remains the same

Vdd + DV

0

Making it work for the asynchronous version:

C

Stack

SAi

Eni

Vdd + DV

0

Vdd + DV

0

EN,Done

Delay

RootVdd

0

Vdd

0

Taken from internal node, 0 -> Vdd+DV

Done

Out

Tuned Energy RangeTuned Energy Range

16

Tuned Delay RangeTuned Delay Range

Questions?Questions?

?

17

EE241

UC Berkeley EE241 Term Project, Spring 2006

Platform-Based

Xuening Sun

Tsung-Te Liu

David Chen

SRAM Standby Power Minimization

EE241 – Project

Outline

Motivation

Prior Work

Global Optimization

Conclusion

18

EE241 – Project

Scaling, Scaling, Scaling …Motivation

SRAMSRAMNoise

Lea

kag

e

Var

iab

ility

VelocitySaturation

“Where is the limit?”

EE241 – Project

Standby Power OptimizationMotivation

Access Time

Circuit Area

Nois

e M

argi

n

Wp, Lp, Wn, LnVsbp, Vsbn

Minimum Power@ Standby

⇓

WpLp

WpLp

WnLnWn

Ln

Vsbp Vsbp

VsbnVsbn

19

EE241 – Project

DRV Measurement and Parametric AnalysisPrior Work

-0.4 -0.2 0 0.2 0.4

120

140

160

180

200

220

Vpb (V)

DRV mean (m

V)

chip 1

Vnb = -0.4Vnb = -0.2Vnb = 0Vnb = 0.2Vnb = 0.4

-0.4 -0.2 0 0.2 0.4

120

140

160

180

200

220

Vpb (V)

DRV mean (m

V)

chip 1

Vnb = -0.4Vnb = -0.2Vnb = 0Vnb = 0.2Vnb = 0.4

-0.4 -0.2 0 0.2 0.4120

130

140

150

160

170

180

190

200

210

220

Vnb (V)

DRV mean (mV)

Vpb = -0.4Vpb = -0.2Vpb = 0Vpb = 0.2Vpb = 0.4

-0.4 -0.2 0 0.2 0.4120

130

140

150

160

170

180

190

200

210

220

Vnb (V)

DRV mean (mV)

Vpb = -0.4Vpb = -0.2Vpb = 0Vpb = 0.2Vpb = 0.4

0.1 0.2 0.3130

140

150

160

170

180

190

200

210

220

Lp (um)

DR

V m

ean

+ 5*

std

(mV

)

Ln = 0.1Ln = 0.15Ln = 0.2Ln = 0.25Ln = 0.3

0.1 0.2 0.3130

140

150

160

170

180

190

200

210

220

Ln (um)

DR

V m

ean

+ 5*

std

(mV

)

Lp = 0.1Lp = 0.15Lp = 0.2Lp = 0.25Lp = 0.3

0.1 0.2 0.3130

140

150

160

170

180

190

200

210

220

Lp (um)

DR

V m

ean

+ 5*

std

(mV

)

Ln = 0.1Ln = 0.15Ln = 0.2Ln = 0.25Ln = 0.3

0.1 0.2 0.3130

140

150

160

170

180

190

200

210

220

Ln (um)

DR

V m

ean

+ 5*

std

(mV

)

Lp = 0.1Lp = 0.15Lp = 0.2Lp = 0.25Lp = 0.3

0.2 0.3 0.4 0.5190

200

210

220

230

240

250

Wp (um)D

RV

mea

n +

5*s

td (

mV

)

Wn = 0.325Wn = 0.27Wn = 0.215Wn = 0.16Wn = 0.105

0.1 0.15 0.2 0.25 0.3190

200

210

220

230

240

250

Wn (um)

DR

V m

ean

+ 5

*std

(m

V)

Wp = 0.2Wp = 0.3Wp = 0.4Wp = 0.5

0.2 0.3 0.4 0.5190

200

210

220

230

240

250

Wp (um)D

RV

mea

n +

5*s

td (

mV

)

Wn = 0.325Wn = 0.27Wn = 0.215Wn = 0.16Wn = 0.105

0.1 0.15 0.2 0.25 0.3190

200

210

220

230

240

250

Wn (um)

DR

V m

ean

+ 5

*std

(m

V)

Wp = 0.2Wp = 0.3Wp = 0.4Wp = 0.5

Courtesy of Huifang Qin, 90nm SRAM Test Chip: DRV and Leakage Measurement, 03/02/2006

EE241 – Project

Design Platform GlobalOptimization

Cir

cuit

Sys

tem

DRV = a1 log

1+ β1

ln

⎛

⎝ ⎜

⎞

⎠ ⎟

wp + k1

α1

exp γ1 −2φ1 + vsbn( )exp γ2 −2φ2 − vsbp( )

+ α1 1+ β2

lp

⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟ wp + k2( )exp γ3 −2φ3 − vsbp( )exp γ 4 −2φ4 + vsbn( )

⎛

⎝

⎜ ⎜ ⎜

⎞

⎠

⎟ ⎟ ⎟

⎛

⎝

⎜ ⎜ ⎜ ⎜

⎞

⎠

⎟ ⎟ ⎟ ⎟

[3]

ID = iS expvGS −VT

nvth

⎛

⎝ ⎜

⎞

⎠ ⎟ 1− exp

−VDD

vth

⎛

⎝ ⎜

⎞

⎠ ⎟

⎛

⎝ ⎜

⎞

⎠ ⎟

[2]

tread = KSW + CBL0 + KBLSW

Kread ASW + B SW3

SN2 − C SW

2

SN− 1

2

⎛ ⎝ ⎜

⎞ ⎠ ⎟

[1]

Area = Wi × Li( )∑Power = DRV × IDj∑

Performance = f WL( )

N, W

L( )A,vsbn[ ]

20

EE241 – Project

Program Demo GlobalOptimization

WpLp

WpLp

WnLnWn

Ln

Vsbp Vsbp

VsbnVsbn

EE241 – Project

Global Optimization: DRVGlobalOptimization

Lower delay costly;

P/NMOS asymmetry⇒ higher DRV;

Smaller area possible;

Lower sensitivity at looser constraints;

Global minimum DRV:35mV (74% reduction)@ Wn/Ln=0.1u/186u,

Wp/Lp=354u/186u,Vsbn=0.4V, Vsbp=−0.4V.

21

EE241 – Project

DRV SensitivityGlobalOptimization

: Standard-Cell

EE241 – Project

Global Optimization: PowerGlobalOptimization

Power and DRV require different sizing;

Performance and area equally influential;

Insensitive to area beyond standard-cell size (next page);

Global minimum power: 0.3nW (70% reduction)@ Wn/Ln=0.1u/0.1u,

Wp/Lp=0.2u/0.1u,Vsbn=Vsbp=0.

22

EE241 – Project

Power SensitivityGlobalOptimization

: Standard-Cell

EE241 – Project

Conclusion (and Beyond)Conclusion

A platform-based SRAM standby power optimization method is presented;

Global optimization yields lower DRV and standby power than single-dimensional parametric search;

Minimum power and minimum DRV require different sizing;

Sizing for globally minimum power exists with 70% power reduction, 5% area reduction, but 80% performance loss;

Presented method can be scaled to incorporate more design variables such as access transistor size and bit-line pre-charge;

Presented method can be scaled to incorporate process variation, where statistic-based modeling is a key component (primitive analysis on final report).

23

EE241 – Project

Acknowledgement

Special thanks to Huifang Qin for generous support from test data and documents, to technical discussion and evaluation;

Thanks to Rakesh Vattikonda and Yu Cao for providing statistical models;

Thanks to Dejan Markovic for early discussion on scope of work and help with MATLab;

Thanks to Yanmei Li for discussion on platform-based design.

Key references:

[1] H. Qin, Y. Cao, D. Markovic, A. Vladimirescu, and J. Rabaey, “SRAM Leakage Suppression by Minimizing Standby Supply Voltage,” ISQED, 2004

[2] B.H. Calhoun, ad A. Chandrakasan, “Analyzing Static Noise Margin for Sub-threshold SRAM in 65nm CMOS,” ESSCIRC, 2005

[3] R. Vattikonda, “Modeling DRV of a SRAM Cell Under Variations,” Research Notes, April 2006

EE241 – Project

Appendix

24

EE241 – Project

More on Statistic-Based Optimization

startDRVMAX

conditionsDRVMEAN

modeloptimal sizing for minimum DRVMEAN

update DRV model

optimal sizing for minimum DRVMAX

match?

end

Yes

No

“How about power… ?”

EE241 – Project

1st-Order VerificationBy fixing W and L to standard size, the presented optimization platform reduces to single-dimensional parametric sweep: Vpb and Vnb.

dots: measured DRV from the 90nm test chip;

line: parametric sweep using the reduced platform;

color code: same color (dots and line) represents same bias condition.

25

15

Ultra Low Power Clock Generationusing Sub-Threshold MCML

Asako Toda • Anurag Pandit • Khang An Tran

EE241 FALLFinal Project Presentation

15

MCML in sub-Vth

Good

• Leakage• Ultra Low Power• Robust• Easy

Bad

• Slow ?• Big ?• Model ?

26

15

Current Mode Logic (CML) in Sub-Vth

~ nA

~100mV

HL

~10MVΩ

10nA

Replica Bias

LH

Iss

L VDD–200mVVDD H

20MΩ

15

Input – Output

ou

t+, o

ut-

∆Vin

out+ out-

sub-Vth

∆Vin

∆Vin_th

∆Vin_th

3 x nVt = 117mV

Vdsat

3 x Vt

27

15

Input – Output Model

VDD – 27mV

ou

t+, o

ut-

∆Vin

out+ out-

∆Vin

15

““

Iss

≡

saturated ?

Iout

—

+

Iout

0

Iout = Iss = 1/a@ ∆V: big

Iout = 0@ ∆V= 0

1/a

=

∆V

∆V

28

15

Mathematica

Input – Output Model Verification Yes !

VDD – b x Iss

15

Variation & Mismatch

a ~ 1 / Iss

a, b ~ Lp, 1/Wp

Iss

PMOS Load NMOS input

Curren Source

1

2

3

VDD

Source Voltage0

29

15

Variation

Sensitivity

Input-Output

worst

worst13 mV0.13b

30 mV0.3a

40m V0.4Iss

∆ Vod (10%)Sens

Constraint

Vin > Vin_th = 3 nVtVin = Vin_th + margin + ∆ Vod

Sensitivity

Example: NMOS, PMOS: min size

15

MatchingConstraint : Vos << Vos_limit

Vos

VDD – ∆ u

Vos_limit

30

15

MatchingWorst Case

pair ratio : r << 10%

Vos << Vos_limitr

Vos

50mV10mV

~ ~

15

PDP

Sub-Vth MCML Static CMOS

N: Number of Stages

1.59n(tp0)

3.10n

7.56n

1 40f

td [sec]

1.59n(tp0)

3.10n

7.56n

1 40f

td [sec]

30p(tp0)

47p

34p

1 40f

td [sec]

30p(tp0)

47p

34p

1 40f

td [sec]

Delay

Power

PDP

EDP

31

15

VDD=0.36 PDP_MCML

PDP_INV

15

Frequency Divider( VDD = 0.5V, Freq = 20MHz )

Fujishima, et al, JSSC,1993

7fJ/cyc 5fJ/cyc

MCML type CMOS Static

32

15

Summary

Good

• PDP in Ultra Low Voltage

• Robust in Matching, Noise, and EM.

• Easy

Bad

• Slow in Middle Low Power

• Variation• Cascading• Big

Analysis of Razor

Timothy Loo, Vincent NgUniversity of California, BerkeleyEE241 Spring 20065/8/2006

33

MotivationAlways correct logic not efficient

D. Ernst, N.Kim, S.Das, S.Pant, R.Rao, T.Pham, C.Ziesler, D.Blaauw, T.Austin, K. Flantner, and T. Mudge, “Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation”, MICRO-36, December 2003

SolutionRazor – an error correction mechanism


34

ChallengesOptimal supply voltage strongly dependent on the program and input dataHold time increased by the amount of delay of the delayed clockMeeting both the setup time constraint and hold time constraint a challengeA strong tradeoff between potential benefit and buffers neededNeed to analyze every path in the logicWill benefit increase or decrease with increase process variation?


Analyzing the Problem

Add model corners to using ee240 and ST technology models.Implement and verify a pipeline to simulate data propagation throw an adder, a regular flip fop, a multiplier, and a Razor Flip Flop.

Simulate circuit to determine longest and shortest paths to see how they affect Razor Operation

35

Multiplier Path VariationSimulated vs Extracted Data

10

15

20

25

30

35

40

90 65 45

Technology

% c

han

ge

fro

m T

T

SS-ST FF-ST SS-240 FF-240

Shortest-Longest Path Variation

135

140

145

150

155

160

165

90 65 45

Technology

% v

arat

ion

If Razor Shadow latch detects errors after 1/2 a clock cycle, the period can be 2/3 * Longest Path instead of 1!

Effects of Variation on Razor Efficiency

1/3 1/3 1/3

Limitation

Shortest Path Constraint may necessitate the use of delay buffers to prevent data corruption

36

Additional Power UseTwo CasesLogic can be modified such that shortest path is increased without increasing longest pathAdd buffers in logic to increase shortest path

Energy Penalty 90nm Total Power Simulation

90 124fJ Energy w/o Razor: 0.9pJ

65 245fJ Energy w Razor: 45 546fJ w/o Overhead: 0.7pJ

w/ Overhead: 1.3pJ

Note: overhead may be amortized with larger logic blocks

Additional Clock PeriodWorst Case:

Logic cannot be modified internally, therefore adding buffers increases both shortest and longest path delay. Clock Period Savings are reduced.

Boundary Requirement:

Shortest Path > Longest Path / 6.

Otherwise, the adjusted clock period (to satisfy the shortest path constraint) will be longer than the Longest Path!

37

Razor Efficiency

None of the 90,65,45nm models pass the shortest path requirement.

Tech % Increase % Over Original Clock Period

90 53% 2.2%65 52% 1.3%45 68% 12.0%

Setup Time Variations

Setup Times vs Technology

0

5

10

15

20

25

90 65 45

Technology

% o

f to

tal

clo

ck p

erio

d

Main FF Setup Time Shadow FF Setup Time

38

Conclusion

Razor Topology is susceptible to shortest path signals corrupting shadow latch.Variations increase dramatically from 65 to 45nm technologyBenefits of Razor decreases as variations increase.

Soft Error Tolerant Logic Circuit Design

Mohammad Amin Arbabian Debopriyo Chowdhury

39

What are soft errors?

Transient faults caused by high energy particles

Sources: Alpha particles from packaging material, thermal and fast neutrons from cosmic rays

[K.J. Hass et al., MWSCAS 1999]

Should we really care about SEE?

What really limits scaling? Wallmark [1962]: Power and SER

C↓ V ↓ Q ↓ ↓SER↑ ↑

Source: Shivakumar P., IEEE 2002, N.J.Wang,UIUC

Upshot: Soft error rates per SRAM or latch bit grow slowly with scaling

But: The number of bits grows with Moore’s Law!

Caution: Custom, ASIC, FPGA designs that push the density envelop

40

Soft Errors in Logic Circuits

Single event transient (SET) vs upset (SEU)

1

CLK

D Q

Q’CLK

D Q

Q’

0SET

D Q

Q’CLK

D Q

Q’

00 0

00 0

CLK

SET in combinational circuitsElectrical masking

0

CLK

D Q

Q’CLK

D Q

Q’

0SET

0

CLK

Latching Window

D Q

Q’CLK

D Q

Q’

0SET

0 00

Logical maskingLatching window masking

Source: Krishnamohan et. al. IEEE 2004

Masking becomingineffective in

nanometer ccts

Available Circuit Techniques:

Triple Modular Redundancy Partial Error Masking/ Cluster Sharing to

decrease the Area Overhead

Selective Node EngineeringAdding CapacitanceAdding Cross Coupled InvertersSizing

Using Timing Slacks to resample dataConventional Hardened Latch

Huge area, poweroverhead

Large performanceoverhead

Complex clocking scheme

SET protection

41

Modeling of Soft ErrorsCircuit response depends not only on deposited charge but also on amplitude, duration as well as shape of current pulse accurate models are essential for reliable results

Bit Flips!No Bit Flip!

Simulation done with inverter (NMOS: 0.51µ/90nm PMOS: 0.88µ/90nm)

Total incident charge: 15fC (1mA, 15ps ; 0.1mA, 150ps)

Same Charge!!

Modeling Neutron StrikeFast rise and gradual fall current waveforms

TrapezoidDouble ExponentialMix of Linear and Exp

Our Choice:

T fitted for 90nm ProcessSimulated by the Piece-wise Linear Function in Cadence

)exp(2

I(t)T

t

T

t

T

−=π

Charge: 10fC

Peak Current~300µA

70ps

42

Our Solution:SETs hurt only if they are captured by the latchSEU protection of Latch is criticalAggressive Pipelining means less logic more flip-flopsIf we can have a latch that filters the transients, why slow all the nodes ???

concept of error-filtering latchDesign 2 new SET ANDSEU Tolerant Latches

Proposed Latch 1

Y IN(d)CLKBAR MN1MN2MN3MP1MP2MP3DCDE

C-element

• Adaptive SET protection and inherent SEU protection

• Very small area/power overhead

•Area Overhead: 3-5% including data-path

•Power Overhead: X 1.08

43

Adaptive Delay Control Unit

Graph shows tradeoff between delay and QCRIT for Latch 1A four bit Digitally Controlled Delay Element (DCDE) Delay Variation: 30ps – 65ps in steps of 2-5ps

Possibility of Adaptive Soft Error Protection at run time

0

10

20

30

40

50

60

70

80

0 0.5 1 1.5 2 2.5Delay (FO4 inverter)

Qcr

itic

al (

fC)

QCRIT vs Delay for Latch 1

Proposed Latch 2Concept of feedback used

Latch does not respond to transient spikes

Less Delay Overhead with some power penalty

Stronger

Power Overhead: X 1.79

44

Simulation Results (1)

Statistical (Monte Carlo type) simulation with random charge injection based on stochastic modeling of neutron energy transferLatch tested with 4-bit ripple carry mirror adder

0

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8 9 10 11 12

Node Number

Pro

babili

ty o

f Fa

ilure

(%)

Nominal

Protected (Scheme1)

0

10

20

30

40

50

60

70

80

90

1 2 3 4 5

Node Number

Pro

bab

lity

of F

ailu

re (%

)

Charge (fC)

Total of 50,000 Random Charges

Average SER Protection:

212% Improvement in critical charge

57 Times less probability of Failure

Simulation Results (2)

0

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8 9 10

Node Number

Pro

bab

ility

of F

ailu

re (

%)

NominalProtected

0

10

20

30

40

50

60

70

80

90

1 2 3 4 5

Node Number

Pro

bab

lity

of F

ailu

re (%

)

Average SER Protection:

239% Improvement in critical charge

79 Times less probability of Failure

Similar improvement in reliability obtained using some ISCAS benchmark circuits as the data-path

45

What comes next?

Scaling into sub-50 nm technology might be limited by soft error, if proper protection is not taken for logic circuitsProposed latch can be combined with node engineering techniques to yield enhanced protection for critical applicationsSoft error sensing and adaptive protection schemes are attractive

Impact of Logic Styles on a Impact of Logic Styles on a ViterbiViterbi Decoder in respect of Decoder in respect of Performance and RobustnessPerformance and Robustness

JiJi--HoonHoon ParkParkSeungSeung--Bum Bum SuhSuh

46

TopicsTopicsI.I. Design of AddDesign of Add--CompareCompare--Select Unit with LVS LogicSelect Unit with LVS Logic

II.II. Performance and Robustness Comparison with Performance and Robustness Comparison with

Different Logic StylesDifferent Logic Styles

ContentsContentsI.I. ViterbiViterbi DecoderDecoderII.II. Simulation Results Simulation Results –– Performance and RobustnessPerformance and RobustnessIII.III. Analysis Analysis –– Delay and VariabilityDelay and VariabilityIV.IV. ConclusionConclusion

Topics and ContentsTopics and Contents

ViterbiViterbi DecoderDecoder

niSM

niSM

niSM

njSM

njSM

njSM

1nk1SM +

1nk2SM +

nk1j,

nk1i, BMBM =

nk2j,

nk2i, BMBM =

{ }nkj

ni

nki

ni

nk ,,

1 ,min BMSMBMSMSM ++=+

•• ACS Unit is a critical block in ACS Unit is a critical block in ViterbiViterbi DecoderDecoder•• AdderAdder is a critical block in ACS Unitis a critical block in ACS Unit

47

Performance of Performance of ACSsACSs and Addersand Adders

I. LVS (LowI. LVS (Low--Voltage Swing)Voltage Swing)

II. Dynamic ManchesterII. Dynamic Manchester

III. Static ManchesterIII. Static Manchester

Delay (adder) = 79.81 Delay (adder) = 79.81 pspsDelay (ACS) = 148.41 Delay (ACS) = 148.41 psps

Carry ImplementationsCarry Implementations

Robustness of AddersRobustness of Adders

LVS LVS –– Most RobustMost Robust•• Clocking of Sense AmplifierClocking of Sense Amplifier•• Sampling Level of Pass Transistor OutputSampling Level of Pass Transistor Output

48

•• Delay ExpressionDelay Expression(Linear Approximation)(Linear Approximation)

⎟⎟⎠

⎞⎜⎜⎝

⎛+−≅= ∫

21

12 11

4

)(

)(

)(2

1 ii

vvCdv

vi

vCt L

v

v

Lp

⎟⎟⎠

⎞⎜⎜⎝

⎛

∂∂

+∂∂−

≅∂

∂∴

22

221

121 11

4

)(

iV

i

iV

ivvC

V

t

TT

L

T

p

⎟⎟⎠

⎞⎜⎜⎝

⎛

∂∂

+∂∂−

≅∂∂

∴22

221

121 11

4

)(

iV

i

iV

ivvC

V

t

DDDD

L

DD

p

•• Variability ExpressionsVariability Expressions

1v 2v

11

i

current1

2

1i

Analysis Analysis -- Concepts Concepts

Analysis Analysis –– Comparison with SimulationsComparison with Simulations

0.05 0.1 0.15 0.2 0.25 0.30

50

100

150

200

250

300

350

Vth (V)

Del

ay (

ps)

Delay Model

Simulation

Analysis Model Analysis Model and Simulation and Simulation

Results MatchingResults Matching

49

Analysis Analysis –– Delay and VariabilityDelay and Variability

( ) ( ) 55.122 2

41

TDDTTDD VVVVV −+

+−

( )[ ] ( ) 55.2222 2

4.1242

TDDTTDD

TDD

VVVVV

VV

−+

+−

−

( )[ ] ( ) 55.2222 2

2.622

TDDTTDD

TDD

VVVVV

VV

−−

+−

+−

( ) ( ) 5.122

43

11

TDDDDTTDD VVVVVV −⋅+

+−

( )( )[ ] 5.2

32

35

32

222

43

5.122

⎟⎠

⎞⎜⎝

⎛ −

⋅+

+−

−

DDTDD

DD

TTDD

TDD

VVV

V

VVV

VV

( )( )[ ] 5.2

32

35

5.23

13

2

222

43

32

455.1

22

⎟⎠⎞

⎜⎝⎛ −

⎟⎠⎞

⎜⎝⎛ −⋅

++−

−−

DDTDD

DDTDD

TTDD

TDD

VVV

VVV

VVV

VV

( ) ( )TDDDDTDD VVVVV −⋅+

− 43

212

( ) ( )23

43

11

TDDDDTDD VVVVV −+

−

( )( )

( )23

43

23

1

TDDDD

TDD

TDD VVV

VV

VV −

−−

−−

pt

T

p

V

t

∂∂

DD

p

V

t

∂

∂

PARALLELPARALLELSERIALSERIALPTLPTL

•• Robustness: Robustness: Serial Stack > Parallel >> PTLSerial Stack > Parallel >> PTL

Analysis Analysis –– VtVt and and VddVdd SensitivitySensitivity

0

0.5

1

1.5

0

0.1

0.2

0.3

0.40

20

40

60

80

Vdd

Normalized Vt Sensitivitiy

Vt

Vt

Sen

sitiv

ity

PTL

Serial

Parallel

0

0.5

1

1.5

0

0.1

0.2

0.3

0.40

10

20

30

40

Vdd

Normalized Vdd Sensitivitiy

Vt

Vdd

Sen

sitiv

ity

PTL

SerialParallel

DDV

pt

∂

∂

T

p

V

t

∂∂

PTL

Serial

Parallel

50

PTL PTL Robustness HigherRobustness Higheras Sampling as Sampling Level LowerLevel LowerVt p 5.0@

Vt p 3.0@

0

0.5

1

1.5

0

0.1

0.2

0.3

0.40

20

40

60

80

Vdd

Normalized Vt Sensitivity (PTL)

Vt

∂ t p/ ∂

Vt

@0.2V

@0.3V

@0.5V

T

p

V

t

∂∂

0 50 100 150 200 2500

0.05

0.1

0.15

0.2

0.25

Delay (ps)

Pro

babi

lity

PTL Vth Variation S imulation varying by S ampling Level

@0.2V

@[email protected]

Analysis Analysis –– PTL Sampling Level vs. RobustnessPTL Sampling Level vs. Robustness

Strong Robustness of LVSStrong Robustness of LVS

Conclusion and Future WorksConclusion and Future Works

ConclusionConclusion•• LVS Logic StyleLVS Logic Style

-- Outperforms in Performance and RobustnessOutperforms in Performance and Robustness-- Complex to Design (Clock Timing is Critical)Complex to Design (Clock Timing is Critical)

•• Variation AnalysisVariation Analysis-- Lowering the Sampling Level in PTL and Stacking the Lowering the Sampling Level in PTL and Stacking the

Devices Improve the RobustnessDevices Improve the Robustness-- Explains the Robustness Differences between Logic StylesExplains the Robustness Differences between Logic Styles-- Explains LVS RobustnessExplains LVS Robustness

Future WorksFuture Works•• Develop Variability Analysis ModelsDevelop Variability Analysis Models

-- Apply to Complex Gate StructuresApply to Complex Gate Structures-- Apply to ShortApply to Short--Channel ModelsChannel Models

•• Robustness vs. Clocking StrategyRobustness vs. Clocking Strategy

51

Analysis and Design of High Performance and Low Power Current Mode Logic CMOS

Phillip ChinJunjie Su

Xiaolan Zhong

Motivation

As technology scales, the following problems are prevalent:

Increased Leakage CurrentIncreased Power ConsumptionReduced noise margins

Today’s circuits require high performance using the low power while addressing above problems.

52

Current Mode Logic (CML)

In the past it was not used, because it consumed more power than voltage mode implementations.However, today, circuits operate at higher frequencies.CML can save a lot of power and offer better performance.

Current Mirror Difficulties

Recall MOS current equations:Saturation

Subthreshold

Lambda is increasing, thus small changes in VDS has a bigger impact, therefore the Current Mirror is more sensitive to changes in VDS.

( )DSqkT

V

qnkT

V

SD VeeIIDSGS

λ+⎟⎟⎠

⎞⎜⎜⎝

⎛−=

−11 //

( ) ( )DSTGSD VVVL

WkI λ+−= 1'

2

1 2

53

CML Adder Failure

Fails with smaller technology, depends heavily on current mirror accuracy

New CML AND Gate

Inspired from the previous adderSizing is very importantCan replace CMOS AND gate

54

Robustness Comparison

In lower frequencies, static CMOS has a better noise margin than CML.At higher frequencies, they are pretty identical.

CML AND Gate at 1.25 GHz Static CMOS AND at 1.25 GHz

Delay and Power ComparisonPower vs. Switching Frequency

0

5

10

15

20

25

30

35

0.5 1 1.5

Switching Frequency (GHZ)

Po

wer

(fw

)

Static CMOS

CML

Delay vs. Switching Frequency

110

115

120

125

130

0.6667 0.8333 1 1.25


Del

ay (

ps)

Static CMOS

CML

1291171.25

1291171

1291170.833

1291170.667

CMLCMOSFreq

3.6232.41.25

426.51

3.53220.833

3.6617.50.667

CMLCMOSFreq

55

Static CMOS vs. CML

CML out performs static CMOSDelay*Power vs. Switching Frequency

0

5001000

1500

2000

25003000

3500

4000

0.5 0.7 0.9 1.1 1.3


Del

ay*P

ow

er (

ps*

fW)

Static CMOS

CML

46737911.25

51631061

455.425740.833

471.820470.667

CMLCMOSFreq

Conclusion

CML offers a new possible solution to the issue of scalingCML AND gate offers better overall performance over its static CMOS counterpartIn the future, optimize current mirror and build bigger blocks

56

UC Berkeley Spring 2006

EE241 Final Presentation

Evaluation of Adiabatic Logic for New Process Technologies

Prof. Jan Rabaey

Karl Skucha, Babak PahlavanArash Ghanadan


Motivation

Exploit property of Adiabatic Charging

E=(RC/T) * C Vvdd2

Asymptotically Zero Power Dissipation

How close can we get?

What are the delay trade-offs?

Future Trends

Viable in new process technologies?

Benefits getting better or worse?

Possibility for mobile or ultra-low power applications.

!

57


Problem Statement

No formal study of Adiabatic circuits below 130nm

technology node

No formal study of trends for new technology nodes

such as 90nm, 65nm and 45nm

Power is one of the biggest issues

Are adiabatic circuits promising as technology scales

down?


Design Families

Non-Adiabatic design family for comparison

Dynamic DCVSL

Adiabatic families

Positive Feedback Adiabatic Logic (PFAL)

Energy Charge Recovery Logic (aka 2N2P)

58


Design FamiliesDynamic DCVSL

Reasons Used

Same switching probability as Adiabatic circuits

Similar structure (differential, cross coupled PMOS)


Design Families2N2P

Reasons Used

Easy to implement

Small area

Rail-to-rail swing

OperationAdiabatic charging and

discharging with clock

4 sinusoidal clocks

90 degrees out of phase

59


Design FamiliesPFAL

Reasons Used

High Performance [1]

Rail-to-Rail Swing

Operation

Functional blocks assist

adiabatic charging

4 sinusoidal power clocks

[1] A. Vetuli, S. Di Parcoli and L. M. Reyneri: “Positive feedback in adiabatic logic”. Ekefmnics Letters, Vol. 32, No. 20, Sep. 1996, pp. 1867ff.


Test Setup

Test Circuit: 4 input NAND decoder

Input Vector: 0101->1010->0101

Frequencies:HF: 1Ghz and 500MHz

MF: 250 and 100MHz

LF: 50 and 10MHz

VLF: 3 and 1MHz

Loads: 0,50,100,150fSized NANDs and scaled

voltage for lowest power

dissipation

Maintain functionality

and “full swing”

60


Energy vs. Delay Results (1/4)

DCVSL uses highest energy, 2N2P ~40%, PFAL ~22%

Energy Per Cycle vs. Delay180nm, 100f load

1.0E-14

1.0E-13

1.0E-12

1.0E-11

1.0E-10

1.0E-09 1.0E-08 1.0E-07 1.0E-06Delay

En

erg

y P

er C

ycle 2N2P

PFALDCVSL





1.0E-14

1.0E-13

1.0E-12

1.0E-11

1.0E-09 1.0E-08 1.0E-07 1.0E-06Delay

En

erg

y P

er C

ycle 2N2P

PFALDCVSL

61




Leakage begins to dominate in low frequency for all three


1.0E-14

1.0E-13

1.0E-12

1.0E-11

1.0E-10

1.0E-09 1.0E-08 1.0E-07 1.0E-06Delay

En

erg

y P

er C

ycle

2N2PPFALDCVSL



All fail in high frequency, and leakage dominates in low frequency

Results promising but likely unreliable due to (much) higher power consumption and voltage requirements for 45nm vs. other models

Energy Per Cycle vs. Delay45nm , 100f load

1.0E-14

1.0E-13

1.0E-12

1.0E-11

1.0E-09 1.0E-08 1.0E-07 1.0E-06Delay

En

erg

y P

er C

ycle

2N2PPFALDCVSL

62


Trend Results (1/2)

Adiabatic gain= (energy for DCVSL) / (energy for Adiabatic family)

High Freq Adiabatic Gain vs Model

0

0.5

1

1.5

2

2.5

3

3.5

180nm 90nm 65nm

Model

Gai

n

2N2PPFALAverage

Middle Freq Adiabatic Gain vs Model

0

1

2

3

4

5

6

7

8

180nm 90nm 65nm 45nm

Model

Gai

n

2N2PPFALAverage

• HF gain ~2• Little change

• MF gain ~4• Slight increase


Trend Results (2/2)

Low Freq Adiabatic Gain vs Model

0

1

2

3

4

5

6

7


Model

Gai

n

2N2PPFALAverage

• LF gain ~3.5• Slight decrease

• VLF gain ~4-2• Large decrease

Very Low Freq Adiabatic Gain vs Model

0

1

2

3

4

5

6


Model

Gai

n

2N2PPFALAverage

63


Results Summary

Adiabatic circuits consume 70-80% less power in medium to very low frequencies If leakage dominates, the benefits decreaseAt high frequencies, benefits also decreaseIn smaller technologies, benefits decrease for low and very low frequenciesHowever, the medium frequency range remains promising.


Discussion & ConclusionDiscussion

More clock networks and clock generation circuits will increase clock distribution more powerAlways switching

Applicable mostly to high switching circuitryLarger design overhead~100 times lower static noise [2]

ConclusionPower saving of 70-80% and increasing for frequencies in the 100-250MHz range for new technologiesVery low frequencies and high frequencies need different low-power solutions

[2] Mahmoodi-Meimand, H.; Afzali-Kusha, A. “Low-power, low-noise adder design with pass-transistor adiabatic logic”Microelectronics, 2000. ICM 2000. Proceedings of the 12th International Conference on 31 Oct.-2 Nov. 2000 Page(s):61 – 64

64


Q & A

Thank you for your time

Any questions?


Back up slides

65


Test Setup

Robustness for Adiabatic Circuits explainedOutput signal goes within ~1% of railOutput goes below VT before next clock reaches VT

Clk2

Clk1

Output


Simulation Results – 150f load (1/4)


1.0E-14

1.0E-13

1.0E-12

1.0E-11

1.0E-10

1.0E-09 1.0E-08 1.0E-07 1.0E-06Delay

En

erg

y P

er C

ycle

2N2PPFALDCVSL

66




1.0E-14

1.0E-13

1.0E-12

1.0E-11

1.0E-09 1.0E-08 1.0E-07 1.0E-06Delay

En

erg

y P

er C

ycle

2N2PPFALDCVSL




1.0E-14

1.0E-13

1.0E-12

1.0E-11

1.0E-09 1.0E-08 1.0E-07 1.0E-06Delay

En

erg

y P

er C

ycle

2N2PPFALDCVSL

67




1.0E-14

1.0E-13

1.0E-12

1.0E-11

1.0E-09 1.0E-08 1.0E-07 1.0E-06Delay

En

erg

y P

er C

ycle

2N2PPFALDCVSL

Algebraic Coding for Reliable Computation using Unreliable Gates

Animesh Kumar

EE 241 Class Project

Instructor: Prof. Jan M. Rabaey

68

Introduction and motivation

• Combinatorial logic block in

DSM circuits

• Feature size very small

• Smaller device => more

susceptible to SER

Unreliable AND logic

• Simple AND gate – well-known problem

• Combinatorial model – correct controlled number of errors

• a, b, x are binary vectors

p

p = probability of failure

a

bx

P(x != a.b) = p, p > 0

69

Past approaches

p

p

p

a

b

x1

a

b

x2

a

b

x3

Majorityvoter

x

• TMR (Triple Modulo

Redundancy)

• Works for single-error

per compute

• Wasteful for p small

p

a

b

• Encode using a linear error-control

code

• An [n, k, d] code will correct (d-1)/2

errors

Interesting idea

p

a

bx

a

b py

a b x y XOR ( x, y)

0 0 0 0 0

0 1 0 1 1

1 0 0 1 1

1 1 1 1 0

= XOR ( a, b)

Approach: Encode over input and have gate-diversity

• XOR channel locates error(s)

• Correction?

70

Correction – the easy part

p

a

bx

a

b py

ifnoisy (x, y) = (1, 1) or (0, 0)

then corrected(x, y) = (0, 1)

Example {Hamming 7,4,3 code}

a = 0 0 0 0 0 0 0

b = 1 1 1 1 1 1 1

x = 0 0 0 1 0 0 0

y = 1 1 1 1 1 1 1s = 1 1 1 0 1 1 1

x = 0 0 0 0 0 0 0

y = 1 1 1 1 1 1 1

☺

Correction – hard part

p

a

bx

a

b py

ifnoisy (x, y) = (1, 0) or (0, 1)

thencorrect(x, y) is in {(0, 0), (1, 1)}

Question: Is it always possible to decode t-logic errors using

a t-error correcting code?

Ans: NO

Proof:a = 1 0 0 0 0 1 1

b = 1 0 0 1 1 0 0

x = 1 0 0 0 0 0 0

y = 1 0 0 0 0 0 0

x = 0 0 0 0 0 0 0

y = 1 0 0 0 0 0 0

71

Coverage results

(a,b)

Is there (c, d) such thata.b = xor(c.d, e)

a + b = xor(c + d, e)e (error) with small weight

Results:• Hamming [7,4,3] – 44% (a,b) have some (c,d) interfering

• ReedMuller[16,5,8] – Tolerates two logic compute errors

each block (obtained by simulation)

Results so far …

• Showed a novel coding and gate-diversity method which detects and corrects a controlled number of errors

• Coverage results are promising

72

Conclusions & future work

• Exploring more complicated gate-network• Proving feasibility of general bounded

distance codes• Efficient decoding methods• Thinking about the overhead!

Power/Area Minimization of UWB Digital Baseband

Albert H. ChangRach Liu

University of California, BerkeleyEE 241 Spring 2006 Final Presentation

May 9th 2006

73

Motivation

Study of Power and Area EfficiencyDesign Example: UWB communicationEfficiency is needed due to high throughput

Explore Micro-Architectural TechniquesParallelismTime-MultiplexingPipelining

The effect of voltage scaling on:Active PowerLeakage PowerOverall Power

Scope of Project

Design Driver: UWB Digital BasebandPulsed radio approach [1]

Goal: optimize power and areaFixed throughput constraintSimulation for all 3 corners

Simulink-to-Silicon Design EnvironmentFunctional verification of the algorithm Circuit-level characterization

[1] Mike Chen, “Ultra Wide-Band Baseband Design and Implementation,”M.S. Thesis, UC Berkeley, 2002

74

Design Methodology

Model & test the design in Simulink/XSGELDO Simulation of 90nm Technology

Understand Power-Delay tradeoff for all corners

Run BWRC/INSECTA digital design flow for an FIR filter @ 1V to estimate Power/Area

Extrapolate to other points: P = 1,2,4,8,16Use Power-Delay tradeoff curve

Find the optimal level of parallelism and Vdd

For the best Power/Area Tradeoff

Simulink Design Exploration

1GHz64

taps

75

Power of Major building blocks

24%9980/1=998024%9980Others of F16

20%9980/8=124767.2%9980Others of F2

15%9980/16=62374%9980Others of F1

76 %3179876 %31798PMF of F16

79.6%487432.8%4874PMF of F2

84.86%349626%3496PMF of F1

Actual Frequency => Depending of the level of parallelism

Assuming everything running at the same frequency

Cases

Power Consumption Studies in terms of number of Slices

Power of PMF block is about 80% of the overall system. Therefore, optimizing the PMF is the main task!

PMF Block, Extreme Case #1:F1 (fully time-multiplexed design)

“1x”

76

PMF Block, Extreme Case #2: F16 (fully parallel design)

Tradeoff between Throughput and Power Consumption

F1 synthesized in 90nm @ 1V, fclk = 3ns. Estimates:Active Power: 28.9mWLeakage Power: 3.4mW

Power extrapolated from F1 resultsTradeoff between Throughput and Power for F1

Since there is only one level of parallelism, throughput = operating frequency

As the frequency and voltage increase, the overall power consumption increases because

Active Power ~ fclk*VDD2

Leakage Power ~ VDD3.3 (experimental data)

77

Technology Characterization: ELDO FO4 Model

Extrapolate DataThroughput normalized to 1 for the highest throughput @ each corner.Normalize Delay to Delay @ 1v

Del

ay (

norm

to 1

V)

Simulated F1 of 90nm @ 1VActive Power: 28.9mWLeakage Power: 3.4mWOverall Power = 32.3mW

Throughput = 333MSamples/sSS

TT FF

1

0.1

0.01

10

100

Po

wer

(m

W)

Reference Case: F1

10.10.010.001Throughput (norm.)

78

Active Power: ~ fclk*VDD

2

Leakage Power:~ VDD

3.3

~ P

Power increases as P increases

Power decreases as P increases

1 2 4 8 16

Need to look at VDD to better understand Pleakage…

1 2 48

16

Case Study: Parallelism P = 1, 2, 4, 8, 16

A Closer Look: Leakage Power

Parallelism doesn’t help much as supply voltage saturates

The rate of voltage scaling decreases

As a result, F16 consumes more power here because the effect of block parallelism kicks in

79

Minimizing Overall Power

F16F8

F4F2F1

V>0.50.35<V<0.7

0.3<V<0.4

0.3<V<0.35

0.25<V<0.3

Choosing the Right Level of Parallelism

It is hard to compare different level of parallelismTherefore, normalize to F1 @ the same throughput

Find the appropriate level of parallelismThroughput = “0.6” (0.6 = 200/333)

F2 has about 60% Power Saving

F4-F16 has about 70% Power SavingThere is no benefit from F8, F16

Will try F2 and F4

80

Parallelism 1,2,4,8,16: TT CornerOverall Power (Normalized to F1)

Active

Leakage

Total

Design Choice: F4

Optimal Design Parameters

Final Voltage for F4 = 0.34V

Throughput

VD

DP

ow

er

Power reduction: 80%

81

Final Architecture

12.5 MHz8~256

12.5 MHz12.5 MHz

320 bits

200 MHz

20 bits Serial to ParallelPMF

(Could 1~16 different parallelism level)

16x15 bits

Corr Block

16x23 bitsMAX

27 bits

12.5 MHz8~256

MAX

30 bits

12.5 MHz8~256

pad limit

Final Voltage: 0.34VPower: 6mW

Power Saving: 80% compared to F1

Throughput: 200MHz/sample

(8 blocks)

(8 blocks)

(parallelism 4)

Future Work

Tape out the DesignExperimental Verification

82

Limited Switch Dynamic Logic:VDD scaling

Josephine Chang

EE241, Spring 2006, final project

Prepared for Dr. Rabaey

Limited Switch Dynamic LogicDynamic logic LSDL

-Faster and smaller Co stack than static allow Vdd scaling

-Lower power and more robust than dynamic better power & variability immunity

83

Full Adder Test CircuitLSDL

-Co logic in dynamic; S logic in static

-A,B,and S loaded with inverters; Cout loaded with Ci

-Functionality verified with 45 input patterns-Power measured with 1000 input patterns.

-Delay measured from Ci to Co

-Compared to static CMOS & Dual rail domino

0.0

0.5

1.0

1.5

2.0

2.5

0 300 600 900

W (nm)

PD

P (

ns*

uW

)

Optimized for PDP at 0.8V

0.0

0.1

0.2

0.3

0.4

0 300 600 900

W (nm)

tpro

p(n

s)

4

6

8

10

12

0 300 600 900

W (nm)

aver

age

po

wer

(u

W)

Static

Dynamic

LSDL

84

Vdd scaling

0.5

1.0

1.5

2.0

2.5

0.25 0.5 0.75 1

Vdd (V)

PD

P (

ns*

uW

)

0.0

0.2

0.4

0.6

0.8

1.0

0.25 0.5 0.75 1

Vdd (V)

tpro

p(n

s)

0

5

10

15

20

0.25 0.5 0.75 1

Vdd (V)av

erag

e p

ow

er (

uW

)

Static

Dynamic

LSDL

VT variation at 0.65V

L= 39nm40nm41nm42nm43nm44nm45nm

L=45nm, VT=.24V (nom) L=44nm, VT=.225V

L=40nm, VT=.02V L=39nm, VT=-.02V

Dynamic failure:

LSDL failure:

L=43nm , VT=.21V

85

Variations on LSDLBasic Clocked feedback Feed forward pulse

Controlled-Load Keeper

Summary

• LSDL power dissipation approaches static CMOS at low VDD.

• LSDL propagation delay worse than CMOS, but comparison was unfair (worst case for LSDL is much worse than nominal)

• 1-bit FA is perhaps not the best showcase demo for LSDL

86

Full Adder Test Circuit

Static LogicDynamic logic

That’s all!

• Excellent job …

Time for summer!

Documents

Lecture 27+28 Final Project Presentationsbwrcs.eecs.berkeley.edu › Classes › icdesign › ee241_s06 › ... · 2006-05-16 · 1 EE241 - Spring 2006 Advanced Digital Integrated