Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
1
EE241 - Spring 2006Advanced Digital Integrated Circuits
Lecture 27+28Final Project Presentations
Study of Subthreshold SRAM Study of Subthreshold SRAM Operation using FinFETsOperation using FinFETs
EE241 Final ProjectEE241 Final Project
Anupama Bowonder, Pratik PatelAnupama Bowonder, Pratik Patel
2
Subthreshold SRAM Operation
Motivation behind Subthreshold SRAM Operation:• Large area of chip devoted to SRAM Cache
• SRAM supply voltage needs to scale with logic supply voltage
• Reduce leakage by operating at DRV during hold
• Reduce active power consumption
Impediments to Subthreshold SRAM Operation:• Supply voltage scaling degrades cell stability
• Scaling increases sensitivity to process variations
– Variation induced local asymmetry, degraded SNM
– Spread in SNM over the whole SRAM array
• Impact of soft errors more significant at lower supply voltages
Previous Work
Leakage power reduction schemes• Offset non scalability of supply voltage
• Cell supply reduced to DRV during standby (H. Qin, et. al, 2004)
• Body biasing to control Vt of bulk transistors during standby(K. Osada et al., ISSCC 2003)
(K. Itoh, Hitachi)
3
Not Disturbed during a read
Cell Area = 0.41um2
Vsn2(V)
Vsn2(V)
Vsn2(V)
14% SNM improvement with 13% area penalty
Previous Work (cont’d)Increasing Cell Stability
(L. Chang, 2005) (Z. Guo, 2005)
Cell Stability vs. Area tradeoff
Fin rotation to increased pull down strength, improve SNM
FinFETs for Subthreshold Operation
• Better subthreshold swing – Improve Ion/Ioff in subthreshold
• Lower Ioff and hence leakage power
• Undoped Channel – no random dopant fluctuation
• Back gate Vt tuning
GATE1GATE2
S
D
GATE1GATE2
S
D
30nmFin Height
1016cm-3Channel Doping
4.5-5 evGate Work-function(φm)
20nmBody Thickness
20AGate Oxide Thickness
35nmGate length
FinFETparameter
30nmFin Height
1016cm-3Channel Doping
4.5-5 evGate Work-function(φm)
20nmBody Thickness
20AGate Oxide Thickness
35nmGate length
FinFETparameterTsi
Gate1
Gate2
Source Drain
Tox
FDSOI Tsi
Gate1
Gate2
Source Drain
Tox
FDSOI
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
4.4 4.6 4.8 5 5.2
workfunction in Ev
Ion
/Ioff
at
Vd
d=0
.3V
NFINFET
PFINFET
NFIN decreasing Vt
PFIN decreasing Vt
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
4.4 4.6 4.8 5 5.2
workfunction in Ev
Ion
/Ioff
at
Vd
d=0
.3V
NFINFET
PFINFET
NFIN decreasing Vt
PFIN decreasing Vt
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
4.4 4.6 4.8 5 5.2
workfunction in Ev
Ion
/Ioff
at
Vd
d=0
.3V
NFINFET
PFINFET
NFIN decreasing Vt
PFIN decreasing Vt
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
4.4 4.6 4.8 5 5.2
workfunction in Ev
Ion
/Ioff
at
Vd
d=0
.3V
NFINFET
PFINFET
NFIN decreasing Vt
PFIN decreasing Vt
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
4.4 4.6 4.8 5 5.2
workfunction in Ev
Ion
/Ioff
at
Vd
d=0
.3V
NFINFET
PFINFET
1.00E+01
1.00E+02
1.00E+03
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
4.4 4.6 4.8 5 5.2
workfunction in Ev
Ion
/Ioff
at
Vd
d=0
.3V
NFINFET
PFINFET
NFIN decreasing Vt
PFIN decreasing Vt
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
4.4 4.6 4.8 5 5.2
workfunction in Ev
Ion
/Ioff
at
Vd
d=0
.3V
NFINFET
PFINFET
1.00E+01
1.00E+02
1.00E+03
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
4.4 4.6 4.8 5 5.2
workfunction in Ev
Ion
/Ioff
at
Vd
d=0
.3V
NFINFET
PFINFET
NFIN decreasing Vt
PFIN decreasing Vt
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
4.4 4.6 4.8 5 5.2
workfunction in Ev
Ion
/Ioff
at
Vd
d=0
.3V
NFINFET
PFINFET
1.00E+01
1.00E+02
1.00E+03
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
4.4 4.6 4.8 5 5.2
workfunction in Ev
Ion
/Ioff
at
Vd
d=0
.3V
NFINFET
PFINFET
NFIN decreasing Vt
PFIN decreasing Vt
4
Leakage power reduction:
From 1v to 0.3 V = 95.7%
From 1V to 0.4V=92.1%
From 1V to 0.5V=86.6%
Standby Power of FinFET SRAM
7.493e-91.0
3.662e-90.8
1.002e-90.5
5.91e-100.4
3.189e-100.3
Standby Power
VddStandby Power =
ddleak VI ⋅
DATA RETENTION VOLTAGE OF 6T FINFET SRAM
0.00E+00
1.00E-01
2.00E-01
3.00E-01
4.00E-01
5.00E-01
0.00E+00 1.00E-01 2.00E-01 3.00E-01 4.00E-01 5.00E-01Vsn1
Vsn
2
DRV (60mV)
0.5V0.4V
0.3V
0.2V0.1V
PL PR
NL NR
VDD
1 0
Ioffp
Ioffn
Read Stability of 6T FinFET SRAM 6T SRAM SNM vs Vdd
(Read Stability and Hold Stability)
0
50
100
150
200
250
0 0.1 0.2 0.3 0.4 0.5 0.6
Vdd (V)
SN
M (m
V)
Dynamic Read Non Dynamic Read Hold
SNM vs Beta Ratio
29
33
37
20
25
30
35
40
0 1 2 3 4 5 6
Beta Ratio
SN
M (
mV
)
WL (VDD)BL (VDD) BL (VDD)
ACL ACR
PL PR
NL NR
VDD
0
0.1
0.2
0.3
0 0.05 0.1 0.15 0.2 0.25 0.3
Vsn2 (V)
Vsn
1 (V
)
Current has a linear dependence on size, exponential dependence on Vt change
>2x
5
Cell Write Ability
•N Curve - One of the few metrics used to determine is the cell is writeable.
(C. Wann et. al, 2005)
•Writeable Cell – positive current at the node you write.
Cell write ability as a function of workfcuntion
0.00E+00
1.00E-06
2.00E-06
3.00E-06
4.00E-06
5.00E-06
6.00E-06
7.00E-06
4.55 4.6 4.65 4.7 4.75 4.8
workfunction (eV)
I(vq
) (A
)
Writeable
0.00E+00
2.00E-06
4.00E-06
6.00E-06
8.00E-06
1.00E-05
1.20E-05
1.40E-05
1.60E-05
1.80E-05
2.00E-05
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
V(q) (V)
I(vq
) (A
)
BL (VDD)WL (VDD)BL (VSS)
ACL ACR
PL PR
NL NR
VDD
V(q)
Impact of Process Variation on Read Stability
3σ Lg = 3σ Tsi = 12% Lg(ITRS 2005)
1.00E-10
1.00E-09
1.00E-08
1.00E-07
1.00E-06
1.00E-05
1.00E-04
1.00E-03
1.00E-02
0 0.2 0.4 0.6 0.8 1 1.2
nominal
bestlg
worstlg
smallesttsi
largesttsi
96.6%
90.1%Statistical Variations in SNM
0
1
2
3
4
5
6
7
8
9
50 55 60 65 70 75 80 85 90 95 100
SNM (mV)
SN
M D
istr
ibu
tion
Den
sity
Mean = 74mV; Sigma = 10mV
Read SNM Monte Carlo
6
The project formerly known as
Design Considerations of Logic Operating in the IC=1 Regime
An EE241 Class Project byC. Marcu, M. Mark, J. A. RichmondUC Berkeley, Spring 2006
Introduction / Motivation / Outline
Strong and weak inversion has been extensively studiedModerate inversion might be a sweet spot for energy/delay trade-offsModeling, sensitivities to parameter variations, optimization of test circuit Is there actually any advantage to IC=1?
7
Modeling the Inverter
Based on EKV
Td =C ⋅VDD ⋅ k fit
IC ⋅ IS
CVLTIVEOP DDPdleakDD ⋅⋅+⋅⋅⋅= 2α
( )σ+
−+=
1
1ln2 ICTT
DD
enUVV
Sensitivity to Parameter Variations
8
Optimization: Adder
Simple full-adder using NAND & INV only
Modeling of Complex Gates: NAND
Logical effort
Leakage of transistor stack
9
Optimization: Software
Based on work done by Dejan MarkovicExtended to moderate and weak inversion by use of our modelOptimizes VDD, VT and gate sizing to minimize energy for a given delay and activity factor
Results: Minimum EOP vs. Delay
Delay and energy normalized to minimum delay and its corresponding maximum energySignificant energy savings within strong inversionVery little energy savings moderate weakHigher potential for energy savings with lower activity factors
10
Results: Optimizing VDD, Vt, W
Results: What Knob to Turn?
Optimizing different parameters at a time
11
Conclusion
Ultimate design goal tends to be minimizing energy for a given delayOur work provides the tools to optimize designs across all regions of operationOperating in IC=1 is therefore a result of the optimization, not a design targetIn fact, IC=1 is neither better nor worse than any other region for a digital designer, so don’t be afraid of it
A SelfA Self--Timed, Tunable State Machine Using Timed, Tunable State Machine Using Low Energy Pass Transistor LibraryLow Energy Pass Transistor Library
Matthew Pierson, Salman Suharwardy– Electrical Engineering and Computer Science - UC Berkeley
Next State/Output Logic
State Elements
......
Inputs Outputs
12
Increased focus on low energy logic and regular structuresIncreased focus on low energy logic and regular structuresRecovering performance at low supply voltages requires thresholdRecovering performance at low supply voltages requires thresholdscaling scaling
Low leakage architecture needed so leakage not dominatorLow leakage architecture needed so leakage not dominator
Regular Structures Needed to cope with manufacturing variabilityRegular Structures Needed to cope with manufacturing variability
Problem and Motivations (Prior Work)Problem and Motivations (Prior Work)
Configurable Logic Block
Inputs
Stack
“Design Considerations in CLBs for Deep Sub-Micron Technologies, ” Louis Alarconand Octavian Florescu, EE241 Spring 2005.
Problem and Motivations (Our Work)Problem and Motivations (Our Work)Until now, CLB library only applied to clocked data path.Until now, CLB library only applied to clocked data path.
Two different clocks requiredTwo different clocks required
Control logic not yet exploredControl logic not yet explored
Our goal: Our goal: Adapt this library to create a SelfAdapt this library to create a Self--Timed, Tunable Timed, Tunable State MachineState Machine
Why?Why?
1.1. Explore applicability to control logicExplore applicability to control logicState machine is most general form of control logicState machine is most general form of control logic
2.2. Asynchronous GainsAsynchronous GainsNo need for two clock networks that contribute heavily to energyNo need for two clock networks that contribute heavily to energydissipationdissipation
Robustness to manufacturing variationsRobustness to manufacturing variations
13
C C
Starti
C C
Donei
Stacki
SAi
Eni
!Eni+2
SAi+1 Stacki+1
Eni+1
!Eni+1 !Eni+2
Starti+1 Donei+1
SolutionSolutionStart with Self-Timed Pipeline
Evolved Into Optimized Pipeline
C
Starti
C
Donei
Stacki
SAi
Donei-1
Eni
SAi+1 Stacki+1
Eni+1 Starti+1 Donei+1
!Eni+1
!Eni+1!Eni+2
Donei-1
Adapting the LibraryAdapting the Library1.1. Cross coupled weak Cross coupled weak NMOSesNMOSes added to stack outputsadded to stack outputs
2.2. Must generate Done signal in the stack.Must generate Done signal in the stack.Stack is a low swing network because of NMOS pass transistors, Stack is a low swing network because of NMOS pass transistors, delay line is really only option.delay line is really only option.
Sense Amp Complete
Vdd + DeltaV
Done
PMOS transistors give us full swing on Done signal which is necessary for correctC element operation at low Supply/Threshold voltages
Ground
14
SelfSelf--Timed State MachineTimed State Machine
C
Starti+1
C
Donei+1
Stacki+1
SAi
Eni
!Eni+2
SAi+1 Stacki+2
Eni+1
!Eni+1
Starti+2 Donei+2
C
SAi+2
Eni+2
StartSDoneS
StackS
SAS
En
Inputs
InputsDone
State ElementState Element
!Eni
OutS
OutS
Results from Self Timed MachineResults from Self Timed Machine
But what if we don’t want to go so slow at 300 mV Supply Voltage…
15
Original clocked library implemented body bias threshold scalingOriginal clocked library implemented body bias threshold scaling and and artificial threshold scalingartificial threshold scaling……
Body BiasBody Bias –– Only works for about ~50 mV of threshold scaling before Only works for about ~50 mV of threshold scaling before body body –– source junction current becomes noticeable.source junction current becomes noticeable.
Artificial threshold scalingArtificial threshold scaling -- Shifting up the output on the Sense Shifting up the output on the Sense Amplifier by Amplifier by DeltaVDeltaV..
Threshold Voltage ScalingThreshold Voltage Scaling
SA
Vdd + DeltaV
DeltaV
Dynamic Power remains the same
Vdd + DV
0
Making it work for the asynchronous version:
C
Stack
SAi
Eni
Vdd + DV
0
Vdd + DV
0
EN,Done
Delay
RootVdd
0
Vdd
0
Taken from internal node, 0 -> Vdd+DV
Done
Out
Tuned Energy RangeTuned Energy Range
16
Tuned Delay RangeTuned Delay Range
Questions?Questions?
?
17
EE241
UC Berkeley EE241 Term Project, Spring 2006
Platform-Based
Xuening Sun
Tsung-Te Liu
David Chen
SRAM Standby Power Minimization
EE241 – Project
Outline
Motivation
Prior Work
Global Optimization
Conclusion
18
EE241 – Project
Scaling, Scaling, Scaling …Motivation
SRAMSRAMNoise
Lea
kag
e
Var
iab
ility
VelocitySaturation
“Where is the limit?”
EE241 – Project
Standby Power OptimizationMotivation
Access Time
Circuit Area
Nois
e M
argi
n
Wp, Lp, Wn, LnVsbp, Vsbn
Minimum Power@ Standby
⇓
WpLp
WpLp
WnLnWn
Ln
Vsbp Vsbp
VsbnVsbn
19
EE241 – Project
DRV Measurement and Parametric AnalysisPrior Work
-0.4 -0.2 0 0.2 0.4
120
140
160
180
200
220
Vpb (V)
DRV mean (m
V)
chip 1
Vnb = -0.4Vnb = -0.2Vnb = 0Vnb = 0.2Vnb = 0.4
-0.4 -0.2 0 0.2 0.4
120
140
160
180
200
220
Vpb (V)
DRV mean (m
V)
chip 1
Vnb = -0.4Vnb = -0.2Vnb = 0Vnb = 0.2Vnb = 0.4
-0.4 -0.2 0 0.2 0.4120
130
140
150
160
170
180
190
200
210
220
Vnb (V)
DRV mean (mV)
Vpb = -0.4Vpb = -0.2Vpb = 0Vpb = 0.2Vpb = 0.4
-0.4 -0.2 0 0.2 0.4120
130
140
150
160
170
180
190
200
210
220
Vnb (V)
DRV mean (mV)
Vpb = -0.4Vpb = -0.2Vpb = 0Vpb = 0.2Vpb = 0.4
0.1 0.2 0.3130
140
150
160
170
180
190
200
210
220
Lp (um)
DR
V m
ean
+ 5*
std
(mV
)
Ln = 0.1Ln = 0.15Ln = 0.2Ln = 0.25Ln = 0.3
0.1 0.2 0.3130
140
150
160
170
180
190
200
210
220
Ln (um)
DR
V m
ean
+ 5*
std
(mV
)
Lp = 0.1Lp = 0.15Lp = 0.2Lp = 0.25Lp = 0.3
0.1 0.2 0.3130
140
150
160
170
180
190
200
210
220
Lp (um)
DR
V m
ean
+ 5*
std
(mV
)
Ln = 0.1Ln = 0.15Ln = 0.2Ln = 0.25Ln = 0.3
0.1 0.2 0.3130
140
150
160
170
180
190
200
210
220
Ln (um)
DR
V m
ean
+ 5*
std
(mV
)
Lp = 0.1Lp = 0.15Lp = 0.2Lp = 0.25Lp = 0.3
0.2 0.3 0.4 0.5190
200
210
220
230
240
250
Wp (um)D
RV
mea
n +
5*s
td (
mV
)
Wn = 0.325Wn = 0.27Wn = 0.215Wn = 0.16Wn = 0.105
0.1 0.15 0.2 0.25 0.3190
200
210
220
230
240
250
Wn (um)
DR
V m
ean
+ 5
*std
(m
V)
Wp = 0.2Wp = 0.3Wp = 0.4Wp = 0.5
0.2 0.3 0.4 0.5190
200
210
220
230
240
250
Wp (um)D
RV
mea
n +
5*s
td (
mV
)
Wn = 0.325Wn = 0.27Wn = 0.215Wn = 0.16Wn = 0.105
0.1 0.15 0.2 0.25 0.3190
200
210
220
230
240
250
Wn (um)
DR
V m
ean
+ 5
*std
(m
V)
Wp = 0.2Wp = 0.3Wp = 0.4Wp = 0.5
Courtesy of Huifang Qin, 90nm SRAM Test Chip: DRV and Leakage Measurement, 03/02/2006
EE241 – Project
Design Platform GlobalOptimization
Cir
cuit
Sys
tem
DRV = a1 log
1+ β1
ln
⎛
⎝ ⎜
⎞
⎠ ⎟
wp + k1
α1
exp γ1 −2φ1 + vsbn( )exp γ2 −2φ2 − vsbp( )
+ α1 1+ β2
lp
⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟ wp + k2( )exp γ3 −2φ3 − vsbp( )exp γ 4 −2φ4 + vsbn( )
⎛
⎝
⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟
⎛
⎝
⎜ ⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟ ⎟
[3]
ID = iS expvGS −VT
nvth
⎛
⎝ ⎜
⎞
⎠ ⎟ 1− exp
−VDD
vth
⎛
⎝ ⎜
⎞
⎠ ⎟
⎛
⎝ ⎜
⎞
⎠ ⎟
[2]
tread = KSW + CBL0 + KBLSW
Kread ASW + B SW3
SN2 − C SW
2
SN− 1
2
⎛ ⎝ ⎜
⎞ ⎠ ⎟
[1]
Area = Wi × Li( )∑Power = DRV × IDj∑
Performance = f WL( )
N, W
L( )A,vsbn[ ]
20
EE241 – Project
Program Demo GlobalOptimization
WpLp
WpLp
WnLnWn
Ln
Vsbp Vsbp
VsbnVsbn
EE241 – Project
Global Optimization: DRVGlobalOptimization
Lower delay costly;
P/NMOS asymmetry⇒ higher DRV;
Smaller area possible;
Lower sensitivity at looser constraints;
Global minimum DRV:35mV (74% reduction)@ Wn/Ln=0.1u/186u,
Wp/Lp=354u/186u,Vsbn=0.4V, Vsbp=−0.4V.
21
EE241 – Project
DRV SensitivityGlobalOptimization
: Standard-Cell
EE241 – Project
Global Optimization: PowerGlobalOptimization
Power and DRV require different sizing;
Performance and area equally influential;
Insensitive to area beyond standard-cell size (next page);
Global minimum power: 0.3nW (70% reduction)@ Wn/Ln=0.1u/0.1u,
Wp/Lp=0.2u/0.1u,Vsbn=Vsbp=0.
22
EE241 – Project
Power SensitivityGlobalOptimization
: Standard-Cell
EE241 – Project
Conclusion (and Beyond)Conclusion
A platform-based SRAM standby power optimization method is presented;
Global optimization yields lower DRV and standby power than single-dimensional parametric search;
Minimum power and minimum DRV require different sizing;
Sizing for globally minimum power exists with 70% power reduction, 5% area reduction, but 80% performance loss;
Presented method can be scaled to incorporate more design variables such as access transistor size and bit-line pre-charge;
Presented method can be scaled to incorporate process variation, where statistic-based modeling is a key component (primitive analysis on final report).
23
EE241 – Project
Acknowledgement
Special thanks to Huifang Qin for generous support from test data and documents, to technical discussion and evaluation;
Thanks to Rakesh Vattikonda and Yu Cao for providing statistical models;
Thanks to Dejan Markovic for early discussion on scope of work and help with MATLab;
Thanks to Yanmei Li for discussion on platform-based design.
Key references:
[1] H. Qin, Y. Cao, D. Markovic, A. Vladimirescu, and J. Rabaey, “SRAM Leakage Suppression by Minimizing Standby Supply Voltage,” ISQED, 2004
[2] B.H. Calhoun, ad A. Chandrakasan, “Analyzing Static Noise Margin for Sub-threshold SRAM in 65nm CMOS,” ESSCIRC, 2005
[3] R. Vattikonda, “Modeling DRV of a SRAM Cell Under Variations,” Research Notes, April 2006
EE241 – Project
Appendix
24
EE241 – Project
More on Statistic-Based Optimization
startDRVMAX
conditionsDRVMEAN
modeloptimal sizing for minimum DRVMEAN
update DRV model
optimal sizing for minimum DRVMAX
match?
end
Yes
No
“How about power… ?”
EE241 – Project
1st-Order VerificationBy fixing W and L to standard size, the presented optimization platform reduces to single-dimensional parametric sweep: Vpb and Vnb.
dots: measured DRV from the 90nm test chip;
line: parametric sweep using the reduced platform;
color code: same color (dots and line) represents same bias condition.
25
15
Ultra Low Power Clock Generationusing Sub-Threshold MCML
Asako Toda • Anurag Pandit • Khang An Tran
EE241 FALLFinal Project Presentation
15
MCML in sub-Vth
Good
• Leakage• Ultra Low Power• Robust• Easy
Bad
• Slow ?• Big ?• Model ?
26
15
Current Mode Logic (CML) in Sub-Vth
~ nA
~100mV
HL
~10MVΩ
10nA
Replica Bias
LH
Iss
L VDD–200mVVDD H
20MΩ
15
Input – Output
ou
t+, o
ut-
∆Vin
out+ out-
sub-Vth
∆Vin
∆Vin_th
∆Vin_th
3 x nVt = 117mV
Vdsat
3 x Vt
27
15
Input – Output Model
VDD – 27mV
ou
t+, o
ut-
∆Vin
out+ out-
∆Vin
15
““
Iss
≡
saturated ?
Iout
—
+
Iout
0
Iout = Iss = 1/a@ ∆V: big
Iout = 0@ ∆V= 0
1/a
=
∆V
∆V
28
15
Mathematica
Input – Output Model Verification Yes !
VDD – b x Iss
15
Variation & Mismatch
a ~ 1 / Iss
a, b ~ Lp, 1/Wp
Iss
PMOS Load NMOS input
Curren Source
1
2
3
VDD
Source Voltage0
29
15
Variation
Sensitivity
Input-Output
worst
worst13 mV0.13b
30 mV0.3a
40m V0.4Iss
∆ Vod (10%)Sens
Constraint
Vin > Vin_th = 3 nVtVin = Vin_th + margin + ∆ Vod
Sensitivity
Example: NMOS, PMOS: min size
15
MatchingConstraint : Vos << Vos_limit
Vos
VDD – ∆ u
Vos_limit
30
15
MatchingWorst Case
pair ratio : r << 10%
Vos << Vos_limitr
Vos
50mV10mV
~ ~
15
PDP
Sub-Vth MCML Static CMOS
N: Number of Stages
1.59n(tp0)
3.10n
7.56n
1 40f
td [sec]
1.59n(tp0)
3.10n
7.56n
1 40f
td [sec]
30p(tp0)
47p
34p
1 40f
td [sec]
30p(tp0)
47p
34p
1 40f
td [sec]
Delay
Power
PDP
EDP
31
15
VDD=0.36 PDP_MCML
PDP_INV
15
Frequency Divider( VDD = 0.5V, Freq = 20MHz )
Fujishima, et al, JSSC,1993
7fJ/cyc 5fJ/cyc
MCML type CMOS Static
32
15
Summary
Good
• PDP in Ultra Low Voltage
• Robust in Matching, Noise, and EM.
• Easy
Bad
• Slow in Middle Low Power
• Variation• Cascading• Big
Analysis of Razor
Timothy Loo, Vincent NgUniversity of California, BerkeleyEE241 Spring 20065/8/2006
33
MotivationAlways correct logic not efficient
D. Ernst, N.Kim, S.Das, S.Pant, R.Rao, T.Pham, C.Ziesler, D.Blaauw, T.Austin, K. Flantner, and T. Mudge, “Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation”, MICRO-36, December 2003
SolutionRazor – an error correction mechanism
D. Ernst, N.Kim, S.Das, S.Pant, R.Rao, T.Pham, C.Ziesler, D.Blaauw, T.Austin, K. Flantner, and T. Mudge, “Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation”, MICRO-36, December 2003
34
ChallengesOptimal supply voltage strongly dependent on the program and input dataHold time increased by the amount of delay of the delayed clockMeeting both the setup time constraint and hold time constraint a challengeA strong tradeoff between potential benefit and buffers neededNeed to analyze every path in the logicWill benefit increase or decrease with increase process variation?
D. Ernst, N.Kim, S.Das, S.Pant, R.Rao, T.Pham, C.Ziesler, D.Blaauw, T.Austin, K. Flantner, and T. Mudge, “Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation”, MICRO-36, December 2003
Analyzing the Problem
Add model corners to using ee240 and ST technology models.Implement and verify a pipeline to simulate data propagation throw an adder, a regular flip fop, a multiplier, and a Razor Flip Flop.
Simulate circuit to determine longest and shortest paths to see how they affect Razor Operation
35
Multiplier Path VariationSimulated vs Extracted Data
10
15
20
25
30
35
40
90 65 45
Technology
% c
han
ge
fro
m T
T
SS-ST FF-ST SS-240 FF-240
Shortest-Longest Path Variation
135
140
145
150
155
160
165
90 65 45
Technology
% v
arat
ion
If Razor Shadow latch detects errors after 1/2 a clock cycle, the period can be 2/3 * Longest Path instead of 1!
Effects of Variation on Razor Efficiency
1/3 1/3 1/3
Limitation
Shortest Path Constraint may necessitate the use of delay buffers to prevent data corruption
36
Additional Power UseTwo CasesLogic can be modified such that shortest path is increased without increasing longest pathAdd buffers in logic to increase shortest path
Energy Penalty 90nm Total Power Simulation
90 124fJ Energy w/o Razor: 0.9pJ
65 245fJ Energy w Razor: 45 546fJ w/o Overhead: 0.7pJ
w/ Overhead: 1.3pJ
Note: overhead may be amortized with larger logic blocks
Additional Clock PeriodWorst Case:
Logic cannot be modified internally, therefore adding buffers increases both shortest and longest path delay. Clock Period Savings are reduced.
Boundary Requirement:
Shortest Path > Longest Path / 6.
Otherwise, the adjusted clock period (to satisfy the shortest path constraint) will be longer than the Longest Path!
37
Razor Efficiency
None of the 90,65,45nm models pass the shortest path requirement.
Tech % Increase % Over Original Clock Period
90 53% 2.2%65 52% 1.3%45 68% 12.0%
Setup Time Variations
Setup Times vs Technology
0
5
10
15
20
25
90 65 45
Technology
% o
f to
tal
clo
ck p
erio
d
Main FF Setup Time Shadow FF Setup Time
38
Conclusion
Razor Topology is susceptible to shortest path signals corrupting shadow latch.Variations increase dramatically from 65 to 45nm technologyBenefits of Razor decreases as variations increase.
Soft Error Tolerant Logic Circuit Design
Mohammad Amin Arbabian Debopriyo Chowdhury
39
What are soft errors?
Transient faults caused by high energy particles
Sources: Alpha particles from packaging material, thermal and fast neutrons from cosmic rays
[K.J. Hass et al., MWSCAS 1999]
Should we really care about SEE?
What really limits scaling? Wallmark [1962]: Power and SER
C↓ V ↓ Q ↓ ↓SER↑ ↑
Source: Shivakumar P., IEEE 2002, N.J.Wang,UIUC
Upshot: Soft error rates per SRAM or latch bit grow slowly with scaling
But: The number of bits grows with Moore’s Law!
Caution: Custom, ASIC, FPGA designs that push the density envelop
40
Soft Errors in Logic Circuits
Single event transient (SET) vs upset (SEU)
1
CLK
D Q
Q’CLK
D Q
Q’
0SET
D Q
Q’CLK
D Q
Q’
00 0
00 0
CLK
SET in combinational circuitsElectrical masking
0
CLK
D Q
Q’CLK
D Q
Q’
0SET
0
CLK
Latching Window
D Q
Q’CLK
D Q
Q’
0SET
0 00
Logical maskingLatching window masking
Source: Krishnamohan et. al. IEEE 2004
Masking becomingineffective in
nanometer ccts
Available Circuit Techniques:
Triple Modular Redundancy Partial Error Masking/ Cluster Sharing to
decrease the Area Overhead
Selective Node EngineeringAdding CapacitanceAdding Cross Coupled InvertersSizing
Using Timing Slacks to resample dataConventional Hardened Latch
Huge area, poweroverhead
Large performanceoverhead
Complex clocking scheme
SET protection
41
Modeling of Soft ErrorsCircuit response depends not only on deposited charge but also on amplitude, duration as well as shape of current pulse accurate models are essential for reliable results
Bit Flips!No Bit Flip!
Simulation done with inverter (NMOS: 0.51µ/90nm PMOS: 0.88µ/90nm)
Total incident charge: 15fC (1mA, 15ps ; 0.1mA, 150ps)
Same Charge!!
Modeling Neutron StrikeFast rise and gradual fall current waveforms
TrapezoidDouble ExponentialMix of Linear and Exp
Our Choice:
T fitted for 90nm ProcessSimulated by the Piece-wise Linear Function in Cadence
)exp(2
I(t)T
t
T
t
T
−=π
Charge: 10fC
Peak Current~300µA
70ps
42
Our Solution:SETs hurt only if they are captured by the latchSEU protection of Latch is criticalAggressive Pipelining means less logic more flip-flopsIf we can have a latch that filters the transients, why slow all the nodes ???
concept of error-filtering latchDesign 2 new SET ANDSEU Tolerant Latches
Proposed Latch 1
Y IN(d)CLKBAR MN1MN2MN3MP1MP2MP3DCDE
C-element
• Adaptive SET protection and inherent SEU protection
• Very small area/power overhead
•Area Overhead: 3-5% including data-path
•Power Overhead: X 1.08
43
Adaptive Delay Control Unit
Graph shows tradeoff between delay and QCRIT for Latch 1A four bit Digitally Controlled Delay Element (DCDE) Delay Variation: 30ps – 65ps in steps of 2-5ps
Possibility of Adaptive Soft Error Protection at run time
0
10
20
30
40
50
60
70
80
0 0.5 1 1.5 2 2.5Delay (FO4 inverter)
Qcr
itic
al (
fC)
QCRIT vs Delay for Latch 1
Proposed Latch 2Concept of feedback used
Latch does not respond to transient spikes
Less Delay Overhead with some power penalty
Stronger
Power Overhead: X 1.79
44
Simulation Results (1)
Statistical (Monte Carlo type) simulation with random charge injection based on stochastic modeling of neutron energy transferLatch tested with 4-bit ripple carry mirror adder
0
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7 8 9 10 11 12
Node Number
Pro
babili
ty o
f Fa
ilure
(%)
Nominal
Protected (Scheme1)
0
10
20
30
40
50
60
70
80
90
1 2 3 4 5
Node Number
Pro
bab
lity
of F
ailu
re (%
)
Charge (fC)
Total of 50,000 Random Charges
Average SER Protection:
212% Improvement in critical charge
57 Times less probability of Failure
Simulation Results (2)
0
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7 8 9 10
Node Number
Pro
bab
ility
of F
ailu
re (
%)
NominalProtected
0
10
20
30
40
50
60
70
80
90
1 2 3 4 5
Node Number
Pro
bab
lity
of F
ailu
re (%
)
Average SER Protection:
239% Improvement in critical charge
79 Times less probability of Failure
Similar improvement in reliability obtained using some ISCAS benchmark circuits as the data-path
45
What comes next?
Scaling into sub-50 nm technology might be limited by soft error, if proper protection is not taken for logic circuitsProposed latch can be combined with node engineering techniques to yield enhanced protection for critical applicationsSoft error sensing and adaptive protection schemes are attractive
Impact of Logic Styles on a Impact of Logic Styles on a ViterbiViterbi Decoder in respect of Decoder in respect of Performance and RobustnessPerformance and Robustness
JiJi--HoonHoon ParkParkSeungSeung--Bum Bum SuhSuh
46
TopicsTopicsI.I. Design of AddDesign of Add--CompareCompare--Select Unit with LVS LogicSelect Unit with LVS Logic
II.II. Performance and Robustness Comparison with Performance and Robustness Comparison with
Different Logic StylesDifferent Logic Styles
ContentsContentsI.I. ViterbiViterbi DecoderDecoderII.II. Simulation Results Simulation Results –– Performance and RobustnessPerformance and RobustnessIII.III. Analysis Analysis –– Delay and VariabilityDelay and VariabilityIV.IV. ConclusionConclusion
Topics and ContentsTopics and Contents
ViterbiViterbi DecoderDecoder
niSM
niSM
niSM
njSM
njSM
njSM
1nk1SM +
1nk2SM +
nk1j,
nk1i, BMBM =
nk2j,
nk2i, BMBM =
{ }nkj
ni
nki
ni
nk ,,
1 ,min BMSMBMSMSM ++=+
•• ACS Unit is a critical block in ACS Unit is a critical block in ViterbiViterbi DecoderDecoder•• AdderAdder is a critical block in ACS Unitis a critical block in ACS Unit
47
Performance of Performance of ACSsACSs and Addersand Adders
I. LVS (LowI. LVS (Low--Voltage Swing)Voltage Swing)
II. Dynamic ManchesterII. Dynamic Manchester
III. Static ManchesterIII. Static Manchester
Delay (adder) = 79.81 Delay (adder) = 79.81 pspsDelay (ACS) = 148.41 Delay (ACS) = 148.41 psps
Carry ImplementationsCarry Implementations
Robustness of AddersRobustness of Adders
LVS LVS –– Most RobustMost Robust•• Clocking of Sense AmplifierClocking of Sense Amplifier•• Sampling Level of Pass Transistor OutputSampling Level of Pass Transistor Output
48
•• Delay ExpressionDelay Expression(Linear Approximation)(Linear Approximation)
⎟⎟⎠
⎞⎜⎜⎝
⎛+−≅= ∫
21
12 11
4
)(
)(
)(2
1 ii
vvCdv
vi
vCt L
v
v
Lp
⎟⎟⎠
⎞⎜⎜⎝
⎛
∂∂
+∂∂−
≅∂
∂∴
22
221
121 11
4
)(
iV
i
iV
ivvC
V
t
TT
L
T
p
⎟⎟⎠
⎞⎜⎜⎝
⎛
∂∂
+∂∂−
≅∂∂
∴22
221
121 11
4
)(
iV
i
iV
ivvC
V
t
DDDD
L
DD
p
•• Variability ExpressionsVariability Expressions
1v 2v
11
i
current1
2
1i
Analysis Analysis -- Concepts Concepts
Analysis Analysis –– Comparison with SimulationsComparison with Simulations
0.05 0.1 0.15 0.2 0.25 0.30
50
100
150
200
250
300
350
Vth (V)
Del
ay (
ps)
Delay Model
Simulation
Analysis Model Analysis Model and Simulation and Simulation
Results MatchingResults Matching
49
Analysis Analysis –– Delay and VariabilityDelay and Variability
( ) ( ) 55.122 2
41
TDDTTDD VVVVV −+
+−
( )[ ] ( ) 55.2222 2
4.1242
TDDTTDD
TDD
VVVVV
VV
−+
+−
−
( )[ ] ( ) 55.2222 2
2.622
TDDTTDD
TDD
VVVVV
VV
−−
+−
+−
( ) ( ) 5.122
43
11
TDDDDTTDD VVVVVV −⋅+
+−
( )( )[ ] 5.2
32
35
32
222
43
5.122
⎟⎠
⎞⎜⎝
⎛ −
⋅+
+−
−
DDTDD
DD
TTDD
TDD
VVV
V
VVV
VV
( )( )[ ] 5.2
32
35
5.23
13
2
222
43
32
455.1
22
⎟⎠⎞
⎜⎝⎛ −
⎟⎠⎞
⎜⎝⎛ −⋅
++−
−−
DDTDD
DDTDD
TTDD
TDD
VVV
VVV
VVV
VV
( ) ( )TDDDDTDD VVVVV −⋅+
− 43
212
( ) ( )23
43
11
TDDDDTDD VVVVV −+
−
( )( )
( )23
43
23
1
TDDDD
TDD
TDD VVV
VV
VV −
−−
−−
pt
T
p
V
t
∂∂
DD
p
V
t
∂
∂
PARALLELPARALLELSERIALSERIALPTLPTL
•• Robustness: Robustness: Serial Stack > Parallel >> PTLSerial Stack > Parallel >> PTL
Analysis Analysis –– VtVt and and VddVdd SensitivitySensitivity
0
0.5
1
1.5
0
0.1
0.2
0.3
0.40
20
40
60
80
Vdd
Normalized Vt Sensitivitiy
Vt
Vt
Sen
sitiv
ity
PTL
Serial
Parallel
0
0.5
1
1.5
0
0.1
0.2
0.3
0.40
10
20
30
40
Vdd
Normalized Vdd Sensitivitiy
Vt
Vdd
Sen
sitiv
ity
PTL
SerialParallel
DDV
pt
∂
∂
T
p
V
t
∂∂
PTL
Serial
Parallel
50
PTL PTL Robustness HigherRobustness Higheras Sampling as Sampling Level LowerLevel LowerVt p 5.0@
Vt p 3.0@
0
0.5
1
1.5
0
0.1
0.2
0.3
0.40
20
40
60
80
Vdd
Normalized Vt Sensitivity (PTL)
Vt
∂ t p/ ∂
Vt
@0.2V
@0.3V
@0.5V
T
p
V
t
∂∂
0 50 100 150 200 2500
0.05
0.1
0.15
0.2
0.25
Delay (ps)
Pro
babi
lity
PTL Vth Variation S imulation varying by S ampling Level
@0.2V
Analysis Analysis –– PTL Sampling Level vs. RobustnessPTL Sampling Level vs. Robustness
Strong Robustness of LVSStrong Robustness of LVS
Conclusion and Future WorksConclusion and Future Works
ConclusionConclusion•• LVS Logic StyleLVS Logic Style
-- Outperforms in Performance and RobustnessOutperforms in Performance and Robustness-- Complex to Design (Clock Timing is Critical)Complex to Design (Clock Timing is Critical)
•• Variation AnalysisVariation Analysis-- Lowering the Sampling Level in PTL and Stacking the Lowering the Sampling Level in PTL and Stacking the
Devices Improve the RobustnessDevices Improve the Robustness-- Explains the Robustness Differences between Logic StylesExplains the Robustness Differences between Logic Styles-- Explains LVS RobustnessExplains LVS Robustness
Future WorksFuture Works•• Develop Variability Analysis ModelsDevelop Variability Analysis Models
-- Apply to Complex Gate StructuresApply to Complex Gate Structures-- Apply to ShortApply to Short--Channel ModelsChannel Models
•• Robustness vs. Clocking StrategyRobustness vs. Clocking Strategy
51
Analysis and Design of High Performance and Low Power Current Mode Logic CMOS
Phillip ChinJunjie Su
Xiaolan Zhong
Motivation
As technology scales, the following problems are prevalent:
Increased Leakage CurrentIncreased Power ConsumptionReduced noise margins
Today’s circuits require high performance using the low power while addressing above problems.
52
Current Mode Logic (CML)
In the past it was not used, because it consumed more power than voltage mode implementations.However, today, circuits operate at higher frequencies.CML can save a lot of power and offer better performance.
Current Mirror Difficulties
Recall MOS current equations:Saturation
Subthreshold
Lambda is increasing, thus small changes in VDS has a bigger impact, therefore the Current Mirror is more sensitive to changes in VDS.
( )DSqkT
V
qnkT
V
SD VeeIIDSGS
λ+⎟⎟⎠
⎞⎜⎜⎝
⎛−=
−11 //
( ) ( )DSTGSD VVVL
WkI λ+−= 1'
2
1 2
53
CML Adder Failure
Fails with smaller technology, depends heavily on current mirror accuracy
New CML AND Gate
Inspired from the previous adderSizing is very importantCan replace CMOS AND gate
54
Robustness Comparison
In lower frequencies, static CMOS has a better noise margin than CML.At higher frequencies, they are pretty identical.
CML AND Gate at 1.25 GHz Static CMOS AND at 1.25 GHz
Delay and Power ComparisonPower vs. Switching Frequency
0
5
10
15
20
25
30
35
0.5 1 1.5
Switching Frequency (GHZ)
Po
wer
(fw
)
Static CMOS
CML
Delay vs. Switching Frequency
110
115
120
125
130
0.6667 0.8333 1 1.25
Switching Frequency (GHZ)
Del
ay (
ps)
Static CMOS
CML
1291171.25
1291171
1291170.833
1291170.667
CMLCMOSFreq
3.6232.41.25
426.51
3.53220.833
3.6617.50.667
CMLCMOSFreq
55
Static CMOS vs. CML
CML out performs static CMOSDelay*Power vs. Switching Frequency
0
5001000
1500
2000
25003000
3500
4000
0.5 0.7 0.9 1.1 1.3
Switching Frequency (GHZ)
Del
ay*P
ow
er (
ps*
fW)
Static CMOS
CML
46737911.25
51631061
455.425740.833
471.820470.667
CMLCMOSFreq
Conclusion
CML offers a new possible solution to the issue of scalingCML AND gate offers better overall performance over its static CMOS counterpartIn the future, optimize current mirror and build bigger blocks
56
UC Berkeley Spring 2006
EE241 Final Presentation
Evaluation of Adiabatic Logic for New Process Technologies
Prof. Jan Rabaey
Karl Skucha, Babak PahlavanArash Ghanadan
UC Berkeley Spring 2006
Motivation
Exploit property of Adiabatic Charging
E=(RC/T) * C Vvdd2
Asymptotically Zero Power Dissipation
How close can we get?
What are the delay trade-offs?
Future Trends
Viable in new process technologies?
Benefits getting better or worse?
Possibility for mobile or ultra-low power applications.
!
57
UC Berkeley Spring 2006
Problem Statement
No formal study of Adiabatic circuits below 130nm
technology node
No formal study of trends for new technology nodes
such as 90nm, 65nm and 45nm
Power is one of the biggest issues
Are adiabatic circuits promising as technology scales
down?
UC Berkeley Spring 2006
Design Families
Non-Adiabatic design family for comparison
Dynamic DCVSL
Adiabatic families
Positive Feedback Adiabatic Logic (PFAL)
Energy Charge Recovery Logic (aka 2N2P)
58
UC Berkeley Spring 2006
Design FamiliesDynamic DCVSL
Reasons Used
Same switching probability as Adiabatic circuits
Similar structure (differential, cross coupled PMOS)
UC Berkeley Spring 2006
Design Families2N2P
Reasons Used
Easy to implement
Small area
Rail-to-rail swing
OperationAdiabatic charging and
discharging with clock
4 sinusoidal clocks
90 degrees out of phase
59
UC Berkeley Spring 2006
Design FamiliesPFAL
Reasons Used
High Performance [1]
Rail-to-Rail Swing
Operation
Functional blocks assist
adiabatic charging
4 sinusoidal power clocks
[1] A. Vetuli, S. Di Parcoli and L. M. Reyneri: “Positive feedback in adiabatic logic”. Ekefmnics Letters, Vol. 32, No. 20, Sep. 1996, pp. 1867ff.
UC Berkeley Spring 2006
Test Setup
Test Circuit: 4 input NAND decoder
Input Vector: 0101->1010->0101
Frequencies:HF: 1Ghz and 500MHz
MF: 250 and 100MHz
LF: 50 and 10MHz
VLF: 3 and 1MHz
Loads: 0,50,100,150fSized NANDs and scaled
voltage for lowest power
dissipation
Maintain functionality
and “full swing”
60
UC Berkeley Spring 2006
Energy vs. Delay Results (1/4)
DCVSL uses highest energy, 2N2P ~40%, PFAL ~22%
Energy Per Cycle vs. Delay180nm, 100f load
1.0E-14
1.0E-13
1.0E-12
1.0E-11
1.0E-10
1.0E-09 1.0E-08 1.0E-07 1.0E-06Delay
En
erg
y P
er C
ycle 2N2P
PFALDCVSL
UC Berkeley Spring 2006
Energy vs. Delay Results (2/4)
DCVSL uses highest energy, 2N2P ~30%, PFAL ~25%
Energy Per Cycle vs. Delay90nm, 100f load
1.0E-14
1.0E-13
1.0E-12
1.0E-11
1.0E-09 1.0E-08 1.0E-07 1.0E-06Delay
En
erg
y P
er C
ycle 2N2P
PFALDCVSL
61
UC Berkeley Spring 2006
Energy vs. Delay Results (3/4)
DCVSL uses highest energy, 2N2P ~40%, PFAL ~30%
Leakage begins to dominate in low frequency for all three
Energy Per Cycle vs. Delay65nm, 100f load
1.0E-14
1.0E-13
1.0E-12
1.0E-11
1.0E-10
1.0E-09 1.0E-08 1.0E-07 1.0E-06Delay
En
erg
y P
er C
ycle
2N2PPFALDCVSL
UC Berkeley Spring 2006
Energy vs. Delay Results (4/4)
All fail in high frequency, and leakage dominates in low frequency
Results promising but likely unreliable due to (much) higher power consumption and voltage requirements for 45nm vs. other models
Energy Per Cycle vs. Delay45nm , 100f load
1.0E-14
1.0E-13
1.0E-12
1.0E-11
1.0E-09 1.0E-08 1.0E-07 1.0E-06Delay
En
erg
y P
er C
ycle
2N2PPFALDCVSL
62
UC Berkeley Spring 2006
Trend Results (1/2)
Adiabatic gain= (energy for DCVSL) / (energy for Adiabatic family)
High Freq Adiabatic Gain vs Model
0
0.5
1
1.5
2
2.5
3
3.5
180nm 90nm 65nm
Model
Gai
n
2N2PPFALAverage
Middle Freq Adiabatic Gain vs Model
0
1
2
3
4
5
6
7
8
180nm 90nm 65nm 45nm
Model
Gai
n
2N2PPFALAverage
• HF gain ~2• Little change
• MF gain ~4• Slight increase
UC Berkeley Spring 2006
Trend Results (2/2)
Low Freq Adiabatic Gain vs Model
0
1
2
3
4
5
6
7
180nm 90nm 65nm 45nm
Model
Gai
n
2N2PPFALAverage
• LF gain ~3.5• Slight decrease
• VLF gain ~4-2• Large decrease
Very Low Freq Adiabatic Gain vs Model
0
1
2
3
4
5
6
180nm 90nm 65nm 45nm
Model
Gai
n
2N2PPFALAverage
63
UC Berkeley Spring 2006
Results Summary
Adiabatic circuits consume 70-80% less power in medium to very low frequencies If leakage dominates, the benefits decreaseAt high frequencies, benefits also decreaseIn smaller technologies, benefits decrease for low and very low frequenciesHowever, the medium frequency range remains promising.
UC Berkeley Spring 2006
Discussion & ConclusionDiscussion
More clock networks and clock generation circuits will increase clock distribution more powerAlways switching
Applicable mostly to high switching circuitryLarger design overhead~100 times lower static noise [2]
ConclusionPower saving of 70-80% and increasing for frequencies in the 100-250MHz range for new technologiesVery low frequencies and high frequencies need different low-power solutions
[2] Mahmoodi-Meimand, H.; Afzali-Kusha, A. “Low-power, low-noise adder design with pass-transistor adiabatic logic”Microelectronics, 2000. ICM 2000. Proceedings of the 12th International Conference on 31 Oct.-2 Nov. 2000 Page(s):61 – 64
64
UC Berkeley Spring 2006
Q & A
Thank you for your time
Any questions?
UC Berkeley Spring 2006
Back up slides
65
UC Berkeley Spring 2006
Test Setup
Robustness for Adiabatic Circuits explainedOutput signal goes within ~1% of railOutput goes below VT before next clock reaches VT
Clk2
Clk1
Output
UC Berkeley Spring 2006
Simulation Results – 150f load (1/4)
Energy Per Cycle vs. Delay180nm, 150f load
1.0E-14
1.0E-13
1.0E-12
1.0E-11
1.0E-10
1.0E-09 1.0E-08 1.0E-07 1.0E-06Delay
En
erg
y P
er C
ycle
2N2PPFALDCVSL
66
UC Berkeley Spring 2006
Simulation Results – 150f load (2/4)
Energy Per Cycle vs. Delay90nm, 150f load
1.0E-14
1.0E-13
1.0E-12
1.0E-11
1.0E-09 1.0E-08 1.0E-07 1.0E-06Delay
En
erg
y P
er C
ycle
2N2PPFALDCVSL
UC Berkeley Spring 2006
Simulation Results – 150f load (3/4)
Energy Per Cycle vs. Delay65nm, 150f load
1.0E-14
1.0E-13
1.0E-12
1.0E-11
1.0E-09 1.0E-08 1.0E-07 1.0E-06Delay
En
erg
y P
er C
ycle
2N2PPFALDCVSL
67
UC Berkeley Spring 2006
Simulation Results – 150f load (4/4)
Energy Per Cycle vs. Delay45nm, 150f load
1.0E-14
1.0E-13
1.0E-12
1.0E-11
1.0E-09 1.0E-08 1.0E-07 1.0E-06Delay
En
erg
y P
er C
ycle
2N2PPFALDCVSL
Algebraic Coding for Reliable Computation using Unreliable Gates
Animesh Kumar
EE 241 Class Project
Instructor: Prof. Jan M. Rabaey
68
Introduction and motivation
• Combinatorial logic block in
DSM circuits
• Feature size very small
• Smaller device => more
susceptible to SER
Unreliable AND logic
• Simple AND gate – well-known problem
• Combinatorial model – correct controlled number of errors
• a, b, x are binary vectors
p
p = probability of failure
a
bx
P(x != a.b) = p, p > 0
69
Past approaches
p
p
p
a
b
x1
a
b
x2
a
b
x3
Majorityvoter
x
• TMR (Triple Modulo
Redundancy)
• Works for single-error
per compute
• Wasteful for p small
p
a
b
• Encode using a linear error-control
code
• An [n, k, d] code will correct (d-1)/2
errors
Interesting idea
p
a
bx
a
b py
a b x y XOR ( x, y)
0 0 0 0 0
0 1 0 1 1
1 0 0 1 1
1 1 1 1 0
= XOR ( a, b)
Approach: Encode over input and have gate-diversity
• XOR channel locates error(s)
• Correction?
70
Correction – the easy part
p
a
bx
a
b py
ifnoisy (x, y) = (1, 1) or (0, 0)
then corrected(x, y) = (0, 1)
Example {Hamming 7,4,3 code}
a = 0 0 0 0 0 0 0
b = 1 1 1 1 1 1 1
x = 0 0 0 1 0 0 0
y = 1 1 1 1 1 1 1s = 1 1 1 0 1 1 1
x = 0 0 0 0 0 0 0
y = 1 1 1 1 1 1 1
☺
Correction – hard part
p
a
bx
a
b py
ifnoisy (x, y) = (1, 0) or (0, 1)
thencorrect(x, y) is in {(0, 0), (1, 1)}
Question: Is it always possible to decode t-logic errors using
a t-error correcting code?
Ans: NO
Proof:a = 1 0 0 0 0 1 1
b = 1 0 0 1 1 0 0
x = 1 0 0 0 0 0 0
y = 1 0 0 0 0 0 0
x = 0 0 0 0 0 0 0
y = 1 0 0 0 0 0 0
71
Coverage results
(a,b)
Is there (c, d) such thata.b = xor(c.d, e)
a + b = xor(c + d, e)e (error) with small weight
Results:• Hamming [7,4,3] – 44% (a,b) have some (c,d) interfering
• ReedMuller[16,5,8] – Tolerates two logic compute errors
each block (obtained by simulation)
Results so far …
• Showed a novel coding and gate-diversity method which detects and corrects a controlled number of errors
• Coverage results are promising
72
Conclusions & future work
• Exploring more complicated gate-network• Proving feasibility of general bounded
distance codes• Efficient decoding methods• Thinking about the overhead!
Power/Area Minimization of UWB Digital Baseband
Albert H. ChangRach Liu
University of California, BerkeleyEE 241 Spring 2006 Final Presentation
May 9th 2006
73
Motivation
Study of Power and Area EfficiencyDesign Example: UWB communicationEfficiency is needed due to high throughput
Explore Micro-Architectural TechniquesParallelismTime-MultiplexingPipelining
The effect of voltage scaling on:Active PowerLeakage PowerOverall Power
Scope of Project
Design Driver: UWB Digital BasebandPulsed radio approach [1]
Goal: optimize power and areaFixed throughput constraintSimulation for all 3 corners
Simulink-to-Silicon Design EnvironmentFunctional verification of the algorithm Circuit-level characterization
[1] Mike Chen, “Ultra Wide-Band Baseband Design and Implementation,”M.S. Thesis, UC Berkeley, 2002
74
Design Methodology
Model & test the design in Simulink/XSGELDO Simulation of 90nm Technology
Understand Power-Delay tradeoff for all corners
Run BWRC/INSECTA digital design flow for an FIR filter @ 1V to estimate Power/Area
Extrapolate to other points: P = 1,2,4,8,16Use Power-Delay tradeoff curve
Find the optimal level of parallelism and Vdd
For the best Power/Area Tradeoff
Simulink Design Exploration
1GHz64
taps
75
Power of Major building blocks
24%9980/1=998024%9980Others of F16
20%9980/8=124767.2%9980Others of F2
15%9980/16=62374%9980Others of F1
76 %3179876 %31798PMF of F16
79.6%487432.8%4874PMF of F2
84.86%349626%3496PMF of F1
Actual Frequency => Depending of the level of parallelism
Assuming everything running at the same frequency
Cases
Power Consumption Studies in terms of number of Slices
Power of PMF block is about 80% of the overall system. Therefore, optimizing the PMF is the main task!
PMF Block, Extreme Case #1:F1 (fully time-multiplexed design)
“1x”
76
PMF Block, Extreme Case #2: F16 (fully parallel design)
Tradeoff between Throughput and Power Consumption
F1 synthesized in 90nm @ 1V, fclk = 3ns. Estimates:Active Power: 28.9mWLeakage Power: 3.4mW
Power extrapolated from F1 resultsTradeoff between Throughput and Power for F1
Since there is only one level of parallelism, throughput = operating frequency
As the frequency and voltage increase, the overall power consumption increases because
Active Power ~ fclk*VDD2
Leakage Power ~ VDD3.3 (experimental data)
77
Technology Characterization: ELDO FO4 Model
Extrapolate DataThroughput normalized to 1 for the highest throughput @ each corner.Normalize Delay to Delay @ 1v
Del
ay (
norm
to 1
V)
Simulated F1 of 90nm @ 1VActive Power: 28.9mWLeakage Power: 3.4mWOverall Power = 32.3mW
Throughput = 333MSamples/sSS
TT FF
1
0.1
0.01
10
100
Po
wer
(m
W)
Reference Case: F1
10.10.010.001Throughput (norm.)
78
Active Power: ~ fclk*VDD
2
Leakage Power:~ VDD
3.3
~ P
Power increases as P increases
Power decreases as P increases
1 2 4 8 16
Need to look at VDD to better understand Pleakage…
1 2 48
16
Case Study: Parallelism P = 1, 2, 4, 8, 16
A Closer Look: Leakage Power
Parallelism doesn’t help much as supply voltage saturates
The rate of voltage scaling decreases
As a result, F16 consumes more power here because the effect of block parallelism kicks in
79
Minimizing Overall Power
F16F8
F4F2F1
V>0.50.35<V<0.7
0.3<V<0.4
0.3<V<0.35
0.25<V<0.3
Choosing the Right Level of Parallelism
It is hard to compare different level of parallelismTherefore, normalize to F1 @ the same throughput
Find the appropriate level of parallelismThroughput = “0.6” (0.6 = 200/333)
F2 has about 60% Power Saving
F4-F16 has about 70% Power SavingThere is no benefit from F8, F16
Will try F2 and F4
80
Parallelism 1,2,4,8,16: TT CornerOverall Power (Normalized to F1)
Active
Leakage
Total
Design Choice: F4
Optimal Design Parameters
Final Voltage for F4 = 0.34V
Throughput
VD
DP
ow
er
Power reduction: 80%
81
Final Architecture
12.5 MHz8~256
12.5 MHz12.5 MHz
320 bits
200 MHz
20 bits Serial to ParallelPMF
(Could 1~16 different parallelism level)
16x15 bits
Corr Block
16x23 bitsMAX
27 bits
12.5 MHz8~256
MAX
30 bits
12.5 MHz8~256
pad limit
Final Voltage: 0.34VPower: 6mW
Power Saving: 80% compared to F1
Throughput: 200MHz/sample
(8 blocks)
(8 blocks)
(parallelism 4)
Future Work
Tape out the DesignExperimental Verification
82
Limited Switch Dynamic Logic:VDD scaling
Josephine Chang
EE241, Spring 2006, final project
Prepared for Dr. Rabaey
Limited Switch Dynamic LogicDynamic logic LSDL
-Faster and smaller Co stack than static allow Vdd scaling
-Lower power and more robust than dynamic better power & variability immunity
83
Full Adder Test CircuitLSDL
-Co logic in dynamic; S logic in static
-A,B,and S loaded with inverters; Cout loaded with Ci
-Functionality verified with 45 input patterns-Power measured with 1000 input patterns.
-Delay measured from Ci to Co
-Compared to static CMOS & Dual rail domino
0.0
0.5
1.0
1.5
2.0
2.5
0 300 600 900
W (nm)
PD
P (
ns*
uW
)
Optimized for PDP at 0.8V
0.0
0.1
0.2
0.3
0.4
0 300 600 900
W (nm)
tpro
p(n
s)
4
6
8
10
12
0 300 600 900
W (nm)
aver
age
po
wer
(u
W)
Static
Dynamic
LSDL
84
Vdd scaling
0.5
1.0
1.5
2.0
2.5
0.25 0.5 0.75 1
Vdd (V)
PD
P (
ns*
uW
)
0.0
0.2
0.4
0.6
0.8
1.0
0.25 0.5 0.75 1
Vdd (V)
tpro
p(n
s)
0
5
10
15
20
0.25 0.5 0.75 1
Vdd (V)av
erag
e p
ow
er (
uW
)
Static
Dynamic
LSDL
VT variation at 0.65V
L= 39nm40nm41nm42nm43nm44nm45nm
L=45nm, VT=.24V (nom) L=44nm, VT=.225V
L=40nm, VT=.02V L=39nm, VT=-.02V
Dynamic failure:
LSDL failure:
L=43nm , VT=.21V
85
Variations on LSDLBasic Clocked feedback Feed forward pulse
Controlled-Load Keeper
Summary
• LSDL power dissipation approaches static CMOS at low VDD.
• LSDL propagation delay worse than CMOS, but comparison was unfair (worst case for LSDL is much worse than nominal)
• 1-bit FA is perhaps not the best showcase demo for LSDL
86
Full Adder Test Circuit
Static LogicDynamic logic
That’s all!
• Excellent job …
Time for summer!