Upload
ravindarsingh
View
215
Download
0
Embed Size (px)
Citation preview
7/29/2019 ch04_2
1/44
Chapter 4Low-Power VLSI Design
Jin-Fu Li
Advanced Reliable Systems (ARES) Lab.
Department of Electrical Engineering
National Central University
Jhongli, Taiwan
7/29/2019 ch04_2
2/44
2EE4012VLSI DesignNational Central University
Outline
Introduction
Low-Power Gate-Level Design Low-Power Architecture-Level Design
Algorithmic-Level Power Reduction RTL Techniques for Optimizing Power
7/29/2019 ch04_2
3/44
3EE4012VLSI DesignNational Central University
Introduction
Most SOC design teams now regard power as oneof their top design concerns
Why low-power design? Battery lifetime (especially for portable devices) Reliability
Power consumption Peak power
Average power
7/29/2019 ch04_2
4/44
4EE4012VLSI DesignNational Central University
Overview of Power Consumption
Average power consumption Dynamic power consumption
Short-circuit power consumption
Leakage power consumption
Static power consumption
Dynamic power dissipation during switching
Cinput
CdrainCinput
interconnect
7/29/2019 ch04_2
5/44
5EE4012VLSI DesignNational Central University
Overview of Power Consumption
Generic representation of a CMOS logic gate forswitching power calculation
inputerconnectdrain CCC int
pMOSnetwork
nMOS
network
Vout
2/
0 2/]))(()([
1 T T
T
outloadoutDD
outloadoutavg dt
dt
dVCVVdt
dt
dVCV
TP
VA
VB
VA
VB
7/29/2019 ch04_2
6/44
6EE4012VLSI DesignNational Central University
Overview of Power Consumption
The average power consumption can be expressedas
The node transition rate can be slower than theclock rate. To better represent this behavior, a
node transition factor ( ) should be introduced
The switching power expressed above are derivedby taking into account the output node loadcapacitance
CLKDDloadDDloadavg fVCVC
T
P 221
CLKDDloadTavg fVCP2
T
7/29/2019 ch04_2
7/44
7EE4012VLSI DesignNational Central University
Overview of Power Consumption
Cload
Cinternal
Vinternal
VA
VB
VA VB
Vout
Vinternal
VA
VB
Vout
CLKDD
ofnodes
iiiTiavg fVVCP
#
1
The generalized expression for the average power dissipation
can be rewritten as
7/29/2019 ch04_2
8/44
8EE4012VLSI DesignNational Central University
Gate-Level Design Technology Mapping
The objective of logic minimization is to reduce theboolean function.
For low-power design, the signal switching activityis minimized by restructuring a logic circuit
The power minimization is constrained by the
delay, however, the area may increase. During this phase of logic minimization, the
function to be minimized is
i
iii CPP )1(
7/29/2019 ch04_2
9/44
9EE4012VLSI DesignNational Central University
Gate-Level Design Technology Mapping
The first step in technology mapping is to decomposeeach logic function into two-input gates
The objective of this decomposition is to minimize the totalpower dissipation by reducing the total switching activity
038.0
ABCD
A 0.2
B 0.2
C 0.5
D 0.5
A 0.2B 0.2
C 0.5
D 0.5
0099.00196.0
0099.0
1875.0
0384.0
7/29/2019 ch04_2
10/44
10EE4012VLSI DesignNational Central University
Gate-Level Design Phase Assignment
A
B
High activity node
C
A
B
High activity node
C
7/29/2019 ch04_2
11/44
11EE4012VLSI DesignNational Central University
Gate-Level Design Pin Swapping
b
a
d
c
ba dc
b
a
d
c
ba dc
Switchin
gactivity
Switching
activity
ba
dc
ba
dc
7/29/2019 ch04_2
12/44
12EE4012VLSI DesignNational Central University
Gate-Level Design Glitching Power
Glitches spurious transitions due to imbalanced path delays
A design has more balanced delay paths has fewer glitches, and thus has less power dissipation
Note that there will be no glitches in a dynamic CMOSlogic
A
B
C
D
E
A
B
C
D
E
7/29/2019 ch04_2
13/44
13EE4012VLSI DesignNational Central University
Gate-Level Design Glitching Power
A chain structure has more glitches
A tree structure has fewer glitches
A
B
C
D
A
B
C
D
Chain structure
Tree structure
7/29/2019 ch04_2
14/44
14EE4012VLSI DesignNational Central University
Gate-Level Design Precomputation
Combinational LogicREG
R1
REG
R2
Combinational LogicREG
R1
REG
R2
Precomputation
Logic
7/29/2019 ch04_2
15/44
15EE4012VLSI DesignNational Central University
Gate-Level Design Precomputation
1-bit Comparator
(MSB)
REGR4
(n-1)-bitComparator
REG
R1
REG
R2
REG
R3
A
B
A
B
Precomputation logic
Enable
F
7/29/2019 ch04_2
16/44
16EE4012VLSI DesignNational Central University
Gate-Level Design Gating Clock
D Q D Q D Q D Q
D Q D Q D Q D Q
T
clk
clk
Fail DFT rulechecking
Add control pinto solve DFTviolation
problem
7/29/2019 ch04_2
17/44
17EE4012VLSI DesignNational Central University
Gate-Level Design Input Gating
+
clk
f1
f2
select
7/29/2019 ch04_2
18/44
18EE4012VLSI DesignNational Central University
Clock-Gating in Low-Power Flip-Flop
D QD
CK
Source: Prof. V. D. Agrawal
7/29/2019 ch04_2
19/44
19EE4012VLSI DesignNational Central University
Reduced-Power Shift Register
D Q D Q D Q
D QD QD Q
D Q
D Q
D
CK(f/2)
multiplexer
Output
Flip-flops are operated at full voltage and half the clock frequency.
Source: Prof. V. D. Agrawal
7/29/2019 ch04_2
20/44
20EE4012VLSI DesignNational Central University
Power Consumption of Shift Register
P = CVDD2f/n
Degree of parallelism, n
1 2 4
Normalizedpower
1.0
0.5
0.25
0.0
Deg. Ofparallelism
Freq(MHz)
Power(W)
1 33.0 1535
2 16.5 887
4 8.25 738
16-bit shift register, 2 CMOS
C. Piguet, Circuit and Logic Level
Design, pages 103-133 in W. Nebeland J. Mermet (ed.), Low Power
Design in Deep Submicron
Electronics, Springer, 1997.
Source: Prof. V. D. Agrawal
7/29/2019 ch04_2
21/44
21EE4012VLSI DesignNational Central University
Architecture-Level Design Parallelism
16x16
multiplier
RA
RB
fref
fref
32 16x16
multiplier
RA
R
fref/2
32
MUXfref/2
16x16
multiplier
R
B R
fref/2
fref/2
32
32
fref
16
16
16
16
16
16
Assume that With the same 16x16multiplier, the power supply canbe reduced from Vref to Vref/1.83.
refrefref
refparallel PfVCP 33.02
)83.1
(2.2 2
7/29/2019 ch04_2
22/44
22EE4012VLSI DesignNational Central University
Architecture-Level Design Pipelining
Half
multiplierREG(A ,B)
fref
32REG
Half
multiplier
32
refref
ref
refpipeline PfV
CP 36.0)83.1
(2.1 2
The hardware between the pipeline stages is reduced thenthe reference voltage Vrefcan be reduced to Vnew to maintainthe same worst case delay. For example, let a 50MHzmultiplier is broken into two equal parts as shown below. The
delay between the pipeline stages can be remained at 50MHzwhen the voltage Vnew is equal to Vref/1.83
7/29/2019 ch04_2
23/44
23EE4012VLSI DesignNational Central University
Architecture-Level Design Retiming
Retiming is a transformation technique used to change thelocations of delay elements in a circuit without affecting theinput/output characteristics of the circuit.
D D 2D
D
D
D
2D
y(n)x(n)
w(n)
y(n)x(n)
w1(n)
w2(n)
b
a
(2)
(1) (2)
(1)
a
(2)
(1) (2)
(1)
retiming
Two versions of an IIR filter.
b
7/29/2019 ch04_2
24/44
24EE4012VLSI DesignNational Central University
Architecture-Level Design Retiming
REG
fref
REG C3
(4ns)
Retiming for pipeline design
C1
(6ns)
C2
(2ns)
REG
fref
REG C3
(4ns)
C1
(6ns)
C2
(2ns)
7/29/2019 ch04_2
25/44
25EE4012VLSI DesignNational Central University
Architecture-Level Design Retiming
Clock cycle is 4 gate delays
Clock cycle is 2 gate delays
7/29/2019 ch04_2
26/44
26EE4012VLSI DesignNational Central University
Architecture-Level Design Power Management
C2
C1
C1_FREEZE
C2_FREEZE
C2
C1
C1_FREEZE
C2_FREEZE
7/29/2019 ch04_2
27/44
27EE4012VLSI DesignNational Central University
Architecture-Level Design Bus Segmentation
Avoid the sharing of resources Reduce the switched capacitance
For example: a global system bus A single shared bus is connected to all modules, this
structure results in a large bus capacitance due to The large number of drivers and receivers sharing the same
bus
The parasitic capacitance of the long bus line
A segmented bus structure Switched capacitance during each bus access is
significantly reduced Overall routing area may be increased
7/29/2019 ch04_2
28/44
28EE4012VLSI DesignNational Central University
Architecture-Level Design Bus Segmentation
Cbus
Cbus1
Cbus1
Bus
Interface
7/29/2019 ch04_2
29/44
29EE4012VLSI DesignNational Central University
Algorithmic-Level Design factivityReduction
Binary Code Gray Code Decimal Equivalent
000001010011
100101110111
000001011010
110111101100
0123
4567
Minimization of the switching activity, at high level, is one way toreduce the power dissipation of digital processors.
One method to minimize the switching signals, at the algorithmiclevel, is to use an appropriate coding for the signals rather thanstraight binary code.
The table shown below shows a comparison of 3-bit representationof the binary and Gray codes.
7/29/2019 ch04_2
30/44
30EE4012VLSI DesignNational Central University
State Encoding for a Counter
Two-bit binary counter: State sequence, 00 01 10 11 00 Six bit transitions in four clock cycles
6/4 = 1.5 transitions per clock
Two-bit Gray-code counter State sequence, 00 01 11 10 00 Four bit transitions in four clock cycles 4/4 = 1.0 transition per clock
Gray-code counter is more power efficient.G. K. Yeap, Practical Low Power Digital VLSI Design, Boston:
Kluwer Academic Publishers (now Springer), 1998.
Source: Prof. V. D. Agrawal
7/29/2019 ch04_2
31/44
31EE4012VLSI DesignNational Central University
Binary Counter: Original Encoding
Present
state
Next state
a b A B
0 0 0 1
0 1 1 0
1 0 1 1
1 1 0 0
A = ab + ab
B = ab + ab
A
B
a
b
CK
CLR
Source: Prof. V. D. Agrawal
7/29/2019 ch04_2
32/44
32EE4012VLSI DesignNational Central University
Binary Counter: Gray Encoding
Presentstate
Next state
a b A B
0 0 0 1
0 1 1 1
1 0 0 0
1 1 1 0
A = ab + ab
B = ab + ab
A
B
a
b
CK
CLR
Source: Prof. V. D. Agrawal
Th Bit C t
7/29/2019 ch04_2
33/44
33EE4012VLSI DesignNational Central University
Three-Bit Counters
Binary Gray-code
State No. of toggles State No. of toggles
000 - 000 -
001 1 001 1
010 2 011 1
011 1 010 1
100 3 110 1
101 1 111 1
110 2 101 1
111 1 100 1000 3 000 1
Av. Transitions/clock = 1.75 Av. Transitions/clock = 1
Source: Prof. V. D. Agrawal
N Bit Counter: Toggles in Counting Cycle
7/29/2019 ch04_2
34/44
34EE4012VLSI DesignNational Central University
N-Bit Counter: Toggles in Counting Cycle
Binary counter: T(binary) = 2(2N 1) Gray-code counter: T(gray) = 2N
T(gray)/T(binary) = 2N-1/(2N 1) 0.5Bits T(binary) T(gray) T(gray)/T(binary)
1 2 2 1.0
2 6 4 0.6667
3 14 8 0.5714
4 30 16 0.5333
5 62 32 0.5161
6 126 64 0.5079
- - 0.5000
Source: Prof. V. D. Agrawal
FSM State Encoding
7/29/2019 ch04_2
35/44
35EE4012VLSI DesignNational Central University
FSM State Encoding
11
01000.1
0.10.4
0.3
0.6 0.9
0.6
01
11000.1
0.10.4
0.3
0.6 0.9
0.6
Expected number of state-bit transitions:
1(0.3+0.4+0.1) + 2(0.1) = 1.0
Transitionprobability
based on
PI statistics
State encoding can be selected using a power-based cost function.
2(0.3+0.4) + 1(0.1+0.1) = 1.6
Source: Prof. V. D. Agrawal
FSM: Clock Gating
7/29/2019 ch04_2
36/44
36EE4012VLSI DesignNational Central University
FSM: Clock-Gating
Moore machine: Outputs depend only on thestate variables. If a state has a self-loop in the state transition
graph (STG), then clock can be stoppedwhenever a self-loop is to be executed.
Sj
Si
Sk
Xi/Zk
Xk/Zk
Xj/Zk
Clock can be stoppedwhen (Xk, Sk) combination
occurs.
Source: Prof. V. D. Agrawal
Clock Gating in Moore FSM
7/29/2019 ch04_2
37/44
37EE4012VLSI DesignNational Central University
Clock-Gating in Moore FSM
Combinational
logic
Latch
Clock
activation
logic
Flip-flops
PI
CK
PO
L. Benini and G. De Micheli,Dynamic Power Management,
Boston: Springer, 1998.
Source: Prof. V. D. Agrawal
Bus Encoding for Reduced Power
7/29/2019 ch04_2
38/44
38EE4012VLSI DesignNational Central University
Bus Encoding for Reduced Power
Example: Four bit bus 0000 1110 has three transitions. If bits of second pattern are inverted, then 0000
0001 will have only one transition.
Bit-inversion encoding for N-bit bus:
Number of bit transitions
0 N/2 N
N
N/2
0Numbe
rofbittrans
itions
afterin
versionenc
oding
Source: Prof. V. D. Agrawal
Bus-Inversion Encoding Logic
7/29/2019 ch04_2
39/44
39EE4012VLSI DesignNational Central University
Bus-Inversion Encoding Logic
Polarity
decisionlogic
Sent
data
Received
data
Bus register
Polarity bitM. Stan and W. Burleson, Bus-InvertCoding for Low Power I/O,IEEE
Trans. VLSI Systems, vol. 3, no. 1, pp.
49-58, March 1995.
Source: Prof. V. D. Agrawal
RTL Level Design Si l G ti
7/29/2019 ch04_2
40/44
40EE4012VLSI DesignNational Central University
RTL-Level Design Signal Gating
module decoder (a, sel);
input [1:0[ a;
ouput [3:0] sel;reg [3:0] sel;
always @(a) begincase (a)
2b00: sel=4b0001;
2b01: sel=4b0010;2b10: sel=4b0100;2b11: sel=4b1000;
endcaseend
endmodule
module decoder (en,a, sel);
input en;
input [1:0[ a;ouput [3:0] sel;reg [3:0] sel;
always @({en,a}) begincase ({en,a})
3b100: sel=4b0001;3b101: sel=4b0010;3b110: sel=4b0100;3b111: sel=4b1000;default: sel=4b0000;
endcaseend
endmodule
Decoder with enableSimple Decoder
RTL Level Design D t th R d i
7/29/2019 ch04_2
41/44
41EE4012VLSI DesignNational Central University
RTL-Level Design Datapath Reordering
Mux
Muxstable
glitchyA
7/29/2019 ch04_2
42/44
42EE4012VLSI DesignNational Central University
RTL-Level Design Memory Partition
128x32
MU
X
32
128x32
dout
32
32
dout
dout
qdpre_addr
clk
write
addr
din
din
addr
write
addr[7:0]addr[7:1]
addr0
8
noe
noe
RTL Level Design Memory Partition
7/29/2019 ch04_2
43/44
43EE4012VLSI DesignNational Central University
RTL-Level Design Memory Partition
ARM
Core
64K bytes
Data
Addr
R/W
Reads
Addr
Range64K28K 4K 32K
Application-driven memory partition
RTL-Level Design Memory Partition
7/29/2019 ch04_2
44/44
44EE4012VLSI DesignNational Central University
RTL-Level Design Memory Partition
ARM
CoreData
Addr
R/W
CS
Data
Addr
R/W
CS
Data
Addr
R/W
CS
Decoder
A power-optimal partitioned memory organization