ch04_2

Embed Size (px)

Citation preview

  • 7/29/2019 ch04_2

    1/44

    Chapter 4Low-Power VLSI Design

    Jin-Fu Li

    Advanced Reliable Systems (ARES) Lab.

    Department of Electrical Engineering

    National Central University

    Jhongli, Taiwan

  • 7/29/2019 ch04_2

    2/44

    2EE4012VLSI DesignNational Central University

    Outline

    Introduction

    Low-Power Gate-Level Design Low-Power Architecture-Level Design

    Algorithmic-Level Power Reduction RTL Techniques for Optimizing Power

  • 7/29/2019 ch04_2

    3/44

    3EE4012VLSI DesignNational Central University

    Introduction

    Most SOC design teams now regard power as oneof their top design concerns

    Why low-power design? Battery lifetime (especially for portable devices) Reliability

    Power consumption Peak power

    Average power

  • 7/29/2019 ch04_2

    4/44

    4EE4012VLSI DesignNational Central University

    Overview of Power Consumption

    Average power consumption Dynamic power consumption

    Short-circuit power consumption

    Leakage power consumption

    Static power consumption

    Dynamic power dissipation during switching

    Cinput

    CdrainCinput

    interconnect

  • 7/29/2019 ch04_2

    5/44

    5EE4012VLSI DesignNational Central University

    Overview of Power Consumption

    Generic representation of a CMOS logic gate forswitching power calculation

    inputerconnectdrain CCC int

    pMOSnetwork

    nMOS

    network

    Vout

    2/

    0 2/]))(()([

    1 T T

    T

    outloadoutDD

    outloadoutavg dt

    dt

    dVCVVdt

    dt

    dVCV

    TP

    VA

    VB

    VA

    VB

  • 7/29/2019 ch04_2

    6/44

    6EE4012VLSI DesignNational Central University

    Overview of Power Consumption

    The average power consumption can be expressedas

    The node transition rate can be slower than theclock rate. To better represent this behavior, a

    node transition factor ( ) should be introduced

    The switching power expressed above are derivedby taking into account the output node loadcapacitance

    CLKDDloadDDloadavg fVCVC

    T

    P 221

    CLKDDloadTavg fVCP2

    T

  • 7/29/2019 ch04_2

    7/44

    7EE4012VLSI DesignNational Central University

    Overview of Power Consumption

    Cload

    Cinternal

    Vinternal

    VA

    VB

    VA VB

    Vout

    Vinternal

    VA

    VB

    Vout

    CLKDD

    ofnodes

    iiiTiavg fVVCP

    #

    1

    The generalized expression for the average power dissipation

    can be rewritten as

  • 7/29/2019 ch04_2

    8/44

    8EE4012VLSI DesignNational Central University

    Gate-Level Design Technology Mapping

    The objective of logic minimization is to reduce theboolean function.

    For low-power design, the signal switching activityis minimized by restructuring a logic circuit

    The power minimization is constrained by the

    delay, however, the area may increase. During this phase of logic minimization, the

    function to be minimized is

    i

    iii CPP )1(

  • 7/29/2019 ch04_2

    9/44

    9EE4012VLSI DesignNational Central University

    Gate-Level Design Technology Mapping

    The first step in technology mapping is to decomposeeach logic function into two-input gates

    The objective of this decomposition is to minimize the totalpower dissipation by reducing the total switching activity

    038.0

    ABCD

    A 0.2

    B 0.2

    C 0.5

    D 0.5

    A 0.2B 0.2

    C 0.5

    D 0.5

    0099.00196.0

    0099.0

    1875.0

    0384.0

  • 7/29/2019 ch04_2

    10/44

    10EE4012VLSI DesignNational Central University

    Gate-Level Design Phase Assignment

    A

    B

    High activity node

    C

    A

    B

    High activity node

    C

  • 7/29/2019 ch04_2

    11/44

    11EE4012VLSI DesignNational Central University

    Gate-Level Design Pin Swapping

    b

    a

    d

    c

    ba dc

    b

    a

    d

    c

    ba dc

    Switchin

    gactivity

    Switching

    activity

    ba

    dc

    ba

    dc

  • 7/29/2019 ch04_2

    12/44

    12EE4012VLSI DesignNational Central University

    Gate-Level Design Glitching Power

    Glitches spurious transitions due to imbalanced path delays

    A design has more balanced delay paths has fewer glitches, and thus has less power dissipation

    Note that there will be no glitches in a dynamic CMOSlogic

    A

    B

    C

    D

    E

    A

    B

    C

    D

    E

  • 7/29/2019 ch04_2

    13/44

    13EE4012VLSI DesignNational Central University

    Gate-Level Design Glitching Power

    A chain structure has more glitches

    A tree structure has fewer glitches

    A

    B

    C

    D

    A

    B

    C

    D

    Chain structure

    Tree structure

  • 7/29/2019 ch04_2

    14/44

    14EE4012VLSI DesignNational Central University

    Gate-Level Design Precomputation

    Combinational LogicREG

    R1

    REG

    R2

    Combinational LogicREG

    R1

    REG

    R2

    Precomputation

    Logic

  • 7/29/2019 ch04_2

    15/44

    15EE4012VLSI DesignNational Central University

    Gate-Level Design Precomputation

    1-bit Comparator

    (MSB)

    REGR4

    (n-1)-bitComparator

    REG

    R1

    REG

    R2

    REG

    R3

    A

    B

    A

    B

    Precomputation logic

    Enable

    F

  • 7/29/2019 ch04_2

    16/44

    16EE4012VLSI DesignNational Central University

    Gate-Level Design Gating Clock

    D Q D Q D Q D Q

    D Q D Q D Q D Q

    T

    clk

    clk

    Fail DFT rulechecking

    Add control pinto solve DFTviolation

    problem

  • 7/29/2019 ch04_2

    17/44

    17EE4012VLSI DesignNational Central University

    Gate-Level Design Input Gating

    +

    clk

    f1

    f2

    select

  • 7/29/2019 ch04_2

    18/44

    18EE4012VLSI DesignNational Central University

    Clock-Gating in Low-Power Flip-Flop

    D QD

    CK

    Source: Prof. V. D. Agrawal

  • 7/29/2019 ch04_2

    19/44

    19EE4012VLSI DesignNational Central University

    Reduced-Power Shift Register

    D Q D Q D Q

    D QD QD Q

    D Q

    D Q

    D

    CK(f/2)

    multiplexer

    Output

    Flip-flops are operated at full voltage and half the clock frequency.

    Source: Prof. V. D. Agrawal

  • 7/29/2019 ch04_2

    20/44

    20EE4012VLSI DesignNational Central University

    Power Consumption of Shift Register

    P = CVDD2f/n

    Degree of parallelism, n

    1 2 4

    Normalizedpower

    1.0

    0.5

    0.25

    0.0

    Deg. Ofparallelism

    Freq(MHz)

    Power(W)

    1 33.0 1535

    2 16.5 887

    4 8.25 738

    16-bit shift register, 2 CMOS

    C. Piguet, Circuit and Logic Level

    Design, pages 103-133 in W. Nebeland J. Mermet (ed.), Low Power

    Design in Deep Submicron

    Electronics, Springer, 1997.

    Source: Prof. V. D. Agrawal

  • 7/29/2019 ch04_2

    21/44

    21EE4012VLSI DesignNational Central University

    Architecture-Level Design Parallelism

    16x16

    multiplier

    RA

    RB

    fref

    fref

    32 16x16

    multiplier

    RA

    R

    fref/2

    32

    MUXfref/2

    16x16

    multiplier

    R

    B R

    fref/2

    fref/2

    32

    32

    fref

    16

    16

    16

    16

    16

    16

    Assume that With the same 16x16multiplier, the power supply canbe reduced from Vref to Vref/1.83.

    refrefref

    refparallel PfVCP 33.02

    )83.1

    (2.2 2

  • 7/29/2019 ch04_2

    22/44

    22EE4012VLSI DesignNational Central University

    Architecture-Level Design Pipelining

    Half

    multiplierREG(A ,B)

    fref

    32REG

    Half

    multiplier

    32

    refref

    ref

    refpipeline PfV

    CP 36.0)83.1

    (2.1 2

    The hardware between the pipeline stages is reduced thenthe reference voltage Vrefcan be reduced to Vnew to maintainthe same worst case delay. For example, let a 50MHzmultiplier is broken into two equal parts as shown below. The

    delay between the pipeline stages can be remained at 50MHzwhen the voltage Vnew is equal to Vref/1.83

  • 7/29/2019 ch04_2

    23/44

    23EE4012VLSI DesignNational Central University

    Architecture-Level Design Retiming

    Retiming is a transformation technique used to change thelocations of delay elements in a circuit without affecting theinput/output characteristics of the circuit.

    D D 2D

    D

    D

    D

    2D

    y(n)x(n)

    w(n)

    y(n)x(n)

    w1(n)

    w2(n)

    b

    a

    (2)

    (1) (2)

    (1)

    a

    (2)

    (1) (2)

    (1)

    retiming

    Two versions of an IIR filter.

    b

  • 7/29/2019 ch04_2

    24/44

    24EE4012VLSI DesignNational Central University

    Architecture-Level Design Retiming

    REG

    fref

    REG C3

    (4ns)

    Retiming for pipeline design

    C1

    (6ns)

    C2

    (2ns)

    REG

    fref

    REG C3

    (4ns)

    C1

    (6ns)

    C2

    (2ns)

  • 7/29/2019 ch04_2

    25/44

    25EE4012VLSI DesignNational Central University

    Architecture-Level Design Retiming

    Clock cycle is 4 gate delays

    Clock cycle is 2 gate delays

  • 7/29/2019 ch04_2

    26/44

    26EE4012VLSI DesignNational Central University

    Architecture-Level Design Power Management

    C2

    C1

    C1_FREEZE

    C2_FREEZE

    C2

    C1

    C1_FREEZE

    C2_FREEZE

  • 7/29/2019 ch04_2

    27/44

    27EE4012VLSI DesignNational Central University

    Architecture-Level Design Bus Segmentation

    Avoid the sharing of resources Reduce the switched capacitance

    For example: a global system bus A single shared bus is connected to all modules, this

    structure results in a large bus capacitance due to The large number of drivers and receivers sharing the same

    bus

    The parasitic capacitance of the long bus line

    A segmented bus structure Switched capacitance during each bus access is

    significantly reduced Overall routing area may be increased

  • 7/29/2019 ch04_2

    28/44

    28EE4012VLSI DesignNational Central University

    Architecture-Level Design Bus Segmentation

    Cbus

    Cbus1

    Cbus1

    Bus

    Interface

  • 7/29/2019 ch04_2

    29/44

    29EE4012VLSI DesignNational Central University

    Algorithmic-Level Design factivityReduction

    Binary Code Gray Code Decimal Equivalent

    000001010011

    100101110111

    000001011010

    110111101100

    0123

    4567

    Minimization of the switching activity, at high level, is one way toreduce the power dissipation of digital processors.

    One method to minimize the switching signals, at the algorithmiclevel, is to use an appropriate coding for the signals rather thanstraight binary code.

    The table shown below shows a comparison of 3-bit representationof the binary and Gray codes.

  • 7/29/2019 ch04_2

    30/44

    30EE4012VLSI DesignNational Central University

    State Encoding for a Counter

    Two-bit binary counter: State sequence, 00 01 10 11 00 Six bit transitions in four clock cycles

    6/4 = 1.5 transitions per clock

    Two-bit Gray-code counter State sequence, 00 01 11 10 00 Four bit transitions in four clock cycles 4/4 = 1.0 transition per clock

    Gray-code counter is more power efficient.G. K. Yeap, Practical Low Power Digital VLSI Design, Boston:

    Kluwer Academic Publishers (now Springer), 1998.

    Source: Prof. V. D. Agrawal

  • 7/29/2019 ch04_2

    31/44

    31EE4012VLSI DesignNational Central University

    Binary Counter: Original Encoding

    Present

    state

    Next state

    a b A B

    0 0 0 1

    0 1 1 0

    1 0 1 1

    1 1 0 0

    A = ab + ab

    B = ab + ab

    A

    B

    a

    b

    CK

    CLR

    Source: Prof. V. D. Agrawal

  • 7/29/2019 ch04_2

    32/44

    32EE4012VLSI DesignNational Central University

    Binary Counter: Gray Encoding

    Presentstate

    Next state

    a b A B

    0 0 0 1

    0 1 1 1

    1 0 0 0

    1 1 1 0

    A = ab + ab

    B = ab + ab

    A

    B

    a

    b

    CK

    CLR

    Source: Prof. V. D. Agrawal

    Th Bit C t

  • 7/29/2019 ch04_2

    33/44

    33EE4012VLSI DesignNational Central University

    Three-Bit Counters

    Binary Gray-code

    State No. of toggles State No. of toggles

    000 - 000 -

    001 1 001 1

    010 2 011 1

    011 1 010 1

    100 3 110 1

    101 1 111 1

    110 2 101 1

    111 1 100 1000 3 000 1

    Av. Transitions/clock = 1.75 Av. Transitions/clock = 1

    Source: Prof. V. D. Agrawal

    N Bit Counter: Toggles in Counting Cycle

  • 7/29/2019 ch04_2

    34/44

    34EE4012VLSI DesignNational Central University

    N-Bit Counter: Toggles in Counting Cycle

    Binary counter: T(binary) = 2(2N 1) Gray-code counter: T(gray) = 2N

    T(gray)/T(binary) = 2N-1/(2N 1) 0.5Bits T(binary) T(gray) T(gray)/T(binary)

    1 2 2 1.0

    2 6 4 0.6667

    3 14 8 0.5714

    4 30 16 0.5333

    5 62 32 0.5161

    6 126 64 0.5079

    - - 0.5000

    Source: Prof. V. D. Agrawal

    FSM State Encoding

  • 7/29/2019 ch04_2

    35/44

    35EE4012VLSI DesignNational Central University

    FSM State Encoding

    11

    01000.1

    0.10.4

    0.3

    0.6 0.9

    0.6

    01

    11000.1

    0.10.4

    0.3

    0.6 0.9

    0.6

    Expected number of state-bit transitions:

    1(0.3+0.4+0.1) + 2(0.1) = 1.0

    Transitionprobability

    based on

    PI statistics

    State encoding can be selected using a power-based cost function.

    2(0.3+0.4) + 1(0.1+0.1) = 1.6

    Source: Prof. V. D. Agrawal

    FSM: Clock Gating

  • 7/29/2019 ch04_2

    36/44

    36EE4012VLSI DesignNational Central University

    FSM: Clock-Gating

    Moore machine: Outputs depend only on thestate variables. If a state has a self-loop in the state transition

    graph (STG), then clock can be stoppedwhenever a self-loop is to be executed.

    Sj

    Si

    Sk

    Xi/Zk

    Xk/Zk

    Xj/Zk

    Clock can be stoppedwhen (Xk, Sk) combination

    occurs.

    Source: Prof. V. D. Agrawal

    Clock Gating in Moore FSM

  • 7/29/2019 ch04_2

    37/44

    37EE4012VLSI DesignNational Central University

    Clock-Gating in Moore FSM

    Combinational

    logic

    Latch

    Clock

    activation

    logic

    Flip-flops

    PI

    CK

    PO

    L. Benini and G. De Micheli,Dynamic Power Management,

    Boston: Springer, 1998.

    Source: Prof. V. D. Agrawal

    Bus Encoding for Reduced Power

  • 7/29/2019 ch04_2

    38/44

    38EE4012VLSI DesignNational Central University

    Bus Encoding for Reduced Power

    Example: Four bit bus 0000 1110 has three transitions. If bits of second pattern are inverted, then 0000

    0001 will have only one transition.

    Bit-inversion encoding for N-bit bus:

    Number of bit transitions

    0 N/2 N

    N

    N/2

    0Numbe

    rofbittrans

    itions

    afterin

    versionenc

    oding

    Source: Prof. V. D. Agrawal

    Bus-Inversion Encoding Logic

  • 7/29/2019 ch04_2

    39/44

    39EE4012VLSI DesignNational Central University

    Bus-Inversion Encoding Logic

    Polarity

    decisionlogic

    Sent

    data

    Received

    data

    Bus register

    Polarity bitM. Stan and W. Burleson, Bus-InvertCoding for Low Power I/O,IEEE

    Trans. VLSI Systems, vol. 3, no. 1, pp.

    49-58, March 1995.

    Source: Prof. V. D. Agrawal

    RTL Level Design Si l G ti

  • 7/29/2019 ch04_2

    40/44

    40EE4012VLSI DesignNational Central University

    RTL-Level Design Signal Gating

    module decoder (a, sel);

    input [1:0[ a;

    ouput [3:0] sel;reg [3:0] sel;

    always @(a) begincase (a)

    2b00: sel=4b0001;

    2b01: sel=4b0010;2b10: sel=4b0100;2b11: sel=4b1000;

    endcaseend

    endmodule

    module decoder (en,a, sel);

    input en;

    input [1:0[ a;ouput [3:0] sel;reg [3:0] sel;

    always @({en,a}) begincase ({en,a})

    3b100: sel=4b0001;3b101: sel=4b0010;3b110: sel=4b0100;3b111: sel=4b1000;default: sel=4b0000;

    endcaseend

    endmodule

    Decoder with enableSimple Decoder

    RTL Level Design D t th R d i

  • 7/29/2019 ch04_2

    41/44

    41EE4012VLSI DesignNational Central University

    RTL-Level Design Datapath Reordering

    Mux

    Muxstable

    glitchyA

  • 7/29/2019 ch04_2

    42/44

    42EE4012VLSI DesignNational Central University

    RTL-Level Design Memory Partition

    128x32

    MU

    X

    32

    128x32

    dout

    32

    32

    dout

    dout

    qdpre_addr

    clk

    write

    addr

    din

    din

    addr

    write

    addr[7:0]addr[7:1]

    addr0

    8

    noe

    noe

    RTL Level Design Memory Partition

  • 7/29/2019 ch04_2

    43/44

    43EE4012VLSI DesignNational Central University

    RTL-Level Design Memory Partition

    ARM

    Core

    64K bytes

    Data

    Addr

    R/W

    Reads

    Addr

    Range64K28K 4K 32K

    Application-driven memory partition

    RTL-Level Design Memory Partition

  • 7/29/2019 ch04_2

    44/44

    44EE4012VLSI DesignNational Central University

    RTL-Level Design Memory Partition

    ARM

    CoreData

    Addr

    R/W

    CS

    Data

    Addr

    R/W

    CS

    Data

    Addr

    R/W

    CS

    Decoder

    A power-optimal partitioned memory organization