24
A Practical Guide to Low-Power Design User Experience with CPF ARM 1176-JZFS CPU-Based Low-Power Subsystem

ARM 1176-JZFS CPU-Based Low-Power Subsystem

Embed Size (px)

Citation preview

Page 1: ARM 1176-JZFS CPU-Based Low-Power Subsystem

A Practical Guide to Low-Power DesignUser Experience with CPF

ARM 1176-JZFS CPU-Based Low-Power Subsystem

Page 2: ARM 1176-JZFS CPU-Based Low-Power Subsystem

Sec14:2

ARM 1176-JZFS CPU-Based Low-Power Subsystem: Methodology to Reduce Electrical and Functional Failure in a Low-Power DesignBy David Flynn, Fellow ARM; Sachin Idgunji, Architect, ARM; Felix Jen, Manager Design Implementation, UMC; Wen-Pin Lin, Senior Manager, UMC; and Vivek Shukla, Cadence Architect, Bangalore.

AbstractLeakage control has become a major design issue due to leakage currents that drain a battery’s charge even when a device is inactive or in standby mode. Transistors in each new process generation leak more than those in previous generations, due to transistor scaling effects, only exacerbating the problem.A few years ago, designers began using power shut-off in their designs and EDA suppliers provided low-power methodology solutions. However, power shut-off created next level issues like performance, wear-outs of power switches, more complexity in the power switch analysis, managing system-level performance due to power-up time, test, and reliability. This required accompanying ASIC implementation and verification methodology to reduce the risk of chip failure, both functional and electrical. We demonstrate the application of these techniques and the methodology on an ARM1176-JZFS CPU-based system that is targeted for a 65nm technology node, which achieves higher speed, but has lower leakage, with a methodology to reduce post silicon electrical failure.

Overview of Ulterior Project

Figure 1: Collaboration and Contributors

Dual Vt technology, 65nm technology models, SOC implementation

IP and Power ManagementCollaboration

Between Leaders

to Deliver the Low

Power SolutionMemory compiler with memory shut offs, std cells, PMK library

Complete implementation methology for Ulterior CPU

Page 3: ARM 1176-JZFS CPU-Based Low-Power Subsystem

Sec14:3

ARM 1176-JZFS CPU-Based Low-Power Subsystem

Joint Collaboration Contributors: This effort has been jointly executed by ARM, UMC, and Cadence to accomplish the following tapeout and silicon measurements.

• UMC: 65nm standard process Looking for performance and yield on the LP implementation −

• ARM: ARM1176JZFS based SoC to demonstrate power management on a high performance design

Low-power architecture −Power management and low-power memory IP for managing leakage −

• Cadence: CPU implementationComplete low-power tool and methodology support −

UMC Technology Trends and Process Selection for Project

This section discusses the process parameters and process selection. Figure 2 illustrates the process nodes used in this project and its evolution over the 90nm process.

Figure 2: UMC Technology Trends

Low Leakage (LL) process has approximately half of the performance at 1.2V in comparison with Standard Process (SP) running at 1.0V (Figure 3).

Technology node L90 1P9M

L65 1P10M

Process SP/LL SP/LL

Lithography 193nm Dry 193nm Dry

Core Voltage (V) 1.0/1.2 1.0/1.2

tox Core (A) IO (A) (IO Vdd)

16/22 30/52/65 (1.8/2.5/3.3V)

12/19

30/52/65 (1.8/2.5/3.3V)

Physical Gate Length (nm) 70/80 40/55

Salicide CoSi2 NiSi

Interconnect Cu Cu

Inter/Intra Metal Dielectric Low-k (k=2.9) Low-k (k=2.9)

1XMetal Pitch (nm) 280 200

6T SRAM Cell Size (um2) 1.16/0.99* 0.499*/0.525**Cell non-shrinkable

Page 4: ARM 1176-JZFS CPU-Based Low-Power Subsystem

Sec14:4

ARM 1176-JZFS CPU-Based Low-Power Subsystem

Figure 3: Managing Performance per Watt

As shown in Figure 4, Low Leakage (LL) Nodes gain significantly (>80x) across the process space. They are highly sensitive to temperature (sub-threshold component). High Performance Nodes gain an average of 25% on Drive Strength. This is dominated by the process spread.

Figure 4: Low Leakage vs. Performance Node Tradeoffs

High Performance Nodes gain significantly (average 30%) across the corners. The power dissipation can be managed effectively with voltage scaling. Multi-channel devices can be used to reduce the leakage.

Nor

mal

ized

loff

(pA

/um

)

Intrinsic R.O. Delay (ps/stage)

106

105

104

103

102

101

10 5 10 15 20 25

65SP1.0V

65LL1.2V

90SP1.0V

90LL1.2V

1000.00

100.00

10.00

1.0025

TT

25

SS

25

FF

25

SF

25

FS

125

TT

-40

FF

Corners

Gai

n

Leakage Gain (LL vs SP) NMOS ratioPMOS ratio

1.40

1.35

1.30

1.25

1.20

1.15

1.10

1.0525

TT

25

SS

25

FF

25

SF

25

FS

125

TT

-40

FF

Corners

Gai

n

Drive strength gain (SP vs LL) NMOS ratioPMOS ratio

Page 5: ARM 1176-JZFS CPU-Based Low-Power Subsystem

Sec14:5

ARM 1176-JZFS CPU-Based Low-Power Subsystem

Figure 5: Performance Gain and Impact of Voltage Scaling

Delay for some key structures is V-α, α is in the range of 1.5 – 2. As shown in Figure 6 and Figure 7, temperature sensitivity decreases with lowered voltage (Zero Temperature Coefficient for block around 0.78V). Variability is highly sensitive to the voltage and increases drastically at lower voltages impacting the functionality of design.

Figure 6: Delay Dependencies

1.81.61.41.2

10.80.60.40.2

025

TT

25

SS

25

FF

25

SF

25

FS

125

TT

-40

FF

Corners

Gai

n

Performance Gain (SP vs LL) Block 1Block 2

2.00E+00

1.80E+00

1.60E+00

1.40E+00

1.20E+00

1.00E+00

8.00E+01

6.00E+01

4.00E+01

2.00E+01

0.00E+000.5 0.6 0.7 0.8 0.9 1 1.1

Voltage

Rat

io (S

P t

o L

L)

Voltage scaling on performance and power SpeedDynamic Power

0

50

100

150

200

250

300

350

400

450

0.75 0.85 0.95 1.05 1.15 1.25

Voltage (V)

Del

ay (n

orm

aliz

ed)

Voltage vs Delay (Average)

Block 1Block 2Block 3

Delay sensitivity to temperature

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.85 0.9 1 1.1

Voltage

Del

ay s

lop

e w

ith t

emp

erat

ure

Delay sensitivity to temperature

Page 6: ARM 1176-JZFS CPU-Based Low-Power Subsystem

Sec14:6

ARM 1176-JZFS CPU-Based Low-Power Subsystem

Figure 7: Variability

ARM1176JZFS based SoC overview and Advance Leakage Control

Figure-8 illustrates the block diagram of ARM 1176_1616 cpu and its subsystem (Ulterior Chip). Here are the key design features:a. ARM1176 CPU with dual 16K caches

• With State Retention Power Gating (SRPG) leakage management• Multi-voltage design (VRAM, VCPU, VSOC)—but not DVS, although will

include level shifters• Support-independent power/energy analysis• Diagnostic SRPG error rate analysis

b. ARM AXI-based system-on-chip support logic

• SDRAM and Flash memory controllers• IEM-based performance and leakage controllers• Level-2 RAM on-chip memory (with BIST)

c. Linux OS port peripherals

• Demonstration of the entire system running real applications

40

35

30

25

20

15

10

5

0

y = 16.396x-3.0055

0.7 0.8 0.9 1 1.1 1.2

Dispersion data

Core Voltage (V)

Voltage vs Variability

Ob

serv

ed s

td d

ev (a

rbiti

trar

y un

its)

Page 7: ARM 1176-JZFS CPU-Based Low-Power Subsystem

Sec14:7

ARM 1176-JZFS CPU-Based Low-Power Subsystem

Figure 8: Architecture of ARM 1176

Ulterior Power Strategy

Ulterior consists of two switchable domains and one always-on domain. The VDDCore power net is switched power grid derived from VDDCPU. VDDCPU itself can be switched off externally and can run at different typical voltage value than VDDSOC. VDDSOC is always-on power for the chip, which feeds to small logic required to be always-on. VDDCore, which can be switched off, contains multi-stage turn-on and turn-off control coming from advanced leakage controller. All the flops in VDDCORE domain are retention flops, while all the memories in the VDDCPU can work in 3 low-power modes. Figure 9 shows the logical power domain definition for ARM1176_1616 core.

Figure 9: Power Domains on Ulterior

AMBA AXI Interface

Memory Management

FlexibleDFT/MBIST

DebugInterface

CeompressorController

InstructionCache (16K)

DataCache (16K)

TCRAM 0/1Interface

TCRAM 0/1Interface

TrustZone™

enabledARM11 core

InstructionInterface

DataInterface

DMAPeripheral

Port

AR

M11

76J2

_161

6JTAGdebug ARM1176

64-bit wide

SRPG CPU/cache

“Level-2”

64-bit wide

Banked SRAM

64-bit AXI Inter-Connection Matrix

ALVCLeakageControl

AHB/APBbridge

PLL +clkgen

Timer x2 UART X2 INTC GPIOFlash16-bit

SDRAM32-bit

ARM 1176_1616 SOC Block diagram

DTCDataRAM ITCDataRAM ITCDataRAM

DDataRAM

DDataRAM

DDataRAM

IDataRAM

IDataRAM

IDataRAM

DDataRAM

DTCDataRAM

DDataRAM

DDataRAM

DDataRAM

DDataRAM

IDataRAM

IDataRAM

IDataRAM

IDataRAM

IDataRAM

ITagRAM

ITagRAM

ITagRAM

ITagRAM

IValidRAM

BTACTagRAM

DTagRAM

DTagRAM

DTagRAM

DTagRAM

DDirtyRAM

DValidRAM

BTACDataRAM

Expectedlocation

of Pb

Expectedlocation ofInstructionDecoders

Expected location of A1176 Core

TLBRAM

TLBRAM

On the connectors, thelocation of the pins is

indicated by a line

Instruction read onlyand data read/write

ports

Peripheral andDMA ports

Clock, reset, and interupts port Coprocessor ports ETM ports

Data side Instruction Side

Figure 6-3 Alternative macrocell floorplan

Instruction read onlyand data read/write

ports

Peripheral andDMA ports

VDDCPU(Can beSwitched offExternally)

VDDSOC(Always on logic)

VVDDCPU(gated logic)

Need to check ifall output ports areclamped

Ulterior ARM1176 Voltage domains (Logical)

Page 8: ARM 1176-JZFS CPU-Based Low-Power Subsystem

Sec14:8

ARM 1176-JZFS CPU-Based Low-Power Subsystem

To reduce the wear-out of the power switches as well as maintaining the performance, Ulterior proposes two kinds of power switch matrices—the weak network and the strong network. Control for both networks comes from the Advance Leakage Controller(ALC) separately; the weak network has 8 power shut-off control input requests and acknowledge; the strong network has one shut-off enable request. Weak resistive network brings up virtual grid gently with sufficient current to ensure VVDD reaches to 0.95*VDD @high temperature. Strong matrix is turned-on once virtual grid reaches to 0.95*VDD to reduce the IR drop. Implementation of 8 weak enable-based network is to carry out wear-out experiment with 1/2/4/8 enables. All the power controls acknowledge signal selection is based on STA measurement. Figure 10 has one example sequence, where 8≥N≥1.

Figure 10: Power Switch Request and Acknowledge (8≥N≥1)

Memory subsystem contains 37 single port memories; each memory can work in three low-power modes (Figure 11):a. Standby mode (HALT)

• CEN disables the memory

b. Retention mode (SRPG)

• Power is supplied to core array to retain state• Power is off for periphery for reduced leakage• Outputs are clamped to zero

VDDCPU

WDDCPU

PWR_REQ PWR_ACK

CLOCK

N_ISOLATE

N_RESET

N_PWR_REQ

N_PWR_ACK[N][N]

Page 9: ARM 1176-JZFS CPU-Based Low-Power Subsystem

ARM 1176-JZFS CPU-Based Low-Power Subsystem

Sec14:9

c. Shutdown mode (HIBERNATE)

• Power is off for core and periphery for reduced leakage• Outputs are clamped to zero• Possible through both integrated MTCMOS and separated power sources for

core and periphery

Figure 11: Memory Retention Power Gating

Implementation Overview

Figure-12 illustrates the Cadence CPF-based low-power implementation flow, with the following key highlights.

• Single CPF used from the synthesis to backend, power and timing sign-off• Leakage optimizations in the synthesis and in the backend flows• CPF-based MMMC flow in the Encounter platform• PSO Planning flow to meet performance/electrical/power goals• Automated Power Switch Network Simulation for multiple combinations• CRC model based spice simulation to reduce TaT for complex power

switch analysis

Column Decoders

Sense Amps and I/O

Column Decoders

Sense Amps and I/O

PGEN RETN PGEN

CORE VDD LOGIC VDD CORE VDD

CORE Ground LOGIC VSS CORE VSS

PGEN_

RETN_ PGEN_

HVtSwitch

HVtSwitch

HVtSwitch

HVtSwitch

HVtSwitch

HVtSwitch

Wo

rd L

ines

Page 10: ARM 1176-JZFS CPU-Based Low-Power Subsystem

Sec14:10

ARM 1176-JZFS CPU-Based Low-Power Subsystem

• Comprehensive IR/EM checks• Low Power Verification throughout the flow

Figure 12: Ulterior Implementation Flow

While addressing low-power implementation and its verification, it is also important that methodology be adequate enough to deal with the challenges of maintaining performance and reliability. Here are the key issues addressed by the implementation methodology:Power shut-off and MSV implementation

• Maintaining the system performance is a challenge• CPF based methodology simplifies the Low Power Insertion

Low-power verification

• Verifying the low power • Through RTL and gate simulations• Through formal checks

LEC + Pow

er Checks

Conform

al Low P

ower

RTL LP Simulation &LP Auto Assertion Generation/Checks

Incisive Enterprise Simulator

PD-Aware Logic Synthesis & DFTEncounter RTL Compiler

PD-Aware Physical ImplementationSoC Encounter

Timing & SI SignoffEncounter Timing System

IR drop & Power SignoffVoltageStorm-PE & DC

Physical Verification

CPF Integration & Quality CheckConformal Low Power

CPF

Page 11: ARM 1176-JZFS CPU-Based Low-Power Subsystem

ARM 1176-JZFS CPU-Based Low-Power Subsystem

Sec14:11

Ensuring Reliability• New power structures and strategy may lead to• Defects in the t>0 time• Needs to be taken care in the design• ARM has come up with new approach in the design• To avoid electrical failures• How Implementation would support such mechanism

Ulterior ImplementationLow-power Verification

Low-power verification is the backbone of any low-power flow. Verification can be performed through dynamic simulation on the RTL as well as gate, and static checks.Cadence Encounter® Conformal® Low Power verifies the correct implementation of low-power design techniques and validates the design using formal techniques (versus simulation) throughout the design process. It also decreases the risk of missed bugs, before a product goes out the door. Conformal Low Power accepts RTL/gate-level netlists with or without explicit power or ground nets and CPF file as input. It performs structural and rule-based checks to verify that low-power implementation is as per the power specification defined in the CPF file. Under Low Power Equivalency Checking, Conformal Low Power ensures that low-power optimizations do not introduce a technology mapping bug or a logical bug in the design netlist. It reads golden and revised designs along with CPF files and checks the logical equivalence without setting any constraints on low-power control signals.The RTL and Conformal Low Power flow is used to verify the CPF. It reads RTL and CPF as input and reports missing, and redundant low-power rules as per the power architecture of the design.Conformal Low Power flows for the synthesis and physical netlists are used to verify the low-power implementation with respect to power specification defined in the CPF file. Since instances in the synthesis netlist does not have power ground pins, power domain are assigned based on the CPF definition. The power domains to the instances in the physical netlist are assigned based on the power and ground pin connectivity. Power domain consistency check (PDCIC) performs power-aware equivalence checking and checks low-power cells. The PDCIC between synthesis and the physical netlist performs the power-aware equivalence checking between the golden and revised design. In this case, it assigns the power domain for the synthesis netlist using the CPF definitions while the power domains for the physical netlist use the power and ground pin connectivity.

Page 12: ARM 1176-JZFS CPU-Based Low-Power Subsystem

ARM 1176-JZFS CPU-Based Low-Power Subsystem

Sec14:12

Figure 13: Verification

Figure 14: Ulterior Backend Flow Overview

CPF

RTLConformal LP VerifyCPF consistency checking

Conformal Low PowerEwuivalence Checking

Physical implementation

Conformal Low PowerEquivalence Checking

Gate netlist

Physical netlist

Logic Syntheis & DFT

Front-endSignoff

Back-endSignoff

Power Equivalence

CLP(RTL and CPF checks)

LEC (RTLvs

Synthesis netlist)

Power aware LEC (RTLvs

Synthesis Netlist)

CLP(Synthesis Netlist)

LEC (Synthesisvs

Backend Netlist)

Power Aware LEC (Synthesisvs

Backend Netlist)

Power Aware LECincluding PDCIC (Synthesis

vsPhysical Netlist)

CLP(Physical Netlist)

CLPUnified (hierarchical + top level)

Physical Netlist

Low Power Check Progress

Load CPF and Create RC, Timing Optimization MMMC views

Floorplanning of Power Domains, relative macro placement

End cap, well-tap,

Power Switch cell placement

Power Planning

(PSO, Well-tap hookup)

Placement

(Isolation, SRPG)

Always-on-Nets Synthesis

For RETAIN and PSO control signals

Power Routing

sroute-LVLSeconday/STDCELL preroute

nanoroute-SRPGSenconday/Always-on-nets

Design Import Multi_Mode Pre-CTS Optimization

Domain aware CTS

Multi-Mode postCTS Optimization

SI and domain Aware nanoroute

Multi-Mode postRoute Optimization

Multi_Mode postRoute SI Optimization

Multi-Mode Hold optimization

SOC Signoff timing and ciltic checks

Multi-Mode leakage Optimization

Page 13: ARM 1176-JZFS CPU-Based Low-Power Subsystem

ARM 1176-JZFS CPU-Based Low-Power Subsystem

Sec14:13

Ulterior PnR Flow

It is important that methodology and flow used for the Place and Rout (P&R) captures the additional complexity due to the power strategy. The Backend tool takes the power architecture information from the CPF file and is able to perform all the steps in automated manner. Figure 14 captures the CPF-based automated low-power P&R flow used in the project.The flow starts with the design import and loading of the CPF file. Inside the Cadence SoC Encounter(r) RTL-to-GDSII System, loading and committing of CPF on the design occurs through the loadCPF and commitCPF commands respectively.The loadCPF command mainly captures the following information from the CPF:

Low-power cells such as level shifter, isolation cell, power switch cell, SRPG, −and always-on buffersAll the power and ground nets −The power domain with switchable attributes and its global connections −Attaches libraries to the appropriate power domains −Rules such as power switch, level shifter, isolation cell, and SRPG −Different analysis views −

The commitCPF command mainly creates the following information according to the loaded CPF:

Creates power domains and defines their global connections −Checks and inserts level shifters and isolations based on the rules −Checks and replaces flip-flops with SRPGs based on the rules −Creates the analysis views −

Once design is imported and CPF is loaded, the following are the key steps performed by the SoCE:(i) Low-power CPF flow and the MMMC settings(ii) Different kinds of the power shut-off (PSO) for the design

• On-chip PSO Column-based checker board PSO −PSO for hard macro (memory) −

• Off-chip PSOVDDCPU, secondary domain for VDDCore, can also be shut off externally −

(iii) Different kinds of level shifter implementation for the designLow-to-high level shifter −

(iv) Isolation implementation for the PSO power domains(v) State retention for the PSO power domains(vi) Always-on net synthesis

Page 14: ARM 1176-JZFS CPU-Based Low-Power Subsystem

ARM 1176-JZFS CPU-Based Low-Power Subsystem

Sec14:14

(vii) Secondary power pin connection for the SRPG/always-on buffer/LVL shifter

(viii) Placement in MMMC(ix) On-chip Variation (OCV) timing analysis mode(x) Timing optimization and analysis in MMMC(xi) Clock tree synthesis in MMMC(xii) Domain-aware routing(xiii) MMMC SI closure(xiv) Hold timing optimization in MMMC(xv) MMMC leakage optimization(xvi) Running multiple-CPU processing to reduce the runtime for

multiple mode analysis

Figure 15: Floorplan and power plan

Figure 15 illustrates the floorplan and the power switches columns of the ulterior design.

Page 15: ARM 1176-JZFS CPU-Based Low-Power Subsystem

ARM 1176-JZFS CPU-Based Low-Power Subsystem

Sec14:15

The key in the implementation is the power network planning. As per the power architecture section, we need two types of power switch network—the weak network and the strong power switch network. Weak network itself has eight different enables and same number of acknowledgements.

Weak Network Enable

As shown in Figure 16, 8 weak enables feed to 16 columns spread uniformly and interleaved; this was done to reduce the rush current issue and to bring up the power grid gently up to 95% of VDD. Every vertical weak column has certain number of cell rows skipped (13 rows to be precise) and skipped rows either have strong network switch cell or the weak network return path switch cell.

Figure 16: Weak Network Enable

Weak Network Acknowledgement

As shown in Figure 17, there are 16 separate acknowledgements on the return path of the weak network. Out of these 16 acknowledgements, 8 have been connected to ALC state machine based on STA measurements.

Figure 17: Weak Network Acknowledgement

REQ_WEAK_0 REQ_WEAK_1 REQ_WEAK_2 REQ_WEAK_3 REQ_WEAK_4 REQ_WEAK_5 REQ_WEAK_6 REQ_WEAK_7

ACK_WEAK_0 ACK_WEAK_2

ACK_WEAK_1 ACK_WEAK_3 ACK_WEAK_5 ACK_WEAK_7 ACK_WEAK_1_1 ACK_WEAK_3_1 ACK_WEAK_5_1 ACK_WEAK_7_1

ACK_WEAK_4 ACK_WEAK_6 ACK_WEAK_0_1 ACK_WEAK_2_1 ACK_WEAK_4_1 ACK_WEAK_6_1

Page 16: ARM 1176-JZFS CPU-Based Low-Power Subsystem

ARM 1176-JZFS CPU-Based Low-Power Subsystem

Sec14:16

Strong Network Request

The strong network has been used to reduce the IR-drop once the power ramps up to 95% of VDD through the weak power network. The strong network has a higher number of PSO switches than the weak, but it is important that the number should not be so high such that leakage through these switches becomes an issue. As shown in Figure 18, single request for the strong network feeds to 16 columns spread uniformly and every column has 351 strong network cells. So, the implementation ends up having the 5600 strong power switches and total leakage through the PSO is 0.6mW.

Figure 18: Strong Network Request

Strong Network Acknowledgment

Figure 19 captures the return path of the strong network with 16 acknowledges; STA measurement has been performed to choose one out of the 16 to connect to the ALC controller.

Figure 19: Strong Network Acknowledge Return Path

REQ_STRONG

ACK_STRONG_0 ACK_STRONG_2

ACK_STRONG_1 ACK_STRONG_3 ACK_STRONG_5 ACK_STRONG_7 ACK_STRONG_1_1 ACK_STRONG_3_1 ACK_STRONG_5_1 ACK_STRONG_7_1

ACK_STRONG_4 ACK_STRONG_6 ACK_STRONG_0_1 ACK_STRONG_2_1 ACK_STRONG_4_1 ACK_STRONG_6_1

Page 17: ARM 1176-JZFS CPU-Based Low-Power Subsystem

ARM 1176-JZFS CPU-Based Low-Power Subsystem

Sec14:17

Well Tap Cells

This standard cell library from ARM can be used for back-bias technique to reduce the leakage. This technique is not used in this project, but proper well taps still need to be inserted for bulk connections. Figure 20 shows the well tap placement in the design. Also, the SoC Encounter system automatically takes care of the domain association of the well tap cells while inserting them.

Figure 20: Well Tap Cells

Isolation Cells Insertion and Placement

Isolation cells are inserted between the always-on and switchable domain based on the isolation rules and cell specified in the CPF file. As shown in Figure 21, in the ulterior design, the isolation has been placed between PDsoc and PDcore, PDcpu and PDcore, and PDsoc and PDcpu. Placement of these isolation cells is in the ON domain.

Figure 21: Isolation Cells

Page 18: ARM 1176-JZFS CPU-Based Low-Power Subsystem

ARM 1176-JZFS CPU-Based Low-Power Subsystem

Sec14:18

SRPG Cells

There are ~40k state retention flops in the PDcore PSO power domain (Figure 22). While all the flops are inserted during the synthesis, its placement and power connections have been performed during P&R. Secondary power pin connection to always-on power for state-retention flops is shown in Figure 23, SRPG is double height cell with VSS in bottom. The entire secondary power hookup for the SRPG were done using Cadence NanoRoute router.

Figure 22: SRPG Cells Spread

Figure 23: SRPG Standard Cell

Page 19: ARM 1176-JZFS CPU-Based Low-Power Subsystem

ARM 1176-JZFS CPU-Based Low-Power Subsystem

Sec14:19

Ulterior Power Sign-Off

With the multiple power switch networks and multiple enable conditions for the weak network, this project required extensive power switch simulations as illustrated in Figure 24. Here is the brief summary of the analysis performed:

• Power calculation Using Common Power Engine (CPE) for power calculation for different −modes and corners Using Powermeter to generate dynamic current waveforms with a −vectorless approach

• Static and dynamic IR-drop analysis Using Cadence VoltageStorm power analysis for static IR-drop analysis and −EM check with power from CPE Using VoltageStorm for dynamic IR-drop analysis with current waveforms −from Powermeter

• Power gating analysis for the multiple request enable conditions Using Powermeter to generate spice decks and Ultrasim to run −power-up simulation Using VoltageStorm for dynamic IR-drop analysis with rush current −from power-up Perform the ECOs on the power switch network to get the ramp-up time, −rush current and request to acknowledge time as per the required specs

• Decap ECO flow and PSO ECO flow

Figure 24: Power Analysis Flow

LibrarySpice model, .cl, .lib, .spice

Design data from FEDEF, LEF, SPEF, TWF

PowermeterPower up deck generation

PowermeterPower Calculation

UltrasimPSO spice simulation

.tmwaveforms

.ptipeakrush current file

plots reports block .cl for top level analysis

*Except VStorm dynamic IR drop, all runs were done in 3 corners (FF, SS, TT). VStorm dynamic IR drop was done in TT only.

plots reports decap eco file

VStormStatic IR/EM analysis

VStormDynamic IR analysis

Window based decap eco was used toget a rough idea of how much decap was needed

Page 20: ARM 1176-JZFS CPU-Based Low-Power Subsystem

ARM 1176-JZFS CPU-Based Low-Power Subsystem

Sec14:20

Figure 25 illustrates the static IR-drop plots and numbers. The average drop across the PSO varies between 15% to 25% of the total IR-drop. So it is important that one does careful analysis at the time of PSO strategy to maintain the performance.

Figure 25: Static IR-Drop Analysis

Figures 26 and 27 illustrate the power switch network simulation results and waveforms. For simplification, Figure 27 illustrates the limited and optimal set of simulation results at the end of the project, but during the power switch network optimization, several combination and corners have been tried out to get optimized power switch network. As you can see, rush current (Ipeak and Iavg) are minimal when we have one weak enable turned on, but ramp-up time is 12X more in comparison with the 8 weak enable ON. Similarly, 8 weak enable conditions has 12X more rush current than the one weak enable condition.

Page 21: ARM 1176-JZFS CPU-Based Low-Power Subsystem

ARM 1176-JZFS CPU-Based Low-Power Subsystem

Sec14:21

Figure 26: Power Switch Network Simulation Results

Figure 27: Power-Up Simulation Waveforms

Page 22: ARM 1176-JZFS CPU-Based Low-Power Subsystem

ARM 1176-JZFS CPU-Based Low-Power Subsystem

Sec14:22

Assembly and PackagingFigure 28 illustrates the bond diagram of the Ulterior SoC. It uses 352-pin package for the 4x4 square mm die. It contains the following:

• 180/244 signal pins• 40 mixed-signal (power measurement)• 52 power/ground

Figure 28: Assembly and Packaging

Page 23: ARM 1176-JZFS CPU-Based Low-Power Subsystem

ARM 1176-JZFS CPU-Based Low-Power Subsystem

Sec14:23

Ulterior Implementation ResultsWhile Silicon Measurements are under progress, here are the stats from the tape-out data measurements:

Gate Count: 1.1M gates, Instance Count: ~300K, 37 memories, 3.09mm2

• Utilization is 80% for core

Performance: 615 MHz in WC corner (0.9V, SS, 125C)

Power Savings:• Ulterior total leakage power savings : 3X• VDDCore domain leakage power savings : 100X

2 types of power switch network:• Weak : 14 columns, 8 enables, 280 total switches, 20 per column• Strong : 16 columns, 1 enable (16 acks), 5616 total switches per column 351

Extensive power analysis • Ramp-up time for every corner (with combination of weak enables)• Ranges from 8ns to 107ns• Rush current control with multiple combination

(ranging from 39mA to 481mA)• Current limit spec has been met for HEADER follow pin• Through the multiple iteration• Drop through the switch 6mV (average), range+ 3mV

_____________________________________________________David Flynn, a Fellow in R&D at ARM Ltd, has been with the company since 1991, specializing in System-on-Chip IP deployment and methodology. He is the original architect behind ARM’s synthesizable CPU family and the AMBA on-chip interconnect standard. His current research focus is low-power system-level design. He holds a number of patents in on-chip bus, low power and embedded processing sub-system design and holds a BSc in Computer Science from Hatfield Polytechnic, UK and a Doctorate in Electronic Engineering from Loughborough University, UK. He is currently Visiting Professor with the Electronics and Computer Science Department at Southampton University, UK. ([email protected])

Sachin Idgunji, a Principal Engineer at ARM Inc. in the Research Group specializing in Systems/Circuit design and analysis. His current research focus is in variation analysis, low power design and statistical techniques. Prior to joining ARM, he was at Synopsys Inc. where he led several projects ranging from design specification through tape out in areas of graphics, networking and embedded processing. Prior to Synopsys, Sachin worked at IBM Labs (India) and PCS-Data General and has over 18 years of industry experience. Sachin holds a BE in Electronics Engineering from Shivaji University, India. ([email protected])

Page 24: ARM 1176-JZFS CPU-Based Low-Power Subsystem

ARM 1176-JZFS CPU-Based Low-Power Subsystem

Sec14:24

Felix Jen, a section manager in the Design Technology Support Section of IP Development and Design Support Division at UMC, has been with the company since 2002, with expertise mainly focusing on IC design implementation and design methodology. ([email protected])

Wen-Pin Lin, a Senior Technical Manager and Staff in IP Development and Design Support Division at UMC, joined the company in 2007. His expertise mainly focuses on deep sub-micron IC design implementation and design methodology. ([email protected])

Vivek Shukla serves as ]an R&D Architect at Cadence Design Systems, Bangalore. Before Cadence, Vivek worked at Beceem Communications, a startup in Wi-Max, where he led multiple tape-outs of Wi-Max 802.16e standard compliant chips. Prior to Beceem, he had a 5 year stint at Intel during which he worked and led efforts for Ethernet chips, multi-core processor, CSI development and was responsible for the processor methodologies in a wide range of areas including timing closure, Custom design and mixed signal design. Prior to Intel, he worked at Motorola on DSP processors and led chip designs for automotive applications. He holds a B.Tech in Electronics and Communications Engineering from IT-BHU, India and has 2 design patents in the area of low power and high speed interconnect. ([email protected])